dms2_batch_bcsubamp
¶
Overview¶
The dms2_batch_bcsubamp
program processes FASTQ files generated by Barcoded-subamplicon sequencing to count the frequencies of mutations at each site for a set of samples, and then summarize the results.
The dms2_batch_bcsubamp
program simply runs dms2_bcsubamp for each sample listed in a batch file specified by --batchfile
.
Specifically, as described in Command-line usage, you can specify a few sample-specific arguments in the --batchfile
.
All other arguments are specified using the normal option syntax (e.g., --bclen BCLEN
) and are shared between all samples specified in --batchfile
.
The result is the output for each individual run of dms2_bcsubamp plus the summary plots described in Output files.
The Doud2016 example to illustrates the usage of dms2_batch_bcsubamp
on a real dataset.
Because dms2_batch_bcsubamp
simply runs dms2_bcsubamp
on each sample specfied by the --batchfile
argument described below, see the dms2_bcsubamp
Algorithm for assembling and aligning subamplicons and the dms2_bcsubamp
Command-line usage for details that are helpful for understanding many of the arguments in the dms2_batch_bcsubamp
Command-line usage below.
Command-line usage¶
Perform many runs of dms2_bcsubamp
and plot results. Part of dms_tools2 (version 2.6.6) written by the Bloom Lab.
usage: dms2_batch_bcsubamp [-h] [--outdir OUTDIR] [--ncpus NCPUS]
[--use_existing {yes,no}] [-v] --refseq REFSEQ
--alignspecs ALIGNSPECS [ALIGNSPECS ...]
[--bclen BCLEN] [--fastqdir FASTQDIR]
[--R2 R2 [R2 ...]] [--R1trim R1TRIM [R1TRIM ...]]
[--R2trim R2TRIM [R2TRIM ...]] [--bclen2 BCLEN2]
[--chartype {codon}] [--maxmuts MAXMUTS]
[--minq MINQ] [--minreads MINREADS]
[--minfraccall MINFRACCALL] [--minconcur MINCONCUR]
[--sitemask SITEMASK] [--purgeread PURGEREAD]
[--purgebc PURGEBC] [--bcinfo] [--bcinfo_csv]
--batchfile BATCHFILE --summaryprefix SUMMARYPREFIX
Named Arguments¶
- --outdir
Output files to this directory (create if needed).
- --ncpus
Number of CPUs to use, -1 is all available.
Default: -1
Multiple runs of
dms2_bcsubamp
can be performed in parallel on the different samples specified by--batchfile
. This argument determines how many CPUs are used if running multiple jobs.- --use_existing
Possible choices: yes, no
If files with names of expected output already exist, do not re-run.
Default: “no”
- -v, --version
show program’s version number and exit
- --refseq
Align subamplicons to gene in this FASTA file.
- --alignspecs
Subamplicon alignment positions as ‘REFSEQSTART,REFSEQEND,R1START,R2START’. REFSEQSTART is nt (1, 2, … numbering) in ‘refseq’ where nt R1START in R1 aligns. REFSEQEND is nt in ‘refseq’ where nt R2START in R2 aligns.’
- --bclen
Length of NNN… barcode at start of each read. Assumed to be same for R1 and R2, use –bclen2 if this is not the case.
Default: 8
- --fastqdir
R1 and R2 files in this directory.
- --R2
Read 2 (R2) FASTQ files assumed to have same names as R1 but with ‘_R1’ replaced by ‘_R2’. If that is not case, provide names here.
- --R1trim
Trim R1 from 3’ end to this length. One value for all reads or values for each subamplicon in
--alignspecs
.- --R2trim
Like ‘–R1trim’, but for R2.
- --bclen2
If R1 and R2 have different length barcodes, use –bclen for R1 length and –bclen2 for R2 length.
- --chartype
Possible choices: codon
Character type for which we count mutations.
Default: “codon”
- --maxmuts
Max allowed mismatches in alignment of subamplicon; mismatches counted in terms of character ‘–chartype’.
Default: 4
- --minq
Only call nucleotides with Q score >= this.
Default: 15
- --minreads
Require this many reads in a barcode to agree to call consensus nucleotide identity.
Default: 2
- --minfraccall
Retain only barcodes where trimmed consensus sequence for each read has >= this frac sites called.
Default: 0.95
- --minconcur
Only call consensus identity for barcode when >= this fraction of reads concur.
Default: 0.75
- --sitemask
Use to only consider mutations at a subset of sites. Should be a CSV file with column named site listing all sites to include.
- --purgeread
Randomly purge read pairs with this probability to subsample data.
Default: 0
- --purgebc
Randomly purge barcodes with this probability to subsample data.
Default: 0
- --bcinfo
Create file with suffix ‘bcinfo.txt.gz’ with info about each barcode.
Default: False
- --bcinfo_csv
Store ‘bcinfo’ file as a csv with the suffix ‘bcinfo.csv.gz’. Only has an effect if –bcinfo is used.
Default: False
- --batchfile
CSV file specifying each
dms2_bcsubamp
run. Must have these columns: name, R1. Can optionally have columns R1trim and R2trim with spaces delimiting subamplicon-specific trimming. If R1trim / R2trim in batch file, do not also give values for--R1trim
and--R2trim
. Other columns are ignored, so otherdms2_bcsubamp
args should be passed as separate command line args rather than in--batchfile
.- --summaryprefix
Prefix of output summary plots.
As detailed in Output files below,
dms2_batch_bcsubamp
creates a variety of plots summarizing the output. These files are in the directory specified by--outdir
, and have the prefix specified here. This prefix should only contain letters, numbers, dashes, and spaces. Underscores are not allowed as they are a LaTex special character.
Output files¶
Running dms2_batch_bcsubamp
produces a variety of output files, all of which will be found in the directory specified by --outdir
.
Results for each sample¶
The program dms2_bcsubamp
is run on each sample specified by --batchfile
, so you will create all of the dms2_bcsubamp
Output files.
Summary files¶
Plots are created that summarize the output for all samples specified by --batchfile
.
These samples have the prefix specified by --summaryprefix
.
So for instance, if you run dms2_batch_bcsubamp
with the arguments --outdir results --summaryprefix summary
then these files will have the prefix ./results/summary
.
They will have the suffixes listed below:
.log
: a text file that logs the progress of the program.
_readstats.pdf
: plot of reads for each sample.
_bcstats.pdf
: plot of barcodes for each sample.
_readsperbc.pdf
: plot of distribution of the number of reads per-barcode for each sample.
_depth.pdf
: plot of number of counts called at each site for each sample.
_mutfreq.pdf
: plot of mutation frequency at each site for each sample.
_codonmuttypes.pdf
: plot of average frequency of different types of codon mutations.
_codonmuttypes.csv
: numerical data in_codonmuttypes.pdf
.
_codnntchanges.pdf
: plot of average frequency of codon mutations with different numbers of nucleotide changes.
_singlentchanges.pdf
: plot frequencies of different types of nucleotide mutations among codons with just one nucleotide change.
_cumulmutcounts.pdf
: plot fraction of mutations that occur \(\leq\) a given number of times.
Examples and more detailed explanations of these plots can be found in the Doud2016 example.
Memory usage¶
As described in the Memory usage section for dms2_bcsubamp
, each iteration of that program can consume substantial memory.
So obviously running it multiple times in parallel with dms2_batch_bcsubamp
will consume even more memory.