dms2_batch_bcsubamp

Overview

The dms2_batch_bcsubamp program processes FASTQ files generated by Barcoded-subamplicon sequencing to count the frequencies of mutations at each site for a set of samples, and then summarize the results.

The dms2_batch_bcsubamp program simply runs dms2_bcsubamp for each sample listed in a batch file specified by --batchfile. Specifically, as described in Command-line usage, you can specify a few sample-specific arguments in the --batchfile. All other arguments are specified using the normal option syntax (e.g., --bclen BCLEN) and are shared between all samples specified in --batchfile. The result is the output for each individual run of dms2_bcsubamp plus the summary plots described in Output files.

The Doud2016 example to illustrates the usage of dms2_batch_bcsubamp on a real dataset.

Because dms2_batch_bcsubamp simply runs dms2_bcsubamp on each sample specfied by the --batchfile argument described below, see the dms2_bcsubamp Algorithm for assembling and aligning subamplicons and the dms2_bcsubamp Command-line usage for details that are helpful for understanding many of the arguments in the dms2_batch_bcsubamp Command-line usage below.

Command-line usage

Perform many runs of dms2_bcsubamp and plot results. Part of dms_tools2 (version 2.6.6) written by the Bloom Lab.

usage: dms2_batch_bcsubamp [-h] [--outdir OUTDIR] [--ncpus NCPUS]
                           [--use_existing {yes,no}] [-v] --refseq REFSEQ
                           --alignspecs ALIGNSPECS [ALIGNSPECS ...]
                           [--bclen BCLEN] [--fastqdir FASTQDIR]
                           [--R2 R2 [R2 ...]] [--R1trim R1TRIM [R1TRIM ...]]
                           [--R2trim R2TRIM [R2TRIM ...]] [--bclen2 BCLEN2]
                           [--chartype {codon}] [--maxmuts MAXMUTS]
                           [--minq MINQ] [--minreads MINREADS]
                           [--minfraccall MINFRACCALL] [--minconcur MINCONCUR]
                           [--sitemask SITEMASK] [--purgeread PURGEREAD]
                           [--purgebc PURGEBC] [--bcinfo] [--bcinfo_csv]
                           --batchfile BATCHFILE --summaryprefix SUMMARYPREFIX

Named Arguments

--outdir

Output files to this directory (create if needed).

--ncpus

Number of CPUs to use, -1 is all available.

Default: -1

Multiple runs of dms2_bcsubamp can be performed in parallel on the different samples specified by --batchfile. This argument determines how many CPUs are used if running multiple jobs.

--use_existing

Possible choices: yes, no

If files with names of expected output already exist, do not re-run.

Default: “no”

-v, --version

show program’s version number and exit

--refseq

Align subamplicons to gene in this FASTA file.

--alignspecs

Subamplicon alignment positions as ‘REFSEQSTART,REFSEQEND,R1START,R2START’. REFSEQSTART is nt (1, 2, … numbering) in ‘refseq’ where nt R1START in R1 aligns. REFSEQEND is nt in ‘refseq’ where nt R2START in R2 aligns.’

--bclen

Length of NNN… barcode at start of each read. Assumed to be same for R1 and R2, use –bclen2 if this is not the case.

Default: 8

--fastqdir

R1 and R2 files in this directory.

--R2

Read 2 (R2) FASTQ files assumed to have same names as R1 but with ‘_R1’ replaced by ‘_R2’. If that is not case, provide names here.

--R1trim

Trim R1 from 3’ end to this length. One value for all reads or values for each subamplicon in --alignspecs.

--R2trim

Like ‘–R1trim’, but for R2.

--bclen2

If R1 and R2 have different length barcodes, use –bclen for R1 length and –bclen2 for R2 length.

--chartype

Possible choices: codon

Character type for which we count mutations.

Default: “codon”

--maxmuts

Max allowed mismatches in alignment of subamplicon; mismatches counted in terms of character ‘–chartype’.

Default: 4

--minq

Only call nucleotides with Q score >= this.

Default: 15

--minreads

Require this many reads in a barcode to agree to call consensus nucleotide identity.

Default: 2

--minfraccall

Retain only barcodes where trimmed consensus sequence for each read has >= this frac sites called.

Default: 0.95

--minconcur

Only call consensus identity for barcode when >= this fraction of reads concur.

Default: 0.75

--sitemask

Use to only consider mutations at a subset of sites. Should be a CSV file with column named site listing all sites to include.

--purgeread

Randomly purge read pairs with this probability to subsample data.

Default: 0

--purgebc

Randomly purge barcodes with this probability to subsample data.

Default: 0

--bcinfo

Create file with suffix ‘bcinfo.txt.gz’ with info about each barcode.

Default: False

--bcinfo_csv

Store ‘bcinfo’ file as a csv with the suffix ‘bcinfo.csv.gz’. Only has an effect if –bcinfo is used.

Default: False

--batchfile

CSV file specifying each dms2_bcsubamp run. Must have these columns: name, R1. Can optionally have columns R1trim and R2trim with spaces delimiting subamplicon-specific trimming. If R1trim / R2trim in batch file, do not also give values for --R1trim and --R2trim. Other columns are ignored, so other dms2_bcsubamp args should be passed as separate command line args rather than in --batchfile.

--summaryprefix

Prefix of output summary plots.

As detailed in Output files below, dms2_batch_bcsubamp creates a variety of plots summarizing the output. These files are in the directory specified by --outdir, and have the prefix specified here. This prefix should only contain letters, numbers, dashes, and spaces. Underscores are not allowed as they are a LaTex special character.

Output files

Running dms2_batch_bcsubamp produces a variety of output files, all of which will be found in the directory specified by --outdir.

Results for each sample

The program dms2_bcsubamp is run on each sample specified by --batchfile, so you will create all of the dms2_bcsubamp Output files.

Summary files

Plots are created that summarize the output for all samples specified by --batchfile. These samples have the prefix specified by --summaryprefix. So for instance, if you run dms2_batch_bcsubamp with the arguments --outdir results --summaryprefix summary then these files will have the prefix ./results/summary. They will have the suffixes listed below:

  • .log: a text file that logs the progress of the program.

  • _readstats.pdf: plot of reads for each sample.

  • _bcstats.pdf: plot of barcodes for each sample.

  • _readsperbc.pdf: plot of distribution of the number of reads per-barcode for each sample.

  • _depth.pdf: plot of number of counts called at each site for each sample.

  • _mutfreq.pdf: plot of mutation frequency at each site for each sample.

  • _codonmuttypes.pdf: plot of average frequency of different types of codon mutations.

  • _codonmuttypes.csv: numerical data in _codonmuttypes.pdf.

  • _codnntchanges.pdf: plot of average frequency of codon mutations with different numbers of nucleotide changes.

  • _singlentchanges.pdf: plot frequencies of different types of nucleotide mutations among codons with just one nucleotide change.

  • _cumulmutcounts.pdf: plot fraction of mutations that occur \(\leq\) a given number of times.

Examples and more detailed explanations of these plots can be found in the Doud2016 example.

Memory usage

As described in the Memory usage section for dms2_bcsubamp, each iteration of that program can consume substantial memory. So obviously running it multiple times in parallel with dms2_batch_bcsubamp will consume even more memory.