dms2_bcsubamp

Overview

The dms2_bcsubamp program processes the FASTQ files generated by Barcoded-subamplicon sequencing to count the frequencies of each identity at each site and generate summary statistics.

In general, you will probably have an easier time if you use the dms2_batch_bcsubamp program rather than running dms2_bcsubamp directly.

Algorithm for assembling and aligning subamplicons

The algorithm implemented by dms2_bcsubamp is as follows:

  1. Read pairs are discarded if either read fails the Illumina chastity filter.

  2. All sites in each read are classified as “high quality” if the site is not an N (ambiguous nucleotide) and has a Q-score of at least --minq, or as “low quality” otherwise. At low-quality sites, the identity in the read is then set to N for the remainder of the analysis.

  3. We discard any read pairs for which there are any N nucleotides in the barcode.

  4. We then collect all remaining read pairs for each barcode (concatenating the barcodes on the paired reads for a total barcode length that is twice --bclen) and perform all subsequent operations on these barcodes. Note that if you have different length barcodes on the two reads (or all the barcode on one read), you can use the --bclen2 option below.

  5. Any barcode with less than --minreads read pairs is discarded.

  6. The R1 and R2 reads for each barcode are each assembled into consensus sequences. The consensus sequences is built by applying the following algorithm to each site:

    1. We collect the identities for all reads, ignoring N nucleotides.

    2. If there are less than --minreads called (non N) nucleotides, set the consensus identity for that site to N.

    3. If there are at least --minreads called nucleotides, set the consensus identity to N if the fraction of called identities that are identical is less than --minconcur, and set it to the consensus called identity if at least --minconcur of the reads are identical at the site.

  7. We attempt to align the consensus read sequences for each barcode to subamplicon positions specified by --alignspecs. The attempted alignment does not accommodate gaps; the consensus sequence must align gaplessly at the position specified by --alignspecs. In any region where the reads overlap (which may happen near the center of the subamplicon), if both consensus reads report high quality nucleotides, the identity is considered ambiguous if the sequences disagree. If one consensus read is ambiguous, we take the identity for the non-ambiguous read. The alignment procedure is as follows:

    1. We trim the reads from the 3’ end using the --R1trim and --R2trim parameters for that subamplicon.

    2. We then attempt to align the consensus reads at each position indicated in --alignspecs. We consider an alignment acceptable if there are no more than --maxmuts mutations relative to refseq (with mutations counted in terms of character --chartype) and if the fraction of called nucleotides in the aligned consensus is at least --minfraccall. We take the first valid alignment that we find. But unless you are working with a very repetitive sequence or specify a very large --maxmuts, there should be at most one valid alignment.

  8. For each aligned subamplicon, we count the identities at each site. We call an identity if at least --minreads reads with non-ambiguous identities span that site, and at least --minconcur of these agree on the identity. If both reads span a site and they disagree, we call that site as N.

  9. The results are reported in the Output files.

Command-line usage

Align barcoded subamplicons and count mutations. Part of dms_tools2 (version 2.6.6) written by the Bloom Lab.

usage: dms2_bcsubamp [-h] [--outdir OUTDIR] [--ncpus NCPUS]
                     [--use_existing {yes,no}] [-v] --refseq REFSEQ
                     --alignspecs ALIGNSPECS [ALIGNSPECS ...] [--bclen BCLEN]
                     [--fastqdir FASTQDIR] [--R2 R2 [R2 ...]]
                     [--R1trim R1TRIM [R1TRIM ...]]
                     [--R2trim R2TRIM [R2TRIM ...]] [--bclen2 BCLEN2]
                     [--chartype {codon}] [--maxmuts MAXMUTS] [--minq MINQ]
                     [--minreads MINREADS] [--minfraccall MINFRACCALL]
                     [--minconcur MINCONCUR] [--sitemask SITEMASK]
                     [--purgeread PURGEREAD] [--purgebc PURGEBC] [--bcinfo]
                     [--bcinfo_csv] --name NAME --R1 R1 [R1 ...]

Named Arguments

--outdir

Output files to this directory (create if needed).

--ncpus

Number of CPUs to use, -1 is all available.

Default: -1

--use_existing

Possible choices: yes, no

If files with names of expected output already exist, do not re-run.

Default: “no”

-v, --version

show program’s version number and exit

--refseq

Align subamplicons to gene in this FASTA file.

--alignspecs

Subamplicon alignment positions as ‘REFSEQSTART,REFSEQEND,R1START,R2START’. REFSEQSTART is nt (1, 2, … numbering) in ‘refseq’ where nt R1START in R1 aligns. REFSEQEND is nt in ‘refseq’ where nt R2START in R2 aligns.’

It is important to set --alignspecs so that you don’t count the part of the subamplicon that is the barocde or in the primer binding site, since the nucleotide identities in this region do not come from the templates being sequenced. See Barcoded-subamplicon sequencing for an example.

The alignments will fail if you don’t set --alignspecs exactly correctly for each subamplicon. So if you have reads failing to align to part of your gene, carefully check --alignspecs to make sure it is correct.

--bclen

Length of NNN… barcode at start of each read. Assumed to be same for R1 and R2, use –bclen2 if this is not the case.

Default: 8

This is the length of the barcode on each primer. So a value of 8 corresponds to a total of 16 N nucleotides on the two primers. See Barcoded-subamplicon sequencing for an example of primers with 8 N nucleotides.

--fastqdir

R1 and R2 files in this directory.

This option can be useful if the --R1 / --R2 files are found in a common directory. Instead of repeatedly listing that directory name, you can just provide it here and then give the file (or subdirectory/file) names for the files via --R1 / --R2.

--R2

Read 2 (R2) FASTQ files assumed to have same names as R1 but with ‘_R1’ replaced by ‘_R2’. If that is not case, provide names here.

Most pipelines for generating Illumina FASTQ files have the read 1 sequences in a file that contains the string _R1 and the read 2 sequences in a file that contains the string _R2. If this is the case, the R2 file name can just be guessed from the R1 file name. However, you can use this option if your R2 files have a different name that you need to specify manually.

--R1trim

Trim R1 from 3’ end to this length. One value for all reads or values for each subamplicon in --alignspecs.

Often your reads will be longer than needed, so it is helpful to trim the low-quality nucleotides that tend to be at the end of long readfs.

If you specify one number, then all R1 reads are trimmed to this length. If you specify a list of numbers equal to the entries for --alignspecs, then each read is trimmed differently depending on which subamplicon it aligns to.

Note that the trimming is done to the initial read, and so the trimmed length specifies here includes the barcode, the primer binding site, and then the part of the read that provides useful sequence information.

--R2trim

Like ‘–R1trim’, but for R2.

--bclen2

If R1 and R2 have different length barcodes, use –bclen for R1 length and –bclen2 for R2 length.

--chartype

Possible choices: codon

Character type for which we count mutations.

Default: “codon”

If chartype is codon then --refseq must provide a valid coding sequence (length a multiple of 3).

--maxmuts

Max allowed mismatches in alignment of subamplicon; mismatches counted in terms of character ‘–chartype’.

Default: 4

--minq

Only call nucleotides with Q score >= this.

Default: 15

--minreads

Require this many reads in a barcode to agree to call consensus nucleotide identity.

Default: 2

--minfraccall

Retain only barcodes where trimmed consensus sequence for each read has >= this frac sites called.

Default: 0.95

--minconcur

Only call consensus identity for barcode when >= this fraction of reads concur.

Default: 0.75

--sitemask

Use to only consider mutations at a subset of sites. Should be a CSV file with column named site listing all sites to include.

--purgeread

Randomly purge read pairs with this probability to subsample data.

Default: 0

Why would you want to purge some of the read pairs? You may be trying to determine whether sequencing to a higher depth will improve your results. If you set --purgeread to a value > 0 (say 0.5), you’ll see how the results would be affected if you had fewer reads. If these results are noticeably worse, this supports that idea that you might be in regime where more reads would help.

--purgebc

Randomly purge barcodes with this probability to subsample data.

Default: 0

This option differs from --purgeread in that it purges barcodes rather than reads. So this gives you some indication of how your results would change if you bottlenecked to fewer unique molecules prior to the round 2 PCR to attach the barcodes to each molecule.

--bcinfo

Create file with suffix ‘bcinfo.txt.gz’ with info about each barcode.

Default: False

This will be a very large file and creating it will take some time, so only use this option if you need to look at this file for debugging.

--bcinfo_csv

Store ‘bcinfo’ file as a csv with the suffix ‘bcinfo.csv.gz’. Only has an effect if –bcinfo is used.

Default: False

--name

Sample name used for output files.

The Output files will have a prefix equal to the name specified here. This name should only contain letters, numbers, dashes, and spaces. Underscores are not allowed as they are a LaTex special character.

--R1

Read 1 (R1) FASTQ files, can be gzipped. See also ‘–fastqdir’.

You can specify multiple files using the * wildcard character, as in reads_R1_L*.fastq.gz.

If you have multiple files that are all in the same directory, it may be convenient to specify this directory using --fastqdir and then avoid repeatedly listing the directory name for each file.

Output files

The program produces a variety of output files. These files all have the prefix specified by --outdir and --name. For instance, if you use --outdir results --name sample-1, then the output files will have the prefix ./results/sample-1 and the suffixes described below.

Here are the specific output files:

Log file

This file has the suffix .log. It is a text file that logs the progress of the program.

Read statistics file

This file has the suffix _readstats.csv. It gives the statistics on all the processed reads. For instance:

fail filter,low Q barcode,total
0,785663,12691176

Reads-per-barcode file

This file has the suffix _readsperbc.csv. It gives the number of reads associated with each barcode. For instance:

number of reads,number of barcodes
1,1248294
2,400579
3,471893
4,456661
5,376158
6,272020
7,174295
8,102660
9,56302
10,28079
11,13125
12,5747
13,2518
14,1035
15,403
16,177
17,78
18,34
19,12
20,9
21,1
22,2
24,1
25,1
75,1

Barcode statistics file

This file has the suffix _bcstats.csv. It provides information on how many of the barcodes could be successfully aligned. For instance:

aligned,not alignable,too few reads,total
2129522,232269,1248294,3610085

Counts file

This is output file that has the results that you will probably use for subsequent analyses. It has the suffix _codoncounts.csv if you are using --chartype codon. It gives the number of called identities at each site in the sequence, as well as the wildtype sequence. For instance, here are the first few lines:

site,wildtype,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,AGG,AGT,ATA,ATC,ATG,ATT,CAA,CAC,CAG,CAT,CCA,CCC,CCG,CCT,CGA,CGC,CGG,CGT,CTA,CTC,CTG,CTT,GAA,GAC,GAG,GAT,GCA,GCC,GCG,GCT,GGA,GGC,GGG,GGT,GTA,GTC,GTG,GTT,TAA,TAC,TAG,TAT,TCA,TCC,TCG,TCT,TGA,TGC,TGG,TGT,TTA,TTC,TTG,TTT
1,ATG,0,0,0,0,0,0,2,0,0,0,0,0,8,0,333985,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,AAG,16,20,333132,41,13,12,27,14,8,6,67,8,9,13,29,9,10,11,12,8,10,15,15,11,6,9,3,7,8,10,17,4,3,7,49,7,9,14,9,4,10,7,7,7,9,11,11,5,14,14,11,6,13,16,15,14,9,9,15,8,9,11,8,15
3,GCA,2,3,8,3,34,11,7,6,7,6,9,8,4,3,5,0,6,14,10,12,6,8,7,10,5,11,7,6,6,1,3,12,19,6,11,9,333250,10,6,9,15,3,5,5,37,9,9,7,8,4,8,3,23,5,7,8,6,11,7,10,7,9,3,6

Detailed barcode information file

This file has the suffix _bcinfo.txt.gz. It is a very large gzipped text file that contains information on all the reads and barcodes. The format should be self explanatory. This file is only created if you use the --bcinfo option, and may be helpful for debugging if your reads aren’t aligning as expected.

Memory usage

dms2_bcsubamp stores all of the reads in the FASTQ files in memory. Therefore, it uses a substantial amount of memory, typically around a gigabyte per million paired-end sequencing reads. However, for typical data sets such memory usage is well within the capacity of modern large-memory nodes.