dms_barcodedsubamplicons

Overview

This program can process the FASTQ reads generated by Barcoded subamplicon sequencing to count the frequencies of each identity at each site, and to generate summary statistics.

If you do such an analysis for several samples, you can summarize the results with dms_summarizealignments.

After you install dms_tools, this program will be available to run at the command line.

Barcoded subamplicon sequencing

Barcoded subamplicon sequencing involves using PCR to make a series of subamplicons from the gene to be sequenced. Each of these subamplicons contains a random barcode (a string of N nucleotides) at each end that provides a unique identifier. The barcoded subamplicons are then diluted such that the number of unique ssDNA molecules is less than the read depth, and then this dilution is amplified by primers that also add the Illumina adaptors. Finally, these molecules are sequenced.

In analyzing the sequencing, we can then identify barcodes that are sequenced multiple times, and a consensus sequence for the subamplicon can be built from the multiple reads. The dms_barcodedsubamplicons script performs this analysis.

To understand how the sequencing is done in detail, example primers are shown below for sequencing of the hemagglutinin (HA) from A/WSN/1933 (H1N1) influenza:

Rnd1for1: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNaagcaggggaaaataaaaacaaccaaa-3'                                                                                                                                                                                                                                                                                                                                                             Rnd1for427: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNccaaggaaagttcatggcccaac-3'                                                                                                                                                                                                                                                                                                                                                           Rnd1for850: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNgtttgagtccggcatcatcacc-3'                                                                                                                                                                                                                                                                                                                                              Rnd1for1276: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNcaacaacttagaaaaaaggatggaaaatttaaataaa-3'
    WSN HA with coding sequence in caps: 5'-agcaaaagcaggggaaaataaaaacaaccaaaATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGAgattaggatttcagaaatataaggaaaaacacccttgtttctact-3'
    WSN HA with coding sequence in caps: 3'-tcgttttcgtccccttttatttttgttggtttTACTTCCGTTTTGATGACCAGGACAATATACGTAAACATCGATGTCTACGTCTGTGTTATACATATCCGATGGTACGCTTGTTGAGTTGGCTGTGACAACTGTGTTATGAGCTCTTCTTACACCGTCACTGTGTAAGACAATTGGACGAGCTTCTGTCGGTGTTGCCCTTTGATACATTTAATTTTCCTTATCGGGGTGATGTTAACCCCTTTACATTGTAGTGGCCTACCGAGAACCCTTTAGGTCTTACGCTGAGTGACGAAGGTCGCTCTAGTACCAGGATGTAACATCTTTGTGGTTTGAGACTCTTACCTCGTACAATAGGTCCTCTAGAGTAGCTGATACTCCTTGACTCCCTCGTTAACTCGAGTCATAGTAGTAATCTTTCTAAGCTTTATAAAGGGTTCCTTTCAAGTACCGGGTTGGTGTGTAAGTTGCCTCATTGTCATAGTACGAGGGTATCCCCTTTTTCGTCAAAAATGTCTTTAAACGATACCGACTGCTTCTTCCCCCTAAGTATGGGTTTCGACTGGTTAAGGATACACTTGTTATTTCCCTTTCTTCAGGAACATGATACCCCACAAGTAGTGGGCAGATCGTCACTACTCGTTGTCTCAGAGATATCATTACCTTTACGAATACAGAGACATCGCAGAAGTTTAATATTGTCCTCTAAGTGGGGCCTTTATCGACGTTCCGGGTTTCATTCTCTAGTTGTACCCTCCTACTTGATAATGACCTGGAACGATCTTGGGCCTCTGTGTTATTATAAACTCCGTTGACCATTAGATTATCGTGGTACCATACGAAAGCGTGACTCATCTCCCAAACTCAGGCCGTAGTAGTGGAGTTTGCGCAGTTACGTACTCACATTGTGCTTCACAGTTTGTGGGGTCCCTCGATATTTGTCGTTAGAGGGAAAGGTCTTATATGTGGGTCAGTGTTATCCTCTCACGGGTTTTATACAGTCCTCATGGTTTAACTCCTACCAATGTCCTGATTCTTTGTAGGGTAGGTAAGTTATGTCTCCAGATAAACCTCGGTAACGACCAAAATAACTCCCCCCTACCTGACCTTACTATCTACCTACCATACCAATAGTAGTAGTCTTACTTGTCCCTAGTCCGATACGTCGCCTAGTTTTTTCGTGTGTTTTACGGTAATTGCCCTAATGTTTGTTCCACTTGAGACAATAGCTCTTTTACTTGTGAGTTAAGTGTCGACACCCATTTCTTAAGTTGTTGAATCTTTTTTCCTACCTTTTAAATTTATTTTTTCAACTACTACCCAAAGACCTGTAAACCTGTATATTACGTCTTAACAATCAAGATGACCTTTTACTTTCCTGAAACCTAAAGGTACTGAATTTACACTTCTTAGACATGCTCTTTCATTTTTCGGTTAATTTCTTATTACGGTTTCTTTAGCCTTTACCCACAAAACTCAAGATGGTGTTCACACTGTTACTTACGTACCTTTCACATTCTTTACCCTGAATACTAATAGGTTTTATAAGTCTTCTTAGTTTCAACTTGTCCCTTTTCTATCTACCTCACTTTAACCTTAGTTACCCCCACATAGTCTAAGACCGCTAGATGAGTTGACAGCGGTCAAGTGACCACGAAAACCAGAGGGACCCCCGTTAGTCAAAGACCTACACAAGATTACCCAGAAACGTCACGTCTTATACGTAGACTctaatcctaaagtctttatattcctttttgtgggaacaaagatga-5'
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Rnd1rev426: 3'-gtgtgtaagttgcctcattgtcatagtacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5'                                                                                                                                                                                                                                                                                                                                                    Rnd1rev849: 3'-agtttgcgcagttacgtactcacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5'                                                                                                                                                                                                                                                                                                                                                                 Rnd1rev1275: 3'-actactacccaaagacctgtaaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5'                                                                                                                                                                                                                                                                                                                                                    Rnd1rev1698: 3'-ctaatcctaaagtctttatattcctttttgtgggaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5'

Briefly, the steps are as follows:

  1. The library of HA genes is subjected to PCR with primers to create subamplicons of \(\approx 425\) internal nucleotides each as well as additional adaptor and primer-binding sequences. In the example above, PCRs are performed with the following primer pairs:

    • Rnd1for1 and Rnd1rev426: these primers create a subamplicon spanning nucleotides 1 to 426 (codons 1 to 142) of the WSN HA, which is the part of the sequence shown in caps above.
    • Rnd1for427 and Rnd1rev849: these primers create a subamplicon spanning nucleotides 427 to 849 (codons 143 to 283).
    • Rnd1for850 and Rnd1rev1275: these primers create a subamplicon spanning nucleotides 850 to 1275 (codons 284 to 425).
    • Rnd1for1276 and Rnd1rev1698: these primers create a subamplicon spanning nucleotides 1276 to 1698 (codons 426 to 566).
  2. The subamplicons are then diluted so that the total number of ssDNA molecules is less than the sequencing depth. Since each subamplicon molecule has 8 N nucleotides at each end, there are \(4^{16} \approx 4.3 \times 10^9\) unique barcodes. Since the number of unique barcodes is much greater than the total number of molecules in the dilution, each barcode will generally be associated with just one molecule.

  3. The barcoded subamplicons are further amplified by primers that add the Illumina adaptors, and perhaps Illumina indices for multiplexing. For instance, here are two example primers for this second round of PCR:

    >Rnd2forUniversal: forward primer for the round 2 PCRs that adds the Illumina adaptor
    AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCC
    
    >Rnd2revIndex: reverse primer for the round 2 PCRs that adds the TruSeq index 1 (in lower case) as well as the Illumina adaptor
    CAAGCAGAAGACGGCATACGAGATcgtgatGTGACTGGAGTTCAGACGTGTGCTCTTCC
    
  4. The products of the round 2 PCRs are sequenced with overlapping paired-end reads of sufficient length to cover the entire subamplicon, and the sequencing data are analyzed with the standard Illumina pipeline to generate FASTQ files with the R1 and R2 reads.

  5. The FASTQ files are analyzed with dms_barcodedsubamplicons. For the example here, the command might be:

    dms_barcodedsubamplicons WSN_HA_ WSN-HA.fasta R1.fastq.gz R2.fastq.gz 1,426,36,38 427,849,32,32 850,1275,31,37 1276,1698,46,45
    

    The meanings of the arguments are detailed in the Command-line usage. Briefly:

    • WSN_HA_ is the prefix attached to the created output files.

    • WSN-HA.fasta should be the name of a FASTA file containing the part of the gene that we are sequencing. In this example, it would be only the capitalized portion of the sequence shown above (as the lowercase parts are simply flanking sequences for primer binding). For instance, here is such a file:

      >WSN HA coding sequence
      ATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGA
      
    • R1.fasta.gz and R2.fastq.gz are the FASTQ files with the R1 and R2 reads.

    • The remaining arguments specify where the subamplicons might align For instance, 1,426,36,38 means that for the first subamplicon, the first sequenced nucleotide (first one outside primer binding site) aligns to nucleotide 1 of the reference sequence in WSN-HA.fasta, the last sequence nucleotide aligns to nucleotide 426 of the reference sequence, the first nucleotide in R1 that we want to consider in the alignment is site 36 (there is an 8 nucleotide barcode plus a 27-nucleotide primer-binding site), and the first nucleotide in R2 that we want to consider in the alignment is site 38 (there is an 8-nucleotide barcode plus a 29-nucleotide primer-binding site). The arguments 427,849,32,32, 850,1275,31,37, and 1276,1698,46,45 are the similar specifications for the other three subamplicons.

Algorithm for assembling and aligning subamplicons

The algorithm implemented by dms_barcodedsubamplicons is as follows:

  1. Any read pair in which either read fails the Illumina chastity filter is discarded.

  2. All sites in each read pair are classified as “high quality” if the site is not an N (ambiguous nucleotide) and has a Q-score of at least --minq, or as “low quality” otherwise. For all low-quality sites, the called identity in the read is simply set to N (so we treat a site with a Q-score of less than --minq as equivalent to an N).

  3. We discard any read pairs for which any of the following conditions are met:

    • There is a low-quality site in the barcode region of either read.
    • Either read has more than --maxlowqfrac of its nucleotides that are low quality.
  4. We then collect all remaining read pairs for each barcode (concatenating the barcodes on the paired reads to give a total barcode length that is twice --barcodelength) and perform all subsequent operations on these barcodes.

  5. Any barcode with less than --minreadsperbarcode read pairs is discarded.

  6. The R1 and R2 reads for each remaining barcode are assembled into consensus sequences. At each site, these consensus sequences have the consensus nucleotide if that nucleotide is found in at least --minreadconcurrence of the reads; otherwise the nucleotide is set to N. During the process of building these consensus sequences, the barcode is discarded if any of the following conditions are met for either the R1 or R2 read:

    • The reads cannot all be made the same length by trimming no more --maxreadtrim from the 3’ end.
    • Any of the reads fail to have at least --minreadidentity identical and high-quality (not N) nucleotides when compared pairwise.
  7. For each remaining barcode, we attempt to align the consensus sequence to each of the subamplicons specified by alignspecs. The attempted alignment does not accommodate gaps; the consensus sequence for the reads is required to align gaplessly at the position specified by alignspecs. In any region where the reads overlap (which may happen near the center of the subamplicon), if both consensus sequences (R1 and R2) report high quality nucleotides, the identity is considered ambiguous if the sequences disagree. An alignment is considered valid only if the following conditions are met:

    • The total fraction of non-high-quality nucleotides in the aligned region exceeds --maxlowqfrac.
    • The subamplicon alignment has no more than --maxmuts mutations relative to refseq when mutations are counted in terms of --chartype (so if --chartype is codon, this is the number of codon mutations).

    If the above criteria are met, then the identities (wildtype or mutant) are counted for each site of --chartype where the are no ambiguous nucleotides in the alignment, and this barcoded subamplicon contributes to the totals reported in the counts.txt output file. If the alignment criteria are not met, the barcode is discarded as un-alignable. Note that unless you are working with an extremely repetitive gene or specify an extremely large value of --maxmuts relative to the read length, spurious alignments should not occur.

Command-line usage

Gathers barcoded subamplicons, aligns to reference sequence, and counts mutations. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/

usage: dms_barcodedsubamplicons [-h] [--barcodelength BARCODELENGTH]
                                [--chartype {codon}] [--maxmuts MAXMUTS]
                                [--minq MINQ] [--maxlowqfrac MAXLOWQFRAC]
                                [--minreadsperbarcode MINREADSPERBARCODE]
                                [--minreadidentity MINREADIDENTITY]
                                [--minreadconcurrence MINREADCONCURRENCE]
                                [--maxreadtrim MAXREADTRIM]
                                [--R1_is_antisense]
                                [--R1trimlength R1TRIMLENGTH]
                                [--R2trimlength R2TRIMLENGTH] [--barcodeinfo]
                                [--purgereadfrac PURGEREADFRAC]
                                [--purgefrac PURGEFRAC] [--seed SEED] [-v]
                                outprefix refseq r1files r2files alignspecs
                                [alignspecs ...]

Positional Arguments

outprefix

Prefix for output files. The suffixes of the created files are: “counts.txt” - character counts at each site; “summarystats.txt” - barcode summary stats; “.log” - file that logs progress; “barcodeinfo.txt.gz” - lists all barcodes if using “–barcodeinfo”. Existing files with these names are removed.

For instance, if outprefix is WSN_HA_ then the created output files will be WSN_HA_counts.txt, WSN_HA_summarystats.txt, WSN_HA.log, and WSN_HA_barcodeinfo.txt.gz if you use the –barcodeinfo option. The format of the output files is described in Output files.

refseq Existing FASTA file containing gene to which we are aligning subamplicons and counting mutations.
r1files Comma-separated list of R1 FASTQ files (no spaces). Files can optionally be gzipped (extension .gz).
r2files Like “r1files” but R2 files. Must be same number of comma-separated entries as for “r1files”.
alignspecs

This argument is repeated to specify each subamplicon. Each occurrence is four comma-delimited integers (no spaces): “REFSEQSTART,REFSEQEND,R1START,R2START”. REFSEQSTART is nucleotide (1, 2, … numbering) in “refseq” where nucleotide R1START in R1 aligns. REFSEQEND is nucleotide in “refseq” where nucleotide R2START in R2 aligns.

It is important to set alignspecs so that you don’t count the part of the subamplicon that is in the primer binding site, since the nucleotide identities in this region come from the primers rather than the templates being sequenced. See Barcoded subamplicon sequencing for an example.

Note also that the alignments will fail if you don’t set alignspecs exactly correctly for each subamplicon. So if you have reads failing to align to part of your gene, carefully check alignspecs to make sure it is correct (you can use the example in Barcoded subamplicon sequencing to make sure you understand things properly).

Named Arguments

--barcodelength
 

Length of barcodes (NNN… nucleotides) at the beginning of R1 and R2 reads.

Default: 8

This is the length of the barcode on each primer. So a value of 8 corresponds to a total of 16 N nucleotides on the two primers. See Barcoded subamplicon sequencing for an example of primers with 8 N nucleotides.

--chartype

Possible choices: codon

Character for which we are counting mutations. Currently “codon” is only allowed value (in the future “nucleotide” might be added).

Default: “codon”

If chartype is codon then refseq must provide a valid coding sequence (length a multiple of 3).

--maxmuts

Only align subamplicons (consensus from a barcode) if <= this many mismatches with “refseq” counted in terms of “chartype”.

Default: 4

--minq

Only consider nucleotides with Q scores >= this number.

Default: 15

--maxlowqfrac

Only retain read pairs if no “N” or Q < “minq” nucleotides in barcodes, and total fraction of such nucleotides is <= this number in each read individually and in the eventual subamplicon built from the barcodes.

Default: 0.075

--minreadsperbarcode
 

Retain only barcodes with >= this many reads that align gaplessly with >= “minreadidentity” identical high-quality nucleotides; should be >= 2.

Default: 2

--minreadidentity
 

Retain only barcodes where all reads have >= this fraction of identical high-quality nucleotides that align gaplessly.

Default: 0.9

--minreadconcurrence
 

For retained barcodes, only make mutation calls when >= this fraction of reads concur.

Default: 0.75

--maxreadtrim

If R1 or R2 reads for same barcode are not all same length, trim up to this many nucleotides from 3’ end; if still not same length then discard.

Default: 3

This option exists because it appears that Illumina runs of a given read length don’t always return all of the reads with exactly that read length.

--R1_is_antisense
 

Is the R1 read in the antisense direction of “refseq”?

Default: False

--R1trimlength Before performing any operations on the read pairs, trim nucleotides away from 3’ end of R1 read to reach the read length specified here.
--R2trimlength Before performing any operations on the read pairs, trim nucleotides away from 3’ end of R2 read to reach the read length specified here.
--barcodeinfo

If you specify this option, create a file with suffix “barcodeinfo.txt.gz” containing information for each barcode. This file is quite large, and its creation will about double the program run time.

Default: False

--purgereadfrac
 

Randomly purge read pairs with this probability, thereby subsampling the data. You might want a value > 0 to estimate how the results depend on the sequencing depth. See also “–purgefrac”.

Default: 0

Why would you ever want to purge some of the read pairs? You may be trying to determine whether sequencing to a higher depth (which will give you more barcodes with at least –nreadsperbarcode reads) will improve the quality of your results. If you set –purgereadfrac to a value > 0 (say 0.5), you’ll see how the results would be affected if you had fewer reads. If these results are noticeably worse by some objective measure, this supports that idea that your read depth might be in regime where greater depth would help.

--purgefrac

Like “–purgereadfrac” but purges barcodes rather than read pairs. In deciding which option to use, remember that the number of unique barcodes sequenced does not necessarily increase linearly with read depth.

Default: 0

--seed

Random number seed used to select reads for purging when using “–purgereadfrac” or “–purgefrac”.

Default: 1

-v, --version show program’s version number and exit

Output files

The formats of the output files are given below in terms of the file suffix that is appended to --outprefix. The data in the first two output files (counts.txt and summarystats.txt) can be used as input to dms_summarizealignments to make plots that summarize the results for several samples.

  • counts.txt : This file gives the counts of the different characters of --chartype in the Deep mutational scanning counts file format. So if --chartype is codon, then these will be the counts of codon identities. Note that files in Deep mutational scanning counts file format can be used directly as input to programs like dms_inferprefs and dms_inferdiffprefs.

  • summarystats.txt : This text file summarizes the numbers of reads and barcodes parsed. For instance, here is an example:

    total read pairs = 4580258
    read pairs that fail Illumina filter = 0
    low quality read pairs = 915131
    barcodes randomly purged = 0
    discarded barcodes with 1 reads = 844768
    aligned barcodes with 2 reads = 466387
    discarded barcodes with 2 reads = 6457
    un-alignable barcodes with 2 reads = 40753
    aligned barcodes with 3 reads = 262149
    discarded barcodes with 3 reads = 5426
    un-alignable barcodes with 3 reads = 24891
    aligned barcodes with 4 reads = 120828
    discarded barcodes with 4 reads = 3281
    un-alignable barcodes with 4 reads = 7675
    aligned barcodes with 5 reads = 45342
    discarded barcodes with 5 reads = 1516
    un-alignable barcodes with 5 reads = 2863
    aligned barcodes with 6 reads = 14545
    discarded barcodes with 6 reads = 577
    un-alignable barcodes with 6 reads = 934
    aligned barcodes with 7 reads = 4060
    discarded barcodes with 7 reads = 203
    un-alignable barcodes with 7 reads = 243
    aligned barcodes with 8 reads = 996
    discarded barcodes with 8 reads = 65
    un-alignable barcodes with 8 reads = 52
    aligned barcodes with 9 reads = 250
    discarded barcodes with 9 reads = 20
    un-alignable barcodes with 9 reads = 11
    aligned barcodes with 10 reads = 51
    discarded barcodes with 10 reads = 3
    un-alignable barcodes with 10 reads = 3
    aligned barcodes with 11 reads = 10
    discarded barcodes with 11 reads = 1
    aligned barcodes with 12 reads = 2
    
  • .log : This is a text file that logs the progress of the program.

  • barcodeinfo.txt.gz : This file is only created if the --barcodeinfo option is used. In this case, the file will give all barcodes and their associated reads, and will explain whether or not the barcode was successfully retained during the alignment process.

Memory usage

dms_barcodedsubamplicons stores all of the reads in the FASTQ files in memory. Therefore, it uses a substantial amount of memory, typically around a gigabyte per million paired-end sequencing reads. However, for typical data sets such memory usage is well within the capacity of modern large-memory nodes.