dms_barcodedsubamplicons
¶
Contents
Overview¶
This program can process the FASTQ reads generated by Barcoded subamplicon sequencing to count the frequencies of each identity at each site, and to generate summary statistics.
If you do such an analysis for several samples, you can summarize the results with dms_summarizealignments.
After you install dms_tools, this program will be available to run at the command line.
Barcoded subamplicon sequencing¶
Barcoded subamplicon sequencing involves using PCR to make a series of subamplicons from the gene to be sequenced. Each of these subamplicons contains a random barcode (a string of N
nucleotides) at each end that provides a unique identifier. The barcoded subamplicons are then diluted such that the number of unique ssDNA molecules is less than the read depth, and then this dilution is amplified by primers that also add the Illumina adaptors. Finally, these molecules are sequenced.
In analyzing the sequencing, we can then identify barcodes that are sequenced multiple times, and a consensus sequence for the subamplicon can be built from the multiple reads. The dms_barcodedsubamplicons
script performs this analysis.
To understand how the sequencing is done in detail, example primers are shown below for sequencing of the hemagglutinin (HA) from A/WSN/1933 (H1N1) influenza:
Rnd1for1: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNaagcaggggaaaataaaaacaaccaaa-3' Rnd1for427: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNccaaggaaagttcatggcccaac-3' Rnd1for850: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNgtttgagtccggcatcatcacc-3' Rnd1for1276: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNcaacaacttagaaaaaaggatggaaaatttaaataaa-3'
WSN HA with coding sequence in caps: 5'-agcaaaagcaggggaaaataaaaacaaccaaaATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGAgattaggatttcagaaatataaggaaaaacacccttgtttctact-3'
WSN HA with coding sequence in caps: 3'-tcgttttcgtccccttttatttttgttggtttTACTTCCGTTTTGATGACCAGGACAATATACGTAAACATCGATGTCTACGTCTGTGTTATACATATCCGATGGTACGCTTGTTGAGTTGGCTGTGACAACTGTGTTATGAGCTCTTCTTACACCGTCACTGTGTAAGACAATTGGACGAGCTTCTGTCGGTGTTGCCCTTTGATACATTTAATTTTCCTTATCGGGGTGATGTTAACCCCTTTACATTGTAGTGGCCTACCGAGAACCCTTTAGGTCTTACGCTGAGTGACGAAGGTCGCTCTAGTACCAGGATGTAACATCTTTGTGGTTTGAGACTCTTACCTCGTACAATAGGTCCTCTAGAGTAGCTGATACTCCTTGACTCCCTCGTTAACTCGAGTCATAGTAGTAATCTTTCTAAGCTTTATAAAGGGTTCCTTTCAAGTACCGGGTTGGTGTGTAAGTTGCCTCATTGTCATAGTACGAGGGTATCCCCTTTTTCGTCAAAAATGTCTTTAAACGATACCGACTGCTTCTTCCCCCTAAGTATGGGTTTCGACTGGTTAAGGATACACTTGTTATTTCCCTTTCTTCAGGAACATGATACCCCACAAGTAGTGGGCAGATCGTCACTACTCGTTGTCTCAGAGATATCATTACCTTTACGAATACAGAGACATCGCAGAAGTTTAATATTGTCCTCTAAGTGGGGCCTTTATCGACGTTCCGGGTTTCATTCTCTAGTTGTACCCTCCTACTTGATAATGACCTGGAACGATCTTGGGCCTCTGTGTTATTATAAACTCCGTTGACCATTAGATTATCGTGGTACCATACGAAAGCGTGACTCATCTCCCAAACTCAGGCCGTAGTAGTGGAGTTTGCGCAGTTACGTACTCACATTGTGCTTCACAGTTTGTGGGGTCCCTCGATATTTGTCGTTAGAGGGAAAGGTCTTATATGTGGGTCAGTGTTATCCTCTCACGGGTTTTATACAGTCCTCATGGTTTAACTCCTACCAATGTCCTGATTCTTTGTAGGGTAGGTAAGTTATGTCTCCAGATAAACCTCGGTAACGACCAAAATAACTCCCCCCTACCTGACCTTACTATCTACCTACCATACCAATAGTAGTAGTCTTACTTGTCCCTAGTCCGATACGTCGCCTAGTTTTTTCGTGTGTTTTACGGTAATTGCCCTAATGTTTGTTCCACTTGAGACAATAGCTCTTTTACTTGTGAGTTAAGTGTCGACACCCATTTCTTAAGTTGTTGAATCTTTTTTCCTACCTTTTAAATTTATTTTTTCAACTACTACCCAAAGACCTGTAAACCTGTATATTACGTCTTAACAATCAAGATGACCTTTTACTTTCCTGAAACCTAAAGGTACTGAATTTACACTTCTTAGACATGCTCTTTCATTTTTCGGTTAATTTCTTATTACGGTTTCTTTAGCCTTTACCCACAAAACTCAAGATGGTGTTCACACTGTTACTTACGTACCTTTCACATTCTTTACCCTGAATACTAATAGGTTTTATAAGTCTTCTTAGTTTCAACTTGTCCCTTTTCTATCTACCTCACTTTAACCTTAGTTACCCCCACATAGTCTAAGACCGCTAGATGAGTTGACAGCGGTCAAGTGACCACGAAAACCAGAGGGACCCCCGTTAGTCAAAGACCTACACAAGATTACCCAGAAACGTCACGTCTTATACGTAGACTctaatcctaaagtctttatattcctttttgtgggaacaaagatga-5'
Rnd1rev426: 3'-gtgtgtaagttgcctcattgtcatagtacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev849: 3'-agtttgcgcagttacgtactcacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev1275: 3'-actactacccaaagacctgtaaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev1698: 3'-ctaatcctaaagtctttatattcctttttgtgggaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5'
Briefly, the steps are as follows:
The library of HA genes is subjected to PCR with primers to create subamplicons of \(\approx 425\) internal nucleotides each as well as additional adaptor and primer-binding sequences. In the example above, PCRs are performed with the following primer pairs:
Rnd1for1
andRnd1rev426
: these primers create a subamplicon spanning nucleotides 1 to 426 (codons 1 to 142) of the WSN HA, which is the part of the sequence shown in caps above.Rnd1for427
andRnd1rev849
: these primers create a subamplicon spanning nucleotides 427 to 849 (codons 143 to 283).Rnd1for850
andRnd1rev1275
: these primers create a subamplicon spanning nucleotides 850 to 1275 (codons 284 to 425).Rnd1for1276
andRnd1rev1698
: these primers create a subamplicon spanning nucleotides 1276 to 1698 (codons 426 to 566).The subamplicons are then diluted so that the total number of ssDNA molecules is less than the sequencing depth. Since each subamplicon molecule has 8
N
nucleotides at each end, there are \(4^{16} \approx 4.3 \times 10^9\) unique barcodes. Since the number of unique barcodes is much greater than the total number of molecules in the dilution, each barcode will generally be associated with just one molecule.The barcoded subamplicons are further amplified by primers that add the Illumina adaptors, and perhaps Illumina indices for multiplexing. For instance, here are two example primers for this second round of PCR:
>Rnd2forUniversal: forward primer for the round 2 PCRs that adds the Illumina adaptor AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCC >Rnd2revIndex: reverse primer for the round 2 PCRs that adds the TruSeq index 1 (in lower case) as well as the Illumina adaptor CAAGCAGAAGACGGCATACGAGATcgtgatGTGACTGGAGTTCAGACGTGTGCTCTTCCThe products of the round 2 PCRs are sequenced with overlapping paired-end reads of sufficient length to cover the entire subamplicon, and the sequencing data are analyzed with the standard Illumina pipeline to generate FASTQ files with the R1 and R2 reads.
The FASTQ files are analyzed with
dms_barcodedsubamplicons
. For the example here, the command might be:dms_barcodedsubamplicons WSN_HA_ WSN-HA.fasta R1.fastq.gz R2.fastq.gz 1,426,36,38 427,849,32,32 850,1275,31,37 1276,1698,46,45The meanings of the arguments are detailed in the Command-line usage. Briefly:
WSN_HA_
is the prefix attached to the created output files.
WSN-HA.fasta
should be the name of a FASTA file containing the part of the gene that we are sequencing. In this example, it would be only the capitalized portion of the sequence shown above (as the lowercase parts are simply flanking sequences for primer binding). For instance, here is such a file:>WSN HA coding sequence ATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGA
R1.fasta.gz
andR2.fastq.gz
are the FASTQ files with the R1 and R2 reads.The remaining arguments specify where the subamplicons might align For instance,
1,426,36,38
means that for the first subamplicon, the first sequenced nucleotide (first one outside primer binding site) aligns to nucleotide 1 of the reference sequence inWSN-HA.fasta
, the last sequence nucleotide aligns to nucleotide 426 of the reference sequence, the first nucleotide in R1 that we want to consider in the alignment is site 36 (there is an 8 nucleotide barcode plus a 27-nucleotide primer-binding site), and the first nucleotide in R2 that we want to consider in the alignment is site 38 (there is an 8-nucleotide barcode plus a 29-nucleotide primer-binding site). The arguments427,849,32,32
,850,1275,31,37
, and1276,1698,46,45
are the similar specifications for the other three subamplicons.
Algorithm for assembling and aligning subamplicons¶
The algorithm implemented by dms_barcodedsubamplicons
is as follows:
Any read pair in which either read fails the Illumina chastity filter is discarded.
All sites in each read pair are classified as “high quality” if the site is not an
N
(ambiguous nucleotide) and has a Q-score of at least--minq
, or as “low quality” otherwise. For all low-quality sites, the called identity in the read is simply set toN
(so we treat a site with a Q-score of less than--minq
as equivalent to anN
).We discard any read pairs for which any of the following conditions are met:
- There is a low-quality site in the barcode region of either read.
- Either read has more than
--maxlowqfrac
of its nucleotides that are low quality.
We then collect all remaining read pairs for each barcode (concatenating the barcodes on the paired reads to give a total barcode length that is twice
--barcodelength
) and perform all subsequent operations on these barcodes.Any barcode with less than
--minreadsperbarcode
read pairs is discarded.The R1 and R2 reads for each remaining barcode are assembled into consensus sequences. At each site, these consensus sequences have the consensus nucleotide if that nucleotide is found in at least
--minreadconcurrence
of the reads; otherwise the nucleotide is set toN
. During the process of building these consensus sequences, the barcode is discarded if any of the following conditions are met for either the R1 or R2 read:- The reads cannot all be made the same length by trimming no more
--maxreadtrim
from the 3’ end. - Any of the reads fail to have at least
--minreadidentity
identical and high-quality (notN
) nucleotides when compared pairwise.
- The reads cannot all be made the same length by trimming no more
For each remaining barcode, we attempt to align the consensus sequence to each of the subamplicons specified by
alignspecs
. The attempted alignment does not accommodate gaps; the consensus sequence for the reads is required to align gaplessly at the position specified byalignspecs
. In any region where the reads overlap (which may happen near the center of the subamplicon), if both consensus sequences (R1 and R2) report high quality nucleotides, the identity is considered ambiguous if the sequences disagree. An alignment is considered valid only if the following conditions are met:- The total fraction of non-high-quality nucleotides in the aligned region exceeds
--maxlowqfrac
. - The subamplicon alignment has no more than
--maxmuts
mutations relative torefseq
when mutations are counted in terms of--chartype
(so if--chartype
iscodon
, this is the number of codon mutations).
If the above criteria are met, then the identities (wildtype or mutant) are counted for each site of
--chartype
where the are no ambiguous nucleotides in the alignment, and this barcoded subamplicon contributes to the totals reported in thecounts.txt
output file. If the alignment criteria are not met, the barcode is discarded as un-alignable. Note that unless you are working with an extremely repetitive gene or specify an extremely large value of--maxmuts
relative to the read length, spurious alignments should not occur.- The total fraction of non-high-quality nucleotides in the aligned region exceeds
Command-line usage¶
Gathers barcoded subamplicons, aligns to reference sequence, and counts mutations. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/
usage: dms_barcodedsubamplicons [-h] [--barcodelength BARCODELENGTH]
[--chartype {codon}] [--maxmuts MAXMUTS]
[--minq MINQ] [--maxlowqfrac MAXLOWQFRAC]
[--minreadsperbarcode MINREADSPERBARCODE]
[--minreadidentity MINREADIDENTITY]
[--minreadconcurrence MINREADCONCURRENCE]
[--maxreadtrim MAXREADTRIM]
[--R1_is_antisense]
[--R1trimlength R1TRIMLENGTH]
[--R2trimlength R2TRIMLENGTH] [--barcodeinfo]
[--purgereadfrac PURGEREADFRAC]
[--purgefrac PURGEFRAC] [--seed SEED] [-v]
outprefix refseq r1files r2files alignspecs
[alignspecs ...]
Positional Arguments¶
outprefix | Prefix for output files. The suffixes of the created files are: “counts.txt” - character counts at each site; “summarystats.txt” - barcode summary stats; “.log” - file that logs progress; “barcodeinfo.txt.gz” - lists all barcodes if using “–barcodeinfo”. Existing files with these names are removed. For instance, if outprefix is WSN_HA_ then the created output files will be WSN_HA_counts.txt, WSN_HA_summarystats.txt, WSN_HA.log, and WSN_HA_barcodeinfo.txt.gz if you use the –barcodeinfo option. The format of the output files is described in Output files. |
refseq | Existing FASTA file containing gene to which we are aligning subamplicons and counting mutations. |
r1files | Comma-separated list of R1 FASTQ files (no spaces). Files can optionally be gzipped (extension .gz). |
r2files | Like “r1files” but R2 files. Must be same number of comma-separated entries as for “r1files”. |
alignspecs | This argument is repeated to specify each subamplicon. Each occurrence is four comma-delimited integers (no spaces): “REFSEQSTART,REFSEQEND,R1START,R2START”. REFSEQSTART is nucleotide (1, 2, … numbering) in “refseq” where nucleotide R1START in R1 aligns. REFSEQEND is nucleotide in “refseq” where nucleotide R2START in R2 aligns. It is important to set alignspecs so that you don’t count the part of the subamplicon that is in the primer binding site, since the nucleotide identities in this region come from the primers rather than the templates being sequenced. See Barcoded subamplicon sequencing for an example. Note also that the alignments will fail if you don’t set alignspecs exactly correctly for each subamplicon. So if you have reads failing to align to part of your gene, carefully check alignspecs to make sure it is correct (you can use the example in Barcoded subamplicon sequencing to make sure you understand things properly). |
Named Arguments¶
--barcodelength | |
Length of barcodes (NNN… nucleotides) at the beginning of R1 and R2 reads. Default: 8 This is the length of the barcode on each primer. So a value of 8 corresponds to a total of 16 N nucleotides on the two primers. See Barcoded subamplicon sequencing for an example of primers with 8 N nucleotides. | |
--chartype | Possible choices: codon Character for which we are counting mutations. Currently “codon” is only allowed value (in the future “nucleotide” might be added). Default: “codon” If chartype is codon then refseq must provide a valid coding sequence (length a multiple of 3). |
--maxmuts | Only align subamplicons (consensus from a barcode) if <= this many mismatches with “refseq” counted in terms of “chartype”. Default: 4 |
--minq | Only consider nucleotides with Q scores >= this number. Default: 15 |
--maxlowqfrac | Only retain read pairs if no “N” or Q < “minq” nucleotides in barcodes, and total fraction of such nucleotides is <= this number in each read individually and in the eventual subamplicon built from the barcodes. Default: 0.075 |
--minreadsperbarcode | |
Retain only barcodes with >= this many reads that align gaplessly with >= “minreadidentity” identical high-quality nucleotides; should be >= 2. Default: 2 | |
--minreadidentity | |
Retain only barcodes where all reads have >= this fraction of identical high-quality nucleotides that align gaplessly. Default: 0.9 | |
--minreadconcurrence | |
For retained barcodes, only make mutation calls when >= this fraction of reads concur. Default: 0.75 | |
--maxreadtrim | If R1 or R2 reads for same barcode are not all same length, trim up to this many nucleotides from 3’ end; if still not same length then discard. Default: 3 This option exists because it appears that Illumina runs of a given read length don’t always return all of the reads with exactly that read length. |
--R1_is_antisense | |
Is the R1 read in the antisense direction of “refseq”? Default: False | |
--R1trimlength | Before performing any operations on the read pairs, trim nucleotides away from 3’ end of R1 read to reach the read length specified here. |
--R2trimlength | Before performing any operations on the read pairs, trim nucleotides away from 3’ end of R2 read to reach the read length specified here. |
--barcodeinfo | If you specify this option, create a file with suffix “barcodeinfo.txt.gz” containing information for each barcode. This file is quite large, and its creation will about double the program run time. Default: False |
--purgereadfrac | |
Randomly purge read pairs with this probability, thereby subsampling the data. You might want a value > 0 to estimate how the results depend on the sequencing depth. See also “–purgefrac”. Default: 0 Why would you ever want to purge some of the read pairs? You may be trying to determine whether sequencing to a higher depth (which will give you more barcodes with at least –nreadsperbarcode reads) will improve the quality of your results. If you set –purgereadfrac to a value > 0 (say 0.5), you’ll see how the results would be affected if you had fewer reads. If these results are noticeably worse by some objective measure, this supports that idea that your read depth might be in regime where greater depth would help. | |
--purgefrac | Like “–purgereadfrac” but purges barcodes rather than read pairs. In deciding which option to use, remember that the number of unique barcodes sequenced does not necessarily increase linearly with read depth. Default: 0 |
--seed | Random number seed used to select reads for purging when using “–purgereadfrac” or “–purgefrac”. Default: 1 |
-v, --version | show program’s version number and exit |
Output files¶
The formats of the output files are given below in terms of the file suffix that is appended to --outprefix
. The data in the first two output files (counts.txt
and summarystats.txt
) can be used as input to dms_summarizealignments to make plots that summarize the results for several samples.
counts.txt
: This file gives the counts of the different characters of--chartype
in the Deep mutational scanning counts file format. So if--chartype
iscodon
, then these will be the counts of codon identities. Note that files in Deep mutational scanning counts file format can be used directly as input to programs like dms_inferprefs and dms_inferdiffprefs.summarystats.txt
: This text file summarizes the numbers of reads and barcodes parsed. For instance, here is an example:total read pairs = 4580258 read pairs that fail Illumina filter = 0 low quality read pairs = 915131 barcodes randomly purged = 0 discarded barcodes with 1 reads = 844768 aligned barcodes with 2 reads = 466387 discarded barcodes with 2 reads = 6457 un-alignable barcodes with 2 reads = 40753 aligned barcodes with 3 reads = 262149 discarded barcodes with 3 reads = 5426 un-alignable barcodes with 3 reads = 24891 aligned barcodes with 4 reads = 120828 discarded barcodes with 4 reads = 3281 un-alignable barcodes with 4 reads = 7675 aligned barcodes with 5 reads = 45342 discarded barcodes with 5 reads = 1516 un-alignable barcodes with 5 reads = 2863 aligned barcodes with 6 reads = 14545 discarded barcodes with 6 reads = 577 un-alignable barcodes with 6 reads = 934 aligned barcodes with 7 reads = 4060 discarded barcodes with 7 reads = 203 un-alignable barcodes with 7 reads = 243 aligned barcodes with 8 reads = 996 discarded barcodes with 8 reads = 65 un-alignable barcodes with 8 reads = 52 aligned barcodes with 9 reads = 250 discarded barcodes with 9 reads = 20 un-alignable barcodes with 9 reads = 11 aligned barcodes with 10 reads = 51 discarded barcodes with 10 reads = 3 un-alignable barcodes with 10 reads = 3 aligned barcodes with 11 reads = 10 discarded barcodes with 11 reads = 1 aligned barcodes with 12 reads = 2
.log
: This is a text file that logs the progress of the program.barcodeinfo.txt.gz
: This file is only created if the--barcodeinfo
option is used. In this case, the file will give all barcodes and their associated reads, and will explain whether or not the barcode was successfully retained during the alignment process.
Memory usage¶
dms_barcodedsubamplicons
stores all of the reads in the FASTQ files in memory. Therefore, it uses a substantial amount of memory, typically around a gigabyte per million paired-end sequencing reads. However, for typical data sets such memory usage is well within the capacity of modern large-memory nodes.