.. _dms_barcodedsubamplicons: ========================================== ``dms_barcodedsubamplicons`` ========================================== .. contents:: Overview ------------- This program can process the FASTQ reads generated by `Barcoded subamplicon sequencing`_ to count the frequencies of each identity at each site, and to generate summary statistics. If you do such an analysis for several samples, you can summarize the results with :ref:`dms_summarizealignments`. After you install `dms_tools`_, this program will be available to run at the command line. Barcoded subamplicon sequencing ----------------------------------- Barcoded subamplicon sequencing involves using PCR to make a series of subamplicons from the gene to be sequenced. Each of these subamplicons contains a random barcode (a string of ``N`` nucleotides) at each end that provides a unique identifier. The barcoded subamplicons are then diluted such that the number of unique ssDNA molecules is less than the read depth, and then this dilution is amplified by primers that also add the Illumina adaptors. Finally, these molecules are sequenced. In analyzing the sequencing, we can then identify barcodes that are sequenced multiple times, and a consensus sequence for the subamplicon can be built from the multiple reads. The ``dms_barcodedsubamplicons`` script performs this analysis. To understand how the sequencing is done in detail, example primers are shown below for sequencing of the hemagglutinin (HA) from A/WSN/1933 (H1N1) influenza:: Rnd1for1: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNaagcaggggaaaataaaaacaaccaaa-3' Rnd1for427: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNccaaggaaagttcatggcccaac-3' Rnd1for850: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNgtttgagtccggcatcatcacc-3' Rnd1for1276: 5'-CTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNcaacaacttagaaaaaaggatggaaaatttaaataaa-3' WSN HA with coding sequence in caps: 5'-agcaaaagcaggggaaaataaaaacaaccaaaATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGAgattaggatttcagaaatataaggaaaaacacccttgtttctact-3' WSN HA with coding sequence in caps: 3'-tcgttttcgtccccttttatttttgttggtttTACTTCCGTTTTGATGACCAGGACAATATACGTAAACATCGATGTCTACGTCTGTGTTATACATATCCGATGGTACGCTTGTTGAGTTGGCTGTGACAACTGTGTTATGAGCTCTTCTTACACCGTCACTGTGTAAGACAATTGGACGAGCTTCTGTCGGTGTTGCCCTTTGATACATTTAATTTTCCTTATCGGGGTGATGTTAACCCCTTTACATTGTAGTGGCCTACCGAGAACCCTTTAGGTCTTACGCTGAGTGACGAAGGTCGCTCTAGTACCAGGATGTAACATCTTTGTGGTTTGAGACTCTTACCTCGTACAATAGGTCCTCTAGAGTAGCTGATACTCCTTGACTCCCTCGTTAACTCGAGTCATAGTAGTAATCTTTCTAAGCTTTATAAAGGGTTCCTTTCAAGTACCGGGTTGGTGTGTAAGTTGCCTCATTGTCATAGTACGAGGGTATCCCCTTTTTCGTCAAAAATGTCTTTAAACGATACCGACTGCTTCTTCCCCCTAAGTATGGGTTTCGACTGGTTAAGGATACACTTGTTATTTCCCTTTCTTCAGGAACATGATACCCCACAAGTAGTGGGCAGATCGTCACTACTCGTTGTCTCAGAGATATCATTACCTTTACGAATACAGAGACATCGCAGAAGTTTAATATTGTCCTCTAAGTGGGGCCTTTATCGACGTTCCGGGTTTCATTCTCTAGTTGTACCCTCCTACTTGATAATGACCTGGAACGATCTTGGGCCTCTGTGTTATTATAAACTCCGTTGACCATTAGATTATCGTGGTACCATACGAAAGCGTGACTCATCTCCCAAACTCAGGCCGTAGTAGTGGAGTTTGCGCAGTTACGTACTCACATTGTGCTTCACAGTTTGTGGGGTCCCTCGATATTTGTCGTTAGAGGGAAAGGTCTTATATGTGGGTCAGTGTTATCCTCTCACGGGTTTTATACAGTCCTCATGGTTTAACTCCTACCAATGTCCTGATTCTTTGTAGGGTAGGTAAGTTATGTCTCCAGATAAACCTCGGTAACGACCAAAATAACTCCCCCCTACCTGACCTTACTATCTACCTACCATACCAATAGTAGTAGTCTTACTTGTCCCTAGTCCGATACGTCGCCTAGTTTTTTCGTGTGTTTTACGGTAATTGCCCTAATGTTTGTTCCACTTGAGACAATAGCTCTTTTACTTGTGAGTTAAGTGTCGACACCCATTTCTTAAGTTGTTGAATCTTTTTTCCTACCTTTTAAATTTATTTTTTCAACTACTACCCAAAGACCTGTAAACCTGTATATTACGTCTTAACAATCAAGATGACCTTTTACTTTCCTGAAACCTAAAGGTACTGAATTTACACTTCTTAGACATGCTCTTTCATTTTTCGGTTAATTTCTTATTACGGTTTCTTTAGCCTTTACCCACAAAACTCAAGATGGTGTTCACACTGTTACTTACGTACCTTTCACATTCTTTACCCTGAATACTAATAGGTTTTATAAGTCTTCTTAGTTTCAACTTGTCCCTTTTCTATCTACCTCACTTTAACCTTAGTTACCCCCACATAGTCTAAGACCGCTAGATGAGTTGACAGCGGTCAAGTGACCACGAAAACCAGAGGGACCCCCGTTAGTCAAAGACCTACACAAGATTACCCAGAAACGTCACGTCTTATACGTAGACTctaatcctaaagtctttatattcctttttgtgggaacaaagatga-5' Rnd1rev426: 3'-gtgtgtaagttgcctcattgtcatagtacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev849: 3'-agtttgcgcagttacgtactcacNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev1275: 3'-actactacccaaagacctgtaaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Rnd1rev1698: 3'-ctaatcctaaagtctttatattcctttttgtgggaaNNNNNNNNTCTAGCCTTCTCGTGTGCAGACTTGAGG-5' Briefly, the steps are as follows: 1) The library of HA genes is subjected to PCR with primers to create subamplicons of :math:`\approx 425` internal nucleotides each as well as additional adaptor and primer-binding sequences. In the example above, PCRs are performed with the following primer pairs: - ``Rnd1for1`` and ``Rnd1rev426``: these primers create a subamplicon spanning nucleotides 1 to 426 (codons 1 to 142) of the WSN HA, which is the part of the sequence shown in caps above. - ``Rnd1for427`` and ``Rnd1rev849``: these primers create a subamplicon spanning nucleotides 427 to 849 (codons 143 to 283). - ``Rnd1for850`` and ``Rnd1rev1275``: these primers create a subamplicon spanning nucleotides 850 to 1275 (codons 284 to 425). - ``Rnd1for1276`` and ``Rnd1rev1698``: these primers create a subamplicon spanning nucleotides 1276 to 1698 (codons 426 to 566). 2) The subamplicons are then diluted so that the total number of ssDNA molecules is less than the sequencing depth. Since each subamplicon molecule has 8 ``N`` nucleotides at each end, there are :math:`4^{16} \approx 4.3 \times 10^9` unique barcodes. Since the number of unique barcodes is much greater than the total number of molecules in the dilution, each barcode will generally be associated with just one molecule. 3) The barcoded subamplicons are further amplified by primers that add the Illumina adaptors, and perhaps Illumina indices for multiplexing. For instance, here are two example primers for this second round of PCR:: >Rnd2forUniversal: forward primer for the round 2 PCRs that adds the Illumina adaptor AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCC >Rnd2revIndex: reverse primer for the round 2 PCRs that adds the TruSeq index 1 (in lower case) as well as the Illumina adaptor CAAGCAGAAGACGGCATACGAGATcgtgatGTGACTGGAGTTCAGACGTGTGCTCTTCC 4) The products of the round 2 PCRs are sequenced with overlapping paired-end reads of sufficient length to cover the entire subamplicon, and the sequencing data are analyzed with the standard Illumina pipeline to generate FASTQ files with the R1 and R2 reads. 5) The FASTQ files are analyzed with ``dms_barcodedsubamplicons``. For the example here, the command might be:: dms_barcodedsubamplicons WSN_HA_ WSN-HA.fasta R1.fastq.gz R2.fastq.gz 1,426,36,38 427,849,32,32 850,1275,31,37 1276,1698,46,45 The meanings of the arguments are detailed in the `Command-line usage`_. Briefly: - ``WSN_HA_`` is the prefix attached to the created output files. - ``WSN-HA.fasta`` should be the name of a FASTA file containing the part of the gene that we are sequencing. In this example, it would be **only** the capitalized portion of the sequence shown above (as the lowercase parts are simply flanking sequences for primer binding). For instance, here is such a file:: >WSN HA coding sequence ATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGA - ``R1.fasta.gz`` and ``R2.fastq.gz`` are the FASTQ files with the R1 and R2 reads. - The remaining arguments specify where the subamplicons might align For instance, ``1,426,36,38`` means that for the first subamplicon, the first sequenced nucleotide (first one outside primer binding site) aligns to nucleotide 1 of the reference sequence in ``WSN-HA.fasta``, the last sequence nucleotide aligns to nucleotide 426 of the reference sequence, the first nucleotide in R1 that we want to consider in the alignment is site 36 (there is an 8 nucleotide barcode plus a 27-nucleotide primer-binding site), and the first nucleotide in R2 that we want to consider in the alignment is site 38 (there is an 8-nucleotide barcode plus a 29-nucleotide primer-binding site). The arguments ``427,849,32,32``, ``850,1275,31,37``, and ``1276,1698,46,45`` are the similar specifications for the other three subamplicons. Algorithm for assembling and aligning subamplicons ---------------------------------------------------- The algorithm implemented by ``dms_barcodedsubamplicons`` is as follows: 1) Any read pair in which either read fails the Illumina chastity filter is discarded. 2) All sites in each read pair are classified as "high quality" if the site is not an ``N`` (ambiguous nucleotide) and has a Q-score of at least ``--minq``, or as "low quality" otherwise. For all low-quality sites, the called identity in the read is simply set to ``N`` (so we treat a site with a Q-score of less than ``--minq`` as equivalent to an ``N``). 3) We discard any read pairs for which any of the following conditions are met: - There is a low-quality site in the barcode region of either read. - Either read has more than ``--maxlowqfrac`` of its nucleotides that are low quality. 4) We then collect all remaining read pairs for each barcode (concatenating the barcodes on the paired reads to give a total barcode length that is twice ``--barcodelength``) and perform all subsequent operations on these barcodes. 5) Any barcode with less than ``--minreadsperbarcode`` read pairs is discarded. 6) The R1 and R2 reads for each remaining barcode are assembled into consensus sequences. At each site, these consensus sequences have the consensus nucleotide if that nucleotide is found in at least ``--minreadconcurrence`` of the reads; otherwise the nucleotide is set to ``N``. During the process of building these consensus sequences, the barcode is discarded if any of the following conditions are met for either the R1 or R2 read: - The reads cannot all be made the same length by trimming no more ``--maxreadtrim`` from the 3' end. - Any of the reads fail to have at least ``--minreadidentity`` identical and high-quality (not ``N``) nucleotides when compared pairwise. 7) For each remaining barcode, we attempt to align the consensus sequence to each of the subamplicons specified by ``alignspecs``. The attempted alignment does **not** accommodate gaps; the consensus sequence for the reads is required to align gaplessly at the position specified by ``alignspecs``. In any region where the reads overlap (which may happen near the center of the subamplicon), if both consensus sequences (R1 and R2) report high quality nucleotides, the identity is considered ambiguous if the sequences disagree. An alignment is considered valid only if the following conditions are met: - The total fraction of non-high-quality nucleotides in the aligned region exceeds ``--maxlowqfrac``. - The subamplicon alignment has no more than ``--maxmuts`` mutations relative to ``refseq`` when mutations are counted in terms of ``--chartype`` (so if ``--chartype`` is ``codon``, this is the number of **codon** mutations). If the above criteria are met, then the identities (wildtype or mutant) are counted for each site of ``--chartype`` where the are no ambiguous nucleotides in the alignment, and this barcoded subamplicon contributes to the totals reported in the ``counts.txt`` output file. If the alignment criteria are not met, the barcode is discarded as un-alignable. Note that unless you are working with an extremely repetitive gene or specify an extremely large value of ``--maxmuts`` relative to the read length, spurious alignments should not occur. Command-line usage --------------------- .. argparse:: :module: parsearguments :func: BarcodedSubampliconsParser :prog: dms_barcodedsubamplicons outprefix For instance, if ``outprefix`` is ``WSN_HA_`` then the created output files will be ``WSN_HA_counts.txt``, ``WSN_HA_summarystats.txt``, ``WSN_HA.log``, and ``WSN_HA_barcodeinfo.txt.gz`` if you use the ``--barcodeinfo`` option. The format of the output files is described in `Output files`_. alignspecs It is important to set ``alignspecs`` so that you don't count the part of the subamplicon that is in the primer binding site, since the nucleotide identities in this region come from the primers rather than the templates being sequenced. See `Barcoded subamplicon sequencing`_ for an example. Note also that the alignments will fail if you don't set ``alignspecs`` exactly correctly for each subamplicon. So if you have reads failing to align to part of your gene, carefully check ``alignspecs`` to make sure it is correct (you can use the example in `Barcoded subamplicon sequencing`_ to make sure you understand things properly). \-\-barcodelength This is the length of the barcode on **each** primer. So a value of 8 corresponds to a total of 16 ``N`` nucleotides on the two primers. See `Barcoded subamplicon sequencing`_ for an example of primers with 8 ``N`` nucleotides. \-\-chartype If ``chartype`` is ``codon`` then ``refseq`` must provide a valid coding sequence (length a multiple of 3). \-\-maxreadtrim This option exists because it appears that Illumina runs of a given read length don't always return all of the reads with exactly that read length. \-\-purgereadfrac Why would you ever want to purge some of the read pairs? You may be trying to determine whether sequencing to a higher depth (which will give you more barcodes with at least ``--nreadsperbarcode`` reads) will improve the quality of your results. If you set ``--purgereadfrac`` to a value > 0 (say 0.5), you'll see how the results would be affected if you had fewer reads. If these results are noticeably worse by some objective measure, this supports that idea that your read depth might be in regime where greater depth would help. Output files -------------- The formats of the output files are given below in terms of the file suffix that is appended to ``--outprefix``. The data in the first two output files (``counts.txt`` and ``summarystats.txt``) can be used as input to :ref:`dms_summarizealignments` to make plots that summarize the results for several samples. * ``counts.txt`` : This file gives the counts of the different characters of ``--chartype`` in the :ref:`dms_counts` format. So if ``--chartype`` is ``codon``, then these will be the counts of codon identities. Note that files in :ref:`dms_counts` format can be used directly as input to programs like :ref:`dms_inferprefs` and :ref:`dms_inferdiffprefs`. * ``summarystats.txt`` : This text file summarizes the numbers of reads and barcodes parsed. For instance, here is an example:: total read pairs = 4580258 read pairs that fail Illumina filter = 0 low quality read pairs = 915131 barcodes randomly purged = 0 discarded barcodes with 1 reads = 844768 aligned barcodes with 2 reads = 466387 discarded barcodes with 2 reads = 6457 un-alignable barcodes with 2 reads = 40753 aligned barcodes with 3 reads = 262149 discarded barcodes with 3 reads = 5426 un-alignable barcodes with 3 reads = 24891 aligned barcodes with 4 reads = 120828 discarded barcodes with 4 reads = 3281 un-alignable barcodes with 4 reads = 7675 aligned barcodes with 5 reads = 45342 discarded barcodes with 5 reads = 1516 un-alignable barcodes with 5 reads = 2863 aligned barcodes with 6 reads = 14545 discarded barcodes with 6 reads = 577 un-alignable barcodes with 6 reads = 934 aligned barcodes with 7 reads = 4060 discarded barcodes with 7 reads = 203 un-alignable barcodes with 7 reads = 243 aligned barcodes with 8 reads = 996 discarded barcodes with 8 reads = 65 un-alignable barcodes with 8 reads = 52 aligned barcodes with 9 reads = 250 discarded barcodes with 9 reads = 20 un-alignable barcodes with 9 reads = 11 aligned barcodes with 10 reads = 51 discarded barcodes with 10 reads = 3 un-alignable barcodes with 10 reads = 3 aligned barcodes with 11 reads = 10 discarded barcodes with 11 reads = 1 aligned barcodes with 12 reads = 2 * ``.log`` : This is a text file that logs the progress of the program. * ``barcodeinfo.txt.gz`` : This file is only created if the ``--barcodeinfo`` option is used. In this case, the file will give all barcodes and their associated reads, and will explain whether or not the barcode was successfully retained during the alignment process. Memory usage --------------------------- ``dms_barcodedsubamplicons`` stores all of the reads in the FASTQ files in memory. Therefore, it uses a substantial amount of memory, typically around a gigabyte per million paired-end sequencing reads. However, for typical data sets such memory usage is well within the capacity of modern large-memory nodes. .. include:: weblinks.txt