.. _dms2_bcsubamp:

==========================================
``dms2_bcsubamp``
==========================================

.. contents::
   :local:

Overview
-------------
The ``dms2_bcsubamp`` program processes the FASTQ files generated by :ref:`bcsubamp` to count the frequencies of each identity at each site and generate summary statistics.

In general, you will probably have an easier time if you use the :ref:`dms2_batch_bcsubamp` program rather than running ``dms2_bcsubamp`` directly.

.. _bcsubamp_algorithm:

Algorithm for assembling and aligning subamplicons
---------------------------------------------------
The algorithm implemented by ``dms2_bcsubamp`` is as follows:

1) Read pairs are discarded if either read fails the Illumina chastity filter.

2) All sites in each read are classified as "high quality" if the site is not an ``N`` (ambiguous nucleotide) and has a Q-score of at least ``--minq``, or as "low quality" otherwise. At low-quality sites, the identity in the read is then set to ``N`` for the remainder of the analysis.

3) We discard any read pairs for which there are any ``N`` nucleotides in the barcode.

4) We then collect all remaining read pairs for each barcode (concatenating the barcodes on the paired reads for a total barcode length that is twice ``--bclen``) and perform all subsequent operations on these barcodes. Note that if you have different length barcodes on the two reads (or all the barcode on one read), you can use the ``--bclen2`` option below.

5) Any barcode with less than ``--minreads`` read pairs is discarded.

6) The R1 and R2 reads for each barcode are each assembled into consensus sequences. The consensus sequences is built by applying the following algorithm to each site:

    a. We collect the identities for all reads, ignoring ``N`` nucleotides.
    
    b. If there are less than ``--minreads`` called (non ``N``) nucleotides, set the consensus identity for that site to ``N``.
    
    c. If there are at least ``--minreads`` called nucleotides, set the consensus identity to ``N`` if the fraction of called identities that are identical is less than ``--minconcur``, and set it to the consensus called identity if at least ``--minconcur`` of the reads are identical at the site.

7) We attempt to align the consensus read sequences for each barcode to subamplicon positions specified by ``--alignspecs``. The attempted alignment does **not** accommodate gaps; the consensus sequence must align gaplessly at the position specified by ``--alignspecs``. In any region where the reads overlap (which may happen near the center of the subamplicon), if both consensus reads report high quality nucleotides, the identity is considered ambiguous if the sequences disagree. If one consensus read is ambiguous, we take the identity for the non-ambiguous read. The alignment procedure is as follows:

    a. We trim the reads from the 3' end using the ``--R1trim`` and ``--R2trim`` parameters for that subamplicon.

    b. We then attempt to align the consensus reads at each position indicated in ``--alignspecs``. We consider an alignment acceptable if there are no more than ``--maxmuts`` mutations relative to ``refseq`` (with mutations counted in terms of character ``--chartype``) and if the fraction of called nucleotides in the aligned consensus is at least ``--minfraccall``. We take the first valid alignment that we find. But unless you are working with a very repetitive sequence or specify a very large ``--maxmuts``, there should be at most one valid alignment.

8) For each aligned subamplicon, we count the identities at each site. We call an identity if at least ``--minreads`` reads with non-ambiguous identities span that site, and at least ``--minconcur`` of these agree on the identity. If both reads span a site and they disagree, we call that site as ``N``.

9) The results are reported in the `Output files`_.


.. _bcsubamp_commandlineusage:

Command-line usage
----------------------------------------
.. argparse::
   :module: dms_tools2.parseargs
   :func: bcsubampParser
   :prog: dms2_bcsubamp

   \-\-name
    The `Output files`_ will have a prefix equal to the name specified here.
    This name should only contain letters, numbers, dashes, and spaces.
    Underscores are **not** allowed as they are a LaTex special character.

   \-\-alignspecs
    It is important to set ``--alignspecs`` so that you don't count the part of the subamplicon that is the barocde or in the primer binding site, since the nucleotide identities in this region do not come from the templates being sequenced. See :ref:`bcsubamp` for an example.

    The alignments will fail if you don't set ``--alignspecs`` exactly correctly for each subamplicon. So if you have reads failing to align to part of your gene, carefully check ``--alignspecs`` to make sure it is correct.

   \-\-bclen
    This is the length of the barcode on **each** primer. So a value of 8 corresponds to a total of 16 ``N`` nucleotides on the two primers. See :ref:`bcsubamp` for an example of primers with 8 ``N`` nucleotides.

   \-\-R1
    You can specify multiple files using the ``*`` wildcard character, as in ``reads_R1_L*.fastq.gz``.

    If you have multiple files that are all in the same directory, it may be convenient to specify this directory using ``--fastqdir`` and then avoid repeatedly listing the directory name for each file.

   \-\-fastqdir
    This option can be useful if the ``--R1`` / ``--R2`` files are found in a common directory. 
    Instead of repeatedly listing that directory name, you can just provide it here and then give the file (or subdirectory/file) names for the files via ``--R1`` / ``--R2``.

   \-\-R2
    Most pipelines for generating Illumina FASTQ files have the read 1 sequences in a file that contains the string ``_R1`` and the read 2 sequences in a file that contains the string ``_R2``. If this is the case, the R2 file name can just be guessed from the R1 file name. However, you can use this option if your R2 files have a different name that you need to specify manually.

   \-\-R1trim
    Often your reads will be longer than needed, so it is helpful to trim the low-quality nucleotides that tend to be at the end of long readfs.

    If you specify one number, then **all** R1 reads are trimmed to this length. If you specify a list of numbers equal to the entries for ``--alignspecs``, then each read is trimmed differently depending on which subamplicon it aligns to.

    Note that the trimming is done to the initial read, and so the trimmed length specifies here includes the barcode, the primer binding site, and then the part of the read that provides useful sequence information.

   \-\-chartype
    If ``chartype`` is ``codon`` then ``--refseq`` must provide a valid coding sequence (length a multiple of 3).

   \-\-purgeread
    Why would you want to purge some of the read pairs? You may be trying to determine whether sequencing to a higher depth will improve your results. If you set ``--purgeread`` to a value > 0 (say 0.5), you'll see how the results would be affected if you had fewer reads. If these results are noticeably worse, this supports that idea that you might be in regime where more reads would help.

   \-\-purgebc
    This option differs from ``--purgeread`` in that it purges **barcodes** rather than reads. So this gives you some indication of how your results would change if you bottlenecked to fewer unique molecules prior to the round 2 PCR to attach the barcodes to each molecule.

   \-\-bcinfo
    This will be a very large file and creating it will take some time, so only use this option if you need to look at this file for debugging.

.. _bcsubamp_outputfiles:

Output files
--------------
The program produces a variety of output files. 
These files all have the prefix specified by ``--outdir`` and ``--name``.
For instance, if you use ``--outdir results --name sample-1``, then the output files will have the prefix ``./results/sample-1`` and the suffixes described below.

Here are the specific output files:

Log file
+++++++++++
This file has the suffix ``.log``. 
It is a text file that logs the progress of the program.

Read statistics file
++++++++++++++++++++++
This file has the suffix ``_readstats.csv``. 
It gives the statistics on all the processed reads. 
For instance::

    fail filter,low Q barcode,total
    0,785663,12691176

Reads-per-barcode file
+++++++++++++++++++++++++
This file has the suffix ``_readsperbc.csv``.
It gives the number of reads associated with each barcode.
For instance::

    number of reads,number of barcodes
    1,1248294
    2,400579
    3,471893
    4,456661
    5,376158
    6,272020
    7,174295
    8,102660
    9,56302
    10,28079
    11,13125
    12,5747
    13,2518
    14,1035
    15,403
    16,177
    17,78
    18,34
    19,12
    20,9
    21,1
    22,2
    24,1
    25,1
    75,1

Barcode statistics file
+++++++++++++++++++++++++
This file has the suffix ``_bcstats.csv``.
It provides information on how many of the barcodes could be successfully aligned.
For instance::

    aligned,not alignable,too few reads,total
    2129522,232269,1248294,3610085

Counts file
+++++++++++++
This is output file that has the results that you will probably use for subsequent analyses.
It has the suffix ``_codoncounts.csv`` if you are using ``--chartype codon``.
It gives the number of called identities at each site in the sequence, as well as the wildtype sequence.
For instance, here are the first few lines::

    site,wildtype,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,AGG,AGT,ATA,ATC,ATG,ATT,CAA,CAC,CAG,CAT,CCA,CCC,CCG,CCT,CGA,CGC,CGG,CGT,CTA,CTC,CTG,CTT,GAA,GAC,GAG,GAT,GCA,GCC,GCG,GCT,GGA,GGC,GGG,GGT,GTA,GTC,GTG,GTT,TAA,TAC,TAG,TAT,TCA,TCC,TCG,TCT,TGA,TGC,TGG,TGT,TTA,TTC,TTG,TTT
    1,ATG,0,0,0,0,0,0,2,0,0,0,0,0,8,0,333985,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    2,AAG,16,20,333132,41,13,12,27,14,8,6,67,8,9,13,29,9,10,11,12,8,10,15,15,11,6,9,3,7,8,10,17,4,3,7,49,7,9,14,9,4,10,7,7,7,9,11,11,5,14,14,11,6,13,16,15,14,9,9,15,8,9,11,8,15
    3,GCA,2,3,8,3,34,11,7,6,7,6,9,8,4,3,5,0,6,14,10,12,6,8,7,10,5,11,7,6,6,1,3,12,19,6,11,9,333250,10,6,9,15,3,5,5,37,9,9,7,8,4,8,3,23,5,7,8,6,11,7,10,7,9,3,6

Detailed barcode information file
++++++++++++++++++++++++++++++++++
This file has the suffix ``_bcinfo.txt.gz``. 
It is a very large gzipped text file that contains information on all the reads and barcodes. 
The format should be self explanatory.
This file is only created if you use the ``--bcinfo`` option, and may be helpful for debugging if your reads aren't aligning as expected.

.. _bcsubamp_memoryusage:

Memory usage
---------------------------
``dms2_bcsubamp`` stores all of the reads in the FASTQ files in memory. Therefore, it uses a substantial amount of memory, typically around a gigabyte per million paired-end sequencing reads. However, for typical data sets such memory usage is well within the capacity of modern large-memory nodes.

.. include:: weblinks.txt