dms_subassemble
¶
Contents
Overview¶
This program can process the FASTQ reads generated by Subassembly sequencing to subassemble gene variants identified by a unique random barcode sequence at their 3’ end. It provides linkage information across sites.
After you install dms_tools, this program will be available to run at the command line.
Subassembly sequencing¶
Subassembly is a technique to use short read sequencing to sequence longer sequences (e.g. full-length genes) by attached a unique barcode (a string of N
nucleotides) to each variant of the sequence, and then associating short reads that span the entire sequence with this barcode to build up the full sequence. The technique is originally described in Hiatt et al (2010).
The dms_subassemble
program currently works for the case in which the unique barcode is at the 3’ end of the gene, and the fragments to be subassembled are generated by PCR at defined locations. The R1 read captures the barcode, and the R2 read begins at one of the defined PCR locations and captures part of the variant’s sequence.
Here we give an example of the experimental workflow to subassemble recA from E. coli.
In order to subassemble the gene, it is necessary to generate appropriately sized subamplicons that contain the barcode sequence at their 3’ ends (if the subamplicons are too large, then they won’t form clusters efficiently). To do this, the plasmids were cut with XbaI & NdeI or XbaI & BstEII to remove 640 bp or 587 bp upstream of the barcode, respectively. The digested plasmids were then recircularized and used as template for the subamplicons spanning nucleotide positions -1 through 545. Uncut plasmid was used as template for the remaining subamplicons, which span nucleotide positions 481 through the terminator and barcode sequences downstream of the recA gene. Each subamplicon is amplified with a common reverse primer annealing to an Illumina adapter sequence that has been incorporated into the plasmid immediately downstream of the barcode sequence, and with one of 7 different forward primers that anneal to different regions of the recA gene. The forward primers were designed such that the distance from the 3’-end of one primer is \(\le 200\) bp from the 3’-end of the next forward primer. The template and primer combinations used to generate the subamplicons are shown below:
Rnd1F-1: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTccggcatgacaggagtaaa-3’ Rnd1F172: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTcttggggcaggtggtctg Rnd1F354: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTtcgacaacctgctgtgctc
XbaI & NdeI digested recA with coding sequence in caps: 5’-ccggtattacccggcatgacaggagtaaaaATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATActagaNNNNNNNNNNNNNNNNNNagatcggaagagcgtcgtgtagggaaagagtgt-3’
XbaI & NdeI digested recA with coding sequence in caps: 3’-ggccataatgggccgtactgtcctcattttTACCGATAGCTGCTTTTGTTTGTCTTTCGCAACCGCCGTCGTGACCCGGTCTAACTCTTTGTTAAACCATTTCCGAGGTAGTACGCGGACCCACTTCTGGCAAGGTACCTACACCTTTGGTAGAGATGGCCAAGCGAAAGTGACCTATAGCGCGAACCCCGTCCACCAGACGGCTACCCGGCATAGCAGCTTTAGATGCCTGGCCTTAGAAGGCCATTTTGGTGCGACTGCGACGTCCACTAGCGGCGTCGCGTCGCACTTCCATTTTGGACACGCAAATAGCTACGACTTGTGCGCGACCTGGGTTAGATGCGTGCATTTGACCCGCAGCTATAGCTGTTGGACGACACGAGGGTCGGCCTGTGGCCGCTCGTCCGTGACCTTTAGACACTGCGGGACCGCGCAAGACCGCGTCATCTGCAATAGCAGCAACTGAGGCACCGCCGTGACTGCGGCTTTCGCCTTTAGCTTCCGCTTTAGCCGCTGAGAGTATgatctNNNNNNNNNNNNNNNNNNtctagccttctcgcagcacatccctttctcaca-5’
Rnd1R: 3’-TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Rnd1F-1: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTccggcatgacaggagtaaa-3’ Rnd1F172: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTcttggggcaggtggtctg Rnd1F354: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTtcgacaacctgctgtgctc
XbaI & BstEII digested recA with coding sequence in caps: 5’-ccggtattacccggcatgacaggagtaaaaATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACctagaNNNNNNNNNNNNNNNNNNagatcggaagagcgtcgtgtagggaaagagtgt-3’
XbaI & BstEII digested recA with coding sequence in caps: 3’-ggccataatgggccgtactgtcctcattttTACCGATAGCTGCTTTTGTTTGTCTTTCGCAACCGCCGTCGTGACCCGGTCTAACTCTTTGTTAAACCATTTCCGAGGTAGTACGCGGACCCACTTCTGGCAAGGTACCTACACCTTTGGTAGAGATGGCCAAGCGAAAGTGACCTATAGCGCGAACCCCGTCCACCAGACGGCTACCCGGCATAGCAGCTTTAGATGCCTGGCCTTAGAAGGCCATTTTGGTGCGACTGCGACGTCCACTAGCGGCGTCGCGTCGCACTTCCATTTTGGACACGCAAATAGCTACGACTTGTGCGCGACCTGGGTTAGATGCGTGCATTTGACCCGCAGCTATAGCTGTTGGACGACACGAGGGTCGGCCTGTGGCCGCTCGTCCGTGACCTTTAGACACTGCGGGACCGCGCAAGACCGCGTCATCTGCAATAGCAGCAACTGAGGCACCGCCGTGACTGCGGCTTTCGCCTTTAGCTTCCGCTTTAGCCGCTGAGAGTATACCCGGAACGCCGTGCATACTACTCGGTCCGCTACGCATTCGACCGCCCATTGgatctNNNNNNNNNNNNNNNNNNtctagccttctcgcagcacatccctttctcaca-5’
Rnd1R: 3’-TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
Rnd1F481: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTggaaatcgaaggcgaaatc-3’ Rnd1F633: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTgcaacccggaaaccactac-3’ Rnd1F809: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTcctctacggcgaaggtatca-3’ Rnd1F941: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTtgcctggctgaaagataacc-3’
E. coli recA with coding sequence in caps: 5’-ccggtattacccggcatgacaggagtaaaaATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAAtcgtcttgtttgatacacaagggtcgcatctgcggcccttttgcttttttaagttgtaaggatatgccattctagaNNNNNNNNNNNNNNNNNNagatcggaagagcgtcgtgtagggaaagagtgt-3’
E. coli recA with coding sequence in caps: 3’-ggccataatgggccgtactgtcctcattttTACCGATAGCTGCTTTTGTTTGTCTTTCGCAACCGCCGTCGTGACCCGGTCTAACTCTTTGTTAAACCATTTCCGAGGTAGTACGCGGACCCACTTCTGGCAAGGTACCTACACCTTTGGTAGAGATGGCCAAGCGAAAGTGACCTATAGCGCGAACCCCGTCCACCAGACGGCTACCCGGCATAGCAGCTTTAGATGCCTGGCCTTAGAAGGCCATTTTGGTGCGACTGCGACGTCCACTAGCGGCGTCGCGTCGCACTTCCATTTTGGACACGCAAATAGCTACGACTTGTGCGCGACCTGGGTTAGATGCGTGCATTTGACCCGCAGCTATAGCTGTTGGACGACACGAGGGTCGGCCTGTGGCCGCTCGTCCGTGACCTTTAGACACTGCGGGACCGCGCAAGACCGCGTCATCTGCAATAGCAGCAACTGAGGCACCGCCGTGACTGCGGCTTTCGCCTTTAGCTTCCGCTTTAGCCGCTGAGAGTATACCCGGAACGCCGTGCATACTACTCGGTCCGCTACGCATTCGACCGCCCATTGGACTTCGTCAGGTTGTGCGACGACTAGAAGTAGTTGGTCTAGGCATACTTTTAACCACACTACAAGCCGTTGGGCCTTTGGTGATGGCCACCATTGCGCGACTTTAAGATGCGGAGACAAGCAGAGCTGTAGGCAGCATAGCCGCGCCACTTTCTCCCGCTTTTGCACCACCCATCGCTTTGGGCGCACTTTCACCACTTCTTGTTTTAGCGACGCGGCAAATTTGTCCGACTTAAGGTCTAGGAGATGCCGCTTCCATAGTTGAAGATGCCGCTTGACCAACTGGACCCGCATTTTCTCTTCGACTAGCTCTTTCGTCCGCGCACCATGTCGATGTTTCCACTCTTCTAGCCAGTCCCATTTCGCTTACGCTGACGGACCGACTTTCTATTGGGCCTTTGGCGCTTTCTCTAGCTCTTCTTTCATGCACTCAACGACGACTCGTTGGGCTTGAGTTGCGGCCTAAAGAGACATCTACTATCGCTTCCGCATCGTCTTTGATTGCTTCTAAAAATTagcagaacaaactatgtgttcccagcgtagacgccgggaaaacgaaaaaattcaacattcctatacggtaagatctNNNNNNNNNNNNNNNNNNtctagccttctcgcagcacatccctttctcaca-5’
Rnd1R: 3’-TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
After these subamplicons are generated in the Round 1 PCR, a second round of PCR is used to add the remaining Illumina adapter sequences and TruSeq index sequences for multiplexing. For example, here are two primers used in the Round 2 PCR reactions:
>Rnd2IndexF: forward primer for the Round 2 PCR that adds the Illumina adaptor (uppercase) and a TruSeq index, shown here as 6 “n” nucleotides
5’-CAAGCAGAAGACGGCATACGAGATnnnnnngtgactggagttcagacgtgtgctcttcc-3’
>Rnd2R: reverse primer for the Round 2 PCR that adds the Illumina adaptor (uppercase)
5’-AATGATACGGCGACCACCGAGATCTacactctttccctacacgacgctcttccgatct-3’
The products of the Round 2 PCR are sequenced with asymmetric paired-end reads with Read1 sequencing the barcode (\(\ge 18\) nucleotides), and with Read2 of sufficient length to overlap with the next subamplicon (\(\ge 200\) nucleotides). The sequencing data are analyzed with the standard Illumina pipeline to generate FASTQ files containing the R1 and R2 reads.
Subassembly algorithm¶
dms_subassemble
implements the following algorithm:
The paired R1 and R2 reads are read from
r1files
andr2files
, and any nucleotide that has a Q score <--minq
is converted to anN
(e.g. considered ambiguous).The R1 read is only processed to extract the barcode (the first
--barcodelength
nucleotides) and any remaining nucleotides in R1 are discarded. The read pair is then discarded if:- There are any
N
nucleotides in the barcode. - The fraction of
N
nucleotides in the trimmed R2 read is >--maxlowqfrac
.
- There are any
An attempt is made to gaplessly align the trimmed R2 read at each site specified by
alignspecs
. Briefly,alignspecs
specifies pairs of numbers REFSEQSTART and R2START. For R2 read, we try to align the read starting at nucleotide R2START (1, 2, … numbering) at nucleotide REFSEQSTART inrefseq
. If--trimR2
has its default value of auto, we only try to align up to where the next subamplicon would be (effectively trimming the unneeded part of R2 from its 3’ end). If the read gaplessly aligns with no more than--maxmuts
mutations of character type--chartype
, the alignment is considered successful. Read that fail to align are written<outprefix>_unaligned.txt
depending on the value of--no_write_unaligned
and then discarded. Otherwise, reads are retained.After collecting all alignable trimmed R2 reads for a barcode, we then see if we have enough coverage to subassemble the gene. We call identities in the gene only if the character (which may be a codon character rather than a nucleotide character, see
--chartype
) has at least--minreadspersite
non-ambiguous (notN
) reads covering it and \(\ge\)--minreadconcurrence
of this reads concur on its identity.
Command-line usage¶
Subassemble barcoded variants with linkage from short-read sequences. R1 has the barcode, R2 has the sequence fragment. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/
usage: dms_subassemble [-h] [--barcodelength BARCODELENGTH]
[--trimR2 {auto,none}] [--minq MINQ]
[--maxlowqfrac MAXLOWQFRAC] [--maxmuts MAXMUTS]
[--minreadspersite MINREADSPERSITE]
[--minreadconcurrence MINREADCONCURRENCE]
[--chartype {codon}] [--no_write_barcode_reads]
[--purgefrac PURGEFRAC] [-v]
outprefix refseq r1files r2files alignspecs
[alignspecs ...]
Positional Arguments¶
outprefix | Prefix for output files. See Output files for a list of the created files. |
refseq | Existing FASTA file containing wildtype gene we are subassembling. This file should specify a valid in-frame coding sequence. |
r1files | Comma-separated list of R1 FASTQ files (no spaces). Files can optionally be gzipped (extension .gz). |
r2files | Like ‘r1files’ but for R2. Must be same number of comma-separated entires as for ‘r1files’. |
alignspecs | This argument is repeated to specify each possible alignment location for R2. Each specification is two comma-delimited integers (no spaces): ‘REFSEQSTART,R2START’. REFSEQSTART is nucleotide (1, 2, … numbering) in ‘refseq’ where nucleotide R2START in R2 aligns. It is important to set alignspecs so that you don’t count the part of the subamplicon that is in the primer binding site, since the nucleotide identities in this region come from the primers rather than the templates being sequences. Typically, R2START would be one greater than the length of the gene-binding region of the primer to avoid this. The alignments will fail if you don’t set alignspecs exactly correctly, as the program only tries gapless alignment. |
Named Arguments¶
--barcodelength | |
Length of barcode (NNN…) which starts at beginning of R1. Default: 18 | |
--trimR2 | Possible choices: auto, none Trim R2 read? If ‘auto’, trim until start of next subamplicon specified by ‘alignspecs’; if ‘none’ no trimming. Default: “auto” |
--minq | Nucleotides with Q scores < this number are converted to N. Default: 20 |
--maxlowqfrac | Only retain if fraction of N nucleotides in R2 read <= this. Default: 0.15 |
--maxmuts | Only align read if <= this many mismatches with “refseq” counted in terms of “chartype”. Default: 4 |
--minreadspersite | |
Call site only barcodes when >= this many reads give it a non-ambiguous identity. Default: 2 | |
--minreadconcurrence | |
Only call sites when >= this fraction of reads concur. Default: 0.75 | |
--chartype | Possible choices: codon Character for which we are counting mutations. Currently “codon” is only allowed value (in the future “nucleotide” might be added). Default: “codon” |
--no_write_barcode_reads | |
Don’t write all barcodes and assigned reads to a file (saves space/time to not do this). Default: False If you set this option, then the program does not create the All reads by barcode file. If you are not debugging, you may not want this file as it is very large. | |
--purgefrac | Randomly purge reads with this probability (subsample the data). Default: 0 |
-v, --version | show program’s version number and exit |
Output files¶
The following output files are created. Each has the prefix specified by outprefix
with the following suffixes.
Subassembled variants file¶
This file has the suffix _subassembled_variants.txt
. It is the primary output file, and lists all barcodes that can be subassembled according to the parameters passed to dms_subassemble
.
It is in the Subassembly file format. Note that for codon sequences (--chartype
of codon
), the mutations are numbered according to the codon position in 1, 2, … numbering, not the nucleotide position.
Here are a few example lines:
TATTACATCTGCCCCCAA ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGCAGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA GGC109GCA
AACTCTGTGTTCCCATCA ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA no_mutations
GTTAACCGATCAACGCAA ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTGCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGGGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA TCA47GCA,GGC230GGG
Each line lists a barcode, then the sequence subassembled for that barcode, and finally any mutations relative to refseq
.
All reads by barcode file¶
This file has the suffix _all_reads_by_barcode.txt
. It is a very large text file that lists each read that matches each barcode (both those successfully subassembled and those that aren’t). It also explains why or why not a barcode was subassembled.
This file is not created if you set the --no_write_barcode_reads
.
Summary statistics file¶
This file has the suffix _summarystats.txt
. It lists summary statistics about the subassembly. Here is an example:
barcodes (total) = 642641
barcodes successfully subassembled = 98281
barcodes with at least one alignable read = 615204
read pairs (total) = 29611710
read pairs aligned at site 1 = 2442839
read pairs aligned at site 172 = 1944582
read pairs aligned at site 354 = 2794256
read pairs aligned at site 481 = 2352744
read pairs aligned at site 633 = 2982037
read pairs aligned at site 809 = 3024340
read pairs aligned at site 941 = 2758838
read pairs purged due to low quality = 8965508
read pairs that are alignable = 18299636
read pairs that are alignable and map to a subassembled barcode = 8068574
read pairs that are unalignable = 2346566
read pairs that fail Illumina filter = 0
sites with insufficient concurrence due to mismatch between mutant and wildtype characters = 52936
sites with insufficient concurrence due to mismatch between two mutant characters = 68862
Alignable reads per barcode file¶
This file has the suffix _alignablereadsperbarcode.txt
. It gives the distribution of the number of alignable reads per barcode. Here is an example of the first few lines:
nreads nbarcodes
0 27437
1 153500
2 10580
3 8714
4 8670
5 8825
6 8991
7 8958
Mutations among subassembled variants file¶
This file has the suffix _nmuts_among_subassembled.txt
. It gives the distribution of the number of mutations per variant among subassembled variants. Here is an example:
nmuts nvariants
0 35614
1 32442
2 17902
3 7705
4 3046
5 1113
6 316
7 107
8 33
9 2
10 1
Read start sites file¶
This file has the suffix _refseqstarts.txt
. It gives the number of reads that start at each of the positions in refseq
specified in alignspecs
. Here is an example:
refseqstart nreads
1 2442839
172 1944582
354 2794256
481 2352744
633 2982037
809 3024340
941 2758838