illuminabarcodeparser¶
Defines IlluminaBarcodeParser
to parse barcodes from Illumina reads.
- class dms_variants.illuminabarcodeparser.IlluminaBarcodeParser(*, bclen=None, upstream='', upstream2='', downstream='', downstream2='', upstream_mismatch=0, upstream2_mismatch=0, downstream_mismatch=0, downstream2_mismatch=0, valid_barcodes=None, bc_orientation='R1', minq=20, chastity_filter=True, list_all_valid_barcodes=True)[source]¶
Bases:
object
Parser for Illumina barcodes.
Note
Barcodes should be read by R1 and optionally R2. Expected arrangement is
5’-[R2_start]-upstream2-upstream-barcode-downstream-downstream2-[R1_start]-3’
R1 anneals downstream of barcode and reads backwards. If R2 is used, it anneals upstream of barcode and reads forward. There can be sequences (upstream and downstream) on either side of the barcode: downstream must fully cover region between R1 start and barcode, and if using R2 then upstream must fully cover region between R2 start and barcode. However, it is fine if R1 reads backwards past upstream, and if R2 reads forward past downstream. The upstream2 and downstream2 can be used to require additional flanking sequences. Normally these would just be rolled into upstream and downstream, but you might specify separately if you are actually using these to parse additional indices that you might want to set different mismatch criteria for.
- Parameters:
bclen (int or None) – Barcode length; None if length determined from valid_barcodes.
upstream (str) – Sequence upstream of the barcode; empty str if no such sequence.
downstream (str) – Sequence downstream of barcode; empty str if no such sequence.
upstream_mismatch (int) – Max number of mismatches allowed in upstream.
downstream_mismatch (int) – Max number of mismatches allowed in downstream.
valid_barcodes (None or iterable such as list) – If not None, only retain barcodes listed here. Use if you know the possible valid barcodes ahead of time.
bc_orientation ({'R1', 'R2'}) – Is the barcode defined in the orientation read by R1 or R2?
minq (int) – Require >= this Q score for all bases in barcode for at least one read.
chastity_filter (bool) – Drop any reads that fail Illumina chastity filter.
list_all_valid_barcodes (bool) – If using valid_barcodes, barcode sets returned by
IlluminaBarcodeParser.parse()
include all valid barcodes even if no counts.
- bclen¶
Length of barcodes.
- Type:
int
- upstream¶
Sequence upstream of barcode.
- Type:
str
- upstream2¶
Second sequence upstream of barcode.
- Type:
str
- downstream¶
Sequence downstream of barcode.
- Type:
str
- downstream2¶
Second sequence downstream of barcode
- Type:
str
- upstream_mismatch¶
Max number of mismatches allowed in upstream.
- Type:
int
- upstream2_mismatch¶
Max number of mismatches allowed in upstream2.
- Type:
int
- downstream_mismatch¶
Max number of mismatches allowed in downstream.
- Type:
int
- downstream2_mismatch¶
Max number of mismatches allowed in downstream2.
- Type:
int
- valid_barcodes¶
If not None, set of barcodes to retain.
- Type:
None or set
- bc_orientation¶
Is the barcode defined in the orientation read by R1 or R2?
- Type:
{‘R1’, ‘R2’}
- minq¶
Require >= this Q score for all bases in barcode for at least one read.
- Type:
int
- chastity_filter¶
Drop any reads that fail Illumina chastity filter.
- Type:
bool
- list_all_valid_barcodes¶
If using valid_barcodes, barcode sets returned by
IlluminaBarcodeParser.parse()
include all valid barcodes even if no counts.- Type:
bool
- VALID_NTS = 'ACGTN'¶
Valid nucleotide characters in FASTQ files.
- Type:
str
- parse(r1files, *, r2files=None, add_cols=None, outer_flank_fates=False)[source]¶
Parse barcodes from files.
- Parameters:
r1files (str or list) – Name of R1 FASTQ file, or list of such files. Can be gzipped.
r2files (None, str, or list) – None or empty list if not using R2, or like r1files for R2.
add_cols (None or dict) – If dict, specify names and values (i.e., sample or library names) to be aded to returned data frames.
outer_flank_fates (bool) – If True, if using outer flanking regions then in the output fates specify reads that fail just the outer flanking regions (upstream2 or downstream2). Otherwise, such failures will be grouped with the “unparseable barcode” fate.
- Returns:
- The 2-tuple (barcodes, fates), where:
barcodes is pandas DataFrame giving number of observations of each barcode (columns are “barcode” and “count”).
fates is pandas DataFrame giving total number of reads with each fate (columns “fate” and “count”). Possible fates: - “failed chastity filter” - “valid barcode” - “invalid barcode”: not in barcode whitelist - “R1 / R2 disagree” (if using r2files) - “low quality barcode”: sequencing quality low - “unparseable barcode”: invalid flank sequence, N in barcode - “read too short”: read is too short to cover specified region - “invalid outer flank” : if using outer_flank_fates and
upstream2 or downstream2 fails.
Note that these data frames also include any columns specified by add_cols.
- Return type:
tuple