illuminabarcodeparser¶
Defines IlluminaBarcodeParser
to parse barcodes from Illumina reads.
- class dms_variants.illuminabarcodeparser.IlluminaBarcodeParser(*, bclen=None, upstream='', downstream='', upstream_mismatch=0, downstream_mismatch=0, valid_barcodes=None, bc_orientation='R1', minq=20, chastity_filter=True, list_all_valid_barcodes=True)[source]¶
Bases:
object
Parser for Illumina barcodes.
Note
Barcodes should be read by R1 and optionally R2. Expected arrangement is
5’-[R2_start]-upstream-barcode-downstream-[R1_start]-3’
R1 anneals downstream of barcode and reads backwards. If R2 is used, it anneals upstream of barcode and reads forward. There can be sequences (upstream and downstream) on either side of the barcode: downstream must fully cover region between R1 start and barcode, and if using R2 then upstream must fully cover region between R2 start and barcode. However, it is fine if R1 reads backwards past upstream, and if R2 reads forward past downstream.
- Parameters
bclen (int or None) – Barcode length; None if length determined from valid_barcodes.
upstream (str) – Sequence upstream of the barcode; empty str if no such sequence.
downstream (str) – Sequence downstream of barcode; empty str if no such sequence.
upstream_mismatch (int) – Max number of mismatches allowed in upstream.
downstream_mismatch (int) – Max number of mismatches allowed in downstream.
valid_barcodes (None or iterable such as list) – If not None, only retain barcodes listed here. Use if you know the possible valid barcodes ahead of time.
bc_orientation ({'R1', 'R2'}) – Is the barcode defined in the orientation read by R1 or R2?
minq (int) – Require >= this Q score for all bases in barcode for at least one read.
chastity_filter (bool) – Drop any reads that fail Illumina chastity filter.
list_all_valid_barcodes (bool) – If using valid_barcodes, barcode sets returned by
IlluminaBarcodeParser.parse()
include all valid barcodes even if no counts.
- bclen¶
Length of barcodes.
- Type
int
- upstream¶
Sequence upstream of barcode.
- Type
str
- downstream¶
Sequence downstream of barcode.
- Type
str
- upstream_mismatch¶
Max number of mismatches allowed in upstream.
- Type
int
- downstream_mismatch¶
Max number of mismatches allowed in downstream.
- Type
int
- valid_barcodes¶
If not None, set of barcodes to retain.
- Type
None or set
- bc_orientation¶
Is the barcode defined in the orientation read by R1 or R2?
- Type
{‘R1’, ‘R2’}
- minq¶
Require >= this Q score for all bases in barcode for at least one read.
- Type
int
- chastity_filter¶
Drop any reads that fail Illumina chastity filter.
- Type
bool
- list_all_valid_barcodes¶
If using valid_barcodes, barcode sets returned by
IlluminaBarcodeParser.parse()
include all valid barcodes even if no counts.- Type
bool
- VALID_NTS = 'ACGTN'¶
Valid nucleotide characters in FASTQ files.
- Type
str
- parse(r1files, *, r2files=None, add_cols=None)[source]¶
Parse barcodes from files.
- Parameters
r1files (str or list) – Name of R1 FASTQ file, or list of such files. Can be gzipped.
r2files (None, str, or list) – None or empty list if not using R2, or like r1files for R2.
add_cols (None or dict) – If dict, specify names and values (i.e., sample or library names) to be aded to returned data frames.
- Returns
- The 2-tuple (barcodes, fates), where:
barcodes is pandas DataFrame giving number of observations of each barcode (columns are “barcode” and “count”).
fates is pandas DataFrame giving total number of reads with each fate (columns “fate” and “count”). Possible fates: - “failed chastity filter” - “valid barcode” - “invalid barcode”: not in barcode whitelist - “R1 / R2 disagree” (if using r2files) - “low quality barcode”: sequencing quality low - “unparseable barcode”: invalid flank sequence, N in barcode
Note that these data frames also include any columns specified by add_cols.
- Return type
tuple