illuminabarcodeparser

Defines IlluminaBarcodeParser to parse barcodes from Illumina reads.

class dms_variants.illuminabarcodeparser.IlluminaBarcodeParser(*, bclen=None, upstream='', upstream2='', downstream='', downstream2='', upstream_mismatch=0, upstream2_mismatch=0, downstream_mismatch=0, downstream2_mismatch=0, valid_barcodes=None, bc_orientation='R1', minq=20, chastity_filter=True, list_all_valid_barcodes=True)[source]

Bases: object

Parser for Illumina barcodes.

Note

Barcodes should be read by R1 and optionally R2. Expected arrangement is

5’-[R2_start]-upstream2-upstream-barcode-downstream-downstream2-[R1_start]-3’

R1 anneals downstream of barcode and reads backwards. If R2 is used, it anneals upstream of barcode and reads forward. There can be sequences (upstream and downstream) on either side of the barcode: downstream must fully cover region between R1 start and barcode, and if using R2 then upstream must fully cover region between R2 start and barcode. However, it is fine if R1 reads backwards past upstream, and if R2 reads forward past downstream. The upstream2 and downstream2 can be used to require additional flanking sequences. Normally these would just be rolled into upstream and downstream, but you might specify separately if you are actually using these to parse additional indices that you might want to set different mismatch criteria for.

Parameters:
  • bclen (int or None) – Barcode length; None if length determined from valid_barcodes.

  • upstream (str) – Sequence upstream of the barcode; empty str if no such sequence.

  • downstream (str) – Sequence downstream of barcode; empty str if no such sequence.

  • upstream_mismatch (int) – Max number of mismatches allowed in upstream.

  • downstream_mismatch (int) – Max number of mismatches allowed in downstream.

  • valid_barcodes (None or iterable such as list) – If not None, only retain barcodes listed here. Use if you know the possible valid barcodes ahead of time.

  • bc_orientation ({'R1', 'R2'}) – Is the barcode defined in the orientation read by R1 or R2?

  • minq (int) – Require >= this Q score for all bases in barcode for at least one read.

  • chastity_filter (bool) – Drop any reads that fail Illumina chastity filter.

  • list_all_valid_barcodes (bool) – If using valid_barcodes, barcode sets returned by IlluminaBarcodeParser.parse() include all valid barcodes even if no counts.

bclen

Length of barcodes.

Type:

int

upstream

Sequence upstream of barcode.

Type:

str

upstream2

Second sequence upstream of barcode.

Type:

str

downstream

Sequence downstream of barcode.

Type:

str

downstream2

Second sequence downstream of barcode

Type:

str

upstream_mismatch

Max number of mismatches allowed in upstream.

Type:

int

upstream2_mismatch

Max number of mismatches allowed in upstream2.

Type:

int

downstream_mismatch

Max number of mismatches allowed in downstream.

Type:

int

downstream2_mismatch

Max number of mismatches allowed in downstream2.

Type:

int

valid_barcodes

If not None, set of barcodes to retain.

Type:

None or set

bc_orientation

Is the barcode defined in the orientation read by R1 or R2?

Type:

{‘R1’, ‘R2’}

minq

Require >= this Q score for all bases in barcode for at least one read.

Type:

int

chastity_filter

Drop any reads that fail Illumina chastity filter.

Type:

bool

list_all_valid_barcodes

If using valid_barcodes, barcode sets returned by IlluminaBarcodeParser.parse() include all valid barcodes even if no counts.

Type:

bool

VALID_NTS = 'ACGTN'

Valid nucleotide characters in FASTQ files.

Type:

str

parse(r1files, *, r2files=None, add_cols=None, outer_flank_fates=False)[source]

Parse barcodes from files.

Parameters:
  • r1files (str or list) – Name of R1 FASTQ file, or list of such files. Can be gzipped.

  • r2files (None, str, or list) – None or empty list if not using R2, or like r1files for R2.

  • add_cols (None or dict) – If dict, specify names and values (i.e., sample or library names) to be aded to returned data frames.

  • outer_flank_fates (bool) – If True, if using outer flanking regions then in the output fates specify reads that fail just the outer flanking regions (upstream2 or downstream2). Otherwise, such failures will be grouped with the “unparseable barcode” fate.

Returns:

The 2-tuple (barcodes, fates), where:
  • barcodes is pandas DataFrame giving number of observations of each barcode (columns are “barcode” and “count”).

  • fates is pandas DataFrame giving total number of reads with each fate (columns “fate” and “count”). Possible fates: - “failed chastity filter” - “valid barcode” - “invalid barcode”: not in barcode whitelist - “R1 / R2 disagree” (if using r2files) - “low quality barcode”: sequencing quality low - “unparseable barcode”: invalid flank sequence, N in barcode - “read too short”: read is too short to cover specified region - “invalid outer flank” : if using outer_flank_fates and

    upstream2 or downstream2 fails.

Note that these data frames also include any columns specified by add_cols.

Return type:

tuple