barcodes

Operations for sequence barcodes and UMIs.

class dms_tools2.barcodes.IlluminaBarcodeParser(*, bclen=None, upstream='', downstream='', upstream_mismatch=0, downstream_mismatch=0, valid_barcodes=None, rc_barcode=True, minq=20, chastity_filter=True, list_all_valid_barcodes=True)[source]

Bases: object

Parser for Illumina barcodes.

The barcodes should be read by R1 and optionally R2. The arrangement of elements is shown below:

5'-[R2_start]-upstream-barcode-downstream-[R1_start]-3'

R1 anneals downstream of the barcode and reads backwards. If R2 is used, it anneals upstream of the barcode and reads forward. There can be sequences (upstream and downstream) on either side of the barcode: downstream must fully cover the region between where R1 starts and the barcode, and if you are using R2 then upstream must fully cover the region between where R2 starts and the barcode. However, it is fine if R1 reads backwards past upstream, and if R2 reads forward past downstream.

Args:
bclen (int or None)

Length of the barcode, or None if length is to be determined from valid_barcodes argument.

upstream (str)

Sequence upstream of the barcode.

downstream (str)

Sequence downstream of barcode.

upstream_mismatch (int)

Max number of mismatches allowed in upstream.

downstream_mismatch (int)

Like upstream_mismatches but for downstream.

valid_barcodes (None or iterable such as list, Series)

If not None, only retain barcodes listed here. Use if you know the set of possible valid barcodes.

rc_barcode (bool)

Parse the reverse complement of the barcode (the orientation read by R1).

minq (int)

Require at least this quality score for all bases in barcode.

chastity_filter (bool)

Drop any reads that fail Illumina chastity filter.

list_all_valid_barcodes (bool)

If using valid_barcodes, then barcode sets returned by IlluminaBarcodeParser.parse includes all valid barcodes even if no counts.

To use, first initialize a IlluminaBarcodeParser, then parse barcodes using IlluminaBarcodeParser.parse. Barcodes are retained as valid only if R1 and R2 agree at every nucleotide in barcode and if at each site at least one read has a quality of at least minq.

Here is an example. Imagine we are parsing 4 nucleotide barcodes that have the following construction:

5'-[R2 binding site]-ACATGA-NNNN-GACT-[R1 binding site]-3'

First, we initialize an appropriate IlluminaBarcodeParser:

>>> parser = IlluminaBarcodeParser(
...              bclen=4,
...              upstream='ACATGA',
...              downstream='GACT'
...              )

Now we write some test FASTQ files. We write valid test reads and some invalid reads. The header for each read explains why it is valid / invalid. We use quality scores of ? (30) or + (10) for high- and low-quality bases:

>>> r1file = '_temp_R1.fastq'
>>> r2file = '_temp_R2.fastq'
>>> with open(r1file, 'w') as f1, open(r2file, 'w') as f2:
...
...     # valid TACG barcode, full flanking regions
...     _ = f1.write(
...         '@valid_CGTA_barcode_full_flanking_region\n'
...         'AGTCCGTATCATGT\n'
...         '+\n'
...         '??????????????\n')
...     _ = f2.write(
...         '@valid_CGTA_barcode_full_flanking_region\n'
...         'ACATGATACGGACT\n'
...         '+\n'
...         '??????????????\n')
...
...     # valid CGTA barcode, partial flanking regions
...     _ = f1.write(
...         '@valid_CGTA_barcode_partial_flanking_region\n'
...         'AGTCCGTATCAT\n'
...         '+\n'
...         '????????????\n')
...     _ = f2.write(
...         '@valid_CGTA_barcode_partial_flanking_region\n'
...         'ACATGATACG\n'
...         '+\n'
...         '??????????\n')
...
...     # valid GCCG barcode, extended flanking regions
...     _ = f1.write(
...         '@valid_GCCG_barcode_extended_flanking_region\n'
...         'AGTCGCCGTCATGTTAC\n'
...         '+\n'
...         '?????????????????\n')
...     _ = f2.write(
...         '@valid_GCCG_barcode_extended_flanking_region\n'
...         'ACATGACGGCGACTGAC\n'
...         '+\n'
...         '?????????????????\n')
...
...     # AAGT barcode in R1 but R2 differs
...     _ = f1.write(
...         '@AAGT_R1_barcode_but_R2_differs\n'
...         'AGTCAAGTTCATGT\n'
...         '+\n'
...         '??????????????\n')
...     _ = f2.write(
...         '@AAGT_R1_barcode_but_R2_differs\n'
...         'ACATGAACTAGACT\n'
...         '+\n'
...         '??????????????\n')
...
...     # same site low quality in R1 and R2
...     _ = f1.write(
...         '@low_quality_site_in_R1_and_R2\n'
...         'AGTCCGTATCATGT\n'
...         '+\n'
...         '?????+????????\n')
...     _ = f2.write(
...         '@low_quality_site_in_R1_and_R2\n'
...         'ACATGATACGGACT\n'
...         '+\n'
...         '????????+?????\n')
...
...     # different site low quality in R1 and R2
...     _ = f1.write(
...         '@AGTA_with_low_quality_site_in_R1\n'
...         'AGTCAGTATCATGT\n'
...         '+\n'
...         '?????+????????\n')
...     _ = f2.write(
...         '@AGTA_with_low_quality_site_in_R1\n'
...         'ACATGATACTGACT\n'
...         '+\n'
...         '?????????+????\n')
...
...     # N in barcode
...     _ = f1.write(
...         '@N_in_barcode\n'
...         'AGTCCGTNTCATGT\n'
...         '+\n'
...         '??????????????\n')
...     _ = f2.write(
...         '@N_in_barcode\n'
...         'ACATGATACGGACT\n'
...         '+\n'
...         '??????????????\n')
...
...     # GGAG barcode, one mismatch in each flanking region
...     _ = f1.write(
...         '@GGAG_barcode_one_mismatch_per_flank\n'
...         'GGTCGGAGTCATGA\n'
...         '+\n'
...         '??????????????\n')
...     _ = f2.write(
...         '@GGAG_barcode_one_mismatch_per_flank\n'
...         'TCATGACTCCGACG\n'
...         '+\n'
...         '??????????????\n')
...
...     # GGAG barcode, two mismatch in a flanking region
...     _ = f1.write(
...         '@GGAG_barcode_two_mismatch_in_a_flank\n'
...         'GGTCGGAGTCATAA\n'
...         '+\n'
...         '??????????????\n')
...     _ = f2.write(
...         '@GGAG_barcode_two_mismatch_in_a_flank\n'
...         'TCATGACTCCGACG\n'
...         '+\n'
...         '??????????????\n')

Now parse the barcodes using both the R1 and R2 files:

>>> barcodes, fates = parser.parse(r1file, r2file)
>>> print(barcodes.to_csv(sep=' ', index=False).strip())
barcode count
CGTA 2
AGTA 1
GCCG 1
>>> print(fates.to_csv(sep=' ', index=False).strip())
fate count
"valid barcode" 4
"unparseable barcode" 3
"R1 / R2 disagree" 1
"low quality barcode" 1

Now we parse just using R1. We gain the barcode where R1 and R2 disagree, but lose the one where R1 is low quality at a position where R2 is OK:

>>> barcodes, fates = parser.parse(r1file)
>>> print(barcodes.to_csv(sep=' ', index=False).strip())
barcode count
CGTA 2
AAGT 1
GCCG 1
>>> print(fates.to_csv(sep=' ', index=False).strip())
fate count
"valid barcode" 4
"unparseable barcode" 3
"low quality barcode" 2

Now create a parser that allows a mismatch in each flanking region, and check that we recover a “GGAG” barcode:

>>> parser_mismatch = IlluminaBarcodeParser(
...              bclen=4,
...              upstream='ACATGA',
...              downstream='GACT',
...              upstream_mismatch=1,
...              downstream_mismatch=1,
...              )
>>> barcodes_mismatch, fates_mismatch = parser_mismatch.parse(r1file, r2file)
>>> print(barcodes_mismatch.to_csv(sep=' ', index=False).strip())
barcode count
CGTA 2
AGTA 1
GCCG 1
GGAG 1
>>> print(fates_mismatch.to_csv(sep=' ', index=False).strip())
fate count
"valid barcode" 5
"unparseable barcode" 2
"R1 / R2 disagree" 1
"low quality barcode" 1

Now parse the barcodes using valid_barcodes to set a barcode whitelist:

>>> parser_wl = IlluminaBarcodeParser(
...              upstream='ACATGA',
...              downstream='GACT',
...              valid_barcodes={'CGTA', 'AGTA', 'TAAT'}
...              )
>>> barcodes_wl, fates_wl = parser_wl.parse(r1file, r2file)
>>> print(barcodes_wl.to_csv(sep=' ', index=False).strip())
barcode count
CGTA 2
AGTA 1
TAAT 0
>>> print(fates_wl.to_csv(sep=' ', index=False).strip())
fate count
"unparseable barcode" 3
"valid barcode" 3
"R1 / R2 disagree" 1
"invalid barcode" 1
"low quality barcode" 1

Remove the test FASTQ files:

>>> os.remove(r1file)
>>> os.remove(r2file)
VALID_NTS = 'ACGTN'

valid nucleotide characters

parse(r1files, r2files=None)[source]

Parses barcodes from files.

Args:
r1files (str or list)

Name of R1 FASTQ file, or list of such files Can optionally be gzipped.

r2files (None, str, or list)

None or empty list if not using R2, otherwise like R1.

Returns:

The 2-tuple (barcodes, fates). In this 2-tuple:

  • barcodes is a pandas DataFrame giving the number of observations of each barcode. The columns are named “barcode” and “count”.

  • fates is a pandas DataFrame giving the total number of reads with each fate. The columns are named “fate” and “count”.

    • “valid barcode”

    • “invalid barcode”: not in our barcode whitelist

    • “R1 / R2 disagree”

    • “low quality barcode”: sequencing quality low

    • “unparseable barcode”: invalid flanking sequences or N in barcode.

dms_tools2.barcodes.almost_duplicated(barcodes, threshold=1)[source]

Identifies nearly identical barcodes.

This function mimics the pandas duplicated function except it can also mark as duplicated almost but not exactly identical sequences.

Args:
barcodes (list or pandas Series)

Lists all the barcodes, which should be strings of same length.

threshold (int)

Max number of mismatches for barcodes to be considered almost identical.

Returns:

A pandas Series of the same length as barcodes filled with bools that indicates whether a barcode is a near duplicate. For each group of barcodes within threshold of each other, only one barcode will have an entry of False (not duplicated) and the rest will be True (duplicated). The barcode listed as not duplicated is chosen as follows: (1) it has the most abundant barcode in the group, (2) among barcodes that are equivalent abundant we take the one listed first in barcodes. Groups are computed using the directional method of umi_tools.

When threshold is zero, just like pandas.duplicated:

>>> barcodes = ['CTC', 'ATG', 'ATG', 'ATA']
>>> almost_duplicated(barcodes, threshold=0).equals(
...     pandas.Series([False, False, True, False]))
True
>>> pandas.Series(barcodes).duplicated().equals(
...     pandas.Series([False, False, True, False]))
True

But when threshold is 1, also identifies almost equal barcodes (in this case, ones that differ by <= 1 mutation:

>>> almost_duplicated(barcodes, threshold=1).equals(
...     pandas.Series([False, False, True, True]))
True

All barcodes are within a distance of two, so all are near duplicates. Note how we mark as not duplicated (False) the first one listed among the most abundant barcode (‘ATG’):

>>> almost_duplicated(pandas.Series(barcodes), threshold=2).equals(
...     pandas.Series([True, False, True, True]))
True
dms_tools2.barcodes.simpleConsensus(df, *, barcode_col='barcode', substitution_col='substitutions', insertion_col='insertions', deletion_col='deletions', library_col=None, max_sub_diffs=1, max_indel_diffs=2, max_minor_muts=1)[source]

Simple method to get consensus of mutations within barcode.

Args:
df (pandas Data Frame)

Holds variants and their barcodes. Each row gives a sequence variant and its barcode. There need to be columns with the names given by the next four arguments described below.

barcode_col (str)

Name of column holding barcodes.

substitution_col (str)

Name of column holding substitutions as list of strings.

insertion_col (str)

Name of column holding insertions as list of strings.

deletion_col (str)

Name of column holding insertions as list of strings.

library_col (None or str)

If we have multiple libraries, analyze each barcode only within its library. In that case, library_col should be name of column giving library name.

max_sub_diffs (int)

Drop any barcode where any variant differs from all other variants for that barcode by more than this many substitution (point mutation) differences.

max_indel_diffs (int)

Drop any barcode where any variant differs from all other variants for that barcode by more than this many indel (insertion or deletion) differences.

max_minor_muts (int)

Drop any barcode where there is a minor (non-consensus) mutation found more than this many times.

Returns:

The 2-tuple (consensus, dropped). These are each data frames:

  • consensus is a new data frame with a row for each barcode for which we could call the consensus. The columns have the same names as barcode_col, substitution_col, insertion_col, deletion_col, and (optionally) library_col–but the the three columns for the mutations now just list the consensus mutations of that type. In addition, there is a new column called “variant_call_support” that gives the number of sequences supporting the call of that barcode.

  • dropped simply contains all rows in the original df that correspond to sequences that were dropped due to max_diffs or max_minor_muts not being satisfied. There is also a column called “drop_reason” that gives the reason that the barcode was dropped.

The approach is as follows:

  1. Group all variants by library and barcode.

  2. If there are multiple sequences, check if any of them differ from all the others by more than max_diffs mutations total (taking substitutions, insertions, and deletions together). If so, drop the entire barcode. The reason is that if there are variants that are very different, it becomes likely that it isn’t just sequencing error, but rather something is wrong with that barcode (multiple variants with same barcode or strand exchange).

  3. Take the consensus of the sequences, which means keeping mutations that are present in greater than half of the variants. Note that this calling scheme means that the consensus being called is dependent on the reference used to call the mutations, which is an important caveat if you are calling variants relative to multiple different parent sequences.

  4. If there are any minor mutations (mutations not in consensus) that are present in more than max_minor_muts variants or missing from more than max_minor_muts variants, then drop that barcode. The reason is that recurring minor mutations also suggest some problem more complex than sequencing error that may render the whole barcode family invalid.

Note that this method returns a consensus even if there is just one sequence for the barcode (in that case, this sequence is the consensus). This is fine–if you want to get consensus calls that are more strongly supported, simply filter the returned consensus data frame for larger values of variant_call_support, as the more sequences that support a call the more accurate it is expected to be.

Here is an example:

>>> df = pandas.DataFrame([
...     ('s1', 'AG', ['A2C'], [], ['del5to7']),
...     ('s1', 'AG', ['A2C'], [], []),
...     ('s1', 'TA', ['G3A'], ['ins4len3'], []),
...     ('s2', 'TA', ['C5A', 'T6C'], [], []),
...     ('s2', 'TA', ['T6C'], ['ins5len1'], []),
...     ('s2', 'TA', ['T6C'], [], []),
...     ('s2', 'TG', ['T6A'], [], []),
...     ('s2', 'TG', ['A2G'], [], []),
...     ('s2', 'GG', [], [], ['del1to4']),
...     ('s2', 'GG', ['A1C'], [], []),
...     ('s2', 'AA', [], [], []),
...     ('s2', 'AA', [], [], []),
...     ('s2', 'AA', ['T6C'], [], []),
...     ('s2', 'AA', ['T6C'], [], []),
...     ('s3', 'AA', ['T6G'], ['ins1len1'], ['del1to2']),
...     ('s3', 'AA', ['T6G'], [], ['del5to7'])],
...     columns=['library', 'barcode', 'substitutions',
...              'insertions', 'deletions']
...     )
>>> consensus, dropped = simpleConsensus(df, library_col='library')
>>> pandas.set_option('display.max_columns', 10)
>>> pandas.set_option('display.width', 500)
>>> consensus
  library barcode substitutions  insertions deletions  variant_call_support
0      s1      AG         [A2C]          []        []                     2
1      s1      TA         [G3A]  [ins4len3]        []                     1
2      s2      GG            []          []        []                     2
3      s2      TA         [T6C]          []        []                     3
>>> dropped
  library barcode substitutions  insertions  deletions           drop_reason
0      s2      TG         [T6A]          []         []  excess substitutions
1      s2      TG         [A2G]          []         []  excess substitutions
2      s2      AA            []          []         []     excess minor muts
3      s2      AA            []          []         []     excess minor muts
4      s2      AA         [T6C]          []         []     excess minor muts
5      s2      AA         [T6C]          []         []     excess minor muts
6      s3      AA         [T6G]  [ins1len1]  [del1to2]         excess indels
7      s3      AA         [T6G]          []  [del5to7]         excess indels
dms_tools2.barcodes.tidy_split(df, column, sep=' ', keep=False)[source]

Split values of a column and expand so new DataFrame has one split value per row. Filters rows where the column is missing.

Taken from https://stackoverflow.com/a/39946744

Args:
dfpandas DataFrame

dataframe with the column to split and expand

columnstr

the column to split and expand

sepstr

the string used to split the column’s values

keepbool

whether to retain the presplit value as it’s own row

Returns:
pandas DataFrame

Returns a dataframe with the same columns as df.