barcodes

Utility functions to process and analyze barcodes.

dms_variants.barcodes.inverse_simpson_index(barcodecounts, *, barcodecol='barcode', countcol='count', groupcols='library')[source]

Inverse Simpson index (reciprocal probability two barcodes are same).

Parameters:
  • barcodecounts (pandas.DataFrame) – Data frame with barcode counts

  • barcodecol (str) – Column in barcodecounts listing all unique barcodes.

  • countcol (str) – Column in barcodecounts with counts of each barcode.

  • groupcols (str, list, or None) – Columns in barcodecounts by which we group for calculations.

Return type:

pandas.DataFrame

Example

>>> barcodecounts = pd.DataFrame.from_records(
...        [('lib1', 'AA', 10),
...         ('lib1', 'AT', 20),
...         ('lib1', 'AC', 30),
...         ('lib2', 'AA', 5)],
...        columns=['library', 'barcode', 'count'])
>>> inverse_simpson_index(barcodecounts)
  library  inverse_simpson_index
0    lib1               2.571429
1    lib2               1.000000
dms_variants.barcodes.rarefyBarcodes(barcodecounts, *, barcodecol='barcode', countcol='count', maxpoints=100000, logspace=True)[source]

Rarefaction curve of barcode observations.

Note

Uses analytical formula for rarefaction defined here: https://en.wikipedia.org/wiki/Rarefaction_(ecology)#Derivation

Parameters:
  • barcodecounts (pandas.DataFrame) – Data frame with counts to rarefy.

  • barcodecol (str) – Column in barcodecounts listing all unique barcodes.

  • countcol (str) – Column in barcodecounts with observed counts of each barcode.

  • maxpoints (int) – Only calculate rarefaction curve at this many points. Benefit is that it is costly to calculate the curve for many points.

  • logspace (bool) – Logarithmically space the points. If False, space linearly.

Returns:

A data frame with columns ‘ncounts’ and ‘nbarcodes’ giving number of unique barcodes observed for each total number of observed counts.

Return type:

pandas.DataFrame

Example

>>> barcodecounts = pd.DataFrame({'barcode': ['A', 'G', 'C', 'T'],
...                               'count': [4, 2, 1, 0]})
>>> rarefaction_curve = rarefyBarcodes(barcodecounts)
>>> rarefaction_curve
   ncounts  nbarcodes
0        1   1.000000
1        2   1.666667
2        3   2.114286
3        4   2.428571
4        5   2.666667
5        6   2.857143
6        7   3.000000

Verify this result matches what is obtained by random sampling:

>>> random.seed(1)
>>> barcodelist = []
>>> for tup in barcodecounts.itertuples(index=False):
...     barcodelist += [tup.barcode] * tup.count
>>> nrand = 10000
>>> ncounts = list(range(1, barcodecounts['count'].sum() + 1))
>>> nbarcodes = []
>>> for ncount in ncounts:
...     nbarcodes.append(sum(len(set(random.sample(barcodelist, ncount)))
...                      for _ in range(nrand)) / nrand)
>>> sim_rarefaction_curve = pd.DataFrame({'ncounts': ncounts,
...                                       'nbarcodes': nbarcodes})
>>> numpy.allclose(rarefaction_curve, sim_rarefaction_curve, atol=1e-2)
True