barcodes¶
Utility functions to process and analyze barcodes.
- dms_variants.barcodes.inverse_simpson_index(barcodecounts, *, barcodecol='barcode', countcol='count', groupcols='library')[source]¶
Inverse Simpson index (reciprocal probability two barcodes are same).
- Parameters:
barcodecounts (pandas.DataFrame) – Data frame with barcode counts
barcodecol (str) – Column in
barcodecounts
listing all unique barcodes.countcol (str) – Column in
barcodecounts
with counts of each barcode.groupcols (str, list, or None) – Columns in
barcodecounts
by which we group for calculations.
- Return type:
pandas.DataFrame
Example
>>> barcodecounts = pd.DataFrame.from_records( ... [('lib1', 'AA', 10), ... ('lib1', 'AT', 20), ... ('lib1', 'AC', 30), ... ('lib2', 'AA', 5)], ... columns=['library', 'barcode', 'count']) >>> inverse_simpson_index(barcodecounts) library inverse_simpson_index 0 lib1 2.571429 1 lib2 1.000000
- dms_variants.barcodes.rarefyBarcodes(barcodecounts, *, barcodecol='barcode', countcol='count', maxpoints=100000, logspace=True)[source]¶
Rarefaction curve of barcode observations.
Note
Uses analytical formula for rarefaction defined here: https://en.wikipedia.org/wiki/Rarefaction_(ecology)#Derivation
- Parameters:
barcodecounts (pandas.DataFrame) – Data frame with counts to rarefy.
barcodecol (str) – Column in barcodecounts listing all unique barcodes.
countcol (str) – Column in barcodecounts with observed counts of each barcode.
maxpoints (int) – Only calculate rarefaction curve at this many points. Benefit is that it is costly to calculate the curve for many points.
logspace (bool) – Logarithmically space the points. If False, space linearly.
- Returns:
A data frame with columns ‘ncounts’ and ‘nbarcodes’ giving number of unique barcodes observed for each total number of observed counts.
- Return type:
pandas.DataFrame
Example
>>> barcodecounts = pd.DataFrame({'barcode': ['A', 'G', 'C', 'T'], ... 'count': [4, 2, 1, 0]}) >>> rarefaction_curve = rarefyBarcodes(barcodecounts) >>> rarefaction_curve ncounts nbarcodes 0 1 1.000000 1 2 1.666667 2 3 2.114286 3 4 2.428571 4 5 2.666667 5 6 2.857143 6 7 3.000000
Verify this result matches what is obtained by random sampling:
>>> random.seed(1) >>> barcodelist = [] >>> for tup in barcodecounts.itertuples(index=False): ... barcodelist += [tup.barcode] * tup.count >>> nrand = 10000 >>> ncounts = list(range(1, barcodecounts['count'].sum() + 1)) >>> nbarcodes = [] >>> for ncount in ncounts: ... nbarcodes.append(sum(len(set(random.sample(barcodelist, ncount))) ... for _ in range(nrand)) / nrand) >>> sim_rarefaction_curve = pd.DataFrame({'ncounts': ncounts, ... 'nbarcodes': nbarcodes}) >>> numpy.allclose(rarefaction_curve, sim_rarefaction_curve, atol=1e-2) True