consensus¶
Get consensus of sequences / mutations. Useful for grouping sequences within a barcode.
- class alignparse.consensus.Mutations(substitutions, deletions, insertions)¶
Bases:
tuple
- deletions¶
Alias for field number 1
- insertions¶
Alias for field number 2
- substitutions¶
Alias for field number 0
- alignparse.consensus.add_mut_info_cols(df, *, mutation_col='mutations', sub_str_col=None, del_str_col=None, ins_str_col=None, indel_str_col=None, n_sub_col=None, n_del_col=None, n_ins_col=None, n_indel_col=None, overwrite_cols=False)[source]¶
Expand information about mutations in a data frame.
- Parameters:
df (pandas.DataFrame) – Data frame with each row giving information about a variant.
mutation_col (str) – Name of column in df with str containing space-delimited list of mutations in format parseable by
process_mut_str()
.sub_str_col (str or None) – Name of added column with string giving substitutions.
del_str_col (str or None) – Name of added column with string giving deletions.
ins_str_col (str or None) – Name of added column with string giving insertions.
indel_str_col (str or None) – Name of added column with string giving deletions and insertions.
n_sub_col (str or None) – Name of added column with number of substitutions.
n_del_col (str or None) – Name of added column with number of deletions.
n_ins_col (str or None) – Name of added column with number of insertions.
n_indel_col (str or None) – Name of added column with number of deletions and insertions.
overwrite_cols (bool) – If column to be created is already in df, overwrite it?
- Returns:
A copy of df with the indicated columns added.
- Return type:
pandas.DataFrame
Example
Data frame for which we get mutation info:
>>> df = pd.DataFrame({ ... 'name': ['seq1', 'seq2', 'seq3'], ... 'mutations': ['A6G del7to8 T3A', '', 'T5A ins5AAT del8to9'], ... })
Get info on substitutions:
>>> add_mut_info_cols(df, sub_str_col='subs', n_sub_col='nsubs') name mutations subs nsubs 0 seq1 A6G del7to8 T3A T3A A6G 2 1 seq2 0 2 seq3 T5A ins5AAT del8to9 T5A 1
Get info on substitutions and indels:
>>> pd.set_option('display.max_columns', 10) >>> add_mut_info_cols(df, sub_str_col='subs', n_sub_col='nsubs', ... indel_str_col='indels', n_indel_col='nindels') name mutations subs indels nsubs nindels 0 seq1 A6G del7to8 T3A T3A A6G del7to8 2 1 1 seq2 0 0 2 seq3 T5A ins5AAT del8to9 T5A del8to9 ins5AAT 1 2
Just get counts for all types of mutations:
>>> add_mut_info_cols(df, n_sub_col='nsubs', n_del_col='ndels', ... n_ins_col='nins', n_indel_col='nindels') name mutations nsubs ndels nins nindels 0 seq1 A6G del7to8 T3A 2 1 0 1 1 seq2 0 0 0 0 2 seq3 T5A ins5AAT del8to9 1 1 1 2
- alignparse.consensus.empirical_accuracy(df, *, group_cols='barcode', upstream_group_cols='library', mutation_col='mutations', accuracy_col='accuracy', sort_mutations=True)[source]¶
Accuracy from number of identical sequences in a group (i.e., barcode).
Note
The accuracy \(1 - \epsilon\) is calculated as follows:
Let \(\epsilon\) be the probability a sequence has an error. Consider a group of \(n\) sequences that should be identical in the absence of errors (i.e., they all have the same barcode), and assume errors only make sequences different (the chance of two sequences having the same error is very low). We would like to calculate the probability of observing \(u\) unique sequences in the group given \(\epsilon\) and \(n\).
First, consider the case where there are two sequences in the group (\(n = 2\)). The probability of having \(u = 1\) unique sequences in the group is the probability that neither have an error:
\[\Pr\left(u=1 | n=2, \epsilon\right) = \left(1 - \epsilon\right)^2.\]The probability of having \(u = 2\) unique sequences is the the probability that either one or both have errors:
\[\Pr\left(u=2 | n=2, \epsilon\right) = 2\left(1 - \epsilon\right)\epsilon + \epsilon^2.\]Generalizing to arbitrary \(n\), we have:
\[\Pr\left(u | n, \epsilon\right) = \binom{n}{u-1} \left(1 - \epsilon\right)^{n - u + 1} \epsilon^{u - 1} + \delta_{un} \epsilon^n\]where \(\binom{n}{u-1}\) is the binomial coefficient and \(\delta_{un}\) is the Kronecker delta.
Let there be \(G\) groups of sequences, with the size and number of unique sequences in group \(g\) be \(n_g\) and \(u_g\) (where \(g = 1, \ldots, G\)). The overall likelihood \(L\) of the observed numbers of unique sequences given the group sizes and error rate is:
\[\begin{split}L &=& \Pr\left(\left\{u_g\right\}|\left\{n_g\right\},\epsilon\right) \\ &=& \prod_g \Pr\left(u_g | n_g, \epsilon\right).\end{split}\]To find the maximum likelihood of error rate, we simply use numerical optimization to find the value of \(\epsilon\) that maximizes \(L\). In practice, we actually maximize \(\ln\left(L\right)\).
- Parameters:
df (pandas.DataFrame) – Each row gives data for a different sequence.
group_cols (str or list) – Column(s) in df indicating how we group sequences before computing fraction within group that is identical.
upstream_group_cols (str or None or list) – Column(s) in df that we use to group before doing the grouping for computing fraction identical. So a different average fraction will be returned for each of these upstream groups.
mutation_col (str) – Column in df that we compare when determining if sequences are identical. Typically list of mutations or sequences themselves.
accuracy_col (str) – Name of column in returned data frame that gives empirical accuracy.
sort_mutations (bool) – Useful if you have strings of space-delimited mutations not guaranteed to be consistently ordered. If True, sort such space-delimited mutations.
- Returns:
Has all columns in upstream_group_cols plus the new column with name given by accuracy_col.
- Return type:
pandas.DataFrame
Example
>>> df = pd.DataFrame({ ... 'barcode': ['AT', 'AT', 'TG', 'TA', 'TA', 'TA'], ... 'mutations': ['A1G', 'A1G', '', 'T2A C6G', 'T5C', 'C6G T2A']}) >>> with pd.option_context('display.precision', 4): ... empirical_accuracy(df, upstream_group_cols=None) accuracy 0 0.7692
If we do the same without sorting the mutations, we get lower values as ‘T2A C6G’ and ‘C6G T2A’ are then considered as different:
>>> with pd.option_context('display.precision', 4): ... empirical_accuracy(df, upstream_group_cols=None, ... sort_mutations=False) accuracy 0 0.4766
Now another example with two libraries and non-standard column names. Note that we only get results for the two libraries with barcodes found multiple times:
>>> df2 = pd.DataFrame({ ... 'bc' :['AT', 'AT', 'TG', 'TA', 'TA', 'TA', 'TA'], ... 'var' :['v1', 'v1', 'v2', 'v3', 'v4', 'v3', 'v3'], ... 'lib' :['s1', 's1', 's2', 's3', 's3', 's3', 's4']}) >>> with pd.option_context('display.precision', 4): ... empirical_accuracy(df2, upstream_group_cols='lib', group_cols='bc', ... mutation_col='var', sort_mutations=False) lib accuracy 0 s1 1.0000 1 s3 0.6667
- alignparse.consensus.process_mut_str(s)[source]¶
Process a string of mutations.
- Parameters:
s (str) – Space-delimited mutations. Substitutions in form ‘A6T’. Deletions in form ‘del5to7’. Insertions in form ‘ins6len2’ or ‘ins6TA’. Negative site numbers are allowed.
- Returns:
- The tuple is named Mutations and has the following elements in order:
substitutions: list of substitutions in order of sequence.
deletions: list of deletions in order of sequence.
insertions: list of insertions in order of sequence.
- Return type:
namedtuple
Example
>>> s = 'A1T del5to7 G9C ins6len2 del12to12' >>> process_mut_str(s) Mutations(substitutions=['A1T', 'G9C'], deletions=['del5to7', 'del12to12'], insertions=['ins6len2']) >>> s = 'A11T del5to7 G-9C ins-6GA del12to15 ins13AAT' >>> process_mut_str(s) Mutations(substitutions=['G-9C', 'A11T'], deletions=['del5to7', 'del12to15'], insertions=['ins-6GA', 'ins13AAT'])
- alignparse.consensus.simple_mutconsensus(df, *, group_cols=('library', 'barcode'), mutation_col='mutations', max_sub_diffs=1, max_indel_diffs=2, max_minor_sub_frac=0.1, max_minor_indel_frac=0.25, max_minor_greater_or_equal=False, min_support=1, support_col='variant_call_support')[source]¶
Get simple consensus of mutations with group (i.e., barcode).
- Parameters:
df (pandas.DataFrame) – Each row gives data on mutations for a different sequence.
group_cols (list, tuple, or str) – Group all sequences that share values in these column(s) of df.
mutation_col (str) – Column in df with mutations in form that can be processed by
process_mut_str()
.max_sub_diffs (int or None) – Drop groups where any variant differs from all other variants by more than this many substitution (point mutation) differences. Set to
None
if no limit.max_indel_diffs (int or None) – Drop groups where any variant differs from all other variants by more than this many indel differences. Set to
None
if no limit.max_minor_sub_frac (float) – Drop groups with a minor (non-consensus) substitution in > the ceiling of this fraction times the number of sequences in group.
max_minor_indel_frac (float) – Drop groups with a minor (non-consensus) indel in > the ceiling of this fraction times the number of sequences in group.
max_minor_greater_or_equal (bool) – For
max_minor_sub_frac
andmax_minor_indel_frac
, use >= rather than >. This is may help if you want to be conservative in avoiding bad consensuses. But don’t reject if zero differences.min_support (int) – Require at least this many supporting sequences to call consensus.
support_col (str) – Name of column in returned consensus data frame with number of sequences supporting the consensus call.
Note
The rationale behind this consensus calling scheme is that we want to build a consensus except in the following two cases, each of which indicate a likely problem beyond simple rare sequencing errors (such as strand exchange or barcode collisions):
A sequence within the group differs by too much from others in group (given by max_sub_diffs and max_indel_diffs).
There are “minor” (non-consensus) mutations present at too high frequency (given by max_minor_indel_frac and max_minor_sub_frac).
- Returns:
The returned tuple (consensus, dropped) consists of two pandas data frames:
consensus: Gives consensus for all groups for which this can be constructed. Columns are group_cols, column with name given by mutation_col that gives consensus mutations, and a column with name given by support_col that gives the number of variants used to generate the consensus.
dropped: Gives each dropped group, the reason it was dropped, and the number of variants (sequences) in that group.
- Return type:
tuple
Example
Create a dummy CSV file with our variants and mutations:
>>> f = io.StringIO() >>> _ = f.write(textwrap.dedent(''' ... library,barcode,mutations ... lib_1,AG,A2C del5to7 ... lib_1,AG,A2C ... lib_1,TA,G3A ins4len3 ... lib_1,TA,G3A ins4len3 ... lib_1,TA,G3A ins4len3 ... lib_2,TA,C5A T-6C ... lib_2,TA,ins5len1 T-6C ... lib_2,TA,T-6C ... lib_2,TA,A4G T-6C ... lib_2,TG,T6A ... lib_2,TG,A2G ... lib_2,GG,del1to4 ... lib_2,GG,A1C ... lib_2,AA, ... lib_2,AA, ... lib_2,AA,T6C ... lib_2,AA,T6C ... lib_3,AA,ins1len1 del1to2 T6G ... lib_3,AA,del5to7 T6G ... ''')) >>> _ = f.seek(0)
Read into data frame. Note use of na_filter=False so empty strings are read as such rather than as na values:
>>> df = pd.read_csv(f, na_filter=False)
Get the consensus mutations, and the dropped libraries / barcodes:
>>> consensus, dropped = simple_mutconsensus(df) >>> consensus library barcode mutations variant_call_support 0 lib_1 AG A2C 2 1 lib_1 TA G3A ins4len3 3 2 lib_2 GG 2 3 lib_2 TA T-6C 4 >>> dropped library barcode drop_reason nseqs 0 lib_2 AA minor subs too frequent 4 1 lib_2 TG subs diff too large 2 2 lib_3 AA indels diff too large 2
Get the consensus just for library 1:
>>> lib1_consensus, _ = simple_mutconsensus(df.query('library == "lib_1"'), ... group_cols='barcode') >>> lib1_consensus barcode mutations variant_call_support 0 AG A2C 2 1 TA G3A ins4len3 3
Set
max_sub_diffs
to None:>>> consensus, dropped = simple_mutconsensus(df, max_sub_diffs=None) >>> consensus library barcode mutations variant_call_support 0 lib_1 AG A2C 2 1 lib_1 TA G3A ins4len3 3 2 lib_2 GG 2 3 lib_2 TA T-6C 4 4 lib_2 TG 2 >>> dropped library barcode drop_reason nseqs 0 lib_2 AA minor subs too frequent 4 1 lib_3 AA indels diff too large 2
Set
max_minor_greater_or_equal
to True:>>> consensus, dropped = simple_mutconsensus( ... df, max_minor_greater_or_equal=True) >>> consensus library barcode mutations variant_call_support 0 lib_1 TA G3A ins4len3 3 >>> dropped library barcode drop_reason nseqs 0 lib_1 AG minor indels too frequent 2 1 lib_2 AA minor subs too frequent 4 2 lib_2 GG minor subs too frequent 2 3 lib_2 TA minor subs too frequent 4 4 lib_2 TG subs diff too large 2 5 lib_3 AA indels diff too large 2
Use
min_support
:>>> consensus, dropped = simple_mutconsensus(df, min_support=3) >>> consensus library barcode mutations variant_call_support 0 lib_1 TA G3A ins4len3 3 1 lib_2 TA T-6C 4 >>> dropped library barcode drop_reason nseqs 0 lib_1 AG too few sequences 2 1 lib_2 AA minor subs too frequent 4 2 lib_2 GG too few sequences 2 3 lib_2 TG too few sequences 2 4 lib_3 AA too few sequences 2