consensus¶

Get consensus of sequences / mutations. Useful for grouping sequences within a barcode.

class alignparse.consensus.Mutations(substitutions, deletions, insertions)¶

Bases: tuple

deletions¶: Alias for field number 1

insertions¶: Alias for field number 2

substitutions¶: Alias for field number 0

alignparse.consensus.add_mut_info_cols(df, *, mutation_col='mutations', sub_str_col=None, del_str_col=None, ins_str_col=None, indel_str_col=None, n_sub_col=None, n_del_col=None, n_ins_col=None, n_indel_col=None, overwrite_cols=False)[source]¶

Expand information about mutations in a data frame.

Parameters:

df (pandas.DataFrame) – Data frame with each row giving information about a variant.
mutation_col (str) – Name of column in df with str containing space-delimited list of mutations in format parseable by process_mut_str().
sub_str_col (str or None) – Name of added column with string giving substitutions.
del_str_col (str or None) – Name of added column with string giving deletions.
ins_str_col (str or None) – Name of added column with string giving insertions.
indel_str_col (str or None) – Name of added column with string giving deletions and insertions.
n_sub_col (str or None) – Name of added column with number of substitutions.
n_del_col (str or None) – Name of added column with number of deletions.
n_ins_col (str or None) – Name of added column with number of insertions.
n_indel_col (str or None) – Name of added column with number of deletions and insertions.
overwrite_cols (bool) – If column to be created is already in df, overwrite it?

Returns:

A copy of df with the indicated columns added.

Return type:

pandas.DataFrame

Example

Data frame for which we get mutation info:

>>> df = pd.DataFrame({
...       'name': ['seq1', 'seq2', 'seq3'],
...       'mutations': ['A6G del7to8 T3A', '', 'T5A ins5AAT del8to9'],
...       })

Get info on substitutions:

>>> add_mut_info_cols(df, sub_str_col='subs', n_sub_col='nsubs')
   name            mutations     subs  nsubs
0  seq1      A6G del7to8 T3A  T3A A6G      2
1  seq2                                    0
2  seq3  T5A ins5AAT del8to9      T5A      1

Get info on substitutions and indels:

>>> pd.set_option('display.max_columns', 10)
>>> add_mut_info_cols(df, sub_str_col='subs', n_sub_col='nsubs',
...                   indel_str_col='indels', n_indel_col='nindels')
   name            mutations     subs           indels  nsubs  nindels
0  seq1      A6G del7to8 T3A  T3A A6G          del7to8      2        1
1  seq2                                                     0        0
2  seq3  T5A ins5AAT del8to9      T5A  del8to9 ins5AAT      1        2

Just get counts for all types of mutations:

>>> add_mut_info_cols(df, n_sub_col='nsubs', n_del_col='ndels',
...                   n_ins_col='nins', n_indel_col='nindels')
   name            mutations  nsubs  ndels  nins  nindels
0  seq1      A6G del7to8 T3A      2      1     0        1
1  seq2                           0      0     0        0
2  seq3  T5A ins5AAT del8to9      1      1     1        2

alignparse.consensus.empirical_accuracy(df, *, group_cols='barcode', upstream_group_cols='library', mutation_col='mutations', accuracy_col='accuracy', sort_mutations=True)[source]¶

Accuracy from number of identical sequences in a group (i.e., barcode).

Note

The accuracy \(1 - \epsilon\) is calculated as follows:

Let \(\epsilon\) be the probability a sequence has an error. Consider a group of \(n\) sequences that should be identical in the absence of errors (i.e., they all have the same barcode), and assume errors only make sequences different (the chance of two sequences having the same error is very low). We would like to calculate the probability of observing \(u\) unique sequences in the group given \(\epsilon\) and \(n\).

First, consider the case where there are two sequences in the group (\(n = 2\)). The probability of having \(u = 1\) unique sequences in the group is the probability that neither have an error:

\[\Pr\left(u=1 | n=2, \epsilon\right) = \left(1 - \epsilon\right)^2.\]

The probability of having \(u = 2\) unique sequences is the the probability that either one or both have errors:

\[\Pr\left(u=2 | n=2, \epsilon\right) = 2\left(1 - \epsilon\right)\epsilon + \epsilon^2.\]

Generalizing to arbitrary \(n\), we have:

\[\Pr\left(u | n, \epsilon\right) = \binom{n}{u-1} \left(1 - \epsilon\right)^{n - u + 1} \epsilon^{u - 1} + \delta_{un} \epsilon^n\]

where \(\binom{n}{u-1}\) is the binomial coefficient and \(\delta_{un}\) is the Kronecker delta.

Let there be \(G\) groups of sequences, with the size and number of unique sequences in group \(g\) be \(n_g\) and \(u_g\) (where \(g = 1, \ldots, G\)). The overall likelihood \(L\) of the observed numbers of unique sequences given the group sizes and error rate is:

\[\begin{split}L &=& \Pr\left(\left\{u_g\right\}|\left\{n_g\right\},\epsilon\right) \\ &=& \prod_g \Pr\left(u_g | n_g, \epsilon\right).\end{split}\]

To find the maximum likelihood of error rate, we simply use numerical optimization to find the value of \(\epsilon\) that maximizes \(L\). In practice, we actually maximize \(\ln\left(L\right)\).

Parameters:

df (pandas.DataFrame) – Each row gives data for a different sequence.
group_cols (str or list) – Column(s) in df indicating how we group sequences before computing fraction within group that is identical.
upstream_group_cols (str or None or list) – Column(s) in df that we use to group before doing the grouping for computing fraction identical. So a different average fraction will be returned for each of these upstream groups.
mutation_col (str) – Column in df that we compare when determining if sequences are identical. Typically list of mutations or sequences themselves.
accuracy_col (str) – Name of column in returned data frame that gives empirical accuracy.
sort_mutations (bool) – Useful if you have strings of space-delimited mutations not guaranteed to be consistently ordered. If True, sort such space-delimited mutations.

Returns:

Has all columns in upstream_group_cols plus the new column with name given by accuracy_col.

Return type:

pandas.DataFrame

Example

>>> df = pd.DataFrame({
...         'barcode': ['AT', 'AT', 'TG', 'TA', 'TA', 'TA'],
...         'mutations': ['A1G', 'A1G', '', 'T2A C6G', 'T5C', 'C6G T2A']})
>>> with pd.option_context('display.precision', 4):
...     empirical_accuracy(df, upstream_group_cols=None)
   accuracy
0    0.7692

If we do the same without sorting the mutations, we get lower values as ‘T2A C6G’ and ‘C6G T2A’ are then considered as different:

>>> with pd.option_context('display.precision', 4):
...     empirical_accuracy(df, upstream_group_cols=None,
...                        sort_mutations=False)
   accuracy
0    0.4766

Now another example with two libraries and non-standard column names. Note that we only get results for the two libraries with barcodes found multiple times:

>>> df2 = pd.DataFrame({
...     'bc'  :['AT', 'AT', 'TG', 'TA', 'TA', 'TA', 'TA'],
...     'var' :['v1', 'v1', 'v2', 'v3', 'v4', 'v3', 'v3'],
...     'lib' :['s1', 's1', 's2', 's3', 's3', 's3', 's4']})
>>> with pd.option_context('display.precision', 4):
...     empirical_accuracy(df2, upstream_group_cols='lib', group_cols='bc',
...                        mutation_col='var', sort_mutations=False)
  lib  accuracy
0  s1    1.0000
1  s3    0.6667

alignparse.consensus.process_mut_str(s)[source]¶

Process a string of mutations.

Parameters:

s (str) – Space-delimited mutations. Substitutions in form ‘A6T’. Deletions in form ‘del5to7’. Insertions in form ‘ins6len2’ or ‘ins6TA’. Negative site numbers are allowed.

Returns:

The tuple is named Mutations and has the following elements in order:

substitutions: list of substitutions in order of sequence.
deletions: list of deletions in order of sequence.
insertions: list of insertions in order of sequence.

Return type:

namedtuple

Example

>>> s = 'A1T del5to7 G9C ins6len2 del12to12'
>>> process_mut_str(s)  
Mutations(substitutions=['A1T', 'G9C'],
          deletions=['del5to7', 'del12to12'],
          insertions=['ins6len2'])
>>> s = 'A11T del5to7 G-9C ins-6GA del12to15 ins13AAT'
>>> process_mut_str(s)  
Mutations(substitutions=['G-9C', 'A11T'],
          deletions=['del5to7', 'del12to15'],
          insertions=['ins-6GA', 'ins13AAT'])

alignparse.consensus.simple_mutconsensus(df, *, group_cols=('library', 'barcode'), mutation_col='mutations', max_sub_diffs=1, max_indel_diffs=2, max_minor_sub_frac=0.1, max_minor_indel_frac=0.25, max_minor_greater_or_equal=False, min_support=1, support_col='variant_call_support')[source]¶

Get simple consensus of mutations with group (i.e., barcode).

Parameters:

df (pandas.DataFrame) – Each row gives data on mutations for a different sequence.
group_cols (list, tuple, or str) – Group all sequences that share values in these column(s) of df.
mutation_col (str) – Column in df with mutations in form that can be processed by process_mut_str().
max_sub_diffs (int or None) – Drop groups where any variant differs from all other variants by more than this many substitution (point mutation) differences. Set to None if no limit.
max_indel_diffs (int or None) – Drop groups where any variant differs from all other variants by more than this many indel differences. Set to None if no limit.
max_minor_sub_frac (float) – Drop groups with a minor (non-consensus) substitution in > the ceiling of this fraction times the number of sequences in group.
max_minor_indel_frac (float) – Drop groups with a minor (non-consensus) indel in > the ceiling of this fraction times the number of sequences in group.
max_minor_greater_or_equal (bool) – For max_minor_sub_frac and max_minor_indel_frac, use >= rather than >. This is may help if you want to be conservative in avoiding bad consensuses. But don’t reject if zero differences.
min_support (int) – Require at least this many supporting sequences to call consensus.
support_col (str) – Name of column in returned consensus data frame with number of sequences supporting the consensus call.

Note

The rationale behind this consensus calling scheme is that we want to build a consensus except in the following two cases, each of which indicate a likely problem beyond simple rare sequencing errors (such as strand exchange or barcode collisions):

A sequence within the group differs by too much from others in group (given by max_sub_diffs and max_indel_diffs).

There are “minor” (non-consensus) mutations present at too high frequency (given by max_minor_indel_frac and max_minor_sub_frac).

Returns:

The returned tuple (consensus, dropped) consists of two pandas data frames:

consensus: Gives consensus for all groups for which this can be constructed. Columns are group_cols, column with name given by mutation_col that gives consensus mutations, and a column with name given by support_col that gives the number of variants used to generate the consensus.

dropped: Gives each dropped group, the reason it was dropped, and the number of variants (sequences) in that group.

Return type:

tuple

Example

Create a dummy CSV file with our variants and mutations:

>>> f = io.StringIO()
>>> _ = f.write(textwrap.dedent('''
...        library,barcode,mutations
...        lib_1,AG,A2C del5to7
...        lib_1,AG,A2C
...        lib_1,TA,G3A ins4len3
...        lib_1,TA,G3A ins4len3
...        lib_1,TA,G3A ins4len3
...        lib_2,TA,C5A T-6C
...        lib_2,TA,ins5len1 T-6C
...        lib_2,TA,T-6C
...        lib_2,TA,A4G T-6C
...        lib_2,TG,T6A
...        lib_2,TG,A2G
...        lib_2,GG,del1to4
...        lib_2,GG,A1C
...        lib_2,AA,
...        lib_2,AA,
...        lib_2,AA,T6C
...        lib_2,AA,T6C
...        lib_3,AA,ins1len1 del1to2 T6G
...        lib_3,AA,del5to7 T6G
...        '''))
>>> _ = f.seek(0)

Read into data frame. Note use of na_filter=False so empty strings are read as such rather than as na values:

>>> df = pd.read_csv(f, na_filter=False)

Get the consensus mutations, and the dropped libraries / barcodes:

>>> consensus, dropped = simple_mutconsensus(df)
>>> consensus
  library barcode     mutations  variant_call_support
0   lib_1      AG           A2C                     2
1   lib_1      TA  G3A ins4len3                     3
2   lib_2      GG                                   2
3   lib_2      TA          T-6C                     4
>>> dropped
  library barcode              drop_reason  nseqs
0   lib_2      AA  minor subs too frequent      4
1   lib_2      TG      subs diff too large      2
2   lib_3      AA    indels diff too large      2

Get the consensus just for library 1:

>>> lib1_consensus, _ = simple_mutconsensus(df.query('library == "lib_1"'),
...                                         group_cols='barcode')
>>> lib1_consensus
  barcode     mutations  variant_call_support
0      AG           A2C                     2
1      TA  G3A ins4len3                     3

Set max_sub_diffs to None:

>>> consensus, dropped = simple_mutconsensus(df, max_sub_diffs=None)
>>> consensus
  library barcode     mutations  variant_call_support
0   lib_1      AG           A2C                     2
1   lib_1      TA  G3A ins4len3                     3
2   lib_2      GG                                   2
3   lib_2      TA          T-6C                     4
4   lib_2      TG                                   2
>>> dropped
  library barcode              drop_reason  nseqs
0   lib_2      AA  minor subs too frequent      4
1   lib_3      AA    indels diff too large      2

Set max_minor_greater_or_equal to True:

>>> consensus, dropped = simple_mutconsensus(
...         df, max_minor_greater_or_equal=True)
>>> consensus
  library barcode     mutations  variant_call_support
0   lib_1      TA  G3A ins4len3                     3
>>> dropped
  library barcode                drop_reason  nseqs
0   lib_1      AG  minor indels too frequent      2
1   lib_2      AA    minor subs too frequent      4
2   lib_2      GG    minor subs too frequent      2
3   lib_2      TA    minor subs too frequent      4
4   lib_2      TG        subs diff too large      2
5   lib_3      AA      indels diff too large      2

Use min_support:

>>> consensus, dropped = simple_mutconsensus(df, min_support=3)
>>> consensus
  library barcode     mutations  variant_call_support
0   lib_1      TA  G3A ins4len3                     3
1   lib_2      TA          T-6C                     4
>>> dropped
  library barcode              drop_reason  nseqs
0   lib_1      AG        too few sequences      2
1   lib_2      AA  minor subs too frequent      4
2   lib_2      GG        too few sequences      2
3   lib_2      TG        too few sequences      2
4   lib_3      AA        too few sequences      2

consensus¶

alignparse

Navigation

Related Topics