utils

Miscellaneous utility functions.

class polyclonal.utils.MutationParser(alphabet, letter_suffixed_sites=False)[source]

Bases: object

Parse mutation strings like ‘A5G’.

Parameters:
  • alphabet (array-like) – Valid single-character letters in alphabet.

  • letter_suffixed_sites (bool) – Allow sites suffixed by lowercase letters, such as “214a”. In this case, returned sites from MutationParser.parse_mut() are str.

Example

>>> mutparser = MutationParser(polyclonal.AAS)
>>> mutparser.parse_mut('A5G')
('A', 5, 'G')
>>> mutparser.parse_mut('K7-')
Traceback (most recent call last):
  ...
ValueError: invalid mutation K7-
>>> mutparser_gap = MutationParser(polyclonal.AAS_WITHGAP)
>>> mutparser_gap.parse_mut('K7-')
('K', 7, '-')
>>> mutparser.parse_mut("E214aA")
Traceback (most recent call last):
  ...
ValueError: invalid mutation E214aA
>>> mutparser_letter_suffix = MutationParser(polyclonal.AAS, True)
>>> mutparser_letter_suffix.parse_mut('A5G')
('A', '5', 'G')
>>> mutparser_letter_suffix.parse_mut('E214aA')
('E', '214a', 'A')
>>> mutparser.parse_mut("A-1G")
('A', -1, 'G')
parse_mut(mutation)[source]

tuple: (wildtype, site, mutation).

polyclonal.utils.shift_mut_site(mut_str, shift)[source]

Shift site in string of mutations.

Parameters:
  • mut_str (str) – String of space-delimited amino-acid substitution mutations.

  • shift (int) – Amount to shift sites (add this to current site number).

Returns:

Mutation string with sites shifted.

Return type:

str

Example

>>> shift_mut_site('A1G K7A', 2)
'A3G K9A'
polyclonal.utils.site_level_variants(df, *, original_alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), wt_char='w', mut_char='m', letter_suffixed_sites=False)[source]

Re-define variants simply in terms of which sites are mutated.

This function is useful if you have a data frame of variants and you want to simplify them from full mutations to just indicating whether sites are mutated.

Parameters:
  • df (pandas.DataFrame) – Must include a column named ‘aa_substitutions’.

  • original_alphabet (array-like) – Valid single-letter characters in the original (mutation-level) alphabet.

  • wt_char (str) – Single letter used to represent wildtype identity at all sites.

  • mut_char (str) – Single letter used to represent mutant identity at all sites.

  • letter_suffixed_sites (str) – Same mutation as for MutationParser.

Returns:

Copy of df with ‘aa_substitutions’ in site-level encoding.

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> df = pd.DataFrame.from_records(
...         [('AA', 'M1A', 1.0),
...          ('AC', '', 0.0),
...          ('AG', 'M1A C53T', 1.0),
...          ],
...        columns=['barcode', 'aa_substitutions', 'escape'],
...        )
>>> site_level_variants(df)
  barcode aa_substitutions  escape
0      AA              w1m     1.0
1      AC                      0.0
2      AG         w1m w53m     1.0
polyclonal.utils.tidy_to_corr(df, sample_col, label_col, value_col, *, group_cols=None, return_type='tidy_pairs', method='pearson')[source]

Pairwise correlations between samples in tidy data frame.

Parameters:
  • df (pandas.DataFrame) – Tidy data frame.

  • sample_col (str) – Column in df with name of sample.

  • label_col (str) – Column in df with labels for variable to correlate.

  • value_col (str) – Column in df with values to correlate.

  • group_cols (None, str, or list) – Additional columns used to group results.

  • return_type ({'tidy_pairs', 'matrix'}) – Return results as tidy dataframe of pairwise correlations or correlation matrix.

  • method (str) – A correlation method passable to pandas.DataFrame.corr.

Returns:

Holds pairwise correlations in format specified by return_type. Correlations only calculated among values with shared label among samples.

Return type:

pandas.DataFrame

Example

Define data frame with data to correlate:

>>> df = pd.DataFrame({
...        'sample': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
...        'barcode': ['A', 'C', 'G', 'G', 'A', 'C', 'T', 'G', 'C', 'A'],
...        'score': [1, 2, 3, 3, 1.5, 2, 4, 1, 2, 3],
...        'group': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y'],
...        })

Pairwise correlations between all samples ignoring group:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score')
  sample_1 sample_2  correlation
0        a        a     1.000000
1        b        a     0.981981
2        c        a    -1.000000
3        a        b     0.981981
4        b        b     1.000000
5        c        b    -0.981981
6        a        c    -1.000000
7        b        c    -0.981981
8        c        c     1.000000

The same but as a matrix rather than in tidy format:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', return_type='matrix')
  sample         a         b         c
0      a  1.000000  0.981981 -1.000000
1      b  0.981981  1.000000 -0.981981
2      c -1.000000 -0.981981  1.000000

Now group before computing correlations:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', group_cols='group')
  group sample_1 sample_2  correlation
0     x        a        a     1.000000
1     x        b        a     0.981981
2     x        a        b     0.981981
3     x        b        b     1.000000
4     y        c        c     1.000000
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', group_cols='group',
...              return_type='matrix')
  group sample         a         b    c
0     x      a  1.000000  0.981981  NaN
1     x      b  0.981981  1.000000  NaN
2     y      c       NaN       NaN  1.0