utils¶
Miscellaneous utility functions.
- class polyclonal.utils.MutationParser(alphabet, letter_suffixed_sites=False)[source]¶
Bases:
object
Parse mutation strings like ‘A5G’.
- Parameters:
alphabet (array-like) – Valid single-character letters in alphabet.
letter_suffixed_sites (bool) – Allow sites suffixed by lowercase letters, such as “214a”. In this case, returned sites from
MutationParser.parse_mut()
are str.
Example
>>> mutparser = MutationParser(polyclonal.AAS) >>> mutparser.parse_mut('A5G') ('A', 5, 'G')
>>> mutparser.parse_mut('K7-') Traceback (most recent call last): ... ValueError: invalid mutation K7-
>>> mutparser_gap = MutationParser(polyclonal.AAS_WITHGAP) >>> mutparser_gap.parse_mut('K7-') ('K', 7, '-')
>>> mutparser.parse_mut("E214aA") Traceback (most recent call last): ... ValueError: invalid mutation E214aA
>>> mutparser_letter_suffix = MutationParser(polyclonal.AAS, True) >>> mutparser_letter_suffix.parse_mut('A5G') ('A', '5', 'G') >>> mutparser_letter_suffix.parse_mut('E214aA') ('E', '214a', 'A')
>>> mutparser.parse_mut("A-1G") ('A', -1, 'G')
- polyclonal.utils.shift_mut_site(mut_str, shift)[source]¶
Shift site in string of mutations.
- Parameters:
mut_str (str) – String of space-delimited amino-acid substitution mutations.
shift (int) – Amount to shift sites (add this to current site number).
- Returns:
Mutation string with sites shifted.
- Return type:
str
Example
>>> shift_mut_site('A1G K7A', 2) 'A3G K9A'
- polyclonal.utils.site_level_variants(df, *, original_alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), wt_char='w', mut_char='m', letter_suffixed_sites=False)[source]¶
Re-define variants simply in terms of which sites are mutated.
This function is useful if you have a data frame of variants and you want to simplify them from full mutations to just indicating whether sites are mutated.
- Parameters:
df (pandas.DataFrame) – Must include a column named ‘aa_substitutions’.
original_alphabet (array-like) – Valid single-letter characters in the original (mutation-level) alphabet.
wt_char (str) – Single letter used to represent wildtype identity at all sites.
mut_char (str) – Single letter used to represent mutant identity at all sites.
letter_suffixed_sites (str) – Same mutation as for
MutationParser
.
- Returns:
Copy of
df
with ‘aa_substitutions’ in site-level encoding.- Return type:
pandas.DataFrame
Example
>>> import pandas as pd >>> df = pd.DataFrame.from_records( ... [('AA', 'M1A', 1.0), ... ('AC', '', 0.0), ... ('AG', 'M1A C53T', 1.0), ... ], ... columns=['barcode', 'aa_substitutions', 'escape'], ... ) >>> site_level_variants(df) barcode aa_substitutions escape 0 AA w1m 1.0 1 AC 0.0 2 AG w1m w53m 1.0
- polyclonal.utils.tidy_to_corr(df, sample_col, label_col, value_col, *, group_cols=None, return_type='tidy_pairs', method='pearson')[source]¶
Pairwise correlations between samples in tidy data frame.
- Parameters:
df (pandas.DataFrame) – Tidy data frame.
sample_col (str) – Column in df with name of sample.
label_col (str) – Column in df with labels for variable to correlate.
value_col (str) – Column in df with values to correlate.
group_cols (None, str, or list) – Additional columns used to group results.
return_type ({'tidy_pairs', 'matrix'}) – Return results as tidy dataframe of pairwise correlations or correlation matrix.
method (str) – A correlation method passable to pandas.DataFrame.corr.
- Returns:
Holds pairwise correlations in format specified by return_type. Correlations only calculated among values with shared label among samples.
- Return type:
pandas.DataFrame
Example
Define data frame with data to correlate:
>>> df = pd.DataFrame({ ... 'sample': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'], ... 'barcode': ['A', 'C', 'G', 'G', 'A', 'C', 'T', 'G', 'C', 'A'], ... 'score': [1, 2, 3, 3, 1.5, 2, 4, 1, 2, 3], ... 'group': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y'], ... })
Pairwise correlations between all samples ignoring group:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score') sample_1 sample_2 correlation 0 a a 1.000000 1 b a 0.981981 2 c a -1.000000 3 a b 0.981981 4 b b 1.000000 5 c b -0.981981 6 a c -1.000000 7 b c -0.981981 8 c c 1.000000
The same but as a matrix rather than in tidy format:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', return_type='matrix') sample a b c 0 a 1.000000 0.981981 -1.000000 1 b 0.981981 1.000000 -0.981981 2 c -1.000000 -0.981981 1.000000
Now group before computing correlations:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', group_cols='group') group sample_1 sample_2 correlation 0 x a a 1.000000 1 x b a 0.981981 2 x a b 0.981981 3 x b b 1.000000 4 y c c 1.000000 >>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', group_cols='group', ... return_type='matrix') group sample a b c 0 x a 1.000000 0.981981 NaN 1 x b 0.981981 1.000000 NaN 2 y c NaN NaN 1.0