utils¶
Miscellaneous utility functions.
- dms_variants.utils.cumul_rows_by_count(df, *, count_col='count', n_col='n_rows', tot_col='total_rows', group_cols=None, group_cols_as_str=False)[source]¶
Cumulative number of rows with >= each count.
- Parameters:
df (pandas.DataFrame) – Data frame with rows to analyze.
count_col (str) – Column in df with count for row.
n_col (str) – Name of column in result giving cumulative count threshold.
tot_col (str) – Name of column in result giving total number of rows.
group_cols (None or list) – Group by these columns and analyze each group separately.
group_cols_as_str (bool) – Convert any group_cols columns to str. This is needed if calling in
R
using reticulate.
- Returns:
Give cumulative counts. For each count in count_col, column n_col gives number of rows with >= that many counts, and tot_col gives total number of counts.
- Return type:
pandas.DataFrame
Examples
>>> df = pd.DataFrame({'sample': ['a', 'a', 'b', 'b', 'a', 'a'], ... 'count': [9, 0, 1, 4, 3, 3]}) >>> cumul_rows_by_count(df) count n_rows total_rows 0 9 1 6 1 4 2 6 2 3 4 6 3 1 5 6 4 0 6 6 >>> cumul_rows_by_count(df, group_cols=['sample']) sample count n_rows total_rows 0 a 9 1 4 1 a 3 3 4 2 a 0 4 4 3 b 4 1 2 4 b 1 2 2
- dms_variants.utils.integer_breaks(x)[source]¶
Integer breaks for axes labels.
- Parameters:
x (array-like) – Numerical data values.
- Returns:
Integer tick locations.
- Return type:
numpy.ndarray
Example
>>> integer_breaks([0.5, 0.7, 1.2, 3.7, 7, 17]) array([ 0., 2., 4., 6., 8., 10., 12., 14., 16., 18.])
- dms_variants.utils.latex_sci_not(xs)[source]¶
Convert list of numbers to LaTex scientific notation.
- Parameters:
xs (list) – Numbers to format.
- Returns:
Formatted strings for numbers.
- Return type:
list
Examples
>>> latex_sci_not([0, 3, 3120, -0.0000927]) ['$0$', '$3$', '$3.1 \\times 10^{3}$', '$-9.3 \\times 10^{-5}$']
>>> latex_sci_not([0.001, 1, 1000, 1e6]) ['$0.001$', '$1$', '$10^{3}$', '$10^{6}$']
>>> latex_sci_not([-0.002, 0.003, 0.000011]) ['$-0.002$', '$0.003$', '$1.1 \\times 10^{-5}$']
>>> latex_sci_not([-0.1, 0.0, 0.1, 0.2]) ['$-0.1$', '$0$', '$0.1$', '$0.2$']
>>> latex_sci_not([0, 1, 2]) ['$0$', '$1$', '$2$']
- dms_variants.utils.reverse_complement(s, *, use_cutils=True)[source]¶
Get reverse complement of DNA sequence.
- Parameters:
s (str) – DNA sequence.
use_cutils (bool) – Use faster C-extension implementation.
- Returns:
Reverse complement of s.
- Return type:
str
Example
>>> s = 'ATGCAAN' >>> reverse_complement(s) 'NTTGCAT' >>> reverse_complement(s, use_cutils=False) == reverse_complement(s) True
- dms_variants.utils.scores_to_prefs(df, mutation_col, score_col, base, wt_score=0, missing='average', alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), exclude_chars=('*',), returnformat='wide', stringency_param=1)[source]¶
Convert functional scores to amino-acid preferences.
Preferences are calculated from functional scores as follows. Let \(y_{r,a}\) be the score of the variant with the single mutation of site \(r\) to \(a\) (when \(a\) is the wildtype character, then \(p_{r,a}\) is the score of the wildtype sequence). Then the preference \(\pi_{r,a}\) is
\[\pi_{r,a} = \frac{b^{y_{r,a}}}{\sum_{a'} b^{y_{r,a'}}}\]where \(b\) is the base for the exponent. This definition ensures that the preferences sum to one at each site. These preferences can be displayed in logo pltos or used as input to phydms
Note
The “flatness” of the preferences is determined by the exponent base. A smaller base yields flatter preferences. There is no obvious “best” base as different values correspond to different linear scalings of the scores. A recommended approach is simply to choose a value of base (such as 10) and then re-scale the preferences by using phydms to optimize a stringency parameter as described here. One thing to note is that phydms has an upper bound on the largest stringency parameter it can fit, so if you are hitting this upper bound then pre-scale the preferences to be less flat by using a larger value of base.
- Parameters:
df (pandas.DataFrame) – Data frame holding the functional scores.
mutation_col (str) – Column in df with mutations, in this format: ‘M1A’.
score_col (str) – Column in df with functional scores.
base (float) –
Base to which the exponent is taken in computing the preferences. Make sure not to choose an excessively small value if using in phydms or the preferences will be too flat. In the examples below we use 2, but you may want a larger value.
wt_score (float) – Functional score for wildtype sequence.
missing ({'average', 'site_average', 'error'}) – What to do when there is no estimate of the score for a mutant? Estimate the phenotype as the average of all single mutants, the average of all single mutants at that site, or raise an error.
alphabet (list or tuple) – Characters (e.g., amino acids) for which we compute preferences.
exclude_chars (tuple or list) – Characters to exclude when calculating preferences (and when averaging values for missing mutants). For instance, you might want to exclude stop codons even if they are in df.
returnformat ({'tidy', 'wide'}) – Return preferences in tidy or wide format data frame.
stringency_param (float) – Re-scale preferences by this stringency parameter. This involves raising each preference to the power of stringency_param, and then re-normalizes. A similar effect can be achieved by changing base.
- Returns:
Data frame where first column is named ‘site’, other columns are named for each character, and rows give preferences for each site.
- Return type:
pandas.DataFrame
Example
>>> func_scores_df = pd.DataFrame( ... {'aa_substitutions': ['M1A', 'M1C', 'A2M', 'A2C', 'M1*'], ... 'func_score': [-0.1, -2.3, 0.8, -1.2, -3.0,]})
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C'], exclude_chars=['*']) ... ).round(2) site M A C 0 1 0.47 0.44 0.10 1 2 0.55 0.31 0.14
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C', '*'], exclude_chars=[]) ... ).round(2) site M A C * 0 1 0.44 0.41 0.09 0.06 1 2 0.48 0.28 0.12 0.12
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C', '*'], exclude_chars=[], ... missing='site_average') ... ).round(2) site M A C * 0 1 0.44 0.41 0.09 0.06 1 2 0.43 0.25 0.11 0.22
>>> scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C', '*'], exclude_chars=[], ... missing='error') Traceback (most recent call last): ... ValueError: missing functional scores for some mutations
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C'], exclude_chars=['*'], ... returnformat='tidy') ... ).round(2) wildtype site mutant preference 0 M 1 C 0.10 1 A 2 C 0.14 2 A 2 A 0.31 3 M 1 A 0.44 4 M 1 M 0.47 5 A 2 M 0.55
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C'], exclude_chars=['*'], ... stringency_param=3) ... ).round(2) site M A C 0 1 0.55 0.45 0.00 1 2 0.83 0.16 0.01
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2, ... alphabet=['M', 'A', 'C', '*'], exclude_chars=[], ... returnformat='tidy') ... ).round(2) wildtype site mutant preference 0 M 1 * 0.06 1 M 1 C 0.09 2 A 2 C 0.12 3 A 2 * 0.12 4 A 2 A 0.28 5 M 1 A 0.41 6 M 1 M 0.44 7 A 2 M 0.48
- dms_variants.utils.single_nt_accessible(codon, aa, codon_encode_aa='raise')[source]¶
Is amino acid accessible from codon by single-nucleotide change?
- Parameters:
codon (str) – The codon.
aa (str) – The amino acid.
codon_encode_aa ({'raise', 'true', 'false'}) – If codon encodes aa, raise an error, return True, or return False.
- Return type:
bool
Example
>>> single_nt_accessible('GGG', 'E') True >>> single_nt_accessible('GGC', 'E') False >>> single_nt_accessible('GGG', 'G') Traceback (most recent call last): ... ValueError: `codon` GGG already encodes `aa` G (see `codon_encode_aa`) >>> single_nt_accessible('GGG', 'G', codon_encode_aa='true') True >>> single_nt_accessible('TTT', 'L') True
- dms_variants.utils.tidy_split(df, column, sep=' ', keep=False)[source]¶
Split values of column and expand into new rows.
Note
Taken from https://stackoverflow.com/a/39946744
- Parameters:
df (pandas.DataFrame) – Data frame with the column to split and expand.
column (str) – Name of column to split and expand.
sep (str) – The string used to split the column’s values.
keep (bool) – Retain the presplit value as it’s own row.
- Returns:
Data frame with the same columns as df. Rows lacking column are filtered.
- Return type:
pandas.DataFrame
Example
>>> df = pd.DataFrame({'col1': ['A', 'B', 'C'], ... 'col2': ['d e', float('nan'), 'f']}) >>> tidy_split(df, 'col2') col1 col2 0 A d 0 A e 2 C f
- dms_variants.utils.tidy_to_corr(df, sample_col, label_col, value_col, *, group_cols=None, return_type='tidy_pairs', method='pearson')[source]¶
Pairwise correlations between samples in tidy data frame.
- Parameters:
df (pandas.DataFrame) – Tidy data frame.
sample_col (str) – Column in df with name of sample.
label_col (str) – Column in df with labels for variable to correlate.
value_col (str) – Column in df with values to correlate.
group_cols (None, str, or list) – Additional columns used to group results.
return_type ({'tidy_pairs', 'matrix'}) – Return results as tidy dataframe of pairwise correlations or correlation matrix.
method (str) – A correlation metho passable to pandas.DataFrame.corr.
- Returns:
Holds pairwise correlations in format specified by return_type. Correlations only calculated among values with shared label among samples.
- Return type:
pandas.DataFrame
Example
Define data frame with data to correlate:
>>> df = pd.DataFrame({ ... 'sample': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'], ... 'barcode': ['A', 'C', 'G', 'G', 'A', 'C', 'T', 'G', 'C', 'A'], ... 'score': [1, 2, 3, 3, 1.5, 2, 4, 1, 2, 3], ... 'group': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y'], ... })
Pairwise correlations between all samples ignoring group:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score') sample_1 sample_2 correlation 0 a a 1.000000 1 b a 0.981981 2 c a -1.000000 3 a b 0.981981 4 b b 1.000000 5 c b -0.981981 6 a c -1.000000 7 b c -0.981981 8 c c 1.000000
The same but as a matrix rather than in tidy format:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', return_type='matrix') sample a b c 0 a 1.000000 0.981981 -1.000000 1 b 0.981981 1.000000 -0.981981 2 c -1.000000 -0.981981 1.000000
Now group before computing correlations:
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', group_cols='group') group sample_1 sample_2 correlation 0 x a a 1.000000 1 x b a 0.981981 2 x a b 0.981981 3 x b b 1.000000 4 y c c 1.000000 >>> tidy_to_corr(df, sample_col='sample', label_col='barcode', ... value_col='score', group_cols='group', ... return_type='matrix') group sample a b c 0 x a 1.000000 0.981981 NaN 1 x b 0.981981 1.000000 NaN 2 y c NaN NaN 1.0