utils

Miscellaneous utility functions.

dms_variants.utils.cumul_rows_by_count(df, *, count_col='count', n_col='n_rows', tot_col='total_rows', group_cols=None, group_cols_as_str=False)[source]

Cumulative number of rows with >= each count.

Parameters:
  • df (pandas.DataFrame) – Data frame with rows to analyze.

  • count_col (str) – Column in df with count for row.

  • n_col (str) – Name of column in result giving cumulative count threshold.

  • tot_col (str) – Name of column in result giving total number of rows.

  • group_cols (None or list) – Group by these columns and analyze each group separately.

  • group_cols_as_str (bool) – Convert any group_cols columns to str. This is needed if calling in R using reticulate.

Returns:

Give cumulative counts. For each count in count_col, column n_col gives number of rows with >= that many counts, and tot_col gives total number of counts.

Return type:

pandas.DataFrame

Examples

>>> df = pd.DataFrame({'sample': ['a', 'a', 'b', 'b', 'a', 'a'],
...                    'count': [9, 0, 1, 4, 3, 3]})
>>> cumul_rows_by_count(df)
   count  n_rows  total_rows
0      9       1           6
1      4       2           6
2      3       4           6
3      1       5           6
4      0       6           6
>>> cumul_rows_by_count(df, group_cols=['sample'])
  sample  count  n_rows  total_rows
0      a      9       1           4
1      a      3       3           4
2      a      0       4           4
3      b      4       1           2
4      b      1       2           2
dms_variants.utils.integer_breaks(x)[source]

Integer breaks for axes labels.

Note

The breaks can be passed to plotnine as in:

scale_x_continuous(breaks=integer_breaks)
Parameters:

x (array-like) – Numerical data values.

Returns:

Integer tick locations.

Return type:

numpy.ndarray

Example

>>> integer_breaks([0.5, 0.7, 1.2, 3.7, 7, 17])
array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14., 16., 18.])
dms_variants.utils.latex_sci_not(xs)[source]

Convert list of numbers to LaTex scientific notation.

Parameters:

xs (list) – Numbers to format.

Returns:

Formatted strings for numbers.

Return type:

list

Examples

>>> latex_sci_not([0, 3, 3120, -0.0000927])
['$0$', '$3$', '$3.1 \\times 10^{3}$', '$-9.3 \\times 10^{-5}$']
>>> latex_sci_not([0.001, 1, 1000, 1e6])
['$0.001$', '$1$', '$10^{3}$', '$10^{6}$']
>>> latex_sci_not([-0.002, 0.003, 0.000011])
['$-0.002$', '$0.003$', '$1.1 \\times 10^{-5}$']
>>> latex_sci_not([-0.1, 0.0, 0.1, 0.2])
['$-0.1$', '$0$', '$0.1$', '$0.2$']
>>> latex_sci_not([0, 1, 2])
['$0$', '$1$', '$2$']
dms_variants.utils.reverse_complement(s, *, use_cutils=True)[source]

Get reverse complement of DNA sequence.

Parameters:
  • s (str) – DNA sequence.

  • use_cutils (bool) – Use faster C-extension implementation.

Returns:

Reverse complement of s.

Return type:

str

Example

>>> s = 'ATGCAAN'
>>> reverse_complement(s)
'NTTGCAT'
>>> reverse_complement(s, use_cutils=False) == reverse_complement(s)
True
dms_variants.utils.scores_to_prefs(df, mutation_col, score_col, base, wt_score=0, missing='average', alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), exclude_chars=('*',), returnformat='wide', stringency_param=1)[source]

Convert functional scores to amino-acid preferences.

Preferences are calculated from functional scores as follows. Let \(y_{r,a}\) be the score of the variant with the single mutation of site \(r\) to \(a\) (when \(a\) is the wildtype character, then \(p_{r,a}\) is the score of the wildtype sequence). Then the preference \(\pi_{r,a}\) is

\[\pi_{r,a} = \frac{b^{y_{r,a}}}{\sum_{a'} b^{y_{r,a'}}}\]

where \(b\) is the base for the exponent. This definition ensures that the preferences sum to one at each site. These preferences can be displayed in logo pltos or used as input to phydms

Note

The “flatness” of the preferences is determined by the exponent base. A smaller base yields flatter preferences. There is no obvious “best” base as different values correspond to different linear scalings of the scores. A recommended approach is simply to choose a value of base (such as 10) and then re-scale the preferences by using phydms to optimize a stringency parameter as described here. One thing to note is that phydms has an upper bound on the largest stringency parameter it can fit, so if you are hitting this upper bound then pre-scale the preferences to be less flat by using a larger value of base.

Parameters:
  • df (pandas.DataFrame) – Data frame holding the functional scores.

  • mutation_col (str) – Column in df with mutations, in this format: ‘M1A’.

  • score_col (str) – Column in df with functional scores.

  • base (float) –

    Base to which the exponent is taken in computing the preferences. Make sure not to choose an excessively small value if using in phydms or the preferences will be too flat. In the examples below we use 2, but you may want a larger value.

  • wt_score (float) – Functional score for wildtype sequence.

  • missing ({'average', 'site_average', 'error'}) – What to do when there is no estimate of the score for a mutant? Estimate the phenotype as the average of all single mutants, the average of all single mutants at that site, or raise an error.

  • alphabet (list or tuple) – Characters (e.g., amino acids) for which we compute preferences.

  • exclude_chars (tuple or list) – Characters to exclude when calculating preferences (and when averaging values for missing mutants). For instance, you might want to exclude stop codons even if they are in df.

  • returnformat ({'tidy', 'wide'}) – Return preferences in tidy or wide format data frame.

  • stringency_param (float) – Re-scale preferences by this stringency parameter. This involves raising each preference to the power of stringency_param, and then re-normalizes. A similar effect can be achieved by changing base.

Returns:

Data frame where first column is named ‘site’, other columns are named for each character, and rows give preferences for each site.

Return type:

pandas.DataFrame

Example

>>> func_scores_df = pd.DataFrame(
...         {'aa_substitutions': ['M1A', 'M1C', 'A2M', 'A2C', 'M1*'],
...          'func_score':       [-0.1,  -2.3,   0.8,  -1.2,  -3.0,]})
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C'], exclude_chars=['*'])
...  ).round(2)
   site     M     A     C
0     1  0.47  0.44  0.10
1     2  0.55  0.31  0.14
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C', '*'], exclude_chars=[])
...  ).round(2)
   site     M     A     C     *
0     1  0.44  0.41  0.09  0.06
1     2  0.48  0.28  0.12  0.12
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C', '*'], exclude_chars=[],
...                  missing='site_average')
...  ).round(2)
   site     M     A     C     *
0     1  0.44  0.41  0.09  0.06
1     2  0.43  0.25  0.11  0.22
>>> scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                 alphabet=['M', 'A', 'C', '*'], exclude_chars=[],
...                 missing='error')
Traceback (most recent call last):
    ...
ValueError: missing functional scores for some mutations
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C'], exclude_chars=['*'],
...                  returnformat='tidy')
...  ).round(2)
  wildtype  site mutant  preference
0        M     1      C        0.10
1        A     2      C        0.14
2        A     2      A        0.31
3        M     1      A        0.44
4        M     1      M        0.47
5        A     2      M        0.55
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C'], exclude_chars=['*'],
...                  stringency_param=3)
...  ).round(2)
   site     M     A     C
0     1  0.55  0.45  0.00
1     2  0.83  0.16  0.01
>>> (scores_to_prefs(func_scores_df, 'aa_substitutions', 'func_score', 2,
...                  alphabet=['M', 'A', 'C', '*'], exclude_chars=[],
...                  returnformat='tidy')
...  ).round(2)
  wildtype  site mutant  preference
0        M     1      *        0.06
1        M     1      C        0.09
2        A     2      C        0.12
3        A     2      *        0.12
4        A     2      A        0.28
5        M     1      A        0.41
6        M     1      M        0.44
7        A     2      M        0.48
dms_variants.utils.single_nt_accessible(codon, aa, codon_encode_aa='raise')[source]

Is amino acid accessible from codon by single-nucleotide change?

Parameters:
  • codon (str) – The codon.

  • aa (str) – The amino acid.

  • codon_encode_aa ({'raise', 'true', 'false'}) – If codon encodes aa, raise an error, return True, or return False.

Return type:

bool

Example

>>> single_nt_accessible('GGG', 'E')
True
>>> single_nt_accessible('GGC', 'E')
False
>>> single_nt_accessible('GGG', 'G')
Traceback (most recent call last):
  ...
ValueError: `codon` GGG already encodes `aa` G (see `codon_encode_aa`)
>>> single_nt_accessible('GGG', 'G', codon_encode_aa='true')
True
>>> single_nt_accessible('TTT', 'L')
True
dms_variants.utils.tidy_split(df, column, sep=' ', keep=False)[source]

Split values of column and expand into new rows.

Parameters:
  • df (pandas.DataFrame) – Data frame with the column to split and expand.

  • column (str) – Name of column to split and expand.

  • sep (str) – The string used to split the column’s values.

  • keep (bool) – Retain the presplit value as it’s own row.

Returns:

Data frame with the same columns as df. Rows lacking column are filtered.

Return type:

pandas.DataFrame

Example

>>> df = pd.DataFrame({'col1': ['A', 'B', 'C'],
...                    'col2': ['d e', float('nan'), 'f']})
>>> tidy_split(df, 'col2')
  col1 col2
0    A    d
0    A    e
2    C    f
dms_variants.utils.tidy_to_corr(df, sample_col, label_col, value_col, *, group_cols=None, return_type='tidy_pairs', method='pearson')[source]

Pairwise correlations between samples in tidy data frame.

Parameters:
  • df (pandas.DataFrame) – Tidy data frame.

  • sample_col (str) – Column in df with name of sample.

  • label_col (str) – Column in df with labels for variable to correlate.

  • value_col (str) – Column in df with values to correlate.

  • group_cols (None, str, or list) – Additional columns used to group results.

  • return_type ({'tidy_pairs', 'matrix'}) – Return results as tidy dataframe of pairwise correlations or correlation matrix.

  • method (str) – A correlation metho passable to pandas.DataFrame.corr.

Returns:

Holds pairwise correlations in format specified by return_type. Correlations only calculated among values with shared label among samples.

Return type:

pandas.DataFrame

Example

Define data frame with data to correlate:

>>> df = pd.DataFrame({
...        'sample': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
...        'barcode': ['A', 'C', 'G', 'G', 'A', 'C', 'T', 'G', 'C', 'A'],
...        'score': [1, 2, 3, 3, 1.5, 2, 4, 1, 2, 3],
...        'group': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y'],
...        })

Pairwise correlations between all samples ignoring group:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score')
  sample_1 sample_2  correlation
0        a        a     1.000000
1        b        a     0.981981
2        c        a    -1.000000
3        a        b     0.981981
4        b        b     1.000000
5        c        b    -0.981981
6        a        c    -1.000000
7        b        c    -0.981981
8        c        c     1.000000

The same but as a matrix rather than in tidy format:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', return_type='matrix')
  sample         a         b         c
0      a  1.000000  0.981981 -1.000000
1      b  0.981981  1.000000 -0.981981
2      c -1.000000 -0.981981  1.000000

Now group before computing correlations:

>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', group_cols='group')
  group sample_1 sample_2  correlation
0     x        a        a     1.000000
1     x        b        a     0.981981
2     x        a        b     0.981981
3     x        b        b     1.000000
4     y        c        c     1.000000
>>> tidy_to_corr(df, sample_col='sample', label_col='barcode',
...              value_col='score', group_cols='group',
...              return_type='matrix')
  group sample         a         b    c
0     x      a  1.000000  0.981981  NaN
1     x      b  0.981981  1.000000  NaN
2     y      c       NaN       NaN  1.0
dms_variants.utils.translate(codonseq)[source]

Translate codon sequence.

Parameters:

codonseq (str) – Codon sequence. Gaps currently not allowed.

Returns:

Amino-acid sequence.

Return type:

str

Example

>>> translate('ATGGGATAA')
'MG*'