syn_selection

Identifies enrichment of synonymous codons.

dms_tools2.syn_selection.syn_selection_by_codon(counts_pre, counts_post, pseudocount=0.5)[source]

Identify sites with selection on synonymous codons.

Runs two-tailed Fisher Exact test, which returns:
  • P-value reflecting the significance of codon_x enrichment.

After calculating the Fisher P-value, a pseudocount is added for calculation of the odds ratio, which reflects the enrichment of codon_x after selection (relative to other synonymous codons).

  • The pseudocount removes 0’s to avoid returning NA or inf odds ratios.

  • Rows where only one codon is represented pre-selection are dropped.

Args:
counts_pre (str or pandas.DataFrame)

CSV file giving pre-selection codon counts with columns named ‘site’, ‘wildtype’, and list of codons. Can also be a pandas DataFrame containing the CSV file.

counts_post (str or pandas.DataFrame)

Like counts_pre but for the post-selection counts. CSV file giving post-selection codon counts in same format as counts_pre.

‘pseudocount’ (float or int, default 0.5)

Number to add to each codon count before calculating the odds ratio.

Returns:
A pandas DataFrame with the following columns:
  • ‘site’

  • ‘wildtype’ : wildtype codon at site

  • ‘codon’ : codon we are analyzing at site

  • ‘aa’ : amino acid

  • ‘codon_pre’ : counts for codon of interest pre-selection

  • ‘aa_pre’ : counts for all codons for amino acid pre-selection

  • ‘codon_post’ : counts for codon of interest post-selection

  • ‘aa_post’ : counts for all codons for amino acid post-selection

  • ‘odds_ratio’ : enrichment of codon post-selection

  • ‘P’ : P-value calculated using Fisher’s exact test

Example:

>>> pd.set_option('display.max_columns', None)  # display all columns
>>> pd.set_option('expand_frame_repr', False)  # do not break lines
>>> counts_pre = pd.DataFrame.from_records(
...         [(1, 'ATC', 5, 100, 10),
...          (2, 'ATT', 50, 10, 10),
...          ],
...         columns=['site', 'wildtype', 'ATT', 'ATC', 'ATA'],
...         )
>>> counts_post = pd.DataFrame.from_records(
...         [(1, 'ATC', 5, 50, 75),
...          (2, 'ATT', 50, 9, 11),
...          ],
...         columns=['site', 'wildtype', 'ATT', 'ATC', 'ATA'],
...         )
>>> syn_selection_by_codon(counts_pre, counts_post, 1)
   site wildtype codon aa  codon_pre  aa_pre  codon_post  aa_post  odds_ratio             P
0     1      ATC   ATT  I          6     118           6      133    0.881890  1.000000e+00
1     1      ATC   ATC  I        101     118          51      133    0.104685  1.798192e-15
2     1      ATC   ATA  I         11     118          76      133   12.969697  7.500709e-17
3     2      ATT   ATT  I         51      73          51       73    1.000000  1.000000e+00
4     2      ATT   ATC  I         11      73          10       73    0.894661  1.000000e+00
5     2      ATT   ATA  I         11      73          12       73    1.108793  1.000000e+00