# prefs¶

Performs operations related to estimating site-specific amino-acid preferences.

Uses pystan to perform MCMC for Bayesian inferences.

dms_tools2.prefs.PRIOR_MIN_VALUE = 1e-07

minimum value for Dirichlet prior elements

class dms_tools2.prefs.StanModelDifferentErr(verbose=False)[source]

Bases: object

pystan model when error_model is different.

For use by inferSitePrefs.

class dms_tools2.prefs.StanModelNoneErr(verbose=False)[source]

Bases: object

pystan model when error_model is none.

For use by inferSitePrefs.

class dms_tools2.prefs.StanModelSameErr(verbose=False)[source]

Bases: object

pystan model when error_model is same.

For use by inferSitePrefs.

dms_tools2.prefs.aafreqsFromAlignment(alignmentfile, codon_to_aa, ignore_gaps=True, ignore_stop=True)[source]

Get amino-acid frequencies at each site in alignment.

Args:
alignmentfile (str)

FASTA file with alignment of proteins or coding sequences.

codon_to_aa (bool)

If True, translate codon alignment to amino acids.

ignore_gaps (bool)

Ignore gaps when calculating frequencies.

ignore_stop (bool)

Ignore stop codons when calculating frequencies.

Returns:

A pandas.DataFrame with columns being site (1, 2, … numbering) and other columns being amino acids and values giving frequencies in alignment.

>>> with tempfile.NamedTemporaryFile(mode='w') as f:
...     x = f.write('>seq1\n'
...                 'ATGGGGCAG\n'
...                 '>seq2\n'
...                 '---AGGCAG\n'
...                 '>seq3\n'
...                 'ATGTGACAG')
...     f.flush()
...     aafreqs = aafreqsFromAlignment(f.name, codon_to_aa=True)
>>> aas_counts = ['M', 'G', 'R', 'Q']
>>> aas_nocounts = [a for a in dms_tools2.AAS if a not in aas_counts]
>>> (0 == aafreqs[aas_nocounts].values).all()
True
>>> expected_counts = pandas.DataFrame.from_dict(
...         collections.OrderedDict([
...         ('site', [1, 2, 3]), ('M', [1.0, 0.0, 0.0]),
...         ('G', [0.0, 0.5, 0]), ('R', [0.0, 0.5, 0.0]),
...         ('Q', [0.0, 0.0, 1.0])]))
>>> expected_counts.equals(aafreqs[['site'] + aas_counts])
True

dms_tools2.prefs.avgPrefs(prefsfiles)[source]

Gets average of site-specific preferences.

Args:
prefsfiles (list)

List of CSV files containing preferences, must all be for same sites and characters.

Returns:

A pandas.DataFrame containing the average of the preferences in prefsfiles. In this returned data frame, site is the index

>>> tf1 = tempfile.NamedTemporaryFile
>>> tf2 = tempfile.NamedTemporaryFile
>>> with tf1(mode='w') as file1, tf2(mode='w') as file2:
...     x = file1.write('site,A,C,G,T\n'
...                 '10,0.2,0.2,0.5,0.1\n'
...                 '2a,0.3,0.3,0.3,0.1')
...     file1.flush()
...     x = file2.write('site,A,C,G,T\n'
...                 '10,0.4,0.1,0.1,0.4\n'
...                 '2a,0.3,0.4,0.1,0.2')
...     file2.flush()
...     avg = avgPrefs([file1.name, file2.name])
>>> (avg['site'] == ['2a', '10']).all()
True
>>> numpy.allclose(avg['A'], [0.3, 0.3])
True
>>> numpy.allclose(avg['C'], [0.35, 0.15])
True
>>> numpy.allclose(avg['G'], [0.2, 0.3])
True
>>> numpy.allclose(avg['T'], [0.15, 0.25])
True

dms_tools2.prefs.inferPrefsByRatio(charlist, sites, wts, pre, post, errpre, errpost, pseudocount)[source]

Site-specific preferences from normalized enrichment ratios.

Calculates site-specific preference $$\pi_{r,a}$$ of each site $$r$$ for each character $$a$$ using re-normalized enrichment ratios. This function accomplishes the same goal as inferSitePrefs, but is much simpler and faster, although less statistically rigorous in principle.

Specifically, this function calculates the preferences as

$\pi_{r,a} = \frac{\phi_{r,a}}{\sum_{a'} \phi_{r,a'}}$

where $$\phi_{r,a}$$ is the enrichment ratio of character $$a$$ relative to wildtype after selection.

The actual definition of these enrichment ratios is somewhat complex due to the need to correct for errors and add a pseudocount. We have 4 sets of counts:

• pre: pre-selection counts

• post: post-selection counts

• errpre: error-control for pre-selection counts

• errpost: error-control for post-selection counts

For each set of counts $$s$$, let $$N_r^s$$ (e.g., $$N_r^{pre}$$) denote the total counts for this sample at site $$r$$, and let $$n_{r,a}^{s}$$ (e.g., $$n_{r,a}^{pre}$$) denote the counts at that site for character $$a$$.

Because some of the counts are low, we add a pseudocount $$P$$ to each observation. Importantly, we need to scale this pseudocount by the total depth for that sample at that site in order to avoid systematically estimating different frequencies purely as a function of sequencing depth, which will bias the preference estimates. Therefore, given the pseudocount value $$P$$ defined via the --pseudocount argument to dms2_prefs, we calculate the scaled pseudocount for sample $$s$$ and site $$r$$ as

$P_r^s = P \times \frac{N_r^s}{\min\limits_{s'} N_r^{s'}}.$

This equation makes the pseudocount $$P$$ for the lowest-depth sample, and scales up the pseodocounts for all other samples proportional to their depth.

With this pseudocount, the estimated frequency of character $$a$$ at site $$r$$ in sample $$s$$ is

$f_{r,a}^s = \frac{n_{r,a}^s + P_r^s}{N_{r}^s + A \times P_r^s}$

where $$A$$ is the number of characters in the alphabet (e.g., 20 for amino acids without stop codons).

The error-corrected estimates of the frequency of character $$a$$ before and after selection are then

\begin{align}\begin{aligned}f_{r,a}^{before} &= \max\left(\frac{P_r^{pre}}{N_{r}^{pre} + A \times P_r^{pre}}, \; f_{r,a}^{pre} + \delta_{a,\rm{wt}\left(r\right)} - f_{r,a}^{errpre}\right)\\f_{r,a}^{after} &= \max\left(\frac{P_r^{post}}{N_{r}^{post} + A \times P_r^{post}}, \; f_{r,a}^{post} + \delta_{a,\rm{wt}\left(r\right)} - f_{r,a}^{errpost}\right)\end{aligned}\end{align}

where $$\delta_{a,\rm{wt}\left(r\right)}$$ is the Kronecker-delta, equal to 1 if $$a$$ is the same as the wildtype character $$\rm{wt}\left(r\right)$$ at site $$r$$, and 0 otherwise. We use the $$\max$$ operator to ensure that even if the error-control frequency exceeds the actual estimate, our estimated frequency never drops below the pseudocount over the depth of the uncorrected sample.

Given these definitions, we then simply calculate the enrichment ratios as

$\phi_{r,a} = \frac{\left(f_{r,a}^{after}\right) / \left(f_{r,\rm{wt}\left(r\right)}^{after}\right)} {\left(f_{r,a}^{before}\right) / \left(f_{r,\rm{wt}\left(r\right)}^{before}\right)}$

In the case where we are not using any error-controls, then we simply set $$f_{r,\rm{wt}}^{errpre} = f_{r,\rm{wt}}^{errpost} = \delta_{a,\rm{wt}\left(r\right)}$$.

Args:
charlist (list)

Valid characters (e.g., codons, amino acids, nts).

sites (list)

Sites to analyze.

wts (list)

wts[r] is the wildtype character at site sites[r].

pre (pandas.DataFrame)

Gives pre-selection counts. Should have columns with names of ‘site’ and all characters in charlist. The rows give the counts of each character at that site.

post (pandas.DataFrame)

Like pre but for post-selection counts.

errpre (None or pandas.DataFrame)

Like pre but for pre-selection error-control counts, or None if there is no such control.

errpost (None or pandas.DataFrame)

Like pre but for post-selection error-control counts, or None if there is no such control.

pseudocount (float or int)

The pseudocount to add to each observation.

Returns:

A pandas.DataFrame holding the preferences. The columns of this dataframe are ‘site’ and all characters in charlist. For each site in sites, the rows give the preference for that character.

dms_tools2.prefs.inferSitePrefs(charlist, wtchar, error_model, counts, priors, seed=1, niter=10000, increasetries=5, n_jobs=1, r_max=1.1, neff_min=100, nchains=4, increasefac=2)[source]

Infers site-specific preferences by MCMC for a specific site.

Infer the site-specific preferences $$\pi_{r,a}$$ for some site $$r$$ for each character $$a$$ by integrating over the posterior defined by the product of the following priors and likelihoods, where $$\boldsymbol{\mathbf{\pi_r}}$$ indicates the vector of $$\pi_{r,a}$$ values for all characters:

\begin{align}\begin{aligned}\Pr\left(\boldsymbol{\mathbf{\pi_r}}\right) &= \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\pi,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\mu_r}}\right) &= \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\mu,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\epsilon_r}}\right) &= \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\epsilon,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\rho_r}}\right) &= \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\rho,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{pre}}}} \mid \boldsymbol{\mathbf{\mu_r}}, \boldsymbol{\mathbf{\epsilon_r}}\right) &= \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{pre}}}}; \boldsymbol{\mathbf{\mu_r}} + \boldsymbol{\mathbf{\epsilon_r}} - \boldsymbol{\mathbf{\delta_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{post}}}} \mid \boldsymbol{\mathbf{\mu_r}}, \boldsymbol{\mathbf{\epsilon_r}}\right) &= \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{post}}}}; \frac{\boldsymbol{\mathbf{\mu_r}} \circ \boldsymbol{\mathbf{\pi_r}}}{\boldsymbol{\mathbf{\mu_r}} \cdot \boldsymbol{\mathbf{\pi_r}}} + \boldsymbol{\mathbf{\rho_r}} - \boldsymbol{\mathbf{\delta_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}} \mid \boldsymbol{\mathbf{\epsilon_r}}\right) &= \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}}; \boldsymbol{\mathbf{\epsilon_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{err,post}}}} \mid \boldsymbol{\mathbf{\rho_r}}\right) &= \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{err,post}}}}; \boldsymbol{\mathbf{\rho_r}}\right)\end{aligned}\end{align}

where $$\boldsymbol{\mathbf{\delta_r}}$$ is a vector that is zero except for a one at the element corresponding to the wildtype.

The MCMC tries to guarantee convergence via the parameters specified by r_max, neff_min, nchains, niter, increasefac, and increasetries.

Args:
charlist (list)

List of valid characters (e.g., codons, amino acids, nts).

wtchar (str)

Wildtype character at the site.

error_model (str or object)

Specifies how errors are estimated. Passing error model objects is faster if calling this function repeatedly as they will not need to be compiled. Can be:

• The str none or instance of StanModelNoneErr: no errors ($$\boldsymbol{\mathbf{\epsilon_r}} = \boldsymbol{\mathbf{\rho_r}} = \mathbf{0}$$)

• The str same or instance of StanModelSameErr: same error rates pre and post-selection ($$\boldsymbol{\mathbf{\epsilon_r}} = \boldsymbol{\mathbf{\rho_r}}$$).

• The str different or instance of StanModelDifferentErr: different error rates pre- and post-selection ($$\boldsymbol{\mathbf{\epsilon_r}} \ne \boldsymbol{\mathbf{\rho_r}}$$).

counts (dict)

Deep sequencing counts. Each string key should specify a dict keyed by all characters in charlist with the values giving integer counts for that character. The keys:

• pre: $$\boldsymbol{\mathbf{n_r^{\rm{pre}}}}$$

• post: $$\boldsymbol{\mathbf{n_r^{\rm{post}}}}$$

• err: required if error_model is same, specifies $$\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}} = \boldsymbol{\mathbf{n_r^{\rm{err,post}}}}$$

• errpre and errpost: required if error_model is different, specify $$\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}}$$ and $$\boldsymbol{\mathbf{n_r^{\rm{err,post}}}}$$.

priors (dict)

Specifies parameter vectors for Dirichlet priors. Each string key should specify a dict keyed by all characters and values giving the prior for that character. Values less than PRIOR_MIN_VALUE are set to PRIOR_MIN_VALUE. Keys are

• pir_prior_params: $$\boldsymbol{\mathbf{a_{\pi,r}}}$$

• mur_prior_params: $$\boldsymbol{\mathbf{a_{\mu,r}}}$$

• epsilonr_prior_params: only required if error_model is same or different, specifies $$\boldsymbol{\mathbf{a_{\epsilon,r}}}$$

• rhor_prior_params : only required if error_model is different, specifies $$\boldsymbol{\mathbf{a_{\rho,r}}}$$

seed (int)

Random number seed for MCMC.

n_jobs (int)

Number of CPUs to use, -1 means all available.

niter, increasetries, r_max, neff_min, nchains, increasefac

Specify MCMC convergence. They all have reasonable defaults. The MCMC is considered to have converged if the mean Gelman-Rubin R statistic (http://www.jstor.org/stable/2246093) over all $$\pi_{r,x}$$ values <= r_max and the mean effective sample size is >= neff_min. The MCMC first runs with nchains chains and niter iterations. If it fails to converge, it increases the iterations by a factor of increasefac and tries again, and repeats until it converges or has tried increasetries times to increase the number of iterations. If the effective sample size exceeds 3 times neff_min then we allow R to be 1 + 1.5 (r_max - 1).

Returns:
The tuple (converged, pi_means, pi_95credint, logstring) where:
• converged is True if the MCMC converged, False otherwise.

• pi_means is dict keyed by characters in charlist

with the value giving the $$\pi_{r,a}$$.

• pi_95credint is dict keyed by characters, values are 2-tuples

giving median-centered credible interval for $$\pi_{r,a}$$.

• logstring is a string describing MCMC run and convergence.

dms_tools2.prefs.prefsEntropy(prefs, charlist)[source]

Calculate site entropy and number of effective characters.

The site entropy $$h_r$$ for site $$r$$ is defined as $$h_r = -\sum_{x} \pi_{r,x} \log\left(\pi_{r,x}\right)$$ where $$\pi_{r,x}$$ is the preference for character (e.g., amino acid) $$x$$ at site $$r$$, and the log is the natural logarithm.

The number of effective characters at site $$r$$ is $$N_{\rm{eff}, r} = \exp\left(h_r\right)$$.

Args:
prefs (pandas DataFrame)

Data frame with the preferences. Should have a column named site and a column for each character in charlist.

charlist (list)

List of the characters of interest, for example the amino acids.

Returns:

A copy of prefs that has additional columns named entropy and neffective, giving the site entropy and number of effective characters for each site. For the entropies, log is taken to base $$e$$.

>>> charlist = ['A', 'C', 'G', 'T']
>>> prefs = pandas.DataFrame({
...         'site':[1, 2, 3],
...         'A':[0.25, 0.6, 1 / 3.],
...         'C':[0.25, 0.3, 1 / 3.],
...         'G':[0.25, 0.1, 1 / 3.],
...         'T':[0.25, 0.0, 0.0],
...         })
>>> prefs_entropy = prefsEntropy(prefs, charlist)
>>> (set(['entropy', 'neffective', 'site'] + charlist) ==
...         set(prefs_entropy.columns))
True
>>> h2 = -0.6 * math.log(0.6) - 0.3 * math.log(0.3) - 0.1 * math.log(0.1)
>>> numpy.allclose(prefs_entropy['entropy'], [math.log(4), h2, math.log(3)])
True
>>> numpy.allclose(prefs_entropy['neffective'], [4, math.exp(h2), 3])
True

dms_tools2.prefs.prefsToMutEffects(prefs, charlist)[source]

Converts amino acid preferences to effects of specific mutations.

If the preference of site $$r$$ for amino acid $$a$$ is $$\pi_{r,a}$$, then the estimated effect (e.g., ratio of enrichment ratios) for mutating that site from $$x$$ to $$y$$ is $$\frac{\pi_{r,y}}{\pi_{r,x}}$$. Very small values indicate disfavored mutations, and very large values indicate highly favored mutations. The logarithm base 2 of the expected effects is also a useful measure – negative values are disfavored mutations, positive values are favored ones.

Args:
prefs (pandas DataFrame)

Preferences to analyze. The columns should be site and a column for every character in charlist.

charlist (list)

The list of characters that we are analyzing. For instance, dms_tools2.AAS for amino acids.

Returns:

A pandas Data Frame where the columns are:

• site: the site

• initial: the initial character (e.g., amino acid)

• final: the final character after the mutation

• mutation: mutation in the string form A1G

• effect: the effect of the mutation

• log2effect: the log2 of the effect of the mutation.

>>> charlist = ['A', 'C', 'G']
>>> prefs = pandas.DataFrame({
...         'site':[1, 2],
...         'A':[0.25, 0.25],
...         'C':[0.25, 0.5],
...         'G':[0.5, 0.25],
...         })
>>> effects = prefsToMutEffects(prefs, charlist)
>>> set(effects.columns) == {'site', 'initial', 'final',
...         'mutation', 'effect', 'log2effect'}
True
>>> numpy.allclose(effects[effects['initial'] == 'A']['effect'],
...         [1, 1, 2, 1, 2, 1])
True
>>> numpy.allclose(effects[effects['initial'] == 'C']['effect'],
...         [1, 1, 2, 0.5, 1, 0.5])
True
>>> numpy.allclose(effects['effect'], 2**effects['log2effect'])
True

dms_tools2.prefs.prefsToMutFromWtEffects(prefs, charlist, wts)[source]

Converts preferences effects of mutations away from wildtype.

This function is similar to prefsToMutEffects() but computes the effects of mutations away from a given wildtype sequence.

Args:
prefs

Same meaning as for prefsToMutEffects().

charlist

Same meaning as for prefsToMutEffects().

wts (pandas DataFrame)

Has columns named “site” and “wildtype” giving the wildtype amino acid for each site. The sites must match those in prefs.

Returns:

A pandas DataFrame similar to that returned by prefsToMutEffects() except that it only contains mutations away from the wildtype character at each site, and the columns specifying the wildtype and mutant characters are named “wildtype” and “mutant” rather than “initial” and “final”.

>>> charlist = ['A', 'C', 'G']
>>> prefs = pandas.DataFrame({
...         'site':[1, 2],
...         'A':[0.25, 0.25],
...         'C':[0.25, 0.5],
...         'G':[0.5, 0.25],
...         })
>>> wts = pandas.DataFrame({
...         'site':[1, 2],
...         'wildtype':['G', 'C']
...         })
>>> prefsToMutFromWtEffects(prefs, charlist, wts)
site wildtype mutant mutation  effect  log2effect
0     1        G      A      G1A     0.5        -1.0
1     1        G      C      G1C     0.5        -1.0
2     1        G      G      G1G     1.0         0.0
3     2        C      A      C2A     0.5        -1.0
4     2        C      C      C2C     1.0         0.0
5     2        C      G      C2G     0.5        -1.0

dms_tools2.prefs.rescalePrefs(prefs, stringency)[source]

Re-scale amino acid preferences by stringency parameter.

If the initial preference of site $$r$$ for amino-acid $$a$$ is $$\pi_{r,a}$$, then the re-scaled preference is

$\frac{\left(\pi_{r,a}\right)^{\beta}}{\sum_{a'} \left(\pi_{r,a'}\right)}$

where $$\beta$$ is the stringency parameter.

Args:
prefs (pandas.DataFrame)

Columns are ‘site’ and then every character (e.g., amino acid).

stringency (float >= 0)

The stringency parameter.

Returns:

A data frame in the same format as prefs but with the re-scaled preferences.

>>> prefs = pandas.DataFrame(dict(
...         [('site', [1, 2]), ('A', [0.24, 0.03]), ('C', [0.04, 0.43])]
...         + [(aa, [0.04, 0.03]) for aa in dms_tools2.AAS if
...         aa not in ['A', 'C']]))
>>> numpy.allclose(1, prefs.drop('site', axis=1).sum(axis=1))
True
>>> rescaled = rescalePrefs(prefs, 1.0)
>>> all([numpy.allclose(prefs[c], rescaled[c]) for c in prefs.columns])
True
>>> rescaled2 = rescalePrefs(prefs, 2.0)
>>> all(rescaled2['site'] == prefs['site'])
True
>>> numpy.allclose(rescaled2['A'], [0.6545, 0.0045], atol=1e-3)
True
>>> numpy.allclose(rescaled2['C'], [0.0182, 0.9153], atol=1e-3)
True
`