# globalepistasis¶

Implements global epistasis models that are based on (but extend in some ways) those described in Otwinoski et al (2018). See also Sailer and Harms (2017) and Otwinoski (2018).

## Global epistasis function¶

The global epistasis function is defined as follows. Let $$v$$ be a variant. We convert $$v$$ into a binary representation with respect to some wildtype sequence. This representation is a vector $$\mathbf{b}\left(v\right)$$ with element $$b\left(v\right)_m$$ equal to 1 if the variant has mutation $$m$$ and 0 otherwise, and $$m$$ ranging over all $$M$$ mutations observed in the overall set of variants (so $$\mathbf{b}\left(v\right)$$ is of length $$M$$). Variants can be converted into this binary form using binarymap.

We define a latent effect for each mutation $$m$$, which we denote as $$\beta_m$$. The latent effects of mutations contribute additively to the latent phenotype, and the latent phenotype of the wildtype sequence is $$\beta_{\rm{wt}}$$. So the latent phenotype of variant $$v$$ is:

(1)$\phi\left(v\right) = \beta_{\rm{wt}} + \sum_{m=1}^M \beta_m b\left(v\right)_m.$

The predicted observed phenotype $$p\left(v\right)$$ is a function of the latent phenotype:

(2)$p\left(v\right) = g\left(\phi\left(v\right)\right)$

where $$g$$ is the global epistasis function.

We define the following global epistasis functions:

### No epistasis function¶

No epistasis, so the observed phenotype is just the latent phenotype:

(3)$g\left(x\right) = x.$

This function is implemented as NoEpistasis.

### Monotonic spline epistasis function¶

This is the function used in Otwinoski et al (2018). It transforms the latent phenotype to the observed phenotype using monotonic I-splines with linear extrapolation outside the spline boundaries:

(4)$\begin{split}g\left(x\right) = \begin{cases} c_{\alpha} + \sum_{m=1}^M \alpha_{m} I_m\left(x\right) & \rm{if\;} L \le x \le U, \\ c_{\alpha} + \sum_{m=1}^M \alpha_m \left[I_m\left(L\right) + \left(x - L\right) \left.\frac{\partial I_m\left(y\right)} {\partial y}\right\rvert_{y=L} \right] & \rm{if\;} x < L, \\ c_{\alpha} + \sum_{m=1}^M \alpha_m \left[I_m\left(U\right) + \left(x - U\right) \left.\frac{\partial I_m\left(y\right)} {\partial y}\right\rvert_{y=U} \right] & \rm{if\;} x > U, \end{cases}\end{split}$

where $$c_{\alpha}$$ is an arbitrary number giving the minimum observed phenotype, the $$\alpha_m$$ coefficients are all $$\ge 0$$, $$I_m$$ indicates a family of I-splines defined via dms_variants.ispline.Isplines_total, and $$L$$ and $$U$$ are the lower and upper bounds on the regions over which the I-splines are defined. Note how when $$x$$ is outside the range of the I-splines, we linearly extrapolate $$g$$ from its range boundaries to calculate.

This function is implemented as MonotonicSplineEpistasis. By default, the I-splines are of order 3 and are defined on a mesh of four evenly spaced points such that the total number of I-splines is $$M=5$$ (although these options can be adjusted when initializing a MonotonicSplineEpistasis model).

The latent effects are scaled so that their mean absolute value is one, and the latent phenotype of the wildtype is set to zero.

### Multiple latent phenotypes¶

Although this package allows multiple latent phenotypes, we do not recommend using them as the models generally do not seem to converge in fitting in a useful way with multiple latent phenotypes.

Equations (1) and (2) can be generalized to the case where multiple latent phenotypes contribute to the observed phenotype. Specifically, let there be $$k = 1, \ldots, K$$ different latent phenotypes, and let $$\beta_m^k$$ denote the effect of mutation $$m$$ on latent phenotype $$k$$. Then we generalize Equation (1) to

(5)$\phi_k\left(v\right) = \beta_{\rm{wt}}^k + \sum_{m=1}^M \beta_m^k b\left(v\right)_m,$

and Equation (2) to

(6)$p\left(v\right) = \sum_{k=1}^K g_k\left(\phi_k\left(v\right)\right),$

where $$\phi_k\left(v\right)$$ is the $$k$$-th latent phenotype of variant $$v$$, and $$g_k$$ is the $$k$$-th global epistasis function.

Note that it does not make sense to fit multiple latent phenotypes to a non-epistatic model as a linear combination of linear effects reduces to a simple linear model (in other words, a multi-latent phenotype non-epistatic model is no differen than a one-latent phenotype non-epistatic model).

## Likelihood calculation¶

We defined a likelihood capturing how well the model describes the actual data, and then fit the models by finding the parameters that maximize this likelihood. This means that different epistasis functions (as described in Global epistasis function) can be compared via their likelihoods after correcting for the number of parameters (e.g. by AIC).

We consider several different forms for calculating the likelihood. Note that epistasis functions can only be compared within the same form of the likelihood on the same dataset–you cannot compare likelihoods calculated using different methods, or on different datasets.

### Gaussian likelihood¶

This is the form of the likelihood used in Otwinoski et al (2018). It is most appropriate when we have functional scores for variants along with good estimates of normally distributed errors on these functional scores.

For each variant $$v$$, we have an experimentally measured functional score $$y_v$$ and optionally an estimate of the error (variance) $$\sigma^2_{y_v}$$ in this functional score measurement. If no error estimates are available, then we set $$\sigma^2_{y_v} = 0$$.

The goal of the fitting is to parameterize the model so the observed phenotype $$p\left(v\right)$$ predicted by the model is as close as possible to the measured functional score $$y_v$$. Following Otwinoski et al (2018), we assume the likelihood of measuring a functional score $$y_v$$ is normally distributed around the model prediction $$p\left(v\right)$$ with variance $$\sigma^2_{y_v} + \sigma^2_{\rm{HOC}}$$, where $$\sigma^2_{\rm{HOC}}$$ is the un-modeled house-of-cards epistasis (although in practice it could also represent experimental noise not capture in the variance estimates). So the overall log likelihood of the model is

(7)$\mathcal{L} = \sum_{v=1}^V \ln\left[N\left(y_v \mid p\left(v\right), \sigma^2_{y_v} + \sigma^2_{\rm{HOC}}\right)\right]$

where $$V$$ is the number of variants and $$N$$ is the normal distribution defined by

(8)$N\left(y \mid \mu, \sigma^2\right) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left(-\frac{\left(y - \mu\right)^2}{2 \sigma^2}\right).$

This likelihood calculation is implemented as GaussianLikelihood.

### Cauchy likelihood¶

This form of the likelihood assumes that the difference between the measured functional scores and the model’s observed phenotypes follows a Cauchy distribution.

A potential advantage over the Gaussian likelihood is the fatter tails of the Cauchy distribution relative to the Gaussian distribution. This could be advantageous if some measurements are very large outliers (either due to un-modeled epistasis or experimental factors such as mis-calling of variant sequences) in ways that are not capture in the functional score variance estimates $$\sigma^2_{y_v}$$. Such outlier measurements will have less influence on the overall model fit for the Cauchy likelihood relative to the Gaussian likelihood.

Specifically, we compute the overall log likelihood as

(9)$\mathcal{L} = -\sum_{v=1}^V \ln\left[\pi \sqrt{\gamma^2 + \sigma^2_{y_v}} \left(1 + \frac{\left[y_v - p\left(v\right)\right]^2} {\gamma^2 + \sigma^2_{y_v}} \right) \right]$

where $$\gamma$$ is the scale parameters, and the functional score variance estimates $$\sigma^2_{y_v}$$ are incorporated in heuristic way (with no real theoretical basis) that qualitatively captures the fact that larger variance estimates give a broader distribution. If variance estimates are not available then $$\sigma^2_{y_v}$$ is set to zero.

This likelihood calculation is implemented as CauchyLikelihood.

### Bottleneck likelihood¶

This form of the likelihood is appropriate when most noise in the experiment comes from a bottleneck when passaging the library from the pre-selection to post-selection condition. This will be the case when the total pre- and post- selection sequencing depths greatly exceed the number of variants that were physically passaged from the pre-selection library to the post- selection condition. At least in Bloom lab viral deep mutational scanning experiments, this situation is quite common.

A full derivation of the log likelihood in this situation is given in Bottleneck likelihood for global epistasis models. As explained in those calculations, the likelihood is computed from the experimental observables $$f_v^{\text{pre}}$$ and $$f_v^{\text{post}}$$, which represent the pre- and post-selection frequencies of variant $$v$$. They are computed from the pre- and post-selection counts $$n_v^{\text{pre}}$$ and $$n_v^{\text{post}}$$ of the variants as

(10)$\begin{split}f_v^{\text{pre}} &=& \frac{n_v^{\text{pre}} + C} {\sum_{v'=1}^V \left(n_{v'}^{\text{pre}} + C \right)} \\ f_v^{\text{post}} &=& \frac{n_v^{\text{post}} + C} {\sum_{v'=1}^V \left(n_{v'}^{\text{post}} + C \right)}\end{split}$

where $$C$$ is a pseudocount which by default is 0.5.

Using the bottleneck likelihood also requires an estimation of the experimental bottleneck $$N_{\rm{bottle}}$$ when passaging the library from the pre- to post-selection conditions. A smaller bottleneck will correspond to more “noise” in the experiment, since random bottlenecking changes the frequencies of variants unpredictably. You can either estimate $$N_{\rm{bottle}}$$ experimentally or from fluctuations in the relative frequencies of wildtype or synonymous variants, such as via dms_variants.bottlenecks.estimateBottleneck().

Given these experimentally measured parameters, the overall log likelihood is:

(11)$\mathcal{L} = \sum_{v=1}^V \left[ n_v^{\rm{bottle}} \ln\left(N_{\rm{bottle}} f_v^{\text{pre}}\right) - \ln \Gamma \left(n_v^{\rm{bottle}} + 1\right) \right] - N_{\rm{bottle}}$

where $$\Gamma$$ is the gamma function and $$n_v^{\rm{bottle}}$$ is the estimated number of copies of variant $$v$$ that made it through the bottleneck, which is defined in terms of the phenotype $$p\left(v\right)$$ as

(12)$n_v^{\rm{bottle}} = \frac{f_v^{\rm{post}} N_{\rm{bottle}} \sum_{v'=1}^V f_{v'}^{\text{pre}} 2^{p\left(v'\right)}} {2^{p\left(v\right)}}.$

The free parameters are therefore the $$p\left(v\right)$$ values (the $$n_v^{\rm{bottle}}$$ values are hidden variables that are not explicitly estimated). Note that Eq. (12) uses an exponent base of 2, but it can be set to arbitrary positive values in the actual implementation.

After fitting the observed phenotypes, the parameters are re-scaled so that the observed phenotype of wildtype is zero (i.e., $$p\left(\rm{wt}\right) = 0$$).

This likelihood calculation is implemented as BottleneckLikelihood.

## The model classes¶

The epistasis models are defined in a set of classes. All these classes inherit their main functionality from the AbstractEpistasis abstract base class.

There are subclasses of AbstractEpistasis that implement the global epistasis functions and likelihood calculation methods. Specifically, the following classes implement a Global epistasis function:

and the following classes each implement a Likelihood calculation:

However, those classes can still not be directly instantianted, as a fully concrete model subclass must have both a global epistasis function and a likelihood calculation method. The following classes implement both, and so can be directly instantiated for use in analyses:

## Details of fitting¶

### Fitting workflow¶

The fitting workflow for a single latent phenotype is similar to that described in Otwinoski et al (2018):

1. The latent effects are fit under an additive (non-epistatic) model using least squares. The residuals from this fit are then used to estimate $$\sigma^2_{\rm{HOC}}$$ for Gaussian likelihood, or $$\gamma^2$$ for Cauchy likelihood.

2. If there are any parameters in the epistasis function, they are set to reasonable initial values. For MonotonicSplineEpistasis this involves setting the mesh to go from 0 to 1, scaling the latent effects so that the latent phenotypes range from 0 to 1, setting $$c_{\alpha}$$ to the minimum functional score and setting the weights $$\alpha_m$$ to equal values such that the max of the epistasis function is the same as the max functional score.

3. The overall model is fit by maximum likelihood.

4. For MonotonicSplineEpistasis, the latent effects and wildtype latent phenotype are rescaled so that the mean absolute value latent effect is one and the wildtype latent phenotype is zero.

### Fitting multiple latent phenotypes¶

When there are multiple latent phenotypes (see Multiple latent phenotypes), the fitting workflow changes. To fit a model with $$K > 1$$ latent phenotypes, first fit a model of the same type to the same data with $$K - 1$$ latent phenotype. The values from that fit for the first $$K - 1$$ latent phenotypes are used to initialize all parameters relevant to those first $$K - 1$$ latent phenotypes and the associated global epistasis functions and likelihood calculations.

Then parameters relevant to latent phenotype $$K$$ are set so the initial contribution of this phenotype to the overall observed phenotype is zero. Specifically the latent effects $$\beta_m^K$$ for phenotype $$K$$ are all set to zero and $$\beta_{\rm{wt}}^K$$ is chosen so that $$0 = g_K \left(\beta_{\rm{wt}}^K\right)$$. For a MonotonicSplineEpistasis model, initial parameters for $$g_K$$ are chosen so that over the mesh, $$g_K$$ spans +/- the absolute value of the largest residual of the model with $$K - 1$$ latent phenotypes.

After initializing the paramters in this way, the entire model (all latent phenotypes) is fit by maximum likelihood.

### Conveniently fitting and comparing several models¶

To conveniently fit and compare several models, use fit_models(). This is especially useful when you are including models with multiple latent phenotypes.

### Vector representation of $$\beta_{\rm{wt}}$$¶

For the purposes of the optimization (and in the equations below), we change how $$\beta_{\rm{wt}}$$ is represented to simplify the calculations. Specifically, to the binary encoding $$\mathbf{b}\left(v\right)$$ of each variant, we append a 1 so that the encodings are now of length $$M + 1$$. We then define $$\beta_{M + 1} = \beta_{\rm{wt}}$$. Then Eq. (1) can be rewritten as

(13)$\phi\left(v\right) = \sum_{m=1}^{M+1} \beta_m b\left(v\right)_m$

enabling $$\beta_{\rm{wt}}$$ to just be handled like the other $$\beta_m$$ parameters.

### Optimization¶

The optimization is performed by AbstractEpistasis.fit(). There are several options to that method about how to do the optimization; by default it uses a L-BFGS-B algorithm with exact gradients calculated as below.

For the optimization, we use the following gradients:

Gradient of latent phenotype with respect to latent effects:

(14)$\begin{split}\frac{\partial \phi_j\left(v\right)}{\partial \beta_m^k} = \begin{cases} b\left(v_m\right) & \rm{if\;} j = k, \\ 0 & \rm{otherwise.} \\ \end{cases}\end{split}$

Gradient of observed phenotype with respect to latent phenotypes:

(15)$\begin{split}\frac{\partial p\left(v\right)}{\partial \beta_m^k} &=& \left.\frac{\partial g_k\left(x\right)}{\partial x} \right\rvert_{x = \phi_k\left(v\right)} \times \frac{\partial \phi_k\left(v\right)}{\partial \beta_m^k} \\ &=& \left.\frac{\partial g_k\left(x\right)}{\partial x} \right\rvert_{x = \phi_k\left(v\right)} \times b\left(v_m\right)\end{split}$

Derivative of the likelihood with respect to latent effects:

(16)$\frac{\partial \mathcal{L}}{\partial \beta_m^k} = \sum_{v=1}^V \frac{\mathcal{L}} {\partial p\left(v\right)} \times \frac{\partial p\left(v\right)}{\partial \beta_m^k}.$

Derivative of Gaussian likelihood (Eq. (7)) with respect to observed phenotype:

(17)$\frac{\partial \mathcal{L}}{\partial p\left(v\right)} = \frac{y_v - p\left(v\right)} {\sigma_{y_v}^2 + \sigma^2_{\rm{HOC}}}.$

Derivative of Gaussian likelihood (Eq. (7)) with respect to house-of-cards epistasis:

(18)$\frac{\partial \mathcal{L}}{\partial \sigma^2_{\rm{HOC}}} = \sum_{v=1}^V \frac{1}{2} \left[\left(\frac{\partial \mathcal{L}} {\partial p\left(v\right)}\right)^2 - \frac{1}{\sigma_{y_v}^2 + \sigma_{\rm{HOC}}^2} \right].$

Derivative of Cauchy likelihood (Eq. (9)) with respect to observed phenotype:

(19)$\frac{\partial \mathcal{L}}{\partial p\left(v\right)} = \frac{2\left[y_v - p\left(v\right)\right]} {\gamma^2 + \sigma^2_{y_v} + \left[y_v - p\left(v\right)\right]^2}.$

Derivative of Cauchy likelihood (Eq. (9)) with respect to scale parameter:

(20)$\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{v=1}^{V} \frac{\gamma\left(\left[y_v - p\left(v\right)\right]^2 - \gamma^2 - \sigma^2_{y_v} \right)} {\left(\gamma^2 + \sigma^2_{y_v}\right) \left(\gamma^2 + \sigma^2_{y_v} + \left[y_v - p\left(v\right)\right]^2\right) }$

Derivative of Bottleneck likelihood with respect to $$p\left(v\right)$$:

$\begin{split}\frac{\partial n_v^{\text{bottle}}} {\partial p\left(v'\right)} &=& \begin{cases} \frac{\left(\ln 2\right) f_v^{\text{post}} N_{\text{bottle}} f_{v}^{\text{pre}} 2^{p\left(v\right)}} {2^{p\left(v\right)}} - \left(\ln 2\right) n_v^{\text{bottle}} & \rm{if\;} v = v', \\ \frac{\left(\ln 2\right) f_v^{\text{post}} N_{\text{bottle}} f_{v'}^{\text{pre}} 2^{p\left(v'\right)}} {2^{p\left(v\right)}} & \text{otherwise} \end{cases} \\ &=& \frac{\left(\ln 2\right) f_v^{\text{post}} N_{\text{bottle}} f_{v'}^{\text{pre}} 2^{p\left(v'\right)}} {2^{p\left(v\right)}} - \delta_{v,v'} \left(\ln 2\right) n_{v'}^{\text{bottle}}\end{split}$
(21)$\begin{split}\frac{\partial \mathcal{L}} {\partial p\left(v'\right)} &=& \sum_{v=1}^V \frac{\partial n_v^{\text{bottle}}}{\partial p\left(v'\right)} \ln\left( N_{\text{bottle}} f_v^{\text{pre}} \right) - \frac{\partial n_v^{\text{bottle}}}{\partial p\left(v'\right)} \psi_0\left(n_v^{\text{bottle}} + 1\right) \\ &=& \left(\ln 2\right) f_{v'}^{\text{pre}} 2^{p\left(v'\right)} N_{\text{bottle}} \left(\sum_{v=1}^V \frac{f_v^{\text{post}}} {2^{p\left(v\right)}} \left[\ln\left(N_{\text{bottle}} f_v^{\text{pre}}\right) - \psi_0\left(n_v^{\text{bottle}} + 1\right) \right] \right) - \left(\ln 2\right) n_{v'}^{\text{bottle}} \left[\ln\left(N_{\text{bottle}} f_{v'}^{\text{pre}}\right) - \psi_0\left(n_{v'}^{\text{bottle}} + 1\right) \right]\end{split}$

where $$\psi_0$$ is the digamma function.

Derivative of Monotonic spline epistasis function with respect to its parameters:

(22)$\frac{\partial g_k\left(x\right)}{\partial c_{\alpha}^k} = 1$
(23)$\frac{\partial g_k\left(x\right)}{\partial \alpha^k_m} = I^k_m\left(x\right)$

## Detailed documentation of models¶

class dms_variants.globalepistasis.AbstractEpistasis(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

Bases: abc.ABC

Abstract base class for epistasis models.

Parameters
• binarymap – Contains the variants, their functional scores, and score variances.

• n_latent_phenotypes (int) – Number of distinct latent phenotypes. See Multiple latent phenotypes.

• model_one_less_latent (None or AbstractEpistasis) – If n_latent_phenotypes > 1, should be a fit model of the same type fit the same binarymap for one less latent phenotype. This is used to initialize the parameters. See Fitting multiple latent phenotypes.

Note

This is an abstract base class. It implements most of the epistasis model functionality, but requires subclasses to define the actual Global epistasis function and Likelihood calculation.

add_phenotypes_to_df(df, *, substitutions_col=None, latent_phenotype_col='latent_phenotype', observed_phenotype_col='observed_phenotype', phenotype_col_overwrite=False, unknown_as_nan=False)[source]

Add predicted phenotypes to data frame of variants.

Parameters
• df (pandas.DataFrame) – Data frame containing variants.

• substitutions_col (str or None) – Column in df giving variants as substitution strings in format that can be processed by AbstractEpistasis.binarymap. If None, defaults to the substitutions_col attribute of that binary map.

• latent_phenotype_col (str) – Column(s) added to df containing predicted latent phenotypes. If there are multiple latent phenotypes, this string is suffixed with the latent phenotype number (i.e., ‘latent_phenotype_1’).

• observed_phenotype_col (str) – Column added to df containing predicted observed phenotypes.

• phenotype_col_overwrite (bool) – If the specified latent or observed phenotype column already exist in df, overwrite it? If False, raise an error.

• unknown_as_nan (bool) – If some of the substitutions in a variant are not present in the model (not in AbstractEpistasis.binarymap) set the phenotypes to nan (not a number)? If False, raise an error.

Returns

A copy of df with the phenotypes added. Phenotypes are predicted based on the current state of the model.

Return type

pandas.DataFrame

property aic

Aikake Information Criterion given current log likelihood.

Type

float

property binarymap

//github.com/jbloomlab/binarymap>_

The binary map is set during initialization of the model.

Type

BinaryMap <https

enrichments(observed_phenotypes, base=2)[source]

Calculated enrichment ratios from observed phenotypes.

Note

In many cases, the functional scores used to fit the model are the logarithm (most commonly base 2) of experimentally observed enrichments For example, this is how functional scores are calculated by dms_variants.codonvarianttable.CodonVariantTable.func_scores(). In that case, the predicted enrichment value $$E\left(v\right)$$ for each variant $$v$$ can be computed from the observed phenotype $$p\left(v\right)$$ as:

$E\left(v\right) = B^{p\left(v\right) - p\left(\rm{wt}\right)}$

where $$p\left(\rm{wt}\right)$$ is the observed phenotype of wildtype, and $$B$$ is the base for the exponent (by default $$B = 2$$).

Parameters
• observed_phenotypes (float or numpy.ndarray) – The observed phenotypes.

• base (float) – The base for the exponent used to convert observed phenotypes to enrichments.

Returns

The enrichments.

Return type

float or numpy.ndarray

abstract epistasis_func(latent_phenotype, k=None)[source]

The Global epistasis function $$g$$.

Parameters
• latent_phenotype (numpy.ndarray) – Latent phenotype(s) of one or more variants.

• k (int or None) – Latent phenotype number (1 <= k <= n_latent_phenotypes), or can be None if just one latent phenotype. See Eq. Multiple latent phenotypes.

Returns

Result of applying global epistasis function $$g_k$$ to latent phenotypes.

Return type

numpy.ndarray

property epistasis_func_params_dict

Parameters for the Global epistasis function.

Maps names of parameters defining the global epistasis function to their current values.

Type

dict

Fit all model params to maximum likelihood values.

Parameters

• optimize_method ({'L-BFGS-B', 'TNC'}) – Optimization method used by scipy.optimize.minimize.

• ftol (float) – Function convergence tolerance for optimization, used by scipy.optimize.minimize.

• clearcache (bool) – Clear the cache after model fitting? This slightly increases the time needed to compute properties after fitting, but greatly saves memory usage.

Returns

The results of optimizing the full model.

Return type

scipy.optimize.OptimizeResult

property latent_effects_df

Latent effects of mutations.

For each single mutation in AbstractEpistasis.binarymap, gives current predicted latent effect of that mutation. If there are multiple latent phenotypes (AbstractEpistasis.n_latent_phenotypes > 1), also indicates the phenotype number of the latent effect.

Type

pandas.DataFrame

latent_phenotype_wt(k=None)[source]

Latent phenotype of wildtype.

Parameters

k (int or None) – Which latent phenotype to get (1 <= k <= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None.

Returns

Wildtype latent phenotype, which is $$\beta_{\rm{wt}}$$ in Eq. (1) or $$\beta_{\rm{wt}}^k$$ in Eq. (5).

Return type

float

property likelihood_calc_params_dict

Parameters for the Likelihood calculation.

Maps names of parameters defining the likelihood calculation to their current values.

Type

dict

abstract property loglik

Current log likelihood of model.

Type

float

property n_latent_phenotypes

number of latent phenotypes, see Multiple latent phenotypes.

Type

int

property nparams

Total number of parameters in model.

Type

int

property phenotypes_df

Phenotypes of variants used to fit model.

For each variant in AbstractEpistasis.binarymap, gives the current predicted latent and observed phenotypes as well as the functional score and its variance.

Type

pandas.DataFrame

phenotypes_frombinary(binary_variants, phenotype, *, wt_col=False, k=None)[source]

Phenotypes from binary variant representations.

Parameters
• binary_variants (scipy.sparse.csr.csr_matrix or numpy.ndarray) –

Binary variants in form used by BinaryMap.

• phenotype ({'latent', 'observed'}) – Calculate the latent or observed phenotype.

• wt_col (bool) – Set to True if binary_variants contains a terminal column of ones to enable calculations in the form given by Eq. (13).

• k (int or None) – If phenotype is ‘latent’, which latent phenotype to use (1 <= k (<= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None. Has no meaning if phenotype is ‘observed’.

Returns

Latent phenotypes calculated using Eq. (1) or observed phenotypes calculated using Eq. (2) (or Eqs. (5) or (6)).

Return type

numpy.ndarray

preferences(phenotype, base, *, missing='average', exclude_chars=('*',), returnformat='wide', stringency_param=1, k=None)[source]

Get preference of each site for each character.

Use the latent or observed phenotype to estimate the preference $$\pi_{r,a}$$ of each site $$r$$ for each character (e.g., amino acid) $$a$$. These preferences can be displayed in logo plots or used as input to phydms in experimentally informed substitution models.

The preferences are calculated from the phenotypes as follows. Let $$p_{r,a}$$ be the phenotype of the variant with the single mutation of site $$r$$ to $$a$$ (when $$a$$ is the wildtype character, then $$p_{r,a}$$ is the phenotype of the wildtype sequence). Then the preference $$\pi_{r,a}$$ is defined as

$\pi_{r,a} = \frac{b^{p_{r,a}}}{\sum_{a'} b^{p_{r,a'}}}$

where $$b$$ is the base for the exponent. This definition ensures that the preferences sum to one at each site.

The alphabet from which the characters are drawn and the site numbers are extracted from AbstractEpistasis.binarymap.

Note

The “flatness” of the preferences is determined by the exponent base. A smaller base yields flatter preferences. There is no obvious “best” base as different values correspond to different linear scalings of of the phenotype. A recommended approach is simply to choose a value of base and then re-scale the preferences by using phydms to optimize a stringency parameter (see here). The stringency parameter and the base chosen here both apply the same transformation to the data: linear scaling of the phenotypes. But note that phydms has an upper bound on the largest stringency parameter it can fit, so if you are hitting this upper bound then pre-scale the preferences to be less flat by using a larger value of base. In particular, the latent phenotypes from many of the epistasis models are scaled during fitting to have a relatively tight range, so you may need a large value of base such as 50.

Parameters
• phenotype ({'observed', 'latent'}) – Calculate the preferences from observed or latent phenotypes? Note that if there are multiple latent phenotypes, you must also set k.

• base (float) – Base to which the exponent is taken in computing the preferences.

• missing ({'average', 'site_average', 'error'}) – What to do when there is no estimate of the phenotype for one of the single mutants? Estimate the phenotype as the average of all single mutants, as the average of all single mutants at that site, or raise an error.

• exclude_chars (tuple or list) – Characters to exclude when calculating preferences (and when averaging values for missing mutants). For instance, you might want to exclude stop codons.

• returnformat ({'tidy', 'wide'}) – Return preferences in tidy or wide format data frame.

• stringency_param (float) – Re-scale preferences by this stringency parameter. This involves raising each preference to the power of stringency_param, and then re-normalizes. A similar effect can be achieved by changing base.

• k (int or None) – Which latent phenotype to use (1 <= k <= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None. Has no meaning if phenotype is ‘observed’.

Returns

Data frame where first column is named ‘site’, other columns are named for each character, and rows give preferences for each site.

Return type

pandas.DataFrame

single_mut_effects(phenotype, *, include_wildtype=True, standardize_range=True, k=None)[source]

Effects of single mutations on latent or observed phenotype.

For the effects on observed phenotype, this is how much the mutation changes the observed phenotype relative to wildtype. Effects are reported only for mutations present in AbstractEpistasis.binarymap.

Parameters
• phenotype ({'latent', 'observed'}) – Get effect on this phenotype. If there are multiple latent phenotypes, you must also set k.

• include_wildtype (bool) – Include the effect of “mutating” to wildtype identity at a site (always zero).

• standardize_range (bool) – Scale effects so that the mean absolute value effect is one (scaling is done before including wildtype).

• k (int or None) – Which latent phenotype to use (1 <= k <= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None. Has no meaning if phenotype is ‘observed’.

Returns

The effects of all single mutations. Columns are:

• ’mutation’: mutation as str

• ’wildtype’: wildtype identity at site

• ’site’: site number

• ’mutant’: mutant identity at site

• ’effect’: effect of mutation on latent or observed phenotype

Return type

pandas.DataFrame

class dms_variants.globalepistasis.BottleneckLikelihood(binarymap, bottleneck, *, n_latent_phenotypes=1, model_one_less_latent=None, pseudocount=0.5, base=2.0)[source]

Bottleneck likelihood calculation.

Note

Subclass of AbstractEpistasis that implements the Bottleneck likelihood.

Note

The AbstractEpistasis.binarymap must have non-None counts in its n_pre and n_post attributes, since as described in Bottleneck likelihood, the model is actually fit to the BottleneckLikelihood.f_pre and BottleneckLikelihood.f_post values calculated from these counts.

Parameters
• bottleneck (float) – The estimated size of the experimental bottleneck between the pre- and post-selection conditions. This is the $$N_{\rm{bottle}}$$ parameter described in Bottleneck likelihood.

• pseudocount (float) – The pseudocount used when converting the counts to frequencies vi Eq. (10).

• base (float) – The exponent base in Eq. (12) used when exponentiating the observed phenotypes $$p\left(v\right)$$. It is written as 2 in Eq. (12).

property bottleneck

Bottleneck pre- to post-selection, $$N_{\rm{bottle}]$$.

Type

int

property f_post

Post-selection frequency of each variant.

The $$f_v^{\rm{post}}$$ values in Eq. (10).

Type

numpy.ndarray

property f_pre

Pre-selection frequency of each variant.

The $$f_v^{\rm{pre}}$$ values in Eq. (10).

Type

numpy.ndarray

property loglik

Current log likelihood from Eq. (11).

Type

float

class dms_variants.globalepistasis.CauchyLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

Cauchy likelihood calculation.

Note

Subclass of AbstractEpistasis that implements the Cauchy likelihood.

property loglik

Current log likelihood from Eq. (9).

Type

float

exception dms_variants.globalepistasis.EpistasisFittingError[source]

Bases: Exception

Error fitting an epistasis model.

exception dms_variants.globalepistasis.EpistasisFittingWarning[source]

Bases: Warning

Warning when fitting epistasis model.

class dms_variants.globalepistasis.GaussianLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

Gaussian likelihood calculation.

Note

Subclass of AbstractEpistasis that implements the Gaussian likelihood.

property loglik

Current log likelihood from Eq. (7).

Type

float

class dms_variants.globalepistasis.MonotonicSplineEpistasis(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None, spline_order=3, meshpoints=4, **kwargs)[source]

Monotonic spline global epistasis model.

Parameters
• spline_order (int) – Order of the I-splines defining the global epistasis function.

• meshpoints (int) – Number of evenly spaced mesh points for the I-spline defining the global epistasis function.

alpha_ms(k=None)[source]

$$\alpha_m$$ in Eq. (4).

Parameters

k (int or None) – Which global epistasis function to get I-splines for (1 <= k <= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None.

Returns

$$\alpha_m$$ values for global epistasis function k.

Return type

numpy.ndarray

c_alpha(k=None)[source]

$$c_{\alpha}$$ in Eq. (4).

Parameters

k (int or None) – Which global epistasis function to get I-splines for (1 <= k <= AbstractEpistasis.n_latent_phenotypes). If there is just one latent phenotype, can also be None.

Returns

$$c_{\alpha}$$ for global epistasis function k.

Return type

float

epistasis_func(latent_phenotype, k=None)[source]

Global epistasis function $$g$$ in Eq. (4).

Concrete implementation of AbstractEpistasis.epistasis_func().

class dms_variants.globalepistasis.MonotonicSplineEpistasisBottleneckLikelihood(binarymap, bottleneck, *, n_latent_phenotypes=1, model_one_less_latent=None, spline_order=3, meshpoints=4, pseudocount=0.5, base=2)[source]

Monotonic spline global epistasis model with bottleneck likelihood.

class dms_variants.globalepistasis.MonotonicSplineEpistasisCauchyLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None, spline_order=3, meshpoints=4, **kwargs)[source]

Monotonic spline global epistasis model with Cauchy likelihood.

Note

This class implements the Monotonic spline epistasis function with a Cauchy likelihood. See documentation for the base classes MonotonicSplineEpistasis, CauchyLikelihood, and AbstractEpistasis for details.

class dms_variants.globalepistasis.MonotonicSplineEpistasisGaussianLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None, spline_order=3, meshpoints=4, **kwargs)[source]

Monotonic spline global epistasis model with Gaussian likelihood.

class dms_variants.globalepistasis.NoEpistasis(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

Non-epistatic model.

Note

Subclass of AbstractEpistasis that implements the No epistasis function.

epistasis_func(latent_phenotype, k=None)[source]

Global epistasis function $$g$$ in Eq. (3).

Concrete implementation of AbstractEpistasis.epistasis_func().

class dms_variants.globalepistasis.NoEpistasisBottleneckLikelihood(binarymap, bottleneck, *, n_latent_phenotypes=1, model_one_less_latent=None, pseudocount=0.5, base=2.0)[source]

No-epistasis model with bottleneck likelihood.

Note

This class implements the No epistasis function with a Bottleneck likelihood. See documentation for the base classes NoEpistasis, BottleneckLikelihood, and AbstractEpistasis for details.

class dms_variants.globalepistasis.NoEpistasisCauchyLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

No-epistasis model with Cauchy likelihood.

Note

This class implements the No epistasis function with a Cauchy likelihood. See documentation for the base classes NoEpistasis, CauchyLikelihood, and AbstractEpistasis for details.

class dms_variants.globalepistasis.NoEpistasisGaussianLikelihood(binarymap, *, n_latent_phenotypes=1, model_one_less_latent=None)[source]

No-epistasis model with Gaussian likelihood.

Note

This class implements the No epistasis function with a Gaussian likelihood. See documentation for the base classes NoEpistasis, GaussianLikelihood, and AbstractEpistasis for details.

dms_variants.globalepistasis.fit_models(binarymap, likelihood, *, bottleneck=None, max_latent_phenotypes=1)[source]

Fit and compare global epistasis models.

This function is useful when you want to examine the fit of several different models to the same data. It does the following:

1. Fits a non-epistatic model to the data.

2. Fits a global epistasis model with $$K = 1$$ latent phenotypes to the data. If the global epistasis model outperforms the no- epistasis model by AIC, proceed to next step. Otherwise stop.

3. Fit a global epistasis model with $$K = 2$$ latent phenotypes. If this model outperforms (by AIC) the model with $$K - 1$$ latent phenotypes, repeat for $$K = 3$$ etc until adding more latent phenotypes no longer improves fit. Note that it only does continues this process while $$K \le$$ max_latent_phenotypes, so set max_latent_phenotypes > 1 if you want to fit multiple latent phenotypes.

Note

All of the fitting is done with the same likelihood-calculation method because you can not compare models fit with different likelihood- calculation methods.

Parameters
• binarymap – Contains the variants, their functional scores, and score variances. The models are fit to these data.

• likelihood ({'Gaussian', 'Cauchy', 'Bottleneck'}) – Likelihood calculation method to use when fitting models. See Likelihood calculation.

• bottleneck (float or None) – Required if using ‘Bottleneck’ likelihood. In that case, is the experimentally estimated bottleneck between the pre- and post-selection conditions.

• max_latent_phenotypes (int) – Maximum number of latent phenotypes that are potentially be fit. See the $$K$$ parameter in Multiple latent phenotypes.

Returns

Summarizes the results of the model fitting and contains the fit models. Columns are:

• ’description’: description of model

• ’n_latent_phenotypes’: number of latent phenotypes in model

• ’AIC’: AIC

• ’nparams’: number of parameters

• ’log_likelihood’: log likelihood

• ’model’: the actual model (subclass of AbstractEpistasis)

• ’fitting_time’: time in seconds that it took to fit model

The data frame is sorted from best to worst model by AIC.

Return type

pandas.DataFrame