Details on fitting¶
Here we describe how the polyclonal
package actually fits the models.
The basic idea is to minimize the difference between the predicted and measured variant-level escape probabilities,
Implementation¶
The fitting is implemented in the Polyclonal.fit
function, which allows adjustment of the weights for the regularization. By default the optimization uses the gradient-based L-BFGS-B method implemented in scipy.optimize.minimize and simply continues optimization until the minimization converges.
Some key details about the fitting are as follows:
By default, the fitting first fits a “site-level” model in which all mutations are lumped together so there are just two characters (wildtype and mutant). The escape values from this initial site-level fitting are then used to initialize the full mutation-level escape values which are then further optimized. The idea is that first fitting a simpler model with less parameters helps get the parameters into a “reasonable” space before full model optimization. This option is implemented via the
fit_site_level_first
parameter toPolyclonal.fit
, and it is recommended to use this approach as testing indicates it helps.By default, if you are using free-parameter Hill coefficients or non-neutralized fractions, a model with those fixed is fit first via
fit_fixed_first
. It is fit with stronger regulation on the activities (viafit_fixed_first_reg_activity_weight
) to keep epitopes from dropping too low in activity to be picked up in subsequent all-parameter optimization. When this model is being used, the site model is fit with this fixed model, not the later full model.By default, if there are multiple variants with the same mutations, they are by default treated as independent measurements that are fit. This can be changed to “collapse” them to a single variant that is then given a weight proportional to the number of constituent variants that is used when calculating the loss. This option is implemented by default via
collapse_identical_variants
during the initialization of aPolyclonal
object. It speeds up fitting without (usually) substantially changing the fitting results. However, do not collaps if you are using bootstrapping.
The Polyclonal.fit
also allows you to adjust the weights on the regularizations. The default should be sensible, but you may want to try adjusting them about. You can also adjust the
Loss function¶
We use a scaled Pseudo-Huber loss function on the difference between the predicted and measure escape probabilities. Note that the Pseudo-Huber function is defined as
Specifically, let
Regularization¶
We also regularize the mutation escape values (
Most mutations should not mediate escape,
When a site is involved in escape for a given epitope, most mutations at a site will have similar-ish effects.
Epitopes should be mostly unique: a site involved in escape should usually only mediate escape from a single epitope.
Epitopes should be relatively spatially compact (requires structural information).
Epitope activities should be small (or negative) except when clear evidence to the contrary.
Regularization of escape values¶
We regularize the escape values
where
Regularization of spread of escape values at each site and epitope¶
We regularize the variance of the escape values at each site, so that
where
Regularization of spatial spread of epitopes¶
To regularize the spatial spread of epitopes, we first define a differentiable measure of the average absolute value of escape at a site for an epitope
where
We then further assume that we have an experimental measure of
The regularization term is then:
Note how this term has weights enabling regularization on both the distances and squared distances. The factor of
Regularization of epitope uniqueness¶
To regularize to ensure epitopes contain largely unique sites, we define the following term which uses the differentiable average absolute value of escape at a site for an epitope
where
Regularization of epitope uniqueness ¶
This is a second regularization to ensure epitopes contain largely unique sites, but this one operates on the squared product of escape at a site, and so will more strongly penalize very large escape at same site, but less penalize weak shared constraint:
where
Regularization of epitope activities¶
We regularize the epitope activities
where
Regularization of Hill coefficients¶
We regularize the Hill coefficients
where
Regularization of non-neutralizable fraction¶
We regularize the non-neutralizable fraction
where
Gradients used for optimization¶
Here are the formulas used to calculate the gradients in the optimization.
Gradient of loss function¶
For the loss function, the gradients are as follows:
See below for how the sub-components that lead to these were calculated.
Calculating ¶
We have
Calculating ¶
First, note
Next, note
where the last step uses the simplification here.
Finally, note
Putting it all together, we have:
Calculating ¶
The only difference from above is the sign, so:
Calculating ¶
Calculating ¶
Gradients of regularizations¶
Calculating ¶
Calculating ¶
Calculating ¶
Calculating ¶
where
Calculating ¶
Calculating ¶
Calculating ¶
Calculating ¶
Calculating ¶
Bootstrapping¶
For the bootstrapping implemented by PolyclonalBootstrap
, we start with a single pre-fit model to all the data. We then draw bootstrap replicates of the data used to fit that model, by default sampling the same variants at each concentration (see sample_by
option of PolyclonalBootstrap
). We then fit each of these bootstrapped models starting from the initial values from the pre-fit model on all of the data. Finally, the fit parameters or predictions from the models are summarized.
Note that mutations may not be present in some bootstrap replicates if they are only in a few variants, and this can be assessed by looking at the frac_boostrap_replicates
column in the output from PolyclonalBootstrap
.
[ ]: