polyclonal¶
Defines Polyclonal
objects for handling antibody mixtures.
- class polyclonal.polyclonal.Polyclonal(*, data_to_fit=None, activity_wt_df=None, mut_escape_df=None, hill_coefficient_df=None, non_neutralized_frac_df=None, site_escape_df=None, n_epitopes=None, spatial_distances=None, collapse_identical_variants=False, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), sites=None, epitope_colors=('#0072B2', '#CC79A7', '#009E73', '#17BECF', '#BCDB22'), init_missing='zero', data_mut_escape_overlap='exact_match', check_concentration_scale=4)[source]¶
Bases:
object
Represent polyclonal antibody mixtures targeting multiple epitopes.
Note
At several concentrations \(c\) of an antibody mixture, we measure \(p_v\left(c\right)\), the probability that variant \(v\) is not bound (or neutralized). We assume antibodies act independently on one of \(E\) epitopes, so the probability \(U_e\left(v, c\right)\) that \(v\) is unbound at concentration \(c\) is related to the probability that epitope \(e\) is unbound by
(1)¶\[p_v\left(c\right) = \prod_{e=1}^E U_e\left(v, c\right).\]We furthermore assume that \(U_e\left(v, c\right)\) is related to the total binding activity \(\phi_e\left(v\right)\) of antibodies targeting epitope \(e\) on variant \(v\) by
(2)¶\[U_e\left(v,c\right)= \frac{1-t_e}{1+\left[c\exp\left(-\phi_e\left(v\right)\right)\right]^{n_e}}+t_e\]where smaller (more negative) values of \(\phi_e\left(v\right)\) correspond to higher overall binding activity against epitope \(e\) variant \(v\), \(n_e\) is the Hill coefficient, and \(t_e\) is the non-neutralized fraction.
We define \(\phi_e\left(v\right)\) in terms of the underlying quantities of biological interest as
(3)¶\[\phi_e\left(v\right) = -a_{\rm{wt}, e} + \sum_{m=1}^M \beta_{m,e} b\left(v\right)_m,\]where \(a_{\rm{wt}, e}\) is the activity of the serum against epitope \(e\) for the “wildtype” (unmutated) protein (larger values indicate higher activity against this epitope), \(\beta_{m,e}\) is the extent to which mutation \(m\) (where \(1 \le m \le M\)) escapes binding from antibodies targeting epitope \(e\) (larger values indicate more escape by this mutation), and \(b\left(v\right)_m\) is 1 if variant \(v\) has mutation \(m\) and 0 otherwise.
Note
You can initialize a
Polyclonal
object in three ways:With known epitope activities \(a_{\rm{wt}, e}\) and mutation-escape values \(\beta_{m,e}\), and no data. Use this approach if you already know these values and just want to visualize the polyclonal antibody mixture properties or predict escape of variants. To do this, initialize with
activity_wt_df
andmut_escape_df
storing the known values, anddata_to_fit=None
.With data to fit the epitope activities and mutation-escape values, and initial guesses for the epitope activities and mutation-escape values. To do this, initialize with
data_to_fit
holding the data,activity_wt_df
holding initial guesses of activities, andmut_escape_df
orsite_escape_df
holding initial guesses for mutation escapes (see alsoinit_missing
anddata_mut_escape_overlap
). Then callPolyclonal.fit()
.With data to fit the epitope activities and mutation-escape values, but no initial guesses. To do this, initialize with
data_to_fit
holding the data,activity_wt_df=None
,mut_escape_df=None
, andn_epitopes
holding the number of epitopes. Then callPolyclonal.fit()
.
For the first and second approaches, you can optionally set initial values for \(n_e\) (the Hill coefficient) and \(t_e\) (the non-neutralizable fraction) via
hill_coefficient_df
andnon_neutralized_frac_df
. If these are not set, we initialize \(n_e = 1\) and \(t_e = 0\).- Parameters:
data_to_fit (pandas.DataFrame or None) – Should have columns named ‘aa_substitutions’, ‘concentration’, and ‘prob_escape’. The ‘aa_substitutions’ column defines each variant \(v\) as a string of substitutions (e.g., ‘M3A K5G’). The ‘prob_escape’ column gives the \(p_v\left(c\right)\) value for each variant at each concentration \(c\).
activity_wt_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘activity’, giving the names of the epitopes and the activity against epitope in the wildtype protein, \(a_{\rm{wt}, e}\).
mut_escape_df (pandas.DataFrame or None) – Should have columns named ‘mutation’, ‘epitope’, and ‘escape’ that give the \(\beta_{m,e}\) values (in the ‘escape’ column), with mutations written like “G7M”.
hill_coefficient_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘hill_coefficient’ that sets initial \(n_e\), or set to None for \(n_e = 1\).
non_neutralized_frac_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘non_neutralized_frac’ that sets initial \(t_e\), or set to None for \(t_e = 0\).
site_escape_df (pandas.DataFrame or None) – Use if you want to initialize all mutations at a given site to have the same \(\beta_{m,e}\) values. In this case, columns should be ‘site’, ‘epitope’, and ‘escape’. This option is mutually exclusive with
mut_escape_df
.n_epitopes (int or None) – If initializing with
activity_wt_df=None
, specifies number of epitopes.spatial_distances (pandas.DataFrame) – Gives inter-residue distances. Must have columns “site_1”, “site_2”, and “distance” where the distance is typically in angstroms. Distances may be missing for some pairs. Used for the spatial regularization.
collapse_identical_variants ({'mean', 'median', False}) – If identical variants in
data_to_fit
(same ‘aa_substitutions’), collapse them and make weight proportional to number of collapsed variants? Collapse by taking mean or median of ‘prob_escape’, or (if False) do not collapse at all. Collapsing will make fitting faster, but not a good idea if you are doing bootstrapping.alphabet (array-like) – Allowed characters in mutation strings.
sites (array-like or None) – By default, sites are assumed to be sequential integer values are and inferred from
data_to_fit
ormut_escape_df
. However, you can also have non-sequential integer sites, or sites with lower-case letter suffixes (eg, 214a) if your protein is numbered against a reference that it has indels relative to. In that case, provide list of all expected in order here; we require that order to be natsorted.epitope_colors (array-like or dict) – Maps each epitope to the color used for plotting. Either a dict keyed by each epitope, or an array of colors that are sequentially assigned to the epitopes.
init_missing ('zero' or int) – How to initialize activities or mutation-escape values not specified in
activity_wt_df
ormut_escape_df
/site_escape_df
. If ‘zero’, set mutation-escapes to zero and activities uniformly spaced from 1 to 0. Otherwise draw uniformly from between 0 and 1 using specified random number seed.data_mut_escape_overlap ({'exact_match', 'fill_to_data', 'prune_to_data'}) – If
data_to_fit
andmut_escape_df
(orsite_escape_df
) both specified, what if they don’t specify same mutations. If ‘exact_match’, raise error. If ‘fill_to_data’, then take sites / wildtypes / mutations fromdata_to_fit
and fill init values from any not specified inmut_escape_df
as indicated byinit_missing
. If ‘prune_to_data’, remove any extra mutations frommut_escape_df
that are not indata_to_fit
.check_concentration_scale (float or None) – The optimization may have problems if concentrations in
data_to_fit
are provided in units where the numbers are very large or very small. Ifcheck_concentration_scale
is not None, check concentrations are not so large or small that the magnitude of the log10 of any concentration exceeds this value.
- epitopes¶
Names of all epitopes.
- Type:
tuple
- mutations¶
All mutations for which we have escape values.
- Type:
tuple
- alphabet¶
Allowed characters in mutation strings.
- Type:
tuple
- sites¶
List of all sites. These are the sites provided via the
sites
parameter, or inferred fromdata_to_fit
ormut_escape_df
if that isn’t provided. If sequential_integer_sites is False, these are str, otherwise int.- Type:
tuple
- sequential_integer_sites¶
True if sites are sequential and integer, False otherwise.
- Type:
bool
- wts¶
Keyed by site, value is wildtype at that site.
- Type:
dict
- epitope_colors¶
Maps each epitope to its color.
- Type:
dict
- data_to_fit¶
Data to fit as passed when initializing this
Polyclonal
object. If usingcollapse_identical_variants
, then identical variants are collapsed on columns ‘concentration’, ‘aa_substitutions’, and ‘prob_escape’, and a column ‘weight’ is added to represent number of collapsed variants. Also, row-order may be changed.- Type:
pandas.DataFrame or None
- spatial_distances¶
Spatial distances passed when initializing this
Polyclonal
object.- Type:
pandas.DataFrame or None
- mutations_times_seen¶
If data_to_fit is not None, keyed by all mutations with escape values and values are number of variants in which the mutation is seen. It is formally calculated as the number of variants with mutation across all concentrations divided by the number of concentrations, so can have non-integer values if there are variants only observed at some concentrations.
- Type:
frozendict.frozendict or None
- distance_matrix¶
If
data_to_fit
is not None andspatial_distances
is not None, then this is a matrix giving the pairwise distances between residues. Pairs with no distance have values of pd.NA and diagonal values are 0.- Type:
pandas.DataFrame or None
Example
Simple example with two epitopes (e1 and e2) and a few mutations where we know the activities and mutation-level escape values ahead of time:
>>> activity_wt_df = pd.DataFrame({'epitope': ['e1', 'e2'], ... 'activity': [ 2.0, 1.0]}) >>> mut_escape_df = pd.DataFrame({ ... 'mutation': ['M1C', 'M1C', 'G2A', 'G2A', 'A4K', 'A4K', 'A4L', 'A4L'], ... 'epitope': [ 'e1', 'e2', 'e1', 'e2', 'e1', 'e2', 'e1', 'e2'], ... 'escape': [ 2.0, 0.0, 3.0, 0.0, 0.0, 2.5, 0.0, 1.5], ... }) >>> model = Polyclonal( ... activity_wt_df=activity_wt_df, ... mut_escape_df=mut_escape_df, ... collapse_identical_variants="mean", ... ) >>> model.epitopes ('e1', 'e2') >>> model.mutations ('M1C', 'G2A', 'A4K', 'A4L') >>> model.mutations_times_seen is None True >>> model.sites (1, 2, 4) >>> model.wts {1: 'M', 2: 'G', 4: 'A'} >>> model.curve_specs_df epitope activity hill_coefficient non_neutralized_frac 0 e1 2.0 1.0 0.0 1 e2 1.0 1.0 0.0 >>> model.mut_escape_df epitope site wildtype mutant mutation escape 0 e1 1 M C M1C 2.0 1 e1 2 G A G2A 3.0 2 e1 4 A K A4K 0.0 3 e1 4 A L A4L 0.0 4 e2 1 M C M1C 0.0 5 e2 2 G A G2A 0.0 6 e2 4 A K A4K 2.5 7 e2 4 A L A4L 1.5
We can also summarize the mutation-level escape at the site level:
>>> pd.set_option("display.max_columns", None) >>> pd.set_option("display.width", 89) >>> model.mut_escape_site_summary_df() epitope site wildtype mean total positive max min total negative n mutations 0 e1 1 M 2.0 2.0 2.0 2.0 0.0 1 1 e1 2 G 3.0 3.0 3.0 3.0 0.0 1 2 e1 4 A 0.0 0.0 0.0 0.0 0.0 2 3 e2 1 M 0.0 0.0 0.0 0.0 0.0 1 4 e2 2 G 0.0 0.0 0.0 0.0 0.0 1 5 e2 4 A 2.0 4.0 2.5 1.5 0.0 2
Note that we can not initialize a
Polyclonal
object if we are missing escape estimates for any mutations for any epitopes:>>> Polyclonal(activity_wt_df=activity_wt_df, ... mut_escape_df=mut_escape_df.head(n=5)) Traceback (most recent call last): ... ValueError: invalid set of mutations for epitope='e2'
Now make a data frame with some variants:
>>> variants_df = pd.DataFrame.from_records( ... [('AA', ''), ... ('AC', 'M1C'), ... ('AG', 'G2A'), ... ('AT', 'A4K'), ... ('TA', 'A4L'), ... ('CA', 'M1C G2A'), ... ('CG', 'M1C A4K'), ... ('CC', 'G2A A4K'), ... ('TC', 'G2A A4L'), ... ('CT', 'M1C G2A A4K'), ... ('TG', 'M1C G2A A4L'), ... ('GA', 'M1C'), ... ], ... columns=['barcode', 'aa_substitutions'])
Get the escape probabilities predicted on these variants from the values in the
Polyclonal
object:>>> escape_probs = model.prob_escape(variants_df=variants_df, ... concentrations=[1.0, 2.0, 4.0]) >>> escape_probs.round(3) barcode aa_substitutions concentration predicted_prob_escape 0 AA 1.0 0.032 1 AT A4K 1.0 0.097 2 TA A4L 1.0 0.074 3 AG G2A 1.0 0.197 4 CC G2A A4K 1.0 0.598 5 TC G2A A4L 1.0 0.455 6 AC M1C 1.0 0.134 7 GA M1C 1.0 0.134 8 CG M1C A4K 1.0 0.409 9 CA M1C G2A 1.0 0.256 10 CT M1C G2A A4K 1.0 0.779 11 TG M1C G2A A4L 1.0 0.593 12 AA 2.0 0.010 13 AT A4K 2.0 0.044 14 TA A4L 2.0 0.029 15 AG G2A 2.0 0.090 16 CC G2A A4K 2.0 0.398 17 TC G2A A4L 2.0 0.260 18 AC M1C 2.0 0.052 19 GA M1C 2.0 0.052 20 CG M1C A4K 2.0 0.230 21 CA M1C G2A 2.0 0.141 22 CT M1C G2A A4K 2.0 0.629 23 TG M1C G2A A4L 2.0 0.411 24 AA 4.0 0.003 25 AT A4K 4.0 0.017 26 TA A4L 4.0 0.010 27 AG G2A 4.0 0.034 28 CC G2A A4K 4.0 0.214 29 TC G2A A4L 4.0 0.118 30 AC M1C 4.0 0.017 31 GA M1C 4.0 0.017 32 CG M1C A4K 4.0 0.106 33 CA M1C G2A 4.0 0.070 34 CT M1C G2A A4K 4.0 0.441 35 TG M1C G2A A4L 4.0 0.243
We can also get predicted escape probabilities by including concentrations in the data frame passed to
Polyclonal.prob_escape()
:>>> model.prob_escape( ... variants_df=pd.concat([variants_df.assign(concentration=c) ... for c in [1.0, 2.0, 4.0]]) ... ).equals(escape_probs) True
We can also compute the IC50s:
>>> model.icXX(variants_df).round(3) barcode aa_substitutions IC50 0 AA 0.085 1 AC M1C 0.230 2 AG G2A 0.296 3 AT A4K 0.128 4 TA A4L 0.117 5 CA M1C G2A 0.355 6 CG M1C A4K 0.722 7 CC G2A A4K 1.414 8 TC G2A A4L 0.858 9 CT M1C G2A A4K 3.237 10 TG M1C G2A A4L 1.430 11 GA M1C 0.230
Or the IC90s:
>>> model.icXX(variants_df, x=0.9, col='IC90').round(3) barcode aa_substitutions IC90 0 AA 0.464 1 AC M1C 1.260 2 AG G2A 1.831 3 AT A4K 0.976 4 TA A4L 0.782 5 CA M1C G2A 2.853 6 CG M1C A4K 4.176 7 CC G2A A4K 7.473 8 TC G2A A4L 4.532 9 CT M1C G2A A4K 18.717 10 TG M1C G2A A4L 9.532 11 GA M1C 1.260
Or the fold change IC90s of all mutations: >>> model.mut_icXX_df( … x=0.9, … icXX_col=”IC90”, … log_fold_change_icXX_col=”log2_fold_change_IC90”, … ).round(2)
site wildtype mutant IC90 log2_fold_change_IC90
0 1 M C 1.26 1.44 1 1 M M 0.46 0.00 2 2 G A 1.83 1.98 3 2 G G 0.46 0.00 4 4 A A 0.46 0.00 5 4 A K 0.98 1.07 6 4 A L 0.78 0.76
Example
Initialize with
escape_probs
created above as data to fit. In order to do this, we need to change the name of the column with the predicted escape probs to just be escape probs as we are now assuming these are the real values:>>> data_to_fit = ( ... escape_probs ... .rename(columns={'predicted_prob_escape': 'prob_escape'}) ... )
>>> model_data = Polyclonal( ... data_to_fit=data_to_fit, ... n_epitopes=2, ... collapse_identical_variants="mean", ... )
The mutations are those in
data_to_fit
:>>> model_data.mutations ('M1C', 'G2A', 'A4K', 'A4L') >>> {key: int(val) for (key, val) in sorted(model_data.mutations_times_seen.items())} {'A4K': 4, 'A4L': 3, 'G2A': 6, 'M1C': 6}
The activities are evenly spaced from 1 to 0, while the mutation escapes are all initialized to zero:
>>> model_data.curve_specs_df epitope activity hill_coefficient non_neutralized_frac 0 1 1.0 1.0 0.0 1 2 0.0 1.0 0.0 >>> model_data.mut_escape_df epitope site wildtype mutant mutation escape times_seen 0 1 1 M C M1C 0.0 6 1 1 2 G A G2A 0.0 6 2 1 4 A K A4K 0.0 4 3 1 4 A L A4L 0.0 3 4 2 1 M C M1C 0.0 6 5 2 2 G A G2A 0.0 6 6 2 4 A K A4K 0.0 4 7 2 4 A L A4L 0.0 3
You can initialize to random numbers by setting
init_missing
to seed (in this example we also don’t include all variants for one concentration):>>> model_data2 = Polyclonal( ... data_to_fit=data_to_fit.head(30), ... n_epitopes=2, ... init_missing=1, ... collapse_identical_variants="mean", ... ) >>> model_data2.curve_specs_df.round(3) epitope activity hill_coefficient non_neutralized_frac 0 1 0.417 1.0 0.0 1 2 0.720 1.0 0.0
You can set some or all mutation escapes to initial values:
>>> model_data3 = Polyclonal( ... data_to_fit=data_to_fit, ... activity_wt_df=activity_wt_df, ... mut_escape_df=pd.DataFrame({'epitope': ['e1'], ... 'mutation': ['M1C'], ... 'escape': [4]}), ... data_mut_escape_overlap='fill_to_data', ... collapse_identical_variants="mean", ... ) >>> model_data3.mut_escape_df epitope site wildtype mutant mutation escape times_seen 0 e1 1 M C M1C 4.0 6 1 e1 2 G A G2A 0.0 6 2 e1 4 A K A4K 0.0 4 3 e1 4 A L A4L 0.0 3 4 e2 1 M C M1C 0.0 6 5 e2 2 G A G2A 0.0 6 6 e2 4 A K A4K 0.0 4 7 e2 4 A L A4L 0.0 3
You can initialize sites to escape values via
site_activity_df
:>>> model_data4 = Polyclonal( ... data_to_fit=data_to_fit, ... activity_wt_df=activity_wt_df, ... site_escape_df=pd.DataFrame.from_records( ... [('e1', 1, 1.0), ('e1', 4, 0.0), ... ('e2', 1, 0.0), ('e2', 4, 2.0)], ... columns=['epitope', 'site', 'escape'], ... ), ... data_mut_escape_overlap='fill_to_data', ... collapse_identical_variants="mean", ... ) >>> model_data4.mut_escape_df epitope site wildtype mutant mutation escape times_seen 0 e1 1 M C M1C 1.0 6 1 e1 2 G A G2A 0.0 6 2 e1 4 A K A4K 0.0 4 3 e1 4 A L A4L 0.0 3 4 e2 1 M C M1C 0.0 6 5 e2 2 G A G2A 0.0 6 6 e2 4 A K A4K 2.0 4 7 e2 4 A L A4L 2.0 3
Fit the data using
Polyclonal.fit()
, and make sure the new predicted escape probabilities are close to the real ones being fit. Reduce weight on regularization since there is so little data in this toy example:>>> for m in [model_data, model_data2, model_data3, model_data4]: ... opt_res = m.fit( ... reg_escape_weight=0.001, ... reg_spread_weight=0.001, ... reg_activity_weight=0.0001, ... reg_uniqueness2_weight=0, ... fix_hill_coefficient=True, ... fix_non_neutralized_frac=True, ... ) ... pred_df = m.prob_escape(variants_df=data_to_fit) ... if not numpy.allclose(pred_df['prob_escape'], ... pred_df['predicted_prob_escape'], ... atol=0.01): ... raise ValueError(f"wrong predictions\n{pred_df}") ... if not numpy.allclose( ... activity_wt_df['activity'].sort_values(), ... m.activity_wt_df['activity'].sort_values(), ... atol=0.1, ... ): ... raise ValueError(f"wrong activities\n{m.activity_wt_df}") ... if not numpy.allclose( ... mut_escape_df['escape'].sort_values(), ... m.mut_escape_df['escape'].sort_values(), ... atol=0.05, ... ): ... raise ValueError(f"wrong escapes\n{m.mut_escape_df}")
The map on next line is to make doctest pass even with negative zeros
>>> ( ... model_data.mut_escape_site_summary_df() ... .round(1).map(lambda x: 0.0 if x == 0 else x) ... ) epitope site wildtype mean total positive max min total negative n mutations 0 1 1 M 0.0 0.0 0.0 0.0 0.0 1 1 1 2 G 0.0 0.0 0.0 0.0 0.0 1 2 1 4 A 2.0 4.0 2.5 1.5 0.0 2 3 2 1 M 2.0 2.0 2.0 2.0 0.0 1 4 2 2 G 3.0 3.0 3.0 3.0 0.0 1 5 2 4 A 0.0 0.0 0.0 0.0 0.0 2
You can also exclude mutations to specific characters (typically you would want to do this for stop codons and/or gaps):
>>> ( ... model_data.mut_escape_site_summary_df(exclude_chars={"C", "K"}) ... .round(1).map(lambda x: 0.0 if x == 0 else x) ... ) epitope site wildtype mean total positive max min total negative n mutations 0 1 2 G 0.0 0.0 0.0 0.0 0.0 1 1 1 4 A 1.5 1.5 1.5 1.5 0.0 1 2 2 2 G 3.0 3.0 3.0 3.0 0.0 1 3 2 4 A 0.0 0.0 0.0 0.0 0.0 1
Example
You can convert a
Polyclonal
model into a site-level model via the transformation ofpolyclonal.utils.site_level_variants()
. The site-level model is anotherPolyclonal
model that just keeps track of whether or not sites are mutated using a 2-letter wildtype/mutant alphabet, and is generated usingPolyclonal.site_level_model()
:>>> model_site = model_data4.site_level_model() >>> model_site.alphabet ('w', 'm') >>> (model_site.mut_escape_df ... .assign(escape=lambda x: x['escape'].abs()).round(1)) epitope site wildtype mutant mutation escape times_seen 0 e1 1 w m w1m 2.0 5 1 e1 2 w m w2m 3.0 6 2 e1 4 w m w4m 0.0 7 3 e2 1 w m w1m 0.0 5 4 e2 2 w m w2m 0.0 6 5 e2 4 w m w4m 2.0 7 >>> model_site.data_to_fit.head(n=5).round(3) concentration aa_substitutions weight prob_escape 0 1.0 1 0.032 1 1.0 w1m 1 0.134 2 1.0 w1m w2m 1 0.256 3 1.0 w1m w2m w4m 2 0.686 4 1.0 w1m w4m 1 0.409
Example
Epitope assignments are arbitrary, so you can calculate correlations among their mutation-escape values and create harmonized models that use the same label to refer to epitopes with similar mutation-escape values.
First, correlations of a model to itself:
>>> ref_model = copy.deepcopy(model) >>> model.mut_escape_corr(ref_model).round(3) ref_epitope self_epitope correlation 0 e1 e1 1.000 1 e1 e2 -0.907 2 e2 e1 -0.907 3 e2 e2 1.000
Make another model with epitope assignments inverted and a few mutations missing:
>>> inverted_model = Polyclonal( ... activity_wt_df=( ... activity_wt_df ... .assign(epitope=lambda x: x["epitope"].map({"e1": "e2", "e2": "e1"})) ... ), ... mut_escape_df=( ... mut_escape_df ... .query("mutation != 'A4K'") ... .assign(epitope=lambda x: x["epitope"].map({"e1": "e2", "e2": "e1"})) ... ), ... ) >>> inverted_model.mut_escape_corr(ref_model).round(3) ref_epitope self_epitope correlation 0 e1 e1 -0.945 1 e1 e2 1.000 2 e2 e1 1.000 3 e2 e2 -0.945
Now actually get epitope-harmonized models:
>>> model_harmonized, harmonize_df = model.epitope_harmonized_model(ref_model) >>> harmonize_df self_initial_epitope self_harmonized_epitope ref_epitope correlation 0 e1 e1 e1 1.0 1 e2 e2 e2 1.0 >>> assert model.mut_escape_df.equals(model_harmonized.mut_escape_df)
>>> inverted_harmonized, harmonize_df = inverted_model.epitope_harmonized_model( ... ref_model ... ) >>> harmonize_df self_initial_epitope self_harmonized_epitope ref_epitope correlation 0 e1 e2 e2 1.0 1 e2 e1 e1 1.0 >>> inverted_harmonized.mut_escape_df epitope site wildtype mutant mutation escape 0 e1 1 M C M1C 2.0 1 e1 2 G A G2A 3.0 2 e1 4 A L A4L 0.0 3 e2 1 M C M1C 0.0 4 e2 2 G A G2A 0.0 5 e2 4 A L A4L 1.5
Example
Filter variants by how often they are seen in data:
>>> model_data.filter_variants_by_seen_muts(variants_df) ... barcode aa_substitutions 0 AA 1 AC M1C 2 AG G2A 3 AT A4K 4 TA A4L 5 CA M1C G2A 6 CG M1C A4K 7 CC G2A A4K 8 TC G2A A4L 9 CT M1C G2A A4K 10 TG M1C G2A A4L 11 GA M1C
>>> model_data.filter_variants_by_seen_muts(variants_df, min_times_seen=5) ... barcode aa_substitutions 0 AA 1 AC M1C 2 AG G2A 3 CA M1C G2A 4 GA M1C
>>> model_data.filter_variants_by_seen_muts(variants_df, min_times_seen=4) ... barcode aa_substitutions 0 AA 1 AC M1C 2 AG G2A 3 AT A4K 4 CA M1C G2A 5 CG M1C A4K 6 CC G2A A4K 7 CT M1C G2A A4K 8 GA M1C
Example
Add spatial distances:
>>> spatial_distances = pd.DataFrame( ... [(1, 2, 1.5), (1, 4, 7.5)], ... columns=["site_1", "site_2", "distance"], ... ) >>> model_spatial = Polyclonal( ... data_to_fit=data_to_fit, ... n_epitopes=2, ... spatial_distances=spatial_distances, ... collapse_identical_variants="mean", ... ) >>> model_spatial.distance_matrix 1 2 4 1 0.0 1.5 7.5 2 1.5 0.0 <NA> 4 7.5 <NA> 0.0
Fit with lower regularization as data so small
>>> opt_res = model_spatial.fit( ... reg_escape_weight=0.001, ... reg_spread_weight=0.001, ... reg_activity_weight=0.0001, ... reg_spatial_weight=0.001, ... reg_spatial2_weight=0.0001, ... reg_uniqueness2_weight=0, ... ) >>> pred_df = model_spatial.prob_escape(variants_df=data_to_fit) >>> if not numpy.allclose( ... pred_df['prob_escape'], ... pred_df['predicted_prob_escape'], ... atol=0.01, ... ): ... raise ValueError(f"wrong predictions\n{pred_df}") >>> if not numpy.allclose( ... activity_wt_df['activity'].sort_values(), ... m.activity_wt_df['activity'].sort_values(), ... atol=0.1, ... ): ... raise ValueError(f"wrong activities\n{m.activity_wt_df}") >>> if not numpy.allclose( ... mut_escape_df['escape'].sort_values(), ... m.mut_escape_df['escape'].sort_values(), ... atol=0.05, ... ): ... raise ValueError(f"wrong escapes\n{m.mut_escape_df}")
- DEFAULT_SCIPY_MINIMIZE_KWARGS = {'method': 'L-BFGS-B', 'options': {'ftol': 1e-07, 'maxfun': 10000000.0, 'maxiter': 1000000.0}}¶
default
scipy_minimize_kwargs
tofit
.- Type:
frozendict.frozendict
- activity_wt_barplot(**kwargs)[source]¶
Bar plot of activity against each epitope, \(a_{\rm{wt},e}\).
Note
Consider
Polyclonal.curves_plot()
as better way to show similar data.- Parameters:
**kwargs – Keyword args for
polyclonal.plot.activity_wt_barplot()
.- Returns:
Interactive plot.
- Return type:
altair.Chart
- property activity_wt_df¶
Activities \(a_{\rm{wt,e}}\) for epitopes.
- Type:
pandas.DataFrame
- property curve_specs_df¶
activities, Hill coefficients, and non-neutralized fracs.
- Type:
pandas.DataFrame
- curves_plot(**kwargs)[source]¶
Plot neutralization / binding curve for unmutated protein at each epitope.
This curve effectively illustrates the epitope activity, Hill curve coefficient, and non-neutralizable fraction.
- Parameters:
**kwargs – Keyword args for
polyclonal.plot.curves_plot()
.- Returns:
Interactive plot
- Return type:
altair.Chart
- epitope_harmonized_model(ref_poly)[source]¶
Get a copy of model with epitopes “harmonized” with a reference model.
Epitopes are unidentifiable, meaning there is no guarantee that we will assign the same epitope labels across multiple models fit to similar data. This function returns a copy of the current model where the epitope labels are harmonized with those of another model with the same epitope labels. The harmonization is done by putting into correspondence the epitopes with the best correlated mutation-escape values.
- Parameters:
ref_poly (
Polyclonal
) – The reference polyclonal object to harmonize epitope labels with.- Returns:
harmonized_model, harmonize_df – harmonized_model is a
Polyclonal
object that is a copy of the current model but with epitopes harmonized, andPolyclonal.epitope_colors
taken from ref_poly. harmonize_df is a pandas.DataFrame that shows how the epitopes were re-assigned and the correlations in their escape values.- Return type:
tuple
- Raises:
PolyclonalHarmonizeError – Raise this error if epitopes cannot be harmonized one-to-one.
- filter_variants_by_seen_muts(variants_df, min_times_seen=1, subs_col='aa_substitutions')[source]¶
Remove variants that contain mutations not seen during model fitting.
- Parameters:
variants_df (pandas.DataFrame) – Contains variants as rows.
min_times_seen (int) – Require mutations to be seen >= this many times in data used to fit model.
subs_col (str) – Column in variants_df with mutations in each variant.
- Returns:
variants_df – Copy of input dataframe, with rows of variants that have unseen mutations removed.
- Return type:
pandas.DataFrame
- fit(*, loss_delta=0.1, reg_escape_weight=0.05, reg_escape_delta=0.1, reg_spread_weight=0.1, site_avg_abs_escape_epsilon=0.1, reg_spatial_weight=0.0, reg_spatial2_weight=0.0001, reg_uniqueness_weight=0, reg_uniqueness2_weight=0.1, reg_activity_weight=2.0, reg_activity_delta=0.1, reg_hill_coefficient_weight=25.0, reg_non_neutralized_frac_weight=1000.0, fit_site_level_first=True, scipy_minimize_kwargs={'method': 'L-BFGS-B', 'options': {'ftol': 1e-07, 'maxfun': 10000000.0, 'maxiter': 1000000.0}}, log=None, logfreq=None, fix_hill_coefficient=False, fix_non_neutralized_frac=False, activity_bounds=(None, None), hill_coefficient_bounds=(1e-08, None), non_neutralized_frac_bounds=(0, 0.5), beta_bounds=(None, None), fit_fixed_first=True, fit_fixed_first_reg_activity_weight=10.0, log_desc='')[source]¶
Fit parameters (activities and mutation escapes) to the data.
Requires
Polyclonal.data_to_fit
be set at initialization of thisPolyclonal
object. After calling this method, the \(a_{\rm{wt},e}\) and \(\beta_{m,e}\) have been optimized, and can be accessed using other methods of thePolyclonal
object.- Parameters:
loss_delta (float) – Pseudo-Huber \(\delta\) parameter for loss on \(p_v\left(c\right)\) fitting.
reg_escape_weight (float) – Strength of Pseudo-Huber regularization on \(\beta_{m,e}\).
reg_escape_delta (float) – Pseudo-Huber \(\delta\) for regularizing \(\beta_{m,e}\).
reg_spread_weight (float) – Strength of regularization on variance of \(\beta_{m,e}\) values at each site.
site_avg_abs_escape_epsilon (float) – The epsilon value used when computing a differentiable measure of the average absolute value of escape at a site for each epitope.
reg_spatial_weight (float) – Strength of regularization of spatial distance between \(\beta_{m,e}\) values at each site. Only meaningful if
Polyclonal.distance_matrix
is not None.reg_spatial2_weight (float) – Strength of regularization of squared spatial distance between \(\beta_{m,e}\) values at each site. Only meaningful if
Polyclonal.distance_matrix
is not None.reg_uniqueness_weight (float) – Strength of regularization on epitope uniqueness.
reg_uniqueness2_weight (float) – Strength of regularization on uniqueness of squared escape. values at each site across epitopes.
site_avg_abs_escape_epsilon – Epsilon value for caclulating site average absolute escape.
reg_activity_weight (float) – Strength of Pseudo-Huber regularization on \(a_{\rm{wt},e}\).
reg_activity_delta (float) – Pseudo-Huber \(\delta\) for regularizing \(a_{\rm{wt},e}\).
reg_hill_coefficient_weight (float) – Weight of regularization on Hill coefficients \(n_e\) that differ from 1.
reg_non_neutralized_frac_weight (float) – Weight of regularization on non-neutralized fractions \(t_e\) that differ from 0.
fit_site_level_first (bool) – First fit a site-level model, then use those activities / escapes to initialize fit of this model. Generally works better.
scipy_minimize_kwargs (dict) – Keyword arguments passed to
scipy.optimize.minimize
.log (None or writable file-like object) – Where to log output. If
None
, usesys.stdout
.logfreq (None or int) – How frequently to write updates on fitting to
log
.fix_hill_coefficient (bool) – Do not optimize the hill coefficients \(n_e\) and instead keep fixed at current values.
fix_non_neutralized_frac (bool) – Do not optimize the non-neutralized fractions \(t_e\) and instead keep fixed at current values.
activity_bounds (2-tuple) – When optimizing activities, keep in these bounds. Set to (None, None) for no bounds.
hill_coefficient_bounds (2-tuple) – When optimizing Hill coefficient, keep in these bounds.
non_neutralized_frac_bounds (2-tuple) – When optimizing non-neutralized fraction, keep in these bounds.
beta_bounds (2-tuple) – When optimizing escape values (\(\beta_m\)), keep in these bounds.
fit_fixed_first (bool) – In fitting, if either the Hill coefficient or non-neutralized fraction are free (either fix_non_neutralized_frac or fix_hill_coefficient is False) then first fit a model with both of these fixed. After fitting that model, use its values to then fit with these free. The site model (if being fit, see fit_site_level_first) is then only fit in the initial fitting with the Hill coefficient and non-neutralized fraction fixed.
fit_fixed_first_reg_activity_weight (float) – The activity regularization if first fitting a fixed model via fit_fixed_first. It can be helpful to have this > reg_activity_weight as it avoids epitopes getting activities so negative that they are lost in later fitting.
log_desc (str) – A description included on the logging string header.
- Return type:
scipy.optimize.OptimizeResult
- property hill_coefficient_df¶
Hill coefficients \(n_e\) for epitopes.
- Type:
pandas.DataFrame
- icXX(variants_df, *, x=0.5, col='IC50', min_c=1e-08, max_c=100000000.0)[source]¶
Concentration at which a given fraction is neutralized (eg, IC50).
- Parameters:
variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).
x (float) – Compute concentration at which this fraction is neutralized for each variant. So set to 0.5 for IC50, and 0.9 for IC90.
col (str) – Name of column in returned data frame with the ICXX value.
min_c (float) – Minimum allowed icXX, truncate values < this at this.
max_c (float) – Maximum allowed icXX, truncate values > this at this.
- Returns:
Copy of
variants_df
with added columncol
containing icXX.- Return type:
pandas.DataFrame
- mut_escape_corr(ref_poly)[source]¶
Correlation of mutation-escape values with another model.
For each epitope, how well is this model’s mutation-escape values correlation with another model?
Mutations present in only one model are ignored.
- Parameters:
ref_poly (
Polyclonal
) – Other (reference) polyclonal model with which we calculate correlations.- Returns:
corr_df – Pairwise epitope correlations for escape.
- Return type:
pandas.DataFrame
- property mut_escape_df¶
Escape \(\beta_{m,e}\) for each mutation.
- Type:
pandas.DataFrame
- mut_escape_pdb_b_factor(*, input_pdbfile, chains, metric, outdir=None, outfile='{metric}-{epitope}.pdb', missing_metric=0, min_times_seen=1)[source]¶
Create PDB files with B factors from a site’s mutation escape.
- Parameters:
input_pdbfile (str) – Path to input PDB file.
chains (str or array-like) – Single chain or list of them to re-color.
metric (str) – Which site-level summary metric to use. Can be any metric in
Polyclonal.mut_escape_site_summary_df()
.outdir (str) – Output directory for created PDB files.
outfile (str) – Output file name, with formatting used to replace metric and epitope in curly brackets.
missing_metric (float or dict) – How do we handle sites in PDB that are missing in escape metric? If a float, reassign B factors for all missing sites to this value. If a dict, should be keyed by chain and assign all missing sites in each chain to indicated value.
min_times_seen (int) – Value passed to
Polyclonal.mut_escape_site_summary_df()
.
- Returns:
Gives name of created B-factor re-colored PDB for each epitope.
- Return type:
pandas.DataFrame
- mut_escape_plot(*, biochem_order_aas=True, prefix_epitope=None, df_to_merge=None, **kwargs)[source]¶
Make plot of the mutation escape values, \(\beta_{m,e}\).
- Parameters:
biochem_order_aas (bool) – Biochemically order the amino-acid alphabet in
Polyclonal.alphabet
by passing it throughpolyclonal.alphabets.biochem_order_aas()
.prefix_epitope (bool or None) – Do we add the prefix “epitope “ to the epitope labels? If None, do only if epitope is integer.
df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with
Polyclonal.mut_escape_df
before being passed topolyclonal.plot.lineplot_and_heatmap()
. Properties will only be included in plot if relevant columns are passed topolyclonal.plot.lineplot_and_heatmap()
via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.**kwargs – Keyword args for
polyclonal.plot.lineplot_and_heatmap()
.
- Returns:
Interactive line plot and heatmap.
- Return type:
altair.Chart
- mut_escape_site_summary_df(*, min_times_seen=1, mutation_whitelist=None, exclude_chars=frozenset({'*'}))[source]¶
Site-level summaries of mutation escape.
- Parameters:
min_times_seen (int) – Only include in summaries mutations seen in at least this many variants.
mutation_whitelist (None or set) – Only include in summaries these mutations.
exclude_chars (set or list) – Exclude mutations to these characters when calculating site summaries. Useful if you want to ignore stop codons (
*
), and perhaps in some cases also gaps (-
).
- Return type:
pandas.DataFrame
- mut_icXX_df(*, x, icXX_col, log_fold_change_icXX_col, min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0))[source]¶
Get data frame of ICXX and log fold change induced by each mutation.
- Parameters:
x (float) – Compute this ICXX, so 0.5 means IC50 aand 0.9 means IC90.
icXX_col (str) – Name of column with ICXX values.
log_fold_change_icXX_col (str) – Name of column with log fold change ICXX values.
min_c (float) – Minimum allowed icXX, truncate values < this at this.
max_c (float) – Maximum allowed icXX, truncate values > this at this.
logbase (float) – Compute log fold-change in ICXX to this base.
check_wt_icXX (None or 2-tuple) – If a 2-tuple, raise an error if the ICXX for the unmutated wildtype is outside the range (min_icXX, max_icXX). This check is because the clipping imposed by min_c and max_c will become a problem if the wildtype ICXX is very different than one. In general, concentration units should be used when fitting so that ICXX for wildtype is within a orders of magnitude of 1.
- Returns:
Has columns “site”, “wildtype”, “mutant”, icXX_col, log_fold_change_icXX_col, and “times_seen” (if available). Note that the dataframe does include entry for the wildtype (wildtype same as mutant) at each site with a non-wildtype mutation.
- Return type:
pandas.DataFrame
- mut_icXX_plot(*, x=0.9, icXX_col='IC90', log_fold_change_icXX_col='log2 fold change IC90', min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0), biochem_order_aas=True, df_to_merge=None, positive_color='#0072B2', negative_color='#E69F00', **kwargs)[source]¶
Make plot of log fold-change in ICXX induced by each mutation.
- Parameters:
x (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.icXX_col (str) – Same meaning as for
Polyclonal.mut_icXX_df()
.log_fold_change_icXX_col (str) – Same meaning as for
Polyclonal.mut_icXX_df()
.min_c (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.max_c (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.logbase (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.check_wt_icXX (None or 2-tuple) – Same meaning as for
Polyclonal.mut_icXX_df()
.biochem_order_aas (bool) – Biochemically order the amino-acid alphabet in
Polyclonal.alphabet
by passing it throughpolyclonal.alphabets.biochem_order_aas()
.df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with
Polyclonal.mut_escape_df
before being passed topolyclonal.plot.lineplot_and_heatmap()
. Properties will only be included in plot if relevant columns are passed topolyclonal.plot.lineplot_and_heatmap()
via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.positive_color (str) – Color for positive log fold change in heatmap.
negative_color (str) – Color for negative log fold change in heatmap.
**kwargs – Keyword args for
polyclonal.plot.lineplot_and_heatmap()
.
- Returns:
Interactive line plot and heatmap.
- Return type:
altair.Chart
- property non_neutralized_frac_df¶
non-neutralizable fractions \(t_e\) for epitopes.
- Type:
pandas.DataFrame
- prob_escape(*, variants_df, concentrations=None)[source]¶
Compute predicted probability of escape \(p_v\left(c\right)\).
Computed using current mutation-escape values \(\beta_{m,e}\) and epitope activities \(a_{\rm{wt},e}\) stored in this
Polyclonal
object.- Parameters:
variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if
concentrations=None
.concentrations (array-like or None) – Concentrations at which we compute probability of escape.
- Returns:
Version of
variants_df
with columns named ‘concentration’ and ‘predicted_prob_escape’ giving predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration.- Return type:
pandas.DataFrame
- site_level_model(*, aggregate_mut_escapes='mean', collapse_identical_variants='mean')[source]¶
Model with mutations collapsed at site level.
- Parameters:
aggregate_mut_escapes ({'mean'}) – How to aggregate mutation-level escape values to site-level ones in
mut_escape_df
.collapse_identical_variants ({"mean", "median", False}) – Same meaning as for
Polyclonal
initialization.
- Return type:
- exception polyclonal.polyclonal.PolyclonalConcentrationError[source]¶
Bases:
Exception
Error in concentrations passed to
Polyclonal
indata_to_fit
.
- exception polyclonal.polyclonal.PolyclonalFitError[source]¶
Bases:
Exception
Error fitting in
Polyclonal.fit()
.
- exception polyclonal.polyclonal.PolyclonalHarmonizeError[source]¶
Bases:
Exception
Error harmonizing epitopes in
Polyclonal.epitope_harmonized_model()
.