Fit more complex curves¶
In the previous section we fit a model to data that was simulated from neutralization curves with a Hill coefficient of one and a non-neutralized fraction of zero.
Now we fit a model to data simulated with a non-one Hill coefficient and a non-neutralized fraction different from zero. We also do the fitting constraining the Hill coefficient to one and the non-neutralized fraction to zero, and get much worse results. This underscores importance of making both of these parameters free.
[1]:
import copy
import requests
import tempfile
import numpy
import pandas as pd
import polyclonal
import polyclonal.plot
data_to_fit = pd.read_csv("2epitope_escape.csv", na_filter=None)
data_to_fit
[1]:
barcode | aa_substitutions | concentration | prob_escape | true IC90 | |
---|---|---|---|---|---|
0 | AAAAAAACGTCAGGAG | 0.25 | 0.026970 | 0.1156 | |
1 | AAAAAAGTCGATGACA | 0.25 | 0.001488 | 0.1156 | |
2 | AAAAACGTATCGAGCA | 0.25 | 0.004059 | 0.1156 | |
3 | AAAAACGTTCTTATAC | 0.25 | 0.005425 | 0.1156 | |
4 | AAAAATGGAGTATTCT | 0.25 | 0.000000 | 0.1156 | |
... | ... | ... | ... | ... | ... |
119995 | GCAAGCTCGCTCTTCT | Y473C Q474E | 4.00 | 0.000000 | 0.1156 |
119996 | ACAGGTGTCATCCAAT | Y473S R408W | 4.00 | 0.009743 | 0.1156 |
119997 | GCAACTTTCGTGTAGG | E484A L518R | 4.00 | 0.016030 | 0.5521 |
119998 | CGTACTATGTATCCAC | Y489I H519I | 4.00 | 0.000000 | 0.1156 |
119999 | TTACGGTTGTGCTCAG | Y489Q L455P | 4.00 | 0.000000 | 0.1156 |
120000 rows × 5 columns
For spatial regularization (encouraging epitopes to be structurally proximal residues), we read the inter-residue distances in angstroms from PDB 6m0j:
[2]:
# we read the PDB from the webpage into a temporary file and get the distances from that.
# you could also just download the file manually and then read from it.
r = requests.get("https://files.rcsb.org/download/6XM4.pdb")
with tempfile.NamedTemporaryFile() as tmpf:
_ = tmpf.write(r.content)
tmpf.flush()
spatial_distances = polyclonal.pdb_utils.inter_residue_distances(tmpf.name, ["A"])
Initialize a Polyclonal
model with two epitopes:
[3]:
reg_escape_weight = 0.02
model = polyclonal.Polyclonal(
data_to_fit=data_to_fit,
n_epitopes=2,
spatial_distances=spatial_distances,
)
model_fixed = copy.deepcopy(model)
Now fit the Polyclonal
model with all free parameters:
[4]:
# NBVAL_IGNORE_OUTPUT
opt_res = model.fit(logfreq=200, reg_escape_weight=reg_escape_weight)
#
# Fitting site-level fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 348 parameters at Tue Apr 4 16:34:12 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.055909 16121 16112 0 0 0 0 0 9.047 0 0
175 9.5794 1221.9 1144.2 1.9793 0 15.674 0 3.0786 56.895 0 0
# Successfully finished at Tue Apr 4 16:34:22 2023.
#
# Fitting fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 3866 parameters at Tue Apr 4 16:34:22 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.090747 2457.7 1372.7 31.043 3.0584e-30 15.674 0 981.41 56.901 0 0
200 17.627 645.38 477.87 32.528 13.949 60.16 0 3.316 57.556 0 0
400 35.431 640.29 474.45 32.048 14.125 58.683 0 3.4311 57.553 0 0
600 53.205 640.1 474.29 31.84 14.363 58.621 0 3.4377 57.55 0 0
611 54.096 640.1 474.28 31.841 14.361 58.621 0 3.4385 57.552 0 0
# Successfully finished at Tue Apr 4 16:35:16 2023.
#
# Fitting model.
# Starting optimization of 3870 parameters at Tue Apr 4 16:35:16 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.080651 594.06 474.28 31.841 14.361 58.621 0 3.4385 11.51 0 0
200 18.191 547.4 423.55 30.916 13.502 53.444 0 4.1208 10.707 8.459 2.7029
400 36.342 542.64 415.69 29.562 11.908 53.278 0 6.9181 10.343 11.924 3.0154
600 55.062 537.9 404.51 28.17 10.614 53.477 0 13.126 10.123 14.773 3.1084
800 72.73 536.44 404.03 28.008 10.614 52.903 0 12.766 10.119 14.904 3.1021
1000 91.718 535.85 404.53 27.779 10.546 52.645 0 12.15 10.128 14.972 3.0996
1200 110.15 535.53 404.09 27.491 10.418 52.647 0 12.189 10.099 15.446 3.1514
1352 123.88 535.31 404.12 27.272 10.353 52.165 0 12.456 10.081 15.72 3.1422
# Successfully finished at Tue Apr 4 16:37:20 2023.
Next fit a model fixing the Hill coefficient to default of one and the non-neutralized fraction to default of zero:
[5]:
# NBVAL_IGNORE_OUTPUT
opt_res_fixed = model_fixed.fit(
logfreq=200,
reg_escape_weight=reg_escape_weight,
fix_hill_coefficient=True,
fix_non_neutralized_frac=True,
)
#
# Fitting site-level model.
# Starting optimization of 348 parameters at Tue Apr 4 16:37:22 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.055571 16114 16112 0 0 0 0 0 1.8094 0 0
101 5.307 1519 1491.5 0.67984 0 9.3701 0 0.20813 17.211 0 0
# Successfully finished at Tue Apr 4 16:37:27 2023.
#
# Fitting model.
# Starting optimization of 3866 parameters at Tue Apr 4 16:37:27 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.090302 1815.7 1712.1 9.8048 1.7883e-31 9.3701 0 67.253 17.211 0 0
80 9.3994 1239.3 1147 14.143 3.0956 54.245 0 0.032042 20.706 0 0
# Successfully finished at Tue Apr 4 16:37:37 2023.
Now look at the actual curve specs:
[6]:
pd.read_csv("2epitope_model_curve_specs.csv")
[6]:
epitope | activity | hill_coefficient | non_neutralized_frac | |
---|---|---|---|---|
0 | class 2 | 3.2 | 1.2 | 0.01 |
1 | class 3 | 2.4 | 1.6 | 0.05 |
Now look at the curve specs for fitting the non-fixed model. See how the Hill coefficients and non-neutralized fractions are fit to values different than one and zero.
[7]:
# NBVAL_IGNORE_OUTPUT
model.curve_specs_df.round(2)
[7]:
epitope | activity | hill_coefficient | non_neutralized_frac | |
---|---|---|---|---|
0 | 1 | 2.32 | 1.66 | 0.05 |
1 | 2 | 2.92 | 1.43 | 0.02 |
And the curve specs for fitting the fixed model:
[8]:
model_fixed.curve_specs_df.round(2)
[8]:
epitope | activity | hill_coefficient | non_neutralized_frac | |
---|---|---|---|---|
0 | 1 | 5.21 | 1.0 | 0.0 |
1 | 2 | -5.34 | 1.0 | 0.0 |
Plot the curves for the non-fixed model:
[9]:
# NBVAL_IGNORE_OUTPUT
model.curves_plot()
[9]:
Versus the fixed model:
[10]:
# NBVAL_IGNORE_OUTPUT
model_fixed.curves_plot()
[10]:
Here are plots of the escape values inferred using the non-fixed model Note how this successfully captures two epitopes:
[11]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot()
[11]:
And here are the escape values using the fixed model. While this fixed model does not capture both epitopes:
[12]:
# NBVAL_IGNORE_OUTPUT
model_fixed.mut_escape_plot()
[12]:
Now correlate the model predicted and actual IC90s for each variant. The full (non-fixed) model correlates better with the true log10 IC50s:
[13]:
# NBVAL_IGNORE_OUTPUT
ic90s = (
data_to_fit[["aa_substitutions", "true IC90"]]
.drop_duplicates()
.pipe(model.icXX, x=0.9, col="model IC90")
.pipe(model_fixed.icXX, x=0.9, col="fixed model IC90")
.assign(
log10_true_IC90=lambda x: numpy.log10(x["true IC90"]),
log10_model_IC90=lambda x: numpy.log10(x["model IC90"]),
log10_fixed_model_IC90=lambda x: numpy.log10(x["fixed model IC90"]),
)
)
# print the correlations
ic90_corrs = (
ic90s[[c for c in ic90s.columns if c.startswith("log")]]
.corr(numeric_only=True)
.round(2)
)
display(ic90_corrs)
assert (
ic90_corrs.at["log10_true_IC90", "log10_model_IC90"]
> ic90_corrs.at["log10_true_IC90", "log10_fixed_model_IC90"]
)
log10_true_IC90 | log10_model_IC90 | log10_fixed_model_IC90 | |
---|---|---|---|
log10_true_IC90 | 1.00 | 0.97 | 0.79 |
log10_model_IC90 | 0.97 | 1.00 | 0.84 |
log10_fixed_model_IC90 | 0.79 | 0.84 | 1.00 |
We also examine the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\):
[14]:
# NBVAL_IGNORE_OUTPUT
mut_escape = (
pd.read_csv("RBD_mut_escape_df.csv")
.query("epitope != 'class 1'") # not used in simulation
.assign(epitope=lambda x: "true " + x["epitope"])
.pivot_table(index="mutation", columns="epitope", values="escape")
.reset_index()
)
for m, mname in [(model, "model"), (model_fixed, "fixed model")]:
# Sort so model with biggest average escape is first. This makes testing more
# robust as it is sort of random which epitope name gets assigned to biggest:
model_df = (
m.mut_escape_df.assign(
mean_escape=lambda x: x.groupby("epitope")["escape"].transform("mean")
)
.sort_values("mean_escape", ascending=False)
.pivot_table(index="mutation", columns="epitope", values="escape", sort=False)
.reset_index()
)
model_df.columns = ["mutation", f"{mname} epitope A", f"{mname} epitope B"]
mut_escape = mut_escape.merge(model_df, on="mutation", validate="one_to_one")
mut_escape_corr = mut_escape.corr(numeric_only=True).drop(
columns=[c for c in mut_escape.columns if not c.startswith("true")],
index=[c for c in mut_escape.columns if c.startswith("true")],
errors="ignore",
)
display(mut_escape_corr.round(2))
assert (
mut_escape_corr.query("index.str.startswith('model')").max().max()
> mut_escape_corr.query("index.str.startswith('fixed')").max().max()
)
true class 2 | true class 3 | |
---|---|---|
model epitope A | -0.06 | 0.96 |
model epitope B | 0.98 | -0.06 |
fixed model epitope A | 0.67 | 0.52 |
fixed model epitope B | -0.07 | -0.05 |
As can be seen from correlations above, the non-fixed model captures both epitopes well, but the fixed model just captures one epitope.