Fit more complex curves

In the previous section we fit a model to data that was simulated from neutralization curves with a Hill coefficient of one and a non-neutralized fraction of zero.

Now we fit a model to data simulated with a non-one Hill coefficient and a non-neutralized fraction different from zero. We also do the fitting constraining the Hill coefficient to one and the non-neutralized fraction to zero, and get much worse results. This underscores importance of making both of these parameters free.

[1]:
import copy
import requests
import tempfile

import numpy

import pandas as pd

import polyclonal
import polyclonal.plot

data_to_fit = pd.read_csv("2epitope_escape.csv", na_filter=None)

data_to_fit
[1]:
barcode aa_substitutions concentration prob_escape true IC90
0 AAAAAAACGTCAGGAG 0.25 0.026970 0.1156
1 AAAAAAGTCGATGACA 0.25 0.001488 0.1156
2 AAAAACGTATCGAGCA 0.25 0.004059 0.1156
3 AAAAACGTTCTTATAC 0.25 0.005425 0.1156
4 AAAAATGGAGTATTCT 0.25 0.000000 0.1156
... ... ... ... ... ...
119995 GCAAGCTCGCTCTTCT Y473C Q474E 4.00 0.000000 0.1156
119996 ACAGGTGTCATCCAAT Y473S R408W 4.00 0.009743 0.1156
119997 GCAACTTTCGTGTAGG E484A L518R 4.00 0.016030 0.5521
119998 CGTACTATGTATCCAC Y489I H519I 4.00 0.000000 0.1156
119999 TTACGGTTGTGCTCAG Y489Q L455P 4.00 0.000000 0.1156

120000 rows × 5 columns

For spatial regularization (encouraging epitopes to be structurally proximal residues), we read the inter-residue distances in angstroms from PDB 6m0j:

[2]:
# we read the PDB from the webpage into a temporary file and get the distances from that.
# you could also just download the file manually and then read from it.
r = requests.get("https://files.rcsb.org/download/6XM4.pdb")
with tempfile.NamedTemporaryFile() as tmpf:
    _ = tmpf.write(r.content)
    tmpf.flush()
    spatial_distances = polyclonal.pdb_utils.inter_residue_distances(tmpf.name, ["A"])

Initialize a Polyclonal model with two epitopes:

[3]:
reg_escape_weight = 0.02

model = polyclonal.Polyclonal(
    data_to_fit=data_to_fit,
    n_epitopes=2,
    spatial_distances=spatial_distances,
)

model_fixed = copy.deepcopy(model)

Now fit the Polyclonal model with all free parameters:

[4]:
# NBVAL_IGNORE_OUTPUT
opt_res = model.fit(logfreq=200, reg_escape_weight=reg_escape_weight)
#
# Fitting site-level fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 348 parameters at Tue Apr  4 16:34:12 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.055909       16121       16112           0           0           0              0               0        9.047                    0                        0
         175      9.5794      1221.9      1144.2      1.9793           0      15.674              0          3.0786       56.895                    0                        0
# Successfully finished at Tue Apr  4 16:34:22 2023.
#
# Fitting fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 3866 parameters at Tue Apr  4 16:34:22 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.090747      2457.7      1372.7      31.043  3.0584e-30      15.674              0          981.41       56.901                    0                        0
         200      17.627      645.38      477.87      32.528      13.949       60.16              0           3.316       57.556                    0                        0
         400      35.431      640.29      474.45      32.048      14.125      58.683              0          3.4311       57.553                    0                        0
         600      53.205       640.1      474.29       31.84      14.363      58.621              0          3.4377        57.55                    0                        0
         611      54.096       640.1      474.28      31.841      14.361      58.621              0          3.4385       57.552                    0                        0
# Successfully finished at Tue Apr  4 16:35:16 2023.
#
# Fitting model.
# Starting optimization of 3870 parameters at Tue Apr  4 16:35:16 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.080651      594.06      474.28      31.841      14.361      58.621              0          3.4385        11.51                    0                        0
         200      18.191       547.4      423.55      30.916      13.502      53.444              0          4.1208       10.707                8.459                   2.7029
         400      36.342      542.64      415.69      29.562      11.908      53.278              0          6.9181       10.343               11.924                   3.0154
         600      55.062       537.9      404.51       28.17      10.614      53.477              0          13.126       10.123               14.773                   3.1084
         800       72.73      536.44      404.03      28.008      10.614      52.903              0          12.766       10.119               14.904                   3.1021
        1000      91.718      535.85      404.53      27.779      10.546      52.645              0           12.15       10.128               14.972                   3.0996
        1200      110.15      535.53      404.09      27.491      10.418      52.647              0          12.189       10.099               15.446                   3.1514
        1352      123.88      535.31      404.12      27.272      10.353      52.165              0          12.456       10.081                15.72                   3.1422
# Successfully finished at Tue Apr  4 16:37:20 2023.

Next fit a model fixing the Hill coefficient to default of one and the non-neutralized fraction to default of zero:

[5]:
# NBVAL_IGNORE_OUTPUT
opt_res_fixed = model_fixed.fit(
    logfreq=200,
    reg_escape_weight=reg_escape_weight,
    fix_hill_coefficient=True,
    fix_non_neutralized_frac=True,
)
#
# Fitting site-level model.
# Starting optimization of 348 parameters at Tue Apr  4 16:37:22 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.055571       16114       16112           0           0           0              0               0       1.8094                    0                        0
         101       5.307        1519      1491.5     0.67984           0      9.3701              0         0.20813       17.211                    0                        0
# Successfully finished at Tue Apr  4 16:37:27 2023.
#
# Fitting model.
# Starting optimization of 3866 parameters at Tue Apr  4 16:37:27 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0    0.090302      1815.7      1712.1      9.8048  1.7883e-31      9.3701              0          67.253       17.211                    0                        0
          80      9.3994      1239.3        1147      14.143      3.0956      54.245              0        0.032042       20.706                    0                        0
# Successfully finished at Tue Apr  4 16:37:37 2023.

Now look at the actual curve specs:

[6]:
pd.read_csv("2epitope_model_curve_specs.csv")
[6]:
epitope activity hill_coefficient non_neutralized_frac
0 class 2 3.2 1.2 0.01
1 class 3 2.4 1.6 0.05

Now look at the curve specs for fitting the non-fixed model. See how the Hill coefficients and non-neutralized fractions are fit to values different than one and zero.

[7]:
# NBVAL_IGNORE_OUTPUT

model.curve_specs_df.round(2)
[7]:
epitope activity hill_coefficient non_neutralized_frac
0 1 2.32 1.66 0.05
1 2 2.92 1.43 0.02

And the curve specs for fitting the fixed model:

[8]:
model_fixed.curve_specs_df.round(2)
[8]:
epitope activity hill_coefficient non_neutralized_frac
0 1 5.21 1.0 0.0
1 2 -5.34 1.0 0.0

Plot the curves for the non-fixed model:

[9]:
# NBVAL_IGNORE_OUTPUT
model.curves_plot()
[9]:

Versus the fixed model:

[10]:
# NBVAL_IGNORE_OUTPUT
model_fixed.curves_plot()
[10]:

Here are plots of the escape values inferred using the non-fixed model Note how this successfully captures two epitopes:

[11]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot()
[11]:

And here are the escape values using the fixed model. While this fixed model does not capture both epitopes:

[12]:
# NBVAL_IGNORE_OUTPUT
model_fixed.mut_escape_plot()
[12]:

Now correlate the model predicted and actual IC90s for each variant. The full (non-fixed) model correlates better with the true log10 IC50s:

[13]:
# NBVAL_IGNORE_OUTPUT

ic90s = (
    data_to_fit[["aa_substitutions", "true IC90"]]
    .drop_duplicates()
    .pipe(model.icXX, x=0.9, col="model IC90")
    .pipe(model_fixed.icXX, x=0.9, col="fixed model IC90")
    .assign(
        log10_true_IC90=lambda x: numpy.log10(x["true IC90"]),
        log10_model_IC90=lambda x: numpy.log10(x["model IC90"]),
        log10_fixed_model_IC90=lambda x: numpy.log10(x["fixed model IC90"]),
    )
)

# print the correlations
ic90_corrs = (
    ic90s[[c for c in ic90s.columns if c.startswith("log")]]
    .corr(numeric_only=True)
    .round(2)
)
display(ic90_corrs)

assert (
    ic90_corrs.at["log10_true_IC90", "log10_model_IC90"]
    > ic90_corrs.at["log10_true_IC90", "log10_fixed_model_IC90"]
)
log10_true_IC90 log10_model_IC90 log10_fixed_model_IC90
log10_true_IC90 1.00 0.97 0.79
log10_model_IC90 0.97 1.00 0.84
log10_fixed_model_IC90 0.79 0.84 1.00

We also examine the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\):

[14]:
# NBVAL_IGNORE_OUTPUT

mut_escape = (
    pd.read_csv("RBD_mut_escape_df.csv")
    .query("epitope != 'class 1'")  # not used in simulation
    .assign(epitope=lambda x: "true " + x["epitope"])
    .pivot_table(index="mutation", columns="epitope", values="escape")
    .reset_index()
)

for m, mname in [(model, "model"), (model_fixed, "fixed model")]:
    # Sort so model with biggest average escape is first. This makes testing more
    # robust as it is sort of random which epitope name gets assigned to biggest:
    model_df = (
        m.mut_escape_df.assign(
            mean_escape=lambda x: x.groupby("epitope")["escape"].transform("mean")
        )
        .sort_values("mean_escape", ascending=False)
        .pivot_table(index="mutation", columns="epitope", values="escape", sort=False)
        .reset_index()
    )
    model_df.columns = ["mutation", f"{mname} epitope A", f"{mname} epitope B"]
    mut_escape = mut_escape.merge(model_df, on="mutation", validate="one_to_one")

mut_escape_corr = mut_escape.corr(numeric_only=True).drop(
    columns=[c for c in mut_escape.columns if not c.startswith("true")],
    index=[c for c in mut_escape.columns if c.startswith("true")],
    errors="ignore",
)

display(mut_escape_corr.round(2))

assert (
    mut_escape_corr.query("index.str.startswith('model')").max().max()
    > mut_escape_corr.query("index.str.startswith('fixed')").max().max()
)
true class 2 true class 3
model epitope A -0.06 0.96
model epitope B 0.98 -0.06
fixed model epitope A 0.67 0.52
fixed model epitope B -0.07 -0.05

As can be seen from correlations above, the non-fixed model captures both epitopes well, but the fixed model just captures one epitope.