Library mutation rate¶
We’ll use simulated data to show how the average mutation rate of variants in a DMS library affects the performance of Polyclonal
models.
[1]:
import os
import pickle
import altair as alt
import numpy as np
import pandas as pd
import polyclonal
First, we read in a four simulated “noisy” libraries, each measured at three sera concentrations. The variants in these libraries were simulated to contain a Poisson-distributed number of mutations. The libraries differ in their average number of mutations (1, 2, 3, or 4) per gene, and are named accordingly.
[2]:
noisy_data = (
pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
.query("concentration in [0.25, 1, 4]")
.reset_index(drop=True)
)
noisy_data
[2]:
library | barcode | concentration | prob_escape | aa_substitutions | IC90 | |
---|---|---|---|---|---|---|
0 | avg1muts | AAAAAATGTTCTATCC | 0.25 | 0.000000 | 0.08212 | |
1 | avg1muts | AAAAACAATCCGGACT | 0.25 | 0.000000 | 0.08212 | |
2 | avg1muts | AAAAACGCGGTCACTT | 0.25 | 0.018470 | 0.08212 | |
3 | avg1muts | AAAAACTTGGCTAGCT | 0.25 | 0.003051 | 0.08212 | |
4 | avg1muts | AAAAAGCAAGGCCCAG | 0.25 | 0.000000 | 0.08212 | |
... | ... | ... | ... | ... | ... | ... |
479995 | avg3muts | CTTAAAATAGCTGGTC | 4.00 | 0.000000 | Y508W | 0.08212 |
479996 | avg4muts | GGTCAATTATGTCGGG | 4.00 | 0.000000 | Y508W | 0.08212 |
479997 | avg1muts | GGAACGACAGTGATCG | 0.25 | 0.000000 | Y508W T531H | 0.08212 |
479998 | avg1muts | GGAACGACAGTGATCG | 1.00 | 0.000000 | Y508W T531H | 0.08212 |
479999 | avg1muts | GGAACGACAGTGATCG | 4.00 | 0.000000 | Y508W T531H | 0.08212 |
480000 rows × 6 columns
Now, we’ll fit a Polyclonal
model to data in each library. We’ll initialize each Polyclonal
model with the same values. We know from prior work the three most important epitopes and a key mutation in each, so we use this prior knowledge to “seed” initial guesses that assign large escape values to a key site in each epitope:
site 417 for class 1 epitope, which is often the least important
site 484 for class 2 epitope, which is often the dominant one
site 444 for class 3 epitope, which is often the second most dominant one
Additionally, we’ll store fit models as pickle files, so that we can conveniently load them in the future without having to fit again.
[3]:
avg_mut_rates = [1, 2, 3, 4]
# Make a directory to house pickled models
os.makedirs("fit_polyclonal_models", exist_ok=True)
def fit_polyclonal(n):
"""
Fit `Polyclonal` model with data with a specific average mutation rate.
Returns fit `Polyclonal` object.
"""
poly_abs = polyclonal.Polyclonal(
data_to_fit=noisy_data.query(f"library == 'avg{n}muts'"),
activity_wt_df=pd.DataFrame.from_records(
[
("1", 1.0),
("2", 3.0),
("3", 2.0),
],
columns=["epitope", "activity"],
),
site_escape_df=pd.DataFrame.from_records(
[
("1", 417, 10.0),
("2", 484, 10.0),
("3", 444, 10.0),
],
columns=["epitope", "site", "escape"],
),
data_mut_escape_overlap="fill_to_data",
)
poly_abs.fit(reg_escape_weight=0.01, reg_uniqueness2_weight=0)
return poly_abs
# Store all fit models in a dictionary for future lookup
fit_models = {}
for n in avg_mut_rates:
# These are the keys for fit models
model_string = f"noisy_[0.25, 1, 4]conc_{n}muts"
# If the pickled model exists in fit_polyclonal_models directory,
# load it and update fit_models
if os.path.exists(f"fit_polyclonal_models/{model_string}.pkl") is True:
model = pickle.load(open(f"fit_polyclonal_models/{model_string}.pkl", "rb"))
fit_models.update({model_string: model})
print(f"Model on data with {n} average mutations was already fit.")
else:
# Else, fit a model using fit_polyclonal(), save it to the
# fit_polyclonal_models directory, and update fit_models
model = fit_polyclonal(n)
fit_models.update({model_string: model})
pickle.dump(model, open(f"fit_polyclonal_models/{model_string}.pkl", "wb"))
print(f"Model on data with {n} average mutations fit and saved.")
Model on data with 1 average mutations was already fit.
Model on data with 2 average mutations was already fit.
Model on data with 3 average mutations was already fit.
Model on data with 4 average mutations was already fit.
We can look at the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\), for the fit models. These mutation-escape values represent the extent to which mutations mediate escape from specific epitopes.
[4]:
all_corrs = pd.DataFrame({"epitope": [], "correlation (R^2)": [], "mutation_rate": []})
for n in avg_mut_rates:
model = fit_models[f"noisy_[0.25, 1, 4]conc_{n}muts"]
mut_escape_pred = pd.read_csv("RBD_mut_escape_df.csv").merge(
(
model.mut_escape_df.assign(
epitope=lambda x: "class " + x["epitope"].astype(str)
).rename(columns={"escape": "predicted escape"})
),
on=["mutation", "epitope"],
validate="one_to_one",
)
corr = (
mut_escape_pred.groupby("epitope")
.apply(lambda x: x["escape"].corr(x["predicted escape"]) ** 2)
.rename("correlation (R^2)")
.reset_index()
)
all_corrs = pd.concat(
[all_corrs, corr.assign(mutation_rate=[f"avg{n}muts"] * len(corr.index))]
)
[5]:
# NBVAL_IGNORE_OUTPUT
alt.Chart(all_corrs).mark_circle(size=125).encode(
x=alt.X("mutation_rate:O", sort=alt.EncodingSortField("x", order="descending")),
y="correlation (R^2):Q",
column="epitope:N",
tooltip=["mutation_rate", alt.Tooltip("correlation (R^2)", format=".3f")],
color=alt.Color(
"epitope", scale=alt.Scale(range=["#0072B2", "#CC79A7", "#4C3549"]), legend=None
),
).properties(width=200, height=200, title="inferred vs. true mutation escape values")
/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
[5]:
Summary¶
An average of at least 2 mutations per gene is needed to infer the true mutation-escape values for all epitopes. When there is an average of 1 mutation per gene, the correlation is highest for the most immunodominant epitope 2 and lowest for the most subdominant epitope 1. This is expected, as we should not observe escape for variants with a single mutation in a subdominant epitope.