Library size¶
We’ll use simulated data to show how the number of variants in a DMS library affects the performance of Polyclonal
models.
[1]:
import os
import pickle
import altair as alt
import numpy as np
import pandas as pd
import polyclonal
First, we read in a simulated “noisy” dataset containing 30,000 variants measured at 3 different sera concentrations. The variants in this library were simulated to contain a Poisson-distributed number of mutations, with an average of three mutations per gene.
[2]:
noisy_data = (
pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
.query("library == 'avg3muts'")
.query("concentration in [0.25, 1, 4]")
.reset_index(drop=True)
)
noisy_data
[2]:
library | barcode | concentration | prob_escape | aa_substitutions | IC90 | |
---|---|---|---|---|---|---|
0 | avg3muts | AAAACTGCTGAGGAGA | 0.25 | 0.054700 | 0.08212 | |
1 | avg3muts | AAAAGCAGGCTACTCT | 0.25 | 0.000000 | 0.08212 | |
2 | avg3muts | AAAAGCTATAGGTGCC | 0.25 | 0.007613 | 0.08212 | |
3 | avg3muts | AAAAGGTATTAGTGGC | 0.25 | 0.001363 | 0.08212 | |
4 | avg3muts | AAAAGTGCCTTCGTTA | 0.25 | 0.000000 | 0.08212 | |
... | ... | ... | ... | ... | ... | ... |
119995 | avg3muts | GAGCATGATCGACGAA | 1.00 | 0.000000 | Y508V H519I | 0.10830 |
119996 | avg3muts | GAGCATGATCGACGAA | 4.00 | 0.000000 | Y508V H519I | 0.10830 |
119997 | avg3muts | CTTAAAATAGCTGGTC | 0.25 | 0.000000 | Y508W | 0.08212 |
119998 | avg3muts | CTTAAAATAGCTGGTC | 1.00 | 0.012260 | Y508W | 0.08212 |
119999 | avg3muts | CTTAAAATAGCTGGTC | 4.00 | 0.000000 | Y508W | 0.08212 |
120000 rows × 6 columns
We’ll randomly subsample smaller fractions of variants in our library and fit a Polyclonal
model to each of the subsets.
[3]:
library_sizes = [1000, 2500, 5000, 10000, 20000, 30000]
# Make a directory to house pickled models
os.makedirs("fit_polyclonal_models", exist_ok=True)
def fit_polyclonal(n_variants):
"""
Fit `Polyclonal` model on n_variants variants randomly subsampled from a dataset.
Returns fit `Polyclonal` object.
"""
poly_abs = polyclonal.Polyclonal(
data_to_fit=(
noisy_data.groupby("concentration")
.apply(lambda x: x.sample(n=n_variants, random_state=123))
.reset_index(drop=True)
),
activity_wt_df=pd.DataFrame.from_records(
[
("1", 1.0),
("2", 3.0),
("3", 2.0),
],
columns=["epitope", "activity"],
),
site_escape_df=pd.DataFrame.from_records(
[
("1", 417, 10.0),
("2", 484, 10.0),
("3", 444, 10.0),
],
columns=["epitope", "site", "escape"],
),
data_mut_escape_overlap="fill_to_data",
)
poly_abs.fit(reg_escape_weight=0.01, reg_uniqueness2_weight=0)
return poly_abs
# Store all fit models in a dictionary for future lookup
fit_models = {}
for n in library_sizes:
# These are the keys for fit models
model_string = f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"
# If the pickled model exists in fit_polyclonal_models directory,
# load it and update fit_models
if os.path.exists(f"fit_polyclonal_models/{model_string}.pkl") is True:
model = pickle.load(open(f"fit_polyclonal_models/{model_string}.pkl", "rb"))
fit_models.update({model_string: model})
print(f"Model with {n} variants was already fit.")
else:
# Else, fit a model using fit_polyclonal(), save it to the
# fit_polyclonal_models directory, and update fit_models
model = fit_polyclonal(n)
fit_models.update({model_string: model})
pickle.dump(model, open(f"fit_polyclonal_models/{model_string}.pkl", "wb"))
print(f"Model with {n} variants fit and saved.")
Model with 1000 variants was already fit.
Model with 2500 variants was already fit.
Model with 5000 variants was already fit.
Model with 10000 variants was already fit.
Model with 20000 variants was already fit.
Model with 30000 variants was already fit.
We can look at the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\), for the fit models. These mutation-escape values represent the extent to which mutations mediate escape from specific epitopes.
[4]:
all_corrs = pd.DataFrame({"epitope": [], "correlation (R^2)": [], "num_variants": []})
for n in library_sizes:
model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]
mut_escape_pred = pd.read_csv("RBD_mut_escape_df.csv").merge(
(
model.mut_escape_df.assign(
epitope=lambda x: "class " + x["epitope"].astype(str)
).rename(columns={"escape": "predicted escape"})
),
on=["mutation", "epitope"],
validate="one_to_one",
)
corr = (
mut_escape_pred.groupby("epitope")
.apply(lambda x: x["escape"].corr(x["predicted escape"]) ** 2)
.rename("correlation (R^2)")
.reset_index()
)
all_corrs = pd.concat(
[all_corrs, corr.assign(num_variants=[str(n)] * len(corr.index))]
)
[5]:
# NBVAL_IGNORE_OUTPUT
base = (
alt.Chart(all_corrs)
.mark_point()
.encode(
alt.X("num_variants:Q"),
alt.Y("correlation (R^2):Q"),
alt.Color(
"epitope:N", scale=alt.Scale(range=["#0072B2", "#CC79A7", "#4C3549"])
),
tooltip=[
"num_variants",
alt.Tooltip("correlation (R^2)", format=".3f"),
"epitope",
],
)
)
base + base.transform_loess(
"num_variants", "correlation (R^2)", groupby=["epitope"]
).mark_line(size=2.5).properties(title="inferred vs. true mutation escape values")
/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
[5]:
Additionally, we’ll look at the correlation between “true” and predicted IC90’s for each of the fit models. To do this, we’ll predict the IC90’s of variants in a separate library with a with a different (higher) mutation rate. We therefore read in the “exact” simulated data from a library containing variants with an average of four mutations per gene.
[6]:
exact_data = (
pd.read_csv("RBD_variants_escape_exact.csv", na_filter=None)
.query('library == "avg4muts"')
.query("concentration in [1]")
.reset_index(drop=True)
)
We’ll make the comparison on a log scale, and clip IC90s at values >50 as that is likely to be way outside the dynamic range given the concentrations used.
[7]:
ic90_corrs = pd.DataFrame({"correlation (R^2)": [], "num_variants": []})
max_ic90 = 50
for n in library_sizes:
model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]
ic90s = (
exact_data[["aa_substitutions", "IC90"]]
.assign(IC90=lambda x: x["IC90"].clip(upper=max_ic90))
.drop_duplicates()
)
ic90s = model.filter_variants_by_seen_muts(ic90s)
ic90s = model.icXX(ic90s, x=0.9, col="predicted_IC90", max_c=max_ic90)
ic90s = ic90s.assign(
log_IC90=lambda x: np.log10(x["IC90"]),
predicted_log_IC90=lambda x: np.log10(x["predicted_IC90"]),
)
corr = ic90s["log_IC90"].corr(ic90s["predicted_log_IC90"]) ** 2
ic90_corrs = pd.concat(
[
ic90_corrs,
pd.DataFrame({"correlation (R^2)": corr, "num_variants": [str(n)]}),
]
)
[8]:
# NBVAL_IGNORE_OUTPUT
base = (
alt.Chart(ic90_corrs)
.mark_point()
.encode(
alt.X("num_variants:Q"),
alt.Y("correlation (R^2):Q"),
tooltip=["num_variants", alt.Tooltip("correlation (R^2)", format=".3f")],
)
)
base + base.transform_loess("num_variants", "correlation (R^2)").mark_line(
size=2.5
).properties(title="predicted vs. true IC90")
/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
[8]:
Summary¶
Library size is an important factor. Having greater than 20,000 (functional) multiply-mutated variants led to stronger inference of the true mutation-escape values at each epitope, particularly the subdominant epitopes.