Library size

We’ll use simulated data to show how the number of variants in a DMS library affects the performance of Polyclonal models.

[1]:
import os
import pickle

import altair as alt
import numpy as np
import pandas as pd
import polyclonal

First, we read in a simulated “noisy” dataset containing 30,000 variants measured at 3 different sera concentrations. The variants in this library were simulated to contain a Poisson-distributed number of mutations, with an average of three mutations per gene.

[2]:
noisy_data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query("library == 'avg3muts'")
    .query("concentration in [0.25, 1, 4]")
    .reset_index(drop=True)
)

noisy_data
[2]:
library barcode concentration prob_escape aa_substitutions IC90
0 avg3muts AAAACTGCTGAGGAGA 0.25 0.054700 0.08212
1 avg3muts AAAAGCAGGCTACTCT 0.25 0.000000 0.08212
2 avg3muts AAAAGCTATAGGTGCC 0.25 0.007613 0.08212
3 avg3muts AAAAGGTATTAGTGGC 0.25 0.001363 0.08212
4 avg3muts AAAAGTGCCTTCGTTA 0.25 0.000000 0.08212
... ... ... ... ... ... ...
119995 avg3muts GAGCATGATCGACGAA 1.00 0.000000 Y508V H519I 0.10830
119996 avg3muts GAGCATGATCGACGAA 4.00 0.000000 Y508V H519I 0.10830
119997 avg3muts CTTAAAATAGCTGGTC 0.25 0.000000 Y508W 0.08212
119998 avg3muts CTTAAAATAGCTGGTC 1.00 0.012260 Y508W 0.08212
119999 avg3muts CTTAAAATAGCTGGTC 4.00 0.000000 Y508W 0.08212

120000 rows × 6 columns

We’ll randomly subsample smaller fractions of variants in our library and fit a Polyclonal model to each of the subsets.

[3]:
library_sizes = [1000, 2500, 5000, 10000, 20000, 30000]

# Make a directory to house pickled models
os.makedirs("fit_polyclonal_models", exist_ok=True)


def fit_polyclonal(n_variants):
    """
    Fit `Polyclonal` model on n_variants variants randomly subsampled from a dataset.
    Returns fit `Polyclonal` object.
    """
    poly_abs = polyclonal.Polyclonal(
        data_to_fit=(
            noisy_data.groupby("concentration")
            .apply(lambda x: x.sample(n=n_variants, random_state=123))
            .reset_index(drop=True)
        ),
        activity_wt_df=pd.DataFrame.from_records(
            [
                ("1", 1.0),
                ("2", 3.0),
                ("3", 2.0),
            ],
            columns=["epitope", "activity"],
        ),
        site_escape_df=pd.DataFrame.from_records(
            [
                ("1", 417, 10.0),
                ("2", 484, 10.0),
                ("3", 444, 10.0),
            ],
            columns=["epitope", "site", "escape"],
        ),
        data_mut_escape_overlap="fill_to_data",
    )
    poly_abs.fit(reg_escape_weight=0.01, reg_uniqueness2_weight=0)
    return poly_abs


# Store all fit models in a dictionary for future lookup
fit_models = {}

for n in library_sizes:
    # These are the keys for fit models
    model_string = f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"

    # If the pickled model exists in fit_polyclonal_models directory,
    # load it and update fit_models
    if os.path.exists(f"fit_polyclonal_models/{model_string}.pkl") is True:
        model = pickle.load(open(f"fit_polyclonal_models/{model_string}.pkl", "rb"))
        fit_models.update({model_string: model})
        print(f"Model with {n} variants was already fit.")
    else:
        # Else, fit a model using fit_polyclonal(), save it to the
        # fit_polyclonal_models directory, and update fit_models
        model = fit_polyclonal(n)
        fit_models.update({model_string: model})
        pickle.dump(model, open(f"fit_polyclonal_models/{model_string}.pkl", "wb"))
        print(f"Model with {n} variants fit and saved.")
Model with 1000 variants was already fit.
Model with 2500 variants was already fit.
Model with 5000 variants was already fit.
Model with 10000 variants was already fit.
Model with 20000 variants was already fit.
Model with 30000 variants was already fit.

We can look at the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\), for the fit models. These mutation-escape values represent the extent to which mutations mediate escape from specific epitopes.

[4]:
all_corrs = pd.DataFrame({"epitope": [], "correlation (R^2)": [], "num_variants": []})

for n in library_sizes:
    model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]

    mut_escape_pred = pd.read_csv("RBD_mut_escape_df.csv").merge(
        (
            model.mut_escape_df.assign(
                epitope=lambda x: "class " + x["epitope"].astype(str)
            ).rename(columns={"escape": "predicted escape"})
        ),
        on=["mutation", "epitope"],
        validate="one_to_one",
    )

    corr = (
        mut_escape_pred.groupby("epitope")
        .apply(lambda x: x["escape"].corr(x["predicted escape"]) ** 2)
        .rename("correlation (R^2)")
        .reset_index()
    )

    all_corrs = pd.concat(
        [all_corrs, corr.assign(num_variants=[str(n)] * len(corr.index))]
    )
[5]:
# NBVAL_IGNORE_OUTPUT
base = (
    alt.Chart(all_corrs)
    .mark_point()
    .encode(
        alt.X("num_variants:Q"),
        alt.Y("correlation (R^2):Q"),
        alt.Color(
            "epitope:N", scale=alt.Scale(range=["#0072B2", "#CC79A7", "#4C3549"])
        ),
        tooltip=[
            "num_variants",
            alt.Tooltip("correlation (R^2)", format=".3f"),
            "epitope",
        ],
    )
)
base + base.transform_loess(
    "num_variants", "correlation (R^2)", groupby=["epitope"]
).mark_line(size=2.5).properties(title="inferred vs. true mutation escape values")
/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
[5]:

Additionally, we’ll look at the correlation between “true” and predicted IC90’s for each of the fit models. To do this, we’ll predict the IC90’s of variants in a separate library with a with a different (higher) mutation rate. We therefore read in the “exact” simulated data from a library containing variants with an average of four mutations per gene.

[6]:
exact_data = (
    pd.read_csv("RBD_variants_escape_exact.csv", na_filter=None)
    .query('library == "avg4muts"')
    .query("concentration in [1]")
    .reset_index(drop=True)
)

We’ll make the comparison on a log scale, and clip IC90s at values >50 as that is likely to be way outside the dynamic range given the concentrations used.

[7]:
ic90_corrs = pd.DataFrame({"correlation (R^2)": [], "num_variants": []})

max_ic90 = 50
for n in library_sizes:
    model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]

    ic90s = (
        exact_data[["aa_substitutions", "IC90"]]
        .assign(IC90=lambda x: x["IC90"].clip(upper=max_ic90))
        .drop_duplicates()
    )
    ic90s = model.filter_variants_by_seen_muts(ic90s)
    ic90s = model.icXX(ic90s, x=0.9, col="predicted_IC90", max_c=max_ic90)

    ic90s = ic90s.assign(
        log_IC90=lambda x: np.log10(x["IC90"]),
        predicted_log_IC90=lambda x: np.log10(x["predicted_IC90"]),
    )

    corr = ic90s["log_IC90"].corr(ic90s["predicted_log_IC90"]) ** 2

    ic90_corrs = pd.concat(
        [
            ic90_corrs,
            pd.DataFrame({"correlation (R^2)": corr, "num_variants": [str(n)]}),
        ]
    )
[8]:
# NBVAL_IGNORE_OUTPUT
base = (
    alt.Chart(ic90_corrs)
    .mark_point()
    .encode(
        alt.X("num_variants:Q"),
        alt.Y("correlation (R^2):Q"),
        tooltip=["num_variants", alt.Tooltip("correlation (R^2)", format=".3f")],
    )
)
base + base.transform_loess("num_variants", "correlation (R^2)").mark_line(
    size=2.5
).properties(title="predicted vs. true IC90")
/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
[8]:

Summary

Library size is an important factor. Having greater than 20,000 (functional) multiply-mutated variants led to stronger inference of the true mutation-escape values at each epitope, particularly the subdominant epitopes.