Library size¶

We’ll use simulated data to show how the number of variants in a DMS library affects the performance of Polyclonal models.

[1]:

import os
import pickle

import altair as alt
import numpy as np
import pandas as pd
import polyclonal

First, we read in a simulated “noisy” dataset containing 30,000 variants measured at 3 different sera concentrations. The variants in this library were simulated to contain a Poisson-distributed number of mutations, with an average of three mutations per gene.

[2]:

noisy_data = (
    pd.read_csv("RBD_variants_escape_noisy.csv", na_filter=None)
    .query("library == 'avg3muts'")
    .query("concentration in [0.25, 1, 4]")
    .reset_index(drop=True)
)

noisy_data

[2]:

	library	barcode	concentration	prob_escape	aa_substitutions	IC90
0	avg3muts	AAAACTGCTGAGGAGA	0.25	0.054700		0.08212
1	avg3muts	AAAAGCAGGCTACTCT	0.25	0.000000		0.08212
2	avg3muts	AAAAGCTATAGGTGCC	0.25	0.007613		0.08212
3	avg3muts	AAAAGGTATTAGTGGC	0.25	0.001363		0.08212
4	avg3muts	AAAAGTGCCTTCGTTA	0.25	0.000000		0.08212
...	...	...	...	...	...	...
119995	avg3muts	GAGCATGATCGACGAA	1.00	0.000000	Y508V H519I	0.10830
119996	avg3muts	GAGCATGATCGACGAA	4.00	0.000000	Y508V H519I	0.10830
119997	avg3muts	CTTAAAATAGCTGGTC	0.25	0.000000	Y508W	0.08212
119998	avg3muts	CTTAAAATAGCTGGTC	1.00	0.012260	Y508W	0.08212
119999	avg3muts	CTTAAAATAGCTGGTC	4.00	0.000000	Y508W	0.08212

120000 rows × 6 columns

We’ll randomly subsample smaller fractions of variants in our library and fit a Polyclonal model to each of the subsets.

[3]:

library_sizes = [1000, 2500, 5000, 10000, 20000, 30000]

# Make a directory to house pickled models
os.makedirs("fit_polyclonal_models", exist_ok=True)


def fit_polyclonal(n_variants):
    """
    Fit `Polyclonal` model on n_variants variants randomly subsampled from a dataset.
    Returns fit `Polyclonal` object.
    """
    poly_abs = polyclonal.Polyclonal(
        data_to_fit=(
            noisy_data.groupby("concentration")
            .apply(lambda x: x.sample(n=n_variants, random_state=123))
            .reset_index(drop=True)
        ),
        activity_wt_df=pd.DataFrame.from_records(
            [
                ("1", 1.0),
                ("2", 3.0),
                ("3", 2.0),
            ],
            columns=["epitope", "activity"],
        ),
        site_escape_df=pd.DataFrame.from_records(
            [
                ("1", 417, 10.0),
                ("2", 484, 10.0),
                ("3", 444, 10.0),
            ],
            columns=["epitope", "site", "escape"],
        ),
        data_mut_escape_overlap="fill_to_data",
    )
    poly_abs.fit(reg_escape_weight=0.01, reg_uniqueness2_weight=0)
    return poly_abs


# Store all fit models in a dictionary for future lookup
fit_models = {}

for n in library_sizes:
    # These are the keys for fit models
    model_string = f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"

    # If the pickled model exists in fit_polyclonal_models directory,
    # load it and update fit_models
    if os.path.exists(f"fit_polyclonal_models/{model_string}.pkl") is True:
        model = pickle.load(open(f"fit_polyclonal_models/{model_string}.pkl", "rb"))
        fit_models.update({model_string: model})
        print(f"Model with {n} variants was already fit.")
    else:
        # Else, fit a model using fit_polyclonal(), save it to the
        # fit_polyclonal_models directory, and update fit_models
        model = fit_polyclonal(n)
        fit_models.update({model_string: model})
        pickle.dump(model, open(f"fit_polyclonal_models/{model_string}.pkl", "wb"))
        print(f"Model with {n} variants fit and saved.")

Model with 1000 variants was already fit.
Model with 2500 variants was already fit.
Model with 5000 variants was already fit.
Model with 10000 variants was already fit.
Model with 20000 variants was already fit.
Model with 30000 variants was already fit.

We can look at the correlation between the “true” and inferred mutation-escape values, \(\beta_{m,e}\), for the fit models. These mutation-escape values represent the extent to which mutations mediate escape from specific epitopes.

[4]:

all_corrs = pd.DataFrame({"epitope": [], "correlation (R^2)": [], "num_variants": []})

for n in library_sizes:
    model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]

    mut_escape_pred = pd.read_csv("RBD_mut_escape_df.csv").merge(
        (
            model.mut_escape_df.assign(
                epitope=lambda x: "class " + x["epitope"].astype(str)
            ).rename(columns={"escape": "predicted escape"})
        ),
        on=["mutation", "epitope"],
        validate="one_to_one",
    )

    corr = (
        mut_escape_pred.groupby("epitope")
        .apply(lambda x: x["escape"].corr(x["predicted escape"]) ** 2)
        .rename("correlation (R^2)")
        .reset_index()
    )

    all_corrs = pd.concat(
        [all_corrs, corr.assign(num_variants=[str(n)] * len(corr.index))]
    )

[5]:

# NBVAL_IGNORE_OUTPUT
base = (
    alt.Chart(all_corrs)
    .mark_point()
    .encode(
        alt.X("num_variants:Q"),
        alt.Y("correlation (R^2):Q"),
        alt.Color(
            "epitope:N", scale=alt.Scale(range=["#0072B2", "#CC79A7", "#4C3549"])
        ),
        tooltip=[
            "num_variants",
            alt.Tooltip("correlation (R^2)", format=".3f"),
            "epitope",
        ],
    )
)
base + base.transform_loess(
    "num_variants", "correlation (R^2)", groupby=["epitope"]
).mark_line(size=2.5).properties(title="inferred vs. true mutation escape values")

/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

[5]:

Additionally, we’ll look at the correlation between “true” and predicted IC90’s for each of the fit models. To do this, we’ll predict the IC90’s of variants in a separate library with a with a different (higher) mutation rate. We therefore read in the “exact” simulated data from a library containing variants with an average of four mutations per gene.

[6]:

exact_data = (
    pd.read_csv("RBD_variants_escape_exact.csv", na_filter=None)
    .query('library == "avg4muts"')
    .query("concentration in [1]")
    .reset_index(drop=True)
)

We’ll make the comparison on a log scale, and clip IC90s at values >50 as that is likely to be way outside the dynamic range given the concentrations used.

[7]:

ic90_corrs = pd.DataFrame({"correlation (R^2)": [], "num_variants": []})

max_ic90 = 50
for n in library_sizes:
    model = fit_models[f"noisy_[0.25, 1, 4]conc_3muts_{n}vars"]

    ic90s = (
        exact_data[["aa_substitutions", "IC90"]]
        .assign(IC90=lambda x: x["IC90"].clip(upper=max_ic90))
        .drop_duplicates()
    )
    ic90s = model.filter_variants_by_seen_muts(ic90s)
    ic90s = model.icXX(ic90s, x=0.9, col="predicted_IC90", max_c=max_ic90)

    ic90s = ic90s.assign(
        log_IC90=lambda x: np.log10(x["IC90"]),
        predicted_log_IC90=lambda x: np.log10(x["predicted_IC90"]),
    )

    corr = ic90s["log_IC90"].corr(ic90s["predicted_log_IC90"]) ** 2

    ic90_corrs = pd.concat(
        [
            ic90_corrs,
            pd.DataFrame({"correlation (R^2)": corr, "num_variants": [str(n)]}),
        ]
    )

[8]:

# NBVAL_IGNORE_OUTPUT
base = (
    alt.Chart(ic90_corrs)
    .mark_point()
    .encode(
        alt.X("num_variants:Q"),
        alt.Y("correlation (R^2):Q"),
        tooltip=["num_variants", alt.Tooltip("correlation (R^2)", format=".3f")],
    )
)
base + base.transform_loess("num_variants", "correlation (R^2)").mark_line(
    size=2.5
).properties(title="predicted vs. true IC90")

/home/tyu2/.local/lib/python3.8/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():

[8]:

Summary¶

Library size is an important factor. Having greater than 20,000 (functional) multiply-mutated variants led to stronger inference of the true mutation-escape values at each epitope, particularly the subdominant epitopes.

Library size¶

Summary¶

polyclonal

Navigation

Related Topics