Simulate experiment¶

We will simulate data from a deep mutational scanning experiment using the RBD antibody mix in order to provide hypothetical data on which to fit a Polyclonal model.

Simulate three epitopes¶

We first define a Polyclonal object that represents the antibody mix we want to simulate. This mix is again based on the three RBD antibodies targeting class 1, 2, and 3 epitopes.

[1]:

import pandas as pd

import polyclonal

# read data needed to initialize `Polyclonal` object
activity_wt_df = pd.read_csv("RBD_activity_wt_df.csv")
mut_escape_df = pd.read_csv("RBD_mut_escape_df.csv")

# create `Polyclonal` object with actual antibody mix
model_3ep = polyclonal.Polyclonal(
    activity_wt_df=activity_wt_df, mut_escape_df=mut_escape_df
)

print(f"Epitopes: {model_3ep.epitopes}")
print(f"Number of mutations: {len(model_3ep.mutations)}")
print(f"Number of sites: {len(model_3ep.sites)}")

Epitopes: ('class 1', 'class 2', 'class 3')
Number of mutations: 1932
Number of sites: 173

Now we simulate libraries of variants with an average of 1, 2, 3, or 4 mutations per gene. These libraries only contain mutations for which we defined mutation escape above. We a dms_variants CodonVariantTable for the RBD with these mutations. Note that because the CodonVariantTable requires sequential 1, 2, … numbering, we have to do some shifting of sites back and forth with an offset of 331 since that is where the RBD starts.

[2]:

import Bio.SeqIO

import dms_variants.simulate

import polyclonal.utils

# read coding sequence, then slice to just region of interest
# noting that RBD starts at residue 331
geneseq = str(Bio.SeqIO.read("RBD_seq.fasta", "fasta").seq)

allowed_aa_muts = model_3ep.mut_escape_df["mutation"].unique()
print(f"There are {len(allowed_aa_muts)} allowed amino-acid mutations.")

variants = dms_variants.simulate.simulate_CodonVariantTable(
    geneseq=geneseq,
    bclen=16,
    library_specs={
        f"avg{m}muts": {"avgmuts": m, "nvariants": 40000} for m in [1, 2, 3, 4]
    },
    allowed_aa_muts=[polyclonal.utils.shift_mut_site(m, -330) for m in allowed_aa_muts],
)

There are 1932 allowed amino-acid mutations.

Now look at the distribution of the number of amino-acid mutations per-variant in the library. The mutation rate is relatively high as we pre-screened for tolerated mutations and need multiple mutants to decompose the sera mix:

[3]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram(
    mut_type="aa", samples=None, libraries=variants.libraries
)
_ = p.draw()

We also look at the mutation rate across the gene. Note that this is sequential numbering of the sequence. Also, the per-site mutation frequency is uneven because we only include tolerated mutations, and there are different number of tolerated mutations per site:

[4]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotMutFreqs(
    variant_type="all", mut_type="aa", samples=None, libraries=variants.libraries
)
_ = p.draw()

/fh/fast/bloom_j/software/miniconda3/envs/polyclonal/lib/python3.11/site-packages/dms_variants/codonvarianttable.py:2000: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

Now we get the data frame with the amino-acid substitutions for each variant, re-adding site offset to get back to RBD numbering:

[5]:

variants_df = variants.barcode_variant_df[
    ["library", "barcode", "aa_substitutions", "n_aa_substitutions"]
].assign(
    aa_substitutions=lambda x: x["aa_substitutions"].apply(
        polyclonal.utils.shift_mut_site, shift=330
    )
)

variants_df

[5]:

	library	barcode	aa_substitutions	n_aa_substitutions
0	avg1muts	AAAAAAAATTTACGCG	F486P	1
1	avg1muts	AAAAAAATCCTCAATT	T385A	1
2	avg1muts	AAAAAACTATGCACCT	Q498N	1
3	avg1muts	AAAAAAGACGCTGTGC	P337S	1
4	avg1muts	AAAAAATAGTTCTGAT	S371F R408V	2
...	...	...	...	...
159995	avg4muts	TTTTTTGCGTTTATCT	V367T	1
159996	avg4muts	TTTTTTTCCGCTCTAA	Y369R C391T N440K G447C K462G	5
159997	avg4muts	TTTTTTTGTAAGCGAG	N334L A348S K386S S443G Q493N V503S	6
159998	avg4muts	TTTTTTTGTTTTCAGC	L335A P337S R357E Y396H E406W G485R T500E	7
159999	avg4muts	TTTTTTTTGCGGCCGT	K444A P527S	2

160000 rows × 4 columns

Now we use our Polyclonal object to compute the predicted probability of escape (\(p_v\left(c\right)\)) at multiple serum concentrations given the mutation escape values \(\beta_{m,e}\) and activities \(a_{\rm{wt},e}\) stored in the Polyclonal object. Since we will use these as the “true” values in our our simulation, we then rename the column with the predicted probabilities of escape to just be the probabilities of escape:

[6]:

variants_escape = model_3ep.prob_escape(
    variants_df=variants_df, concentrations=[0.125, 0.25, 0.5, 1, 2, 4]
)

variants_escape = variants_escape.rename(
    columns={"predicted_prob_escape": "prob_escape"}
)

variants_escape

[6]:

	library	barcode	aa_substitutions	n_aa_substitutions	concentration	prob_escape
0	avg1muts	AAAAAATGTTCTATCC		0	0.125	0.048594
1	avg1muts	AAAAACAATCCGGACT		0	0.125	0.048594
2	avg1muts	AAAAACGCGGTCACTT		0	0.125	0.048594
3	avg1muts	AAAAACTTGGCTAGCT		0	0.125	0.048594
4	avg1muts	AAAAAGCAAGGCCCAG		0	0.125	0.048594
...	...	...	...	...	...	...
959995	avg2muts	ATTCACCACCAGGTAG	Y508W	1	4.000	0.000006
959996	avg2muts	TAGAATGTATTATTCT	Y508W	1	4.000	0.000006
959997	avg3muts	CTTAAAATAGCTGGTC	Y508W	1	4.000	0.000006
959998	avg4muts	GGTCAATTATGTCGGG	Y508W	1	4.000	0.000006
959999	avg1muts	GGAACGACAGTGATCG	Y508W T531H	2	4.000	0.000006

960000 rows × 6 columns

Plot some features of the variants, namely the distribution of probability of escape for each number of mutations. As expected, variants with many mutations are more likely to escape the serum:

[7]:

# NBVAL_IGNORE_OUTPUT

from plotnine import *

p = (
    ggplot(
        variants_escape.assign(
            n_aa_substitutions=lambda x: x["n_aa_substitutions"].map(
                lambda n: str(n) if n <= 6 else ">6"
            ),
            concentration=lambda x: pd.Categorical(x["concentration"]),
        )
    )
    + aes("n_aa_substitutions", "prob_escape", fill="concentration")
    + geom_boxplot(outlier_size=0.5, outlier_alpha=0.1)
    + facet_wrap("~ library", ncol=1)
    + theme_classic()
    + theme(figure_size=(9, 7))
)

_ = p.draw()

Also, just compute the overall average probability of escape for each library:

[8]:

(
    variants_escape.groupby(["library", "concentration"], observed=False).aggregate(
        {"prob_escape": "mean"}
    )
)

[8]:

		prob_escape
library	concentration
avg1muts	0.125	0.086326
	0.250	0.033071
	0.500	0.012991
	1.000	0.005777
	2.000	0.003006
	4.000	0.001809
avg2muts	0.125	0.125877
	0.250	0.060015
	0.500	0.030075
	1.000	0.016740
	2.000	0.010386
	4.000	0.007009
avg3muts	0.125	0.165698
	0.250	0.090470
	0.500	0.052089
	1.000	0.032763
	2.000	0.022442
	4.000	0.016411
avg4muts	0.125	0.204486
	0.250	0.122522
	0.500	0.077355
	1.000	0.052786
	2.000	0.038664
	4.000	0.029815

We also want to acknowledge the fact that a real experiment will have some noise in the data. This noise is likely to come in two forms:

Inaccuracies in the measurements of the probabilities of escape, \(p_v\left(c\right)\).
Occassional mis-assignment of which mutations are in a variant due to sequencing errors.

We therefore will make a “noisy” version of variants_df where we have incorporated both of these sources of noise by adding Gaussian measurement error to the escape probabilities (but then truncating them to be between 0 and 1), and by adding or subtracting a mutation to occassional variants to represent sequencing errors.

[9]:

variants_escape

[9]:

	library	barcode	aa_substitutions	n_aa_substitutions	concentration	prob_escape
0	avg1muts	AAAAAATGTTCTATCC		0	0.125	0.048594
1	avg1muts	AAAAACAATCCGGACT		0	0.125	0.048594
2	avg1muts	AAAAACGCGGTCACTT		0	0.125	0.048594
3	avg1muts	AAAAACTTGGCTAGCT		0	0.125	0.048594
4	avg1muts	AAAAAGCAAGGCCCAG		0	0.125	0.048594
...	...	...	...	...	...	...
959995	avg2muts	ATTCACCACCAGGTAG	Y508W	1	4.000	0.000006
959996	avg2muts	TAGAATGTATTATTCT	Y508W	1	4.000	0.000006
959997	avg3muts	CTTAAAATAGCTGGTC	Y508W	1	4.000	0.000006
959998	avg4muts	GGTCAATTATGTCGGG	Y508W	1	4.000	0.000006
959999	avg1muts	GGAACGACAGTGATCG	Y508W T531H	2	4.000	0.000006

960000 rows × 6 columns

[10]:

import numpy

numpy.random.seed(1)  # seed for reproducible output


def add_subtract_mutation(subs):
    """Sometimes add or subtract a mutation."""
    subs = subs.split()
    sub_sites = [int(sub[1:-1]) for sub in subs]
    rand = numpy.random.random()
    if rand < 0.01:  # add mutation with 1% probability
        mut = numpy.random.choice(model_3ep.mutations)
        while int(mut[1:-1]) in sub_sites:
            mut = numpy.random.choice(model_3ep.mutations)
        subs.append(mut)
    elif rand > 0.99 and subs:  # subtract mutation with 1% probability
        subs.pop(numpy.random.randint(len(subs)))
    return " ".join(subs)


# only keep needed columns in variants_escape
variants_escape = variants_escape.drop(columns="n_aa_substitutions")

# simulate noisy variants
noisy_variants_escape = (
    variants_escape.assign(
        # add Gaussian noise with standard deviation of 0.03
        prob_escape=lambda x: (
            x["prob_escape"] + numpy.random.normal(scale=0.03, size=len(x))
        ).clip(lower=0, upper=1),
    )
    # add rare sequencing errors to variants
    .drop(columns="aa_substitutions").merge(
        variants_df[["library", "barcode", "aa_substitutions"]].assign(
            aa_substitutions=lambda x: x["aa_substitutions"].map(add_subtract_mutation)
        ),
        how="left",
        on=["library", "barcode"],
        validate="many_to_one",
    )
)

noisy_variants_escape

[10]:

	library	barcode	concentration	prob_escape	aa_substitutions
0	avg1muts	AAAAAATGTTCTATCC	0.125	0.097324
1	avg1muts	AAAAACAATCCGGACT	0.125	0.030241
2	avg1muts	AAAAACGCGGTCACTT	0.125	0.032749
3	avg1muts	AAAAACTTGGCTAGCT	0.125	0.016405
4	avg1muts	AAAAAGCAAGGCCCAG	0.125	0.074556
...	...	...	...	...	...
959995	avg2muts	ATTCACCACCAGGTAG	4.000	0.074859	Y508W
959996	avg2muts	TAGAATGTATTATTCT	4.000	0.000000	Y508W
959997	avg3muts	CTTAAAATAGCTGGTC	4.000	0.000000	Y508W
959998	avg4muts	GGTCAATTATGTCGGG	4.000	0.000000	Y508W
959999	avg1muts	GGAACGACAGTGATCG	4.000	0.000000	Y508W T531H

960000 rows × 5 columns

For both the exact and noisy data frames, we will also compute the IC90 for each variant based on the simulated antibody mix:

[ ]:

variants_escape = model_3ep.icXX(variants_escape, x=0.9, col="IC90")

noisy_variants_escape = model_3ep.icXX(noisy_variants_escape, x=0.9, col="IC90")

noisy_variants_escape

We will write these data frames giving the variants \(v\) and their probabilities of escape \(p_v\left(c\right)\) at each serum concentration \(c\) to CSV files:

RBD_variants_escape_exact.csv for the measurements without noise
RBD_variants_escape_noisy.csv for the measurements with realistic experimental noise

The goal is to infer the properties of the escape mutations and the serum, \(a_{\rm{wt},e}\) and \(\beta_{m,e}\), from these data, which are what would be measured in a deep mutational scanning experiment.

[ ]:

variants_escape.to_csv(
    "RBD_variants_escape_exact.csv", index=False, float_format="%.4g"
)

noisy_variants_escape.to_csv(
    "RBD_variants_escape_noisy.csv", index=False, float_format="%.4g"
)

Simulate library with 2 epitopes and non-one Hill coefficient and non-zero non-neutralizable fraction¶

Create a two-epitope model with the Hill coefficent not equal to one, and the non-neutralizable fraction greater than zero. To do this, we just keep two epitopes from the above model and rename them:

[ ]:

model_2ep = polyclonal.Polyclonal(
    activity_wt_df=activity_wt_df.query("epitope != 'class 1'"),
    mut_escape_df=mut_escape_df.query("epitope != 'class 1'"),
    hill_coefficient_df=pd.DataFrame(
        {"epitope": ["class 2", "class 3"], "hill_coefficient": [1.2, 1.6]}
    ),
    non_neutralized_frac_df=pd.DataFrame(
        {"epitope": ["class 2", "class 3"], "non_neutralized_frac": [0.01, 0.05]},
    ),
)

model_2ep.curve_specs_df

Now compute probability of escape for 2-mutation variants, adding some noise. Also compute the true IC90 from the model. This is the simulated library:

[ ]:

variants_2ep = (
    model_2ep.prob_escape(
        variants_df=variants_df.query("library == 'avg2muts'"),
        concentrations=[0.25, 1, 4],
    )
    .rename(columns={"predicted_prob_escape": "prob_escape"})
    .drop(columns=["n_aa_substitutions", "library"])
    .assign(
        # add Gaussian noise with standard deviation of 0.03
        prob_escape=lambda x: (
            x["prob_escape"] + numpy.random.normal(scale=0.03, size=len(x))
        ).clip(lower=0, upper=1),
        # add rare sequencing errors
        aa_substitutions=lambda x: x["aa_substitutions"].map(add_subtract_mutation),
    )
)

variants_2ep = model_2ep.icXX(variants_2ep, x=0.9, col="true IC90")

print(f"Average escape at each concentration:")
display(variants_2ep.groupby("concentration").aggregate({"prob_escape": "mean"}))

We write details on the 2-epitope model to files:

2epitope_model_curve_specs.csv the neutralization curve parameters
2epitope_escape.csv the variants and their measured probability of escape

[ ]:

model_2ep.curve_specs_df.to_csv("2epitope_model_curve_specs.csv", index=False)

variants_2ep.to_csv("2epitope_escape.csv", index=False, float_format="%.4g")

Simulate experiment¶

Simulate three epitopes¶

Simulate library with 2 epitopes and non-one Hill coefficient and non-zero non-neutralizable fraction¶

polyclonal

Navigation

Related Topics