Real antibody example

In this notebook we fit real data from a monoclonal antibody deep mutational scanning experiment. The monoclonal antibody in question is LyCoV1404, and the deep mutational scanning is against the SARS-CoV-2 Omicron BA.1 spike, using a library designed to contain only mutations that are thought to be well tolerated.

First, import the Python modules:

[1]:
import time

import altair as alt

import polyclonal

import pandas as pd

Get the data to fit

Now we read the deep mutational scanning measurements, which quantify the “probability of escape” (fraction not neutralized) for each variant. Then let’s look at some key columns:

[2]:
# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA
data_to_fit = pd.read_csv(
    "Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv",
    na_filter=None,
)

# these data to fit have columns numbering mutations in both sequential and
# reference site numbering; here we use sequential site numbers (these will
# not match Wuhan-Hu-1 reference numbering due to indels)
data_to_fit = data_to_fit.rename(
    columns={"aa_substitutions_sequential": "aa_substitutions"}
)

data_to_fit.head()
[2]:
antibody_concentration barcode aa_substitutions aa_substitutions_reference n_aa_substitutions prob_escape prob_escape_uncensored no-antibody_count
0 0.654 AAGAGCTTACTCTCGA S443P S1167P S446P S1170P 2 0.9004 0.9004 2890
1 0.654 ATAAGATAGATTTAGG K441T A519S K444T A522S 2 0.8006 0.8006 2217
2 0.654 GATCGAGTGTGTAGCA F152T K441E F157T K444E 2 0.5797 0.5797 2966
3 0.654 CCAAACGGTATGATGA Q14H V67P D212H N445T I1224M Q14H V67P D215H N448T I1227M 5 0.4125 0.4125 4131
4 0.654 TTACTGTGCAACCCAA F163L N445S F168L N448S 2 0.3600 0.3600 4275

Each row in the above data frame is a different variant (defined by it’s barcode) that has some set of amino-acid substitutions and has it’s probability of escape (fraction not neutralized) measured at a given antibody concentration (which can be in arbitrary units).

The probability of escape should in reality be between 0 and 1 (since you can’t have more than complete escape), but due to noise in the measurements the values will sometimes be >1. We therefore have the censored probabilities of escape (prob_escape) that we will actually fit, plus the uncensored values. We expect some uncensored values to be >1, but it’s good to confirm that on average they aren’t much different than the censored values (if they are, could indicate some experimental problem).

Note also that the experimental estimation of the probability of escape is ultimately based on sequencing counts. If the no-antibody count is not sufficiently high, the measurements will be very noisy. The above data have been pre-filtered to only include variants with at least moderately high pre-antibody counts.

Typically we have the same number of variants at each antibody concentration, and they should have the same statistics for the no-antibody counts as a single no-antibody control is used alongside all the antibody selections. The below table confirms some key points are as expected: - we have the same variants for each concentration (although this is not a strict requirement for model fitting) - the mean probability of escape (fraction not neutralized) decreases with increasing antibody concentration - the uncensored probabilities of escape are on average very similar to the censored ones we will actually fit (if this isn’t true it’s a red flag about your data!) - the mean and minimum no-antibody counts are the same for all concentrations (as they all use the same no-antibody control) and are reasonably large, outside the range where statistical noise is expected to have a major effect.

[3]:
display(
    data_to_fit.groupby("antibody_concentration")
    .aggregate(
        n_variants=pd.NamedAgg("barcode", "nunique"),
        mean_prob_escape=pd.NamedAgg("prob_escape", "mean"),
        mean_prob_escape_uncensored=pd.NamedAgg("prob_escape_uncensored", "mean"),
        mean_no_antibody_count=pd.NamedAgg("no-antibody_count", "mean"),
        min_no_antibody_count=pd.NamedAgg("no-antibody_count", "min"),
    )
    .round(2)
)
n_variants mean_prob_escape mean_prob_escape_uncensored mean_no_antibody_count min_no_antibody_count
antibody_concentration
0.654 117389 0.04 0.05 558.01 20
2.616 117389 0.03 0.04 558.01 20
10.460 117389 0.02 0.03 558.01 20

We will also plot the mean probability of escape across variants with each numbers of mutations. The expectation is that variants with more mutations will tend to have more escape, and again the censored and uncensored values should look similar. You can see that is true in plot below. Note that the y-axis is a symlog scale; mouse over points for details:

[4]:
# NBVAL_IGNORE_OUTPUT

max_aa_subs = 4  # group if >= this many substitutions

mean_prob_escape = (
    data_to_fit.assign(
        n_subs=lambda x: (
            x["n_aa_substitutions"]
            .clip(upper=max_aa_subs)
            .map(lambda n: str(n) if n < max_aa_subs else f">{max_aa_subs - 1}")
        )
    )
    .groupby(["antibody_concentration", "n_subs"], as_index=False)
    .aggregate({"prob_escape": "mean", "prob_escape_uncensored": "mean"})
    .rename(
        columns={
            "prob_escape": "censored to [0, 1]",
            "prob_escape_uncensored": "not censored",
        }
    )
    .melt(
        id_vars=["antibody_concentration", "n_subs"],
        var_name="censored",
        value_name="probability escape",
    )
)

mean_prob_escape_chart = (
    alt.Chart(mean_prob_escape)
    .encode(
        x=alt.X("antibody_concentration"),
        y=alt.Y(
            "probability escape",
            scale=alt.Scale(type="symlog", constant=0.05),
        ),
        column=alt.Column("censored", title=None),
        color=alt.Color("n_subs", title="n substitutions"),
        tooltip=[
            alt.Tooltip(c, format=".3g") if mean_prob_escape[c].dtype == float else c
            for c in mean_prob_escape.columns
        ],
    )
    .mark_line(point=True, size=0.5)
    .properties(width=200, height=125)
    .configure_axis(grid=False)
)

mean_prob_escape_chart
[4]:

Fit polyclonal model

We will now fit a polyclonal model to the data.

In general, an important consideration when fitting these models is how many epitopes to include. However, here it’s just a monoclonal antibody (not polyclonal serum), so we know we should just have one epitope.

We use an alphabet that includes stop (*) and gap (-) characters in addition to the 20 amino acids, in case some variants have such characters.

An important consideration: when fitting data to a single antibody, there is often more power to resolve mutational effects. So we increase the regularization on the mutation escape values via the reg_escape_weight parameter. This seems to often help for monoclonal antibodies, but hurts resolution of subdominant epitopes in sera:

[5]:
# NBVAL_IGNORE_OUTPUT

reg_escape_weight = 0.1  # somewhat stronger regularization on mutation escape

model = polyclonal.Polyclonal(
    # `polyclonal` expects the concentration column to be named "concentration"
    data_to_fit=data_to_fit.rename(columns={"antibody_concentration": "concentration"}),
    n_epitopes=1,
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)

_ = model.fit(logfreq=100, reg_escape_weight=reg_escape_weight)
#
# Fitting site-level fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 1248 parameters at Mon Jan 29 16:31:23 2024.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0     0.13355       37333       37314           0           0           0              0               0       18.641                    0                        0
         100      14.986        4910        4825      38.888           0           0              0               0       46.108                    0                        0
         177      26.404      4908.2      4822.5      39.589           0           0              0               0       46.145                    0                        0
# Successfully finished at Mon Jan 29 16:31:49 2024.
#
# Fitting fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 8450 parameters at Mon Jan 29 16:31:50 2024.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0      2.2456      7347.8      7034.8      266.82  3.6355e-31           0              0               0       46.145                    0                        0
         100      35.417      6864.9      6706.7      95.063      10.674           0              0               0       52.402                    0                        0
         200      65.438      6855.5      6701.8      89.625      11.611           0              0               0       52.473                    0                        0
         265      80.776      6854.7      6701.4      89.038      11.746           0              0               0       52.456                    0                        0
# Successfully finished at Mon Jan 29 16:33:11 2024.
#
# Fitting model.
# Starting optimization of 8452 parameters at Mon Jan 29 16:33:11 2024.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
           0     0.23519      6812.7      6701.4      89.038      11.746           0              0               0       10.491                    0                        0
          19      5.2434      6750.3      6638.1      89.041      11.747           0              0               0       10.444              0.90715                 0.076633
# Successfully finished at Mon Jan 29 16:33:16 2024.

Visualize how mutations affect escape

Now we look at the results of the fitting.

First, we just look at the activity of the epitope. For a monoclonal antibody with just one epitope, this isn’t very complicated to interpret: the one epitope should have significantly positive activity. If it doesn’t, something is wrong. Here, the activity of the epitope is positive as expected:

[6]:
model.curve_specs_df.round(1)
[6]:
epitope activity hill_coefficient non_neutralized_frac
0 1 4.4 1.2 0.0

The more interesting measurements are how mutations affect escape.

Importantly, in visualizing these results, there is an important parameter we need to consider: the number of times that mutation is seen in a variant. A mutation that is seen in just one variant is more susceptible to being poorly measured as there are less data supporting it. So we set a min_times_seen parameter that tells us how many variants must have the mutation in order to show results for it:

[7]:
min_times_seen = 3

Now we create a plot map showing how each mutation affects escape. But first we want to add data giving the mutation effects and reference (rather than sequential) site numbering:

[8]:
muteffects = pd.read_csv("Omicron_BA.1_muteffects_observed.csv").rename(
    columns={
        "sequential_site": "site",
        "effect": "functional effect",
    },
)[["site", "reference_site", "mutant", "functional effect"]]

muteffects.head()
[8]:
site reference_site mutant functional effect
0 1 1 I -2.2303
1 1 1 M 0.0000
2 1 1 T -2.4823
3 1 1 V -2.3068
4 2 2 F 0.0000

Now make plot merging in functional effects. Note we specify functional effects as an additional slider stat, but one that is hidden not filtered, so mutations with negative functional effects are just hidden in gray on heat map:

[9]:
# NBVAL_IGNORE_OUTPUT

model.mut_escape_plot(
    df_to_merge=muteffects,
    addtl_tooltip_stats=["reference_site"],
    addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
    slider_binding_range_kwargs={"times_seen": {"max": 50}},
    addtl_slider_stats_hide_not_filter=["functional effect"],
)
[9]:

In the heatmap above, each cell is a different mutation (those without measurements are grayed out). The x characters mark the wildtype identity, and you can mouse over points for details. You can use the zoom bar to zoom to specific regions. You can use the times_seen slider to adjust the value to see mutations seen in fewer variants (or in more variants, if you increase value). Most sites don’t have much escape, so you can use the slider to just get sites where there is a mutation with lots of escape (specifically, this slider only shows sites with mutations that have escape \(\ge\) some percent of the maximal escaping mutation).

Here is a similar plot but now showing the fold change in IC90s rather than the mutation escape values, and showing negative values (no floor at zero):

[10]:
# NBVAL_IGNORE_OUTPUT

model.mut_icXX_plot(
    df_to_merge=muteffects,
    addtl_tooltip_stats=["reference_site"],
    addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
    slider_binding_range_kwargs={"times_seen": {"max": 50}},
    addtl_slider_stats_hide_not_filter=["functional effect"],
    init_floor_at_zero=False,
)
[10]:

Note that you can use the sites_to_show option to just show sites in a specific range:

[11]:
# NBVAL_IGNORE_OUTPUT

model.mut_escape_plot(
    df_to_merge=muteffects,
    addtl_tooltip_stats=["reference_site"],
    addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
    slider_binding_range_kwargs={"times_seen": {"max": 50}},
    addtl_slider_stats_hide_not_filter=["functional effect"],
    sites_to_show={"include_range": (434, 451)},
)
[11]:

Or just show certain sites:

[12]:
# NBVAL_IGNORE_OUTPUT

model.mut_escape_plot(
    df_to_merge=muteffects,
    addtl_tooltip_stats=["reference_site"],
    addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
    slider_binding_range_kwargs={"times_seen": {"max": 50}},
    addtl_slider_stats_hide_not_filter=["functional effect"],
    sites_to_show={"include": [436, 440, 441, 442, 443, 444, 445, 447, 449]},
)
[12]:

List the top sites of escape:

[13]:
# NBVAL_IGNORE_OUTPUT

top_sites = (
    model.mut_escape_df.query("times_seen >= @min_times_seen")
    .sort_values("escape", ascending=False)
    .groupby("site", as_index=False, sort=False)
    .first(1)[["site", "escape"]]
    .rename(columns={"escape": "largest_mutation_escape_at_site"})
    .head(n=5)
    .round(1)
)

top_sites
[13]:
site largest_mutation_escape_at_site
0 444 6.7
1 447 6.3
2 496 6.2
3 441 6.1
4 443 5.9
[14]:
assert set(top_sites["site"]).issubset({447, 444, 441, 442, 443, 445, 496})
[ ]: