Real antibody example¶
In this notebook we fit real data from a monoclonal antibody deep mutational scanning experiment. The monoclonal antibody in question is LyCoV1404, and the deep mutational scanning is against the SARS-CoV-2 Omicron BA.1 spike, using a library designed to contain only mutations that are thought to be well tolerated.
First, import the Python modules:
[1]:
import time
import altair as alt
import polyclonal
import pandas as pd
Get the data to fit¶
Now we read the deep mutational scanning measurements, which quantify the “probability of escape” (fraction not neutralized) for each variant. Then let’s look at some key columns:
[2]:
# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA
data_to_fit = pd.read_csv(
"Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv",
na_filter=None,
)
# these data to fit have columns numbering mutations in both sequential and
# reference site numbering; here we use sequential site numbers (these will
# not match Wuhan-Hu-1 reference numbering due to indels)
data_to_fit = data_to_fit.rename(
columns={"aa_substitutions_sequential": "aa_substitutions"}
)
data_to_fit.head()
[2]:
antibody_concentration | barcode | aa_substitutions | aa_substitutions_reference | n_aa_substitutions | prob_escape | prob_escape_uncensored | no-antibody_count | |
---|---|---|---|---|---|---|---|---|
0 | 0.654 | AAGAGCTTACTCTCGA | S443P S1167P | S446P S1170P | 2 | 0.9004 | 0.9004 | 2890 |
1 | 0.654 | ATAAGATAGATTTAGG | K441T A519S | K444T A522S | 2 | 0.8006 | 0.8006 | 2217 |
2 | 0.654 | GATCGAGTGTGTAGCA | F152T K441E | F157T K444E | 2 | 0.5797 | 0.5797 | 2966 |
3 | 0.654 | CCAAACGGTATGATGA | Q14H V67P D212H N445T I1224M | Q14H V67P D215H N448T I1227M | 5 | 0.4125 | 0.4125 | 4131 |
4 | 0.654 | TTACTGTGCAACCCAA | F163L N445S | F168L N448S | 2 | 0.3600 | 0.3600 | 4275 |
Each row in the above data frame is a different variant (defined by it’s barcode) that has some set of amino-acid substitutions and has it’s probability of escape (fraction not neutralized) measured at a given antibody concentration (which can be in arbitrary units).
The probability of escape should in reality be between 0 and 1 (since you can’t have more than complete escape), but due to noise in the measurements the values will sometimes be >1. We therefore have the censored probabilities of escape (prob_escape
) that we will actually fit, plus the uncensored values. We expect some uncensored values to be >1, but it’s good to confirm that on average they aren’t much different than the censored values (if they are, could indicate some experimental
problem).
Note also that the experimental estimation of the probability of escape is ultimately based on sequencing counts. If the no-antibody count is not sufficiently high, the measurements will be very noisy. The above data have been pre-filtered to only include variants with at least moderately high pre-antibody counts.
Typically we have the same number of variants at each antibody concentration, and they should have the same statistics for the no-antibody counts as a single no-antibody control is used alongside all the antibody selections. The below table confirms some key points are as expected: - we have the same variants for each concentration (although this is not a strict requirement for model fitting) - the mean probability of escape (fraction not neutralized) decreases with increasing antibody concentration - the uncensored probabilities of escape are on average very similar to the censored ones we will actually fit (if this isn’t true it’s a red flag about your data!) - the mean and minimum no-antibody counts are the same for all concentrations (as they all use the same no-antibody control) and are reasonably large, outside the range where statistical noise is expected to have a major effect.
[3]:
display(
data_to_fit.groupby("antibody_concentration")
.aggregate(
n_variants=pd.NamedAgg("barcode", "nunique"),
mean_prob_escape=pd.NamedAgg("prob_escape", "mean"),
mean_prob_escape_uncensored=pd.NamedAgg("prob_escape_uncensored", "mean"),
mean_no_antibody_count=pd.NamedAgg("no-antibody_count", "mean"),
min_no_antibody_count=pd.NamedAgg("no-antibody_count", "min"),
)
.round(2)
)
n_variants | mean_prob_escape | mean_prob_escape_uncensored | mean_no_antibody_count | min_no_antibody_count | |
---|---|---|---|---|---|
antibody_concentration | |||||
0.654 | 117389 | 0.04 | 0.05 | 558.01 | 20 |
2.616 | 117389 | 0.03 | 0.04 | 558.01 | 20 |
10.460 | 117389 | 0.02 | 0.03 | 558.01 | 20 |
We will also plot the mean probability of escape across variants with each numbers of mutations. The expectation is that variants with more mutations will tend to have more escape, and again the censored and uncensored values should look similar. You can see that is true in plot below. Note that the y-axis is a symlog scale; mouse over points for details:
[4]:
# NBVAL_IGNORE_OUTPUT
max_aa_subs = 4 # group if >= this many substitutions
mean_prob_escape = (
data_to_fit.assign(
n_subs=lambda x: (
x["n_aa_substitutions"]
.clip(upper=max_aa_subs)
.map(lambda n: str(n) if n < max_aa_subs else f">{max_aa_subs - 1}")
)
)
.groupby(["antibody_concentration", "n_subs"], as_index=False)
.aggregate({"prob_escape": "mean", "prob_escape_uncensored": "mean"})
.rename(
columns={
"prob_escape": "censored to [0, 1]",
"prob_escape_uncensored": "not censored",
}
)
.melt(
id_vars=["antibody_concentration", "n_subs"],
var_name="censored",
value_name="probability escape",
)
)
mean_prob_escape_chart = (
alt.Chart(mean_prob_escape)
.encode(
x=alt.X("antibody_concentration"),
y=alt.Y(
"probability escape",
scale=alt.Scale(type="symlog", constant=0.05),
),
column=alt.Column("censored", title=None),
color=alt.Color("n_subs", title="n substitutions"),
tooltip=[
alt.Tooltip(c, format=".3g") if mean_prob_escape[c].dtype == float else c
for c in mean_prob_escape.columns
],
)
.mark_line(point=True, size=0.5)
.properties(width=200, height=125)
.configure_axis(grid=False)
)
mean_prob_escape_chart
[4]:
Fit polyclonal
model¶
We will now fit a polyclonal
model to the data.
In general, an important consideration when fitting these models is how many epitopes to include. However, here it’s just a monoclonal antibody (not polyclonal serum), so we know we should just have one epitope.
We use an alphabet that includes stop (*
) and gap (-
) characters in addition to the 20 amino acids, in case some variants have such characters.
An important consideration: when fitting data to a single antibody, there is often more power to resolve mutational effects. So we increase the regularization on the mutation escape values via the reg_escape_weight
parameter. This seems to often help for monoclonal antibodies, but hurts resolution of subdominant epitopes in sera:
[5]:
# NBVAL_IGNORE_OUTPUT
reg_escape_weight = 0.1 # somewhat stronger regularization on mutation escape
model = polyclonal.Polyclonal(
# `polyclonal` expects the concentration column to be named "concentration"
data_to_fit=data_to_fit.rename(columns={"antibody_concentration": "concentration"}),
n_epitopes=1,
alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)
_ = model.fit(logfreq=100, reg_escape_weight=reg_escape_weight)
#
# Fitting site-level fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 1248 parameters at Mon Jan 29 16:31:23 2024.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.13355 37333 37314 0 0 0 0 0 18.641 0 0
100 14.986 4910 4825 38.888 0 0 0 0 46.108 0 0
177 26.404 4908.2 4822.5 39.589 0 0 0 0 46.145 0 0
# Successfully finished at Mon Jan 29 16:31:49 2024.
#
# Fitting fixed Hill coefficient and non-neutralized frac model.
# Starting optimization of 8450 parameters at Mon Jan 29 16:31:50 2024.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 2.2456 7347.8 7034.8 266.82 3.6355e-31 0 0 0 46.145 0 0
100 35.417 6864.9 6706.7 95.063 10.674 0 0 0 52.402 0 0
200 65.438 6855.5 6701.8 89.625 11.611 0 0 0 52.473 0 0
265 80.776 6854.7 6701.4 89.038 11.746 0 0 0 52.456 0 0
# Successfully finished at Mon Jan 29 16:33:11 2024.
#
# Fitting model.
# Starting optimization of 8452 parameters at Mon Jan 29 16:33:11 2024.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.23519 6812.7 6701.4 89.038 11.746 0 0 0 10.491 0 0
19 5.2434 6750.3 6638.1 89.041 11.747 0 0 0 10.444 0.90715 0.076633
# Successfully finished at Mon Jan 29 16:33:16 2024.
Visualize how mutations affect escape¶
Now we look at the results of the fitting.
First, we just look at the activity of the epitope. For a monoclonal antibody with just one epitope, this isn’t very complicated to interpret: the one epitope should have significantly positive activity. If it doesn’t, something is wrong. Here, the activity of the epitope is positive as expected:
[6]:
model.curve_specs_df.round(1)
[6]:
epitope | activity | hill_coefficient | non_neutralized_frac | |
---|---|---|---|---|
0 | 1 | 4.4 | 1.2 | 0.0 |
The more interesting measurements are how mutations affect escape.
Importantly, in visualizing these results, there is an important parameter we need to consider: the number of times that mutation is seen in a variant. A mutation that is seen in just one variant is more susceptible to being poorly measured as there are less data supporting it. So we set a min_times_seen
parameter that tells us how many variants must have the mutation in order to show results for it:
[7]:
min_times_seen = 3
Now we create a plot map showing how each mutation affects escape. But first we want to add data giving the mutation effects and reference (rather than sequential) site numbering:
[8]:
muteffects = pd.read_csv("Omicron_BA.1_muteffects_observed.csv").rename(
columns={
"sequential_site": "site",
"effect": "functional effect",
},
)[["site", "reference_site", "mutant", "functional effect"]]
muteffects.head()
[8]:
site | reference_site | mutant | functional effect | |
---|---|---|---|---|
0 | 1 | 1 | I | -2.2303 |
1 | 1 | 1 | M | 0.0000 |
2 | 1 | 1 | T | -2.4823 |
3 | 1 | 1 | V | -2.3068 |
4 | 2 | 2 | F | 0.0000 |
Now make plot merging in functional effects. Note we specify functional effects as an additional slider stat, but one that is hidden not filtered, so mutations with negative functional effects are just hidden in gray on heat map:
[9]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot(
df_to_merge=muteffects,
addtl_tooltip_stats=["reference_site"],
addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
slider_binding_range_kwargs={"times_seen": {"max": 50}},
addtl_slider_stats_hide_not_filter=["functional effect"],
)
[9]:
In the heatmap above, each cell is a different mutation (those without measurements are grayed out). The x
characters mark the wildtype identity, and you can mouse over points for details. You can use the zoom bar to zoom to specific regions. You can use the times_seen
slider to adjust the value to see mutations seen in fewer variants (or in more variants, if you increase value). Most sites don’t have much escape, so you can use the slider to just get sites where there is a mutation with
lots of escape (specifically, this slider only shows sites with mutations that have escape \(\ge\) some percent of the maximal escaping mutation).
Here is a similar plot but now showing the fold change in IC90s rather than the mutation escape values, and showing negative values (no floor at zero):
[10]:
# NBVAL_IGNORE_OUTPUT
model.mut_icXX_plot(
df_to_merge=muteffects,
addtl_tooltip_stats=["reference_site"],
addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
slider_binding_range_kwargs={"times_seen": {"max": 50}},
addtl_slider_stats_hide_not_filter=["functional effect"],
init_floor_at_zero=False,
)
[10]:
Note that you can use the sites_to_show
option to just show sites in a specific range:
[11]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot(
df_to_merge=muteffects,
addtl_tooltip_stats=["reference_site"],
addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
slider_binding_range_kwargs={"times_seen": {"max": 50}},
addtl_slider_stats_hide_not_filter=["functional effect"],
sites_to_show={"include_range": (434, 451)},
)
[11]:
Or just show certain sites:
[12]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot(
df_to_merge=muteffects,
addtl_tooltip_stats=["reference_site"],
addtl_slider_stats={"functional effect": muteffects["functional effect"].min()},
slider_binding_range_kwargs={"times_seen": {"max": 50}},
addtl_slider_stats_hide_not_filter=["functional effect"],
sites_to_show={"include": [436, 440, 441, 442, 443, 444, 445, 447, 449]},
)
[12]:
List the top sites of escape:
[13]:
# NBVAL_IGNORE_OUTPUT
top_sites = (
model.mut_escape_df.query("times_seen >= @min_times_seen")
.sort_values("escape", ascending=False)
.groupby("site", as_index=False, sort=False)
.first(1)[["site", "escape"]]
.rename(columns={"escape": "largest_mutation_escape_at_site"})
.head(n=5)
.round(1)
)
top_sites
[13]:
site | largest_mutation_escape_at_site | |
---|---|---|
0 | 444 | 6.7 |
1 | 447 | 6.3 |
2 | 496 | 6.2 |
3 | 441 | 6.1 |
4 | 443 | 5.9 |
[14]:
assert set(top_sites["site"]).issubset({447, 444, 441, 442, 443, 445, 496})
[ ]: