Reference site numbering¶
So far all the preceding examples have involved analysis of proteins with sequential integer site numbers (e.g., 1, 2, 3, …). This sequential integer numbering can also start at sites other than one (e.g., 331, 332, 333, …).
However, often proteins are numbered in relation to alignment to sequential integer numbering of some reference homolog. For instance, SARS-CoV-2 spike sequences are usually numbered in relation to the Wuhan-Hu-1 reference. But if the protein has accumulated insertions or deletions relative to the reference, there may be missing sites (gaps) or insertions (typically numbered as 214
, 214a
, 214b
, etc).
This notebook shows how to perform an analysis and analyze the results with non-sequential reference site numbering. The advantage of doing this is that the results can directly be visualized/analyzed using the conventional site numbering scheme.
It does this by analyzing deep mutational scanning of the SARS-CoV-2 Omicron BA.1 spike, which has several indels relative to the Wuhan-Hu-1 numbering reference.
Fit and visualize a model with non-integer site numbering¶
First, import Python modules:
[1]:
import polyclonal
import pandas as pd
Now read the data to fit:
[2]:
# read data w `na_filter=None` so empty aa_substitutions read as such rather than NA
data_to_fit = pd.read_csv(
"Lib-2_2022-06-22_thaw-1_LyCoV-1404_1_prob_escape.csv",
na_filter=None,
)
data_to_fit.head()
[2]:
antibody_concentration | barcode | aa_substitutions_sequential | aa_substitutions_reference | n_aa_substitutions | prob_escape | prob_escape_uncensored | no-antibody_count | |
---|---|---|---|---|---|---|---|---|
0 | 0.654 | AAGAGCTTACTCTCGA | S443P S1167P | S446P S1170P | 2 | 0.9004 | 0.9004 | 2890 |
1 | 0.654 | ATAAGATAGATTTAGG | K441T A519S | K444T A522S | 2 | 0.8006 | 0.8006 | 2217 |
2 | 0.654 | GATCGAGTGTGTAGCA | F152T K441E | F157T K444E | 2 | 0.5797 | 0.5797 | 2966 |
3 | 0.654 | CCAAACGGTATGATGA | Q14H V67P D212H N445T I1224M | Q14H V67P D215H N448T I1227M | 5 | 0.4125 | 0.4125 | 4131 |
4 | 0.654 | TTACTGTGCAACCCAA | F163L N445S | F168L N448S | 2 | 0.3600 | 0.3600 | 4275 |
Notice how these data have sites numbered in two different schemes:
aa_substitutions_sequential
: sequential integer numbering (1, 2, 3, …) of the protein used in the experiment. You could just analyze the mutations this way, but then the results will not be in standard reference numbering scheme.aa_substitutions_reference
: the referencey based site numbering, which skips sites with indels and has some non-numeric sites where there are insertions (eg,214a
), as for example in the variant shown below:
[3]:
data_to_fit.query("aa_substitutions_reference.str.contains('a')").head(n=1)
[3]:
antibody_concentration | barcode | aa_substitutions_sequential | aa_substitutions_reference | n_aa_substitutions | prob_escape | prob_escape_uncensored | no-antibody_count | |
---|---|---|---|---|---|---|---|---|
32 | 0.654 | CTAGATAAATCCCTGC | E209A K441N | E214aA K444N | 2 | 0.3965 | 0.3965 | 2214 |
Here we will use the reference based amino-acid substitutions, so create a column with the name used by Polyclonal
(the “aa_substitutions” column) that uses this numbering scheme:
[4]:
data_to_fit["aa_substitutions"] = data_to_fit["aa_substitutions_reference"]
Importantly, in order to use reference based numbering that is not sequential integer, you also have to provide a list of the sites in order. The reason is that otherwise it’s not possible for Polyclonal
to figure out for instance if sites are just missing from the data are are actually deletions.
Here we read a data frame that maps sequential to reference site numbering, and then use that to extract the list of reference sites. Note this plot also contains the protein regions, which we will use later:
[5]:
site_numbering_map = pd.read_csv("BA.1_site_numbering_map.csv")
display(site_numbering_map.head())
print("Note how some reference sites differ from sequential ones due to indels:")
display(
site_numbering_map[
site_numbering_map["sequential_site"].astype(str)
!= site_numbering_map["reference_site"]
].head()
)
sites = site_numbering_map["reference_site"].tolist()
sequential_site | reference_site | region | |
---|---|---|---|
0 | 0 | -1 | other |
1 | 1 | 1 | other |
2 | 2 | 2 | other |
3 | 3 | 3 | other |
4 | 4 | 4 | other |
Note how some reference sites differ from sequential ones due to indels:
sequential_site | reference_site | region | |
---|---|---|---|
0 | 0 | -1 | other |
69 | 69 | 71 | NTD |
70 | 70 | 72 | NTD |
71 | 71 | 73 | NTD |
72 | 72 | 74 | NTD |
Now initialize and fit the Polyclonal
model, but pass the sites
argument so we can use these non-sequential-integer reference sites. (If you don’t pass the sites
argument, sites are assumed to be sequential integer):
[6]:
model = polyclonal.Polyclonal(
# `polyclonal` expects the concentration column to be named "concentration"
data_to_fit=data_to_fit.rename(columns={"antibody_concentration": "concentration"}),
n_epitopes=1,
alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
sites=sites,
)
Note how the model has its sequential_integer_sites
attribute set to False
:
[7]:
assert set(model.sites) == set(sites)
assert model.sequential_integer_sites is False
Now fit the model:
[8]:
# NBVAL_IGNORE_OUTPUT
_ = model.fit(
logfreq=200,
reg_escape_weight=0.1,
fix_hill_coefficient=True,
fix_non_neutralized_frac=True,
)
#
# Fitting site-level model.
# Starting optimization of 1248 parameters at Mon Oct 16 20:36:36 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.075336 37318 37314 0 0 0 0 0 3.7281 0 0
177 16.341 4871.2 4822.7 39.285 0 0 0 0 9.2641 0 0
# Successfully finished at Mon Oct 16 20:36:53 2023.
#
# Fitting model.
# Starting optimization of 8450 parameters at Mon Oct 16 20:36:53 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 3.5293 7305.4 7031.5 264.69 4.1727e-31 0 0 0 9.2641 0 0
200 106.55 6812.9 6701.1 89.496 11.721 0 0 0 10.565 0 0
257 114.37 6812.4 6700.3 89.672 11.869 0 0 0 10.564 0 0
# Successfully finished at Mon Oct 16 20:38:47 2023.
Look at output. First we look at the mut_escape_df
. The site entries are str:
[9]:
# NBVAL_IGNORE_OUTPUT
assert all(model.mut_escape_df["site"].astype(str) == model.mut_escape_df["site"])
assert set(model.mut_escape_df["site"]).issubset(sites)
model.mut_escape_df.head().round(2)
[9]:
epitope | site | wildtype | mutant | mutation | escape | times_seen | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | M | I | M1I | -0.00 | 1 |
1 | 1 | 1 | M | T | M1T | -0.01 | 2 |
2 | 1 | 2 | F | I | F2I | 0.00 | 2 |
3 | 1 | 2 | F | L | F2L | 0.01 | 14 |
4 | 1 | 2 | F | S | F2S | -0.01 | 14 |
The same is true for mut_escape_site_summary_df
:
[10]:
# NBVAL_IGNORE_OUTPUT
assert all(
model.mut_escape_site_summary_df()["site"].astype(str)
== model.mut_escape_site_summary_df()["site"]
)
assert set(model.mut_escape_site_summary_df()["site"]).issubset(sites)
model.mut_escape_site_summary_df().head().round(2)
[10]:
epitope | site | wildtype | mean | total positive | max | min | total negative | n mutations | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | M | -0.00 | 0.00 | -0.00 | -0.01 | -0.01 | 2 |
1 | 1 | 2 | F | 0.01 | 0.03 | 0.02 | -0.01 | -0.01 | 4 |
2 | 1 | 3 | V | 0.06 | 0.65 | 0.59 | -0.17 | -0.21 | 8 |
3 | 1 | 4 | F | 0.22 | 1.14 | 1.07 | -0.02 | -0.04 | 5 |
4 | 1 | 5 | L | 0.00 | 2.16 | 1.09 | -1.02 | -2.12 | 20 |
Now plot the escape values. Note how we merge in the data frame with sequential site numbers and protein regions and use it to color the zoom bar and add sequential sites to the lineplot and heatmap tooltips:
[11]:
# NBVAL_IGNORE_OUTPUT
model.mut_escape_plot(
df_to_merge=site_numbering_map.rename(columns={"reference_site": "site"}),
addtl_tooltip_stats=["sequential_site"],
site_zoom_bar_color_col="region",
addtl_slider_stats={"times_seen": 2},
)
[11]:
Confirm same results for reference and sequential integer number¶
To demonstrate how the results are the same regardless of which numbering scheme is used for the fitting, we also fit a model with the sequentially numbered variants. This section of the notebook can almost be considered a test rather than an example.
[12]:
# NBVAL_IGNORE_OUTPUT
model_sequential = polyclonal.Polyclonal(
data_to_fit=(
data_to_fit.drop(columns="aa_substitutions").rename(
columns={
"antibody_concentration": "concentration",
"aa_substitutions_sequential": "aa_substitutions",
}
)
),
n_epitopes=1,
alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)
assert model_sequential.sequential_integer_sites is True
_ = model_sequential.fit(
logfreq=200,
reg_escape_weight=0.1,
fix_hill_coefficient=True,
fix_non_neutralized_frac=True,
)
#
# Fitting site-level model.
# Starting optimization of 1248 parameters at Mon Oct 16 20:38:56 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.076008 37318 37314 0 0 0 0 0 3.7281 0 0
191 16.705 4871.2 4822.4 39.515 0 0 0 0 9.2632 0 0
# Successfully finished at Mon Oct 16 20:39:12 2023.
#
# Fitting model.
# Starting optimization of 8450 parameters at Mon Oct 16 20:39:13 2023.
step time_sec loss fit_loss reg_escape reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity reg_hill_coefficient reg_non_neutralized_frac
0 0.11586 7306.6 7031.5 265.85 2.1605e-31 0 0 0 9.2632 0 0
200 27.761 6813 6701.1 89.589 11.718 0 0 0 10.559 0 0
282 38.933 6812.4 6700.3 89.705 11.873 0 0 0 10.564 0 0
# Successfully finished at Mon Oct 16 20:39:52 2023.
Now make sure the fitting gives nearly the same result regardless of which site numbering is used. First check the activity values:
[13]:
pd.testing.assert_frame_equal(
model_sequential.curve_specs_df,
model.curve_specs_df,
atol=0.02,
)
Now compare the mutation-escape values. In order to do this, we have to re-number the sequential values to reference numbering:
[14]:
min_times_seen = 50
mut_escape = model.mut_escape_df.drop(columns="mutation")
mut_escape_sequential = model_sequential.mut_escape_df.assign(
site=lambda x: x["site"].map(
site_numbering_map.set_index("sequential_site")["reference_site"].to_dict()
)
).drop(columns="mutation")
# have to use fairly big atol to test this, so also do correlations
pd.testing.assert_frame_equal(
mut_escape,
mut_escape_sequential,
atol=1.5,
)
assert 0.99 < mut_escape["escape"].corr(mut_escape_sequential["escape"])
Test of averaging¶
Just as a toy test, average the reference-site model with itself and make sure it still displays reference-based site numbers:
[15]:
avg_model = polyclonal.PolyclonalAverage(
pd.DataFrame({"number": [1, 2], "model": [model, model]}),
)
[16]:
# NBVAL_IGNORE_OUTPUT
avg_model.mut_escape_plot()
[16]:
You cannot average models that don’t have the same site-numbering setup in terms of sequential or non-sequential:
[17]:
# NBVAL_RAISES_EXCEPTION
# this will give an error
polyclonal.PolyclonalAverage(
pd.DataFrame({"number": [1, 2], "model": [model_sequential, model]}),
)
---------------------------------------------------------------------------
PolyclonalHarmonizeError Traceback (most recent call last)
Cell In[17], line 4
1 # NBVAL_RAISES_EXCEPTION
2
3 # this will give an error
----> 4 polyclonal.PolyclonalAverage(
5 pd.DataFrame({"number": [1, 2], "model": [model_sequential, model]}),
6 )
File ~/polyclonal/polyclonal/polyclonal_collection.py:1459, in PolyclonalAverage.__init__(self, models_df, harmonize_to, default_avg_to_plot)
1456 if harmonize_to is None:
1457 harmonize_to = models_df.iloc[0]["model"]
-> 1459 models_df["model"] = [
1460 m.epitope_harmonized_model(harmonize_to)[0] for m in models_df["model"]
1461 ]
1463 super().__init__(models_df, default_avg_to_plot=default_avg_to_plot)
File ~/polyclonal/polyclonal/polyclonal_collection.py:1460, in <listcomp>(.0)
1456 if harmonize_to is None:
1457 harmonize_to = models_df.iloc[0]["model"]
1459 models_df["model"] = [
-> 1460 m.epitope_harmonized_model(harmonize_to)[0] for m in models_df["model"]
1461 ]
1463 super().__init__(models_df, default_avg_to_plot=default_avg_to_plot)
File ~/polyclonal/polyclonal/polyclonal.py:3233, in Polyclonal.epitope_harmonized_model(self, ref_poly)
3227 raise PolyclonalHarmonizeError(
3228 "The models being harmonized have different epitope labels:\n"
3229 f"{self.epitopes=}\n{ref_poly.epitopes=}"
3230 )
3232 if self.sequential_integer_sites != ref_poly.sequential_integer_sites:
-> 3233 raise PolyclonalHarmonizeError(
3234 "Models being harmonized don't have same `sequential_integer_sites`"
3235 )
3237 corr_df = (
3238 self.mut_escape_corr(ref_poly)
3239 .sort_values(["self_epitope", "correlation"], ascending=[True, False])
3240 .reset_index(drop=True)
3241 )
3243 harmonize_df = (
3244 corr_df.rename(columns={"self_epitope": "self_initial_epitope"})
3245 .groupby("self_initial_epitope", as_index=False)
(...)
3254 ]
3255 )
PolyclonalHarmonizeError: Models being harmonized don't have same `sequential_integer_sites`