polyclonal

Defines Polyclonal objects for handling antibody mixtures.

class polyclonal.polyclonal.Polyclonal(*, data_to_fit=None, activity_wt_df=None, mut_escape_df=None, hill_coefficient_df=None, non_neutralized_frac_df=None, site_escape_df=None, n_epitopes=None, spatial_distances=None, collapse_identical_variants=False, alphabet=('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'), sites=None, epitope_colors=('#0072B2', '#CC79A7', '#009E73', '#17BECF', '#BCDB22'), init_missing='zero', data_mut_escape_overlap='exact_match', check_concentration_scale=4)[source]

Bases: object

Represent polyclonal antibody mixtures targeting multiple epitopes.

Note

At several concentrations \(c\) of an antibody mixture, we measure \(p_v\left(c\right)\), the probability that variant \(v\) is not bound (or neutralized). We assume antibodies act independently on one of \(E\) epitopes, so the probability \(U_e\left(v, c\right)\) that \(v\) is unbound at concentration \(c\) is related to the probability that epitope \(e\) is unbound by

(1)\[p_v\left(c\right) = \prod_{e=1}^E U_e\left(v, c\right).\]

We furthermore assume that \(U_e\left(v, c\right)\) is related to the total binding activity \(\phi_e\left(v\right)\) of antibodies targeting epitope \(e\) on variant \(v\) by

(2)\[U_e\left(v,c\right)= \frac{1-t_e}{1+\left[c\exp\left(-\phi_e\left(v\right)\right)\right]^{n_e}}+t_e\]

where smaller (more negative) values of \(\phi_e\left(v\right)\) correspond to higher overall binding activity against epitope \(e\) variant \(v\), \(n_e\) is the Hill coefficient, and \(t_e\) is the non-neutralized fraction.

We define \(\phi_e\left(v\right)\) in terms of the underlying quantities of biological interest as

(3)\[\phi_e\left(v\right) = -a_{\rm{wt}, e} + \sum_{m=1}^M \beta_{m,e} b\left(v\right)_m,\]

where \(a_{\rm{wt}, e}\) is the activity of the serum against epitope \(e\) for the “wildtype” (unmutated) protein (larger values indicate higher activity against this epitope), \(\beta_{m,e}\) is the extent to which mutation \(m\) (where \(1 \le m \le M\)) escapes binding from antibodies targeting epitope \(e\) (larger values indicate more escape by this mutation), and \(b\left(v\right)_m\) is 1 if variant \(v\) has mutation \(m\) and 0 otherwise.

Note

You can initialize a Polyclonal object in three ways:

  1. With known epitope activities \(a_{\rm{wt}, e}\) and mutation-escape values \(\beta_{m,e}\), and no data. Use this approach if you already know these values and just want to visualize the polyclonal antibody mixture properties or predict escape of variants. To do this, initialize with activity_wt_df and mut_escape_df storing the known values, and data_to_fit=None.

  2. With data to fit the epitope activities and mutation-escape values, and initial guesses for the epitope activities and mutation-escape values. To do this, initialize with data_to_fit holding the data, activity_wt_df holding initial guesses of activities, and mut_escape_df or site_escape_df holding initial guesses for mutation escapes (see also init_missing and data_mut_escape_overlap). Then call Polyclonal.fit().

  3. With data to fit the epitope activities and mutation-escape values, but no initial guesses. To do this, initialize with data_to_fit holding the data, activity_wt_df=None, mut_escape_df=None, and n_epitopes holding the number of epitopes. Then call Polyclonal.fit().

For the first and second approaches, you can optionally set initial values for \(n_e\) (the Hill coefficient) and \(t_e\) (the non-neutralizable fraction) via hill_coefficient_df and non_neutralized_frac_df. If these are not set, we initialize \(n_e = 1\) and \(t_e = 0\).

Parameters:
  • data_to_fit (pandas.DataFrame or None) – Should have columns named ‘aa_substitutions’, ‘concentration’, and ‘prob_escape’. The ‘aa_substitutions’ column defines each variant \(v\) as a string of substitutions (e.g., ‘M3A K5G’). The ‘prob_escape’ column gives the \(p_v\left(c\right)\) value for each variant at each concentration \(c\).

  • activity_wt_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘activity’, giving the names of the epitopes and the activity against epitope in the wildtype protein, \(a_{\rm{wt}, e}\).

  • mut_escape_df (pandas.DataFrame or None) – Should have columns named ‘mutation’, ‘epitope’, and ‘escape’ that give the \(\beta_{m,e}\) values (in the ‘escape’ column), with mutations written like “G7M”.

  • hill_coefficient_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘hill_coefficient’ that sets initial \(n_e\), or set to None for \(n_e = 1\).

  • non_neutralized_frac_df (pandas.DataFrame or None) – Should have columns named ‘epitope’ and ‘non_neutralized_frac’ that sets initial \(t_e\), or set to None for \(t_e = 0\).

  • site_escape_df (pandas.DataFrame or None) – Use if you want to initialize all mutations at a given site to have the same \(\beta_{m,e}\) values. In this case, columns should be ‘site’, ‘epitope’, and ‘escape’. This option is mutually exclusive with mut_escape_df.

  • n_epitopes (int or None) – If initializing with activity_wt_df=None, specifies number of epitopes.

  • spatial_distances (pandas.DataFrame) – Gives inter-residue distances. Must have columns “site_1”, “site_2”, and “distance” where the distance is typically in angstroms. Distances may be missing for some pairs. Used for the spatial regularization.

  • collapse_identical_variants ({'mean', 'median', False}) – If identical variants in data_to_fit (same ‘aa_substitutions’), collapse them and make weight proportional to number of collapsed variants? Collapse by taking mean or median of ‘prob_escape’, or (if False) do not collapse at all. Collapsing will make fitting faster, but not a good idea if you are doing bootstrapping.

  • alphabet (array-like) – Allowed characters in mutation strings.

  • sites (array-like or None) – By default, sites are assumed to be sequential integer values are and inferred from data_to_fit or mut_escape_df. However, you can also have non-sequential integer sites, or sites with lower-case letter suffixes (eg, 214a) if your protein is numbered against a reference that it has indels relative to. In that case, provide list of all expected in order here; we require that order to be natsorted.

  • epitope_colors (array-like or dict) – Maps each epitope to the color used for plotting. Either a dict keyed by each epitope, or an array of colors that are sequentially assigned to the epitopes.

  • init_missing ('zero' or int) – How to initialize activities or mutation-escape values not specified in activity_wt_df or mut_escape_df / site_escape_df. If ‘zero’, set mutation-escapes to zero and activities uniformly spaced from 1 to 0. Otherwise draw uniformly from between 0 and 1 using specified random number seed.

  • data_mut_escape_overlap ({'exact_match', 'fill_to_data', 'prune_to_data'}) – If data_to_fit and mut_escape_df (or site_escape_df) both specified, what if they don’t specify same mutations. If ‘exact_match’, raise error. If ‘fill_to_data’, then take sites / wildtypes / mutations from data_to_fit and fill init values from any not specified in mut_escape_df as indicated by init_missing. If ‘prune_to_data’, remove any extra mutations from mut_escape_df that are not in data_to_fit.

  • check_concentration_scale (float or None) – The optimization may have problems if concentrations in data_to_fit are provided in units where the numbers are very large or very small. If check_concentration_scale is not None, check concentrations are not so large or small that the magnitude of the log10 of any concentration exceeds this value.

epitopes

Names of all epitopes.

Type:

tuple

mutations

All mutations for which we have escape values.

Type:

tuple

alphabet

Allowed characters in mutation strings.

Type:

tuple

sites

List of all sites. These are the sites provided via the sites parameter, or inferred from data_to_fit or mut_escape_df if that isn’t provided. If sequential_integer_sites is False, these are str, otherwise int.

Type:

tuple

sequential_integer_sites

True if sites are sequential and integer, False otherwise.

Type:

bool

wts

Keyed by site, value is wildtype at that site.

Type:

dict

epitope_colors

Maps each epitope to its color.

Type:

dict

data_to_fit

Data to fit as passed when initializing this Polyclonal object. If using collapse_identical_variants, then identical variants are collapsed on columns ‘concentration’, ‘aa_substitutions’, and ‘prob_escape’, and a column ‘weight’ is added to represent number of collapsed variants. Also, row-order may be changed.

Type:

pandas.DataFrame or None

spatial_distances

Spatial distances passed when initializing this Polyclonal object.

Type:

pandas.DataFrame or None

mutations_times_seen

If data_to_fit is not None, keyed by all mutations with escape values and values are number of variants in which the mutation is seen. It is formally calculated as the number of variants with mutation across all concentrations divided by the number of concentrations, so can have non-integer values if there are variants only observed at some concentrations.

Type:

frozendict.frozendict or None

distance_matrix

If data_to_fit is not None and spatial_distances is not None, then this is a matrix giving the pairwise distances between residues. Pairs with no distance have values of pd.NA and diagonal values are 0.

Type:

pandas.DataFrame or None

Example

Simple example with two epitopes (e1 and e2) and a few mutations where we know the activities and mutation-level escape values ahead of time:

>>> activity_wt_df = pd.DataFrame({'epitope':  ['e1', 'e2'],
...                                'activity': [ 2.0,  1.0]})
>>> mut_escape_df = pd.DataFrame({
...   'mutation': ['M1C', 'M1C', 'G2A', 'G2A', 'A4K', 'A4K', 'A4L', 'A4L'],
...   'epitope':  [ 'e1',  'e2',  'e1',  'e2',  'e1',  'e2',  'e1',  'e2'],
...   'escape':   [  2.0,   0.0,   3.0,   0.0,  0.0,    2.5,   0.0,   1.5],
...   })
>>> model = Polyclonal(
...     activity_wt_df=activity_wt_df,
...     mut_escape_df=mut_escape_df,
...     collapse_identical_variants="mean",
... )
>>> model.epitopes
('e1', 'e2')
>>> model.mutations
('M1C', 'G2A', 'A4K', 'A4L')
>>> model.mutations_times_seen is None
True
>>> model.sites
(1, 2, 4)
>>> model.wts
{1: 'M', 2: 'G', 4: 'A'}
>>> model.curve_specs_df
  epitope  activity  hill_coefficient  non_neutralized_frac
0      e1       2.0               1.0                   0.0
1      e2       1.0               1.0                   0.0
>>> model.mut_escape_df
  epitope  site wildtype mutant mutation  escape
0      e1     1        M      C      M1C     2.0
1      e1     2        G      A      G2A     3.0
2      e1     4        A      K      A4K     0.0
3      e1     4        A      L      A4L     0.0
4      e2     1        M      C      M1C     0.0
5      e2     2        G      A      G2A     0.0
6      e2     4        A      K      A4K     2.5
7      e2     4        A      L      A4L     1.5

We can also summarize the mutation-level escape at the site level:

>>> pd.set_option("display.max_columns", None)
>>> pd.set_option("display.width", 89)
>>> model.mut_escape_site_summary_df()
  epitope  site wildtype  mean  total positive  max  min  total negative  n mutations
0      e1     1        M   2.0             2.0  2.0  2.0             0.0            1
1      e1     2        G   3.0             3.0  3.0  3.0             0.0            1
2      e1     4        A   0.0             0.0  0.0  0.0             0.0            2
3      e2     1        M   0.0             0.0  0.0  0.0             0.0            1
4      e2     2        G   0.0             0.0  0.0  0.0             0.0            1
5      e2     4        A   2.0             4.0  2.5  1.5             0.0            2

Note that we can not initialize a Polyclonal object if we are missing escape estimates for any mutations for any epitopes:

>>> Polyclonal(activity_wt_df=activity_wt_df,
...            mut_escape_df=mut_escape_df.head(n=5))
Traceback (most recent call last):
  ...
ValueError: invalid set of mutations for epitope='e2'

Now make a data frame with some variants:

>>> variants_df = pd.DataFrame.from_records(
...         [('AA', ''),
...          ('AC', 'M1C'),
...          ('AG', 'G2A'),
...          ('AT', 'A4K'),
...          ('TA', 'A4L'),
...          ('CA', 'M1C G2A'),
...          ('CG', 'M1C A4K'),
...          ('CC', 'G2A A4K'),
...          ('TC', 'G2A A4L'),
...          ('CT', 'M1C G2A A4K'),
...          ('TG', 'M1C G2A A4L'),
...          ('GA', 'M1C'),
...          ],
...         columns=['barcode', 'aa_substitutions'])

Get the escape probabilities predicted on these variants from the values in the Polyclonal object:

>>> escape_probs = model.prob_escape(variants_df=variants_df,
...                                  concentrations=[1.0, 2.0, 4.0])
>>> escape_probs.round(3)
   barcode aa_substitutions  concentration  predicted_prob_escape
0       AA                             1.0                  0.032
1       AT              A4K            1.0                  0.097
2       TA              A4L            1.0                  0.074
3       AG              G2A            1.0                  0.197
4       CC          G2A A4K            1.0                  0.598
5       TC          G2A A4L            1.0                  0.455
6       AC              M1C            1.0                  0.134
7       GA              M1C            1.0                  0.134
8       CG          M1C A4K            1.0                  0.409
9       CA          M1C G2A            1.0                  0.256
10      CT      M1C G2A A4K            1.0                  0.779
11      TG      M1C G2A A4L            1.0                  0.593
12      AA                             2.0                  0.010
13      AT              A4K            2.0                  0.044
14      TA              A4L            2.0                  0.029
15      AG              G2A            2.0                  0.090
16      CC          G2A A4K            2.0                  0.398
17      TC          G2A A4L            2.0                  0.260
18      AC              M1C            2.0                  0.052
19      GA              M1C            2.0                  0.052
20      CG          M1C A4K            2.0                  0.230
21      CA          M1C G2A            2.0                  0.141
22      CT      M1C G2A A4K            2.0                  0.629
23      TG      M1C G2A A4L            2.0                  0.411
24      AA                             4.0                  0.003
25      AT              A4K            4.0                  0.017
26      TA              A4L            4.0                  0.010
27      AG              G2A            4.0                  0.034
28      CC          G2A A4K            4.0                  0.214
29      TC          G2A A4L            4.0                  0.118
30      AC              M1C            4.0                  0.017
31      GA              M1C            4.0                  0.017
32      CG          M1C A4K            4.0                  0.106
33      CA          M1C G2A            4.0                  0.070
34      CT      M1C G2A A4K            4.0                  0.441
35      TG      M1C G2A A4L            4.0                  0.243

We can also get predicted escape probabilities by including concentrations in the data frame passed to Polyclonal.prob_escape():

>>> model.prob_escape(
...         variants_df=pd.concat([variants_df.assign(concentration=c)
...                                for c in [1.0, 2.0, 4.0]])
...         ).equals(escape_probs)
True

We can also compute the IC50s:

>>> model.icXX(variants_df).round(3)
   barcode aa_substitutions   IC50
0       AA                   0.085
1       AC              M1C  0.230
2       AG              G2A  0.296
3       AT              A4K  0.128
4       TA              A4L  0.117
5       CA          M1C G2A  0.355
6       CG          M1C A4K  0.722
7       CC          G2A A4K  1.414
8       TC          G2A A4L  0.858
9       CT      M1C G2A A4K  3.237
10      TG      M1C G2A A4L  1.430
11      GA              M1C  0.230

Or the IC90s:

>>> model.icXX(variants_df, x=0.9, col='IC90').round(3)
   barcode aa_substitutions    IC90
0       AA                    0.464
1       AC              M1C   1.260
2       AG              G2A   1.831
3       AT              A4K   0.976
4       TA              A4L   0.782
5       CA          M1C G2A   2.853
6       CG          M1C A4K   4.176
7       CC          G2A A4K   7.473
8       TC          G2A A4L   4.532
9       CT      M1C G2A A4K  18.717
10      TG      M1C G2A A4L   9.532
11      GA              M1C   1.260

Or the fold change IC90s of all mutations: >>> model.mut_icXX_df( … x=0.9, … icXX_col=”IC90”, … log_fold_change_icXX_col=”log2_fold_change_IC90”, … ).round(2)

site wildtype mutant IC90 log2_fold_change_IC90

0 1 M C 1.26 1.44 1 1 M M 0.46 0.00 2 2 G A 1.83 1.98 3 2 G G 0.46 0.00 4 4 A A 0.46 0.00 5 4 A K 0.98 1.07 6 4 A L 0.78 0.76

Example

Initialize with escape_probs created above as data to fit. In order to do this, we need to change the name of the column with the predicted escape probs to just be escape probs as we are now assuming these are the real values:

>>> data_to_fit = (
...     escape_probs
...     .rename(columns={'predicted_prob_escape': 'prob_escape'})
... )
>>> model_data = Polyclonal(
...     data_to_fit=data_to_fit,
...     n_epitopes=2,
...     collapse_identical_variants="mean",
... )

The mutations are those in data_to_fit:

>>> model_data.mutations
('M1C', 'G2A', 'A4K', 'A4L')
>>> {key: int(val) for (key, val) in sorted(model_data.mutations_times_seen.items())}
{'A4K': 4, 'A4L': 3, 'G2A': 6, 'M1C': 6}

The activities are evenly spaced from 1 to 0, while the mutation escapes are all initialized to zero:

>>> model_data.curve_specs_df
  epitope  activity  hill_coefficient  non_neutralized_frac
0       1       1.0               1.0                   0.0
1       2       0.0               1.0                   0.0
>>> model_data.mut_escape_df
  epitope  site wildtype mutant mutation  escape  times_seen
0       1     1        M      C      M1C     0.0           6
1       1     2        G      A      G2A     0.0           6
2       1     4        A      K      A4K     0.0           4
3       1     4        A      L      A4L     0.0           3
4       2     1        M      C      M1C     0.0           6
5       2     2        G      A      G2A     0.0           6
6       2     4        A      K      A4K     0.0           4
7       2     4        A      L      A4L     0.0           3

You can initialize to random numbers by setting init_missing to seed (in this example we also don’t include all variants for one concentration):

>>> model_data2 = Polyclonal(
...     data_to_fit=data_to_fit.head(30),
...     n_epitopes=2,
...     init_missing=1,
...     collapse_identical_variants="mean",
... )
>>> model_data2.curve_specs_df.round(3)
  epitope  activity  hill_coefficient  non_neutralized_frac
0       1     0.417               1.0                   0.0
1       2     0.720               1.0                   0.0

You can set some or all mutation escapes to initial values:

>>> model_data3 = Polyclonal(
...     data_to_fit=data_to_fit,
...     activity_wt_df=activity_wt_df,
...     mut_escape_df=pd.DataFrame({'epitope': ['e1'],
...                                 'mutation': ['M1C'],
...                                 'escape': [4]}),
...     data_mut_escape_overlap='fill_to_data',
...     collapse_identical_variants="mean",
... )
>>> model_data3.mut_escape_df
  epitope  site wildtype mutant mutation  escape  times_seen
0      e1     1        M      C      M1C     4.0           6
1      e1     2        G      A      G2A     0.0           6
2      e1     4        A      K      A4K     0.0           4
3      e1     4        A      L      A4L     0.0           3
4      e2     1        M      C      M1C     0.0           6
5      e2     2        G      A      G2A     0.0           6
6      e2     4        A      K      A4K     0.0           4
7      e2     4        A      L      A4L     0.0           3

You can initialize sites to escape values via site_activity_df:

>>> model_data4 = Polyclonal(
...     data_to_fit=data_to_fit,
...     activity_wt_df=activity_wt_df,
...     site_escape_df=pd.DataFrame.from_records(
...         [('e1', 1, 1.0), ('e1', 4, 0.0),
...          ('e2', 1, 0.0), ('e2', 4, 2.0)],
...         columns=['epitope', 'site', 'escape'],
...     ),
...     data_mut_escape_overlap='fill_to_data',
...     collapse_identical_variants="mean",
... )
>>> model_data4.mut_escape_df
  epitope  site wildtype mutant mutation  escape  times_seen
0      e1     1        M      C      M1C     1.0           6
1      e1     2        G      A      G2A     0.0           6
2      e1     4        A      K      A4K     0.0           4
3      e1     4        A      L      A4L     0.0           3
4      e2     1        M      C      M1C     0.0           6
5      e2     2        G      A      G2A     0.0           6
6      e2     4        A      K      A4K     2.0           4
7      e2     4        A      L      A4L     2.0           3

Fit the data using Polyclonal.fit(), and make sure the new predicted escape probabilities are close to the real ones being fit. Reduce weight on regularization since there is so little data in this toy example:

>>> for m in [model_data, model_data2, model_data3, model_data4]:
...     opt_res = m.fit(
...         reg_escape_weight=0.001,
...         reg_spread_weight=0.001,
...         reg_activity_weight=0.0001,
...         reg_uniqueness2_weight=0,
...         fix_hill_coefficient=True,
...         fix_non_neutralized_frac=True,
...     )
...     pred_df = m.prob_escape(variants_df=data_to_fit)
...     if not numpy.allclose(pred_df['prob_escape'],
...                           pred_df['predicted_prob_escape'],
...                           atol=0.01):
...          raise ValueError(f"wrong predictions\n{pred_df}")
...     if not numpy.allclose(
...              activity_wt_df['activity'].sort_values(),
...              m.activity_wt_df['activity'].sort_values(),
...              atol=0.1,
...              ):
...          raise ValueError(f"wrong activities\n{m.activity_wt_df}")
...     if not numpy.allclose(
...              mut_escape_df['escape'].sort_values(),
...              m.mut_escape_df['escape'].sort_values(),
...              atol=0.05,
...              ):
...          raise ValueError(f"wrong escapes\n{m.mut_escape_df}")

The map on next line is to make doctest pass even with negative zeros

>>> (
...     model_data.mut_escape_site_summary_df()
...     .round(1).map(lambda x: 0.0 if x == 0 else x)
... )
  epitope  site wildtype  mean  total positive  max  min  total negative  n mutations
0       1     1        M   0.0             0.0  0.0  0.0             0.0            1
1       1     2        G   0.0             0.0  0.0  0.0             0.0            1
2       1     4        A   2.0             4.0  2.5  1.5             0.0            2
3       2     1        M   2.0             2.0  2.0  2.0             0.0            1
4       2     2        G   3.0             3.0  3.0  3.0             0.0            1
5       2     4        A   0.0             0.0  0.0  0.0             0.0            2

You can also exclude mutations to specific characters (typically you would want to do this for stop codons and/or gaps):

>>> (
...     model_data.mut_escape_site_summary_df(exclude_chars={"C", "K"})
...     .round(1).map(lambda x: 0.0 if x == 0 else x)
... )
  epitope  site wildtype  mean  total positive  max  min  total negative  n mutations
0       1     2        G   0.0             0.0  0.0  0.0             0.0            1
1       1     4        A   1.5             1.5  1.5  1.5             0.0            1
2       2     2        G   3.0             3.0  3.0  3.0             0.0            1
3       2     4        A   0.0             0.0  0.0  0.0             0.0            1

Example

You can convert a Polyclonal model into a site-level model via the transformation of polyclonal.utils.site_level_variants(). The site-level model is another Polyclonal model that just keeps track of whether or not sites are mutated using a 2-letter wildtype/mutant alphabet, and is generated using Polyclonal.site_level_model():

>>> model_site = model_data4.site_level_model()
>>> model_site.alphabet
('w', 'm')
>>> (model_site.mut_escape_df
...  .assign(escape=lambda x: x['escape'].abs()).round(1))
  epitope  site wildtype mutant mutation  escape  times_seen
0      e1     1        w      m      w1m     2.0           5
1      e1     2        w      m      w2m     3.0           6
2      e1     4        w      m      w4m     0.0           7
3      e2     1        w      m      w1m     0.0           5
4      e2     2        w      m      w2m     0.0           6
5      e2     4        w      m      w4m     2.0           7
>>> model_site.data_to_fit.head(n=5).round(3)
   concentration aa_substitutions  weight  prob_escape
0            1.0                        1        0.032
1            1.0              w1m       1        0.134
2            1.0          w1m w2m       1        0.256
3            1.0      w1m w2m w4m       2        0.686
4            1.0          w1m w4m       1        0.409

Example

Epitope assignments are arbitrary, so you can calculate correlations among their mutation-escape values and create harmonized models that use the same label to refer to epitopes with similar mutation-escape values.

First, correlations of a model to itself:

>>> ref_model = copy.deepcopy(model)
>>> model.mut_escape_corr(ref_model).round(3)
  ref_epitope self_epitope  correlation
0          e1           e1        1.000
1          e1           e2       -0.907
2          e2           e1       -0.907
3          e2           e2        1.000

Make another model with epitope assignments inverted and a few mutations missing:

>>> inverted_model = Polyclonal(
...     activity_wt_df=(
...         activity_wt_df
...         .assign(epitope=lambda x: x["epitope"].map({"e1": "e2", "e2": "e1"}))
...     ),
...     mut_escape_df=(
...         mut_escape_df
...         .query("mutation != 'A4K'")
...         .assign(epitope=lambda x: x["epitope"].map({"e1": "e2", "e2": "e1"}))
...     ),
... )
>>> inverted_model.mut_escape_corr(ref_model).round(3)
  ref_epitope self_epitope  correlation
0          e1           e1       -0.945
1          e1           e2        1.000
2          e2           e1        1.000
3          e2           e2       -0.945

Now actually get epitope-harmonized models:

>>> model_harmonized, harmonize_df = model.epitope_harmonized_model(ref_model)
>>> harmonize_df
  self_initial_epitope self_harmonized_epitope ref_epitope  correlation
0                   e1                      e1          e1          1.0
1                   e2                      e2          e2          1.0
>>> assert model.mut_escape_df.equals(model_harmonized.mut_escape_df)
>>> inverted_harmonized, harmonize_df = inverted_model.epitope_harmonized_model(
...     ref_model
... )
>>> harmonize_df
  self_initial_epitope self_harmonized_epitope ref_epitope  correlation
0                   e1                      e2          e2          1.0
1                   e2                      e1          e1          1.0
>>> inverted_harmonized.mut_escape_df
  epitope  site wildtype mutant mutation  escape
0      e1     1        M      C      M1C     2.0
1      e1     2        G      A      G2A     3.0
2      e1     4        A      L      A4L     0.0
3      e2     1        M      C      M1C     0.0
4      e2     2        G      A      G2A     0.0
5      e2     4        A      L      A4L     1.5

Example

Filter variants by how often they are seen in data:

>>> model_data.filter_variants_by_seen_muts(variants_df)
... 
   barcode aa_substitutions
0       AA
1       AC              M1C
2       AG              G2A
3       AT              A4K
4       TA              A4L
5       CA          M1C G2A
6       CG          M1C A4K
7       CC          G2A A4K
8       TC          G2A A4L
9       CT      M1C G2A A4K
10      TG      M1C G2A A4L
11      GA              M1C
>>> model_data.filter_variants_by_seen_muts(variants_df, min_times_seen=5)
... 
  barcode aa_substitutions
0      AA
1      AC              M1C
2      AG              G2A
3      CA          M1C G2A
4      GA              M1C
>>> model_data.filter_variants_by_seen_muts(variants_df, min_times_seen=4)
... 
  barcode aa_substitutions
0      AA
1      AC              M1C
2      AG              G2A
3      AT              A4K
4      CA          M1C G2A
5      CG          M1C A4K
6      CC          G2A A4K
7      CT      M1C G2A A4K
8      GA              M1C

Example

Add spatial distances:

>>> spatial_distances = pd.DataFrame(
...     [(1, 2, 1.5), (1, 4, 7.5)],
...     columns=["site_1", "site_2", "distance"],
... )
>>> model_spatial = Polyclonal(
...     data_to_fit=data_to_fit,
...     n_epitopes=2,
...     spatial_distances=spatial_distances,
...     collapse_identical_variants="mean",
... )
>>> model_spatial.distance_matrix
     1     2     4
1  0.0   1.5   7.5
2  1.5   0.0  <NA>
4  7.5  <NA>   0.0

Fit with lower regularization as data so small

>>> opt_res = model_spatial.fit(
...     reg_escape_weight=0.001,
...     reg_spread_weight=0.001,
...     reg_activity_weight=0.0001,
...     reg_spatial_weight=0.001,
...     reg_spatial2_weight=0.0001,
...     reg_uniqueness2_weight=0,
... )
>>> pred_df = model_spatial.prob_escape(variants_df=data_to_fit)
>>> if not numpy.allclose(
...     pred_df['prob_escape'],
...     pred_df['predicted_prob_escape'],
...     atol=0.01,
... ):
...     raise ValueError(f"wrong predictions\n{pred_df}")
>>> if not numpy.allclose(
...      activity_wt_df['activity'].sort_values(),
...      m.activity_wt_df['activity'].sort_values(),
...      atol=0.1,
... ):
...     raise ValueError(f"wrong activities\n{m.activity_wt_df}")
>>> if not numpy.allclose(
...      mut_escape_df['escape'].sort_values(),
...      m.mut_escape_df['escape'].sort_values(),
...      atol=0.05,
... ):
...     raise ValueError(f"wrong escapes\n{m.mut_escape_df}")
DEFAULT_SCIPY_MINIMIZE_KWARGS = {'method': 'L-BFGS-B', 'options': {'ftol': 1e-07, 'maxfun': 10000000.0, 'maxiter': 1000000.0}}

default scipy_minimize_kwargs to fit.

Type:

frozendict.frozendict

activity_wt_barplot(**kwargs)[source]

Bar plot of activity against each epitope, \(a_{\rm{wt},e}\).

Note

Consider Polyclonal.curves_plot() as better way to show similar data.

Parameters:

**kwargs – Keyword args for polyclonal.plot.activity_wt_barplot().

Returns:

Interactive plot.

Return type:

altair.Chart

property activity_wt_df

Activities \(a_{\rm{wt,e}}\) for epitopes.

Type:

pandas.DataFrame

property curve_specs_df

activities, Hill coefficients, and non-neutralized fracs.

Type:

pandas.DataFrame

curves_plot(**kwargs)[source]

Plot neutralization / binding curve for unmutated protein at each epitope.

This curve effectively illustrates the epitope activity, Hill curve coefficient, and non-neutralizable fraction.

Parameters:

**kwargs – Keyword args for polyclonal.plot.curves_plot().

Returns:

Interactive plot

Return type:

altair.Chart

epitope_harmonized_model(ref_poly)[source]

Get a copy of model with epitopes “harmonized” with a reference model.

Epitopes are unidentifiable, meaning there is no guarantee that we will assign the same epitope labels across multiple models fit to similar data. This function returns a copy of the current model where the epitope labels are harmonized with those of another model with the same epitope labels. The harmonization is done by putting into correspondence the epitopes with the best correlated mutation-escape values.

Parameters:

ref_poly (Polyclonal) – The reference polyclonal object to harmonize epitope labels with.

Returns:

harmonized_model, harmonize_dfharmonized_model is a Polyclonal object that is a copy of the current model but with epitopes harmonized, and Polyclonal.epitope_colors taken from ref_poly. harmonize_df is a pandas.DataFrame that shows how the epitopes were re-assigned and the correlations in their escape values.

Return type:

tuple

Raises:

PolyclonalHarmonizeError – Raise this error if epitopes cannot be harmonized one-to-one.

filter_variants_by_seen_muts(variants_df, min_times_seen=1, subs_col='aa_substitutions')[source]

Remove variants that contain mutations not seen during model fitting.

Parameters:
  • variants_df (pandas.DataFrame) – Contains variants as rows.

  • min_times_seen (int) – Require mutations to be seen >= this many times in data used to fit model.

  • subs_col (str) – Column in variants_df with mutations in each variant.

Returns:

variants_df – Copy of input dataframe, with rows of variants that have unseen mutations removed.

Return type:

pandas.DataFrame

fit(*, loss_delta=0.1, reg_escape_weight=0.05, reg_escape_delta=0.1, reg_spread_weight=0.1, site_avg_abs_escape_epsilon=0.1, reg_spatial_weight=0.0, reg_spatial2_weight=0.0001, reg_uniqueness_weight=0, reg_uniqueness2_weight=0.1, reg_activity_weight=2.0, reg_activity_delta=0.1, reg_hill_coefficient_weight=25.0, reg_non_neutralized_frac_weight=1000.0, fit_site_level_first=True, scipy_minimize_kwargs={'method': 'L-BFGS-B', 'options': {'ftol': 1e-07, 'maxfun': 10000000.0, 'maxiter': 1000000.0}}, log=None, logfreq=None, fix_hill_coefficient=False, fix_non_neutralized_frac=False, activity_bounds=(None, None), hill_coefficient_bounds=(1e-08, None), non_neutralized_frac_bounds=(0, 0.5), beta_bounds=(None, None), fit_fixed_first=True, fit_fixed_first_reg_activity_weight=10.0, log_desc='')[source]

Fit parameters (activities and mutation escapes) to the data.

Requires Polyclonal.data_to_fit be set at initialization of this Polyclonal object. After calling this method, the \(a_{\rm{wt},e}\) and \(\beta_{m,e}\) have been optimized, and can be accessed using other methods of the Polyclonal object.

Parameters:
  • loss_delta (float) – Pseudo-Huber \(\delta\) parameter for loss on \(p_v\left(c\right)\) fitting.

  • reg_escape_weight (float) – Strength of Pseudo-Huber regularization on \(\beta_{m,e}\).

  • reg_escape_delta (float) – Pseudo-Huber \(\delta\) for regularizing \(\beta_{m,e}\).

  • reg_spread_weight (float) – Strength of regularization on variance of \(\beta_{m,e}\) values at each site.

  • site_avg_abs_escape_epsilon (float) – The epsilon value used when computing a differentiable measure of the average absolute value of escape at a site for each epitope.

  • reg_spatial_weight (float) – Strength of regularization of spatial distance between \(\beta_{m,e}\) values at each site. Only meaningful if Polyclonal.distance_matrix is not None.

  • reg_spatial2_weight (float) – Strength of regularization of squared spatial distance between \(\beta_{m,e}\) values at each site. Only meaningful if Polyclonal.distance_matrix is not None.

  • reg_uniqueness_weight (float) – Strength of regularization on epitope uniqueness.

  • reg_uniqueness2_weight (float) – Strength of regularization on uniqueness of squared escape. values at each site across epitopes.

  • site_avg_abs_escape_epsilon – Epsilon value for caclulating site average absolute escape.

  • reg_activity_weight (float) – Strength of Pseudo-Huber regularization on \(a_{\rm{wt},e}\).

  • reg_activity_delta (float) – Pseudo-Huber \(\delta\) for regularizing \(a_{\rm{wt},e}\).

  • reg_hill_coefficient_weight (float) – Weight of regularization on Hill coefficients \(n_e\) that differ from 1.

  • reg_non_neutralized_frac_weight (float) – Weight of regularization on non-neutralized fractions \(t_e\) that differ from 0.

  • fit_site_level_first (bool) – First fit a site-level model, then use those activities / escapes to initialize fit of this model. Generally works better.

  • scipy_minimize_kwargs (dict) – Keyword arguments passed to scipy.optimize.minimize.

  • log (None or writable file-like object) – Where to log output. If None, use sys.stdout.

  • logfreq (None or int) – How frequently to write updates on fitting to log.

  • fix_hill_coefficient (bool) – Do not optimize the hill coefficients \(n_e\) and instead keep fixed at current values.

  • fix_non_neutralized_frac (bool) – Do not optimize the non-neutralized fractions \(t_e\) and instead keep fixed at current values.

  • activity_bounds (2-tuple) – When optimizing activities, keep in these bounds. Set to (None, None) for no bounds.

  • hill_coefficient_bounds (2-tuple) – When optimizing Hill coefficient, keep in these bounds.

  • non_neutralized_frac_bounds (2-tuple) – When optimizing non-neutralized fraction, keep in these bounds.

  • beta_bounds (2-tuple) – When optimizing escape values (\(\beta_m\)), keep in these bounds.

  • fit_fixed_first (bool) – In fitting, if either the Hill coefficient or non-neutralized fraction are free (either fix_non_neutralized_frac or fix_hill_coefficient is False) then first fit a model with both of these fixed. After fitting that model, use its values to then fit with these free. The site model (if being fit, see fit_site_level_first) is then only fit in the initial fitting with the Hill coefficient and non-neutralized fraction fixed.

  • fit_fixed_first_reg_activity_weight (float) – The activity regularization if first fitting a fixed model via fit_fixed_first. It can be helpful to have this > reg_activity_weight as it avoids epitopes getting activities so negative that they are lost in later fitting.

  • log_desc (str) – A description included on the logging string header.

Return type:

scipy.optimize.OptimizeResult

property hill_coefficient_df

Hill coefficients \(n_e\) for epitopes.

Type:

pandas.DataFrame

icXX(variants_df, *, x=0.5, col='IC50', min_c=1e-08, max_c=100000000.0)[source]

Concentration at which a given fraction is neutralized (eg, IC50).

Parameters:
  • variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).

  • x (float) – Compute concentration at which this fraction is neutralized for each variant. So set to 0.5 for IC50, and 0.9 for IC90.

  • col (str) – Name of column in returned data frame with the ICXX value.

  • min_c (float) – Minimum allowed icXX, truncate values < this at this.

  • max_c (float) – Maximum allowed icXX, truncate values > this at this.

Returns:

Copy of variants_df with added column col containing icXX.

Return type:

pandas.DataFrame

mut_escape_corr(ref_poly)[source]

Correlation of mutation-escape values with another model.

For each epitope, how well is this model’s mutation-escape values correlation with another model?

Mutations present in only one model are ignored.

Parameters:

ref_poly (Polyclonal) – Other (reference) polyclonal model with which we calculate correlations.

Returns:

corr_df – Pairwise epitope correlations for escape.

Return type:

pandas.DataFrame

property mut_escape_df

Escape \(\beta_{m,e}\) for each mutation.

Type:

pandas.DataFrame

mut_escape_pdb_b_factor(*, input_pdbfile, chains, metric, outdir=None, outfile='{metric}-{epitope}.pdb', missing_metric=0, min_times_seen=1)[source]

Create PDB files with B factors from a site’s mutation escape.

Parameters:
  • input_pdbfile (str) – Path to input PDB file.

  • chains (str or array-like) – Single chain or list of them to re-color.

  • metric (str) – Which site-level summary metric to use. Can be any metric in Polyclonal.mut_escape_site_summary_df().

  • outdir (str) – Output directory for created PDB files.

  • outfile (str) – Output file name, with formatting used to replace metric and epitope in curly brackets.

  • missing_metric (float or dict) – How do we handle sites in PDB that are missing in escape metric? If a float, reassign B factors for all missing sites to this value. If a dict, should be keyed by chain and assign all missing sites in each chain to indicated value.

  • min_times_seen (int) – Value passed to Polyclonal.mut_escape_site_summary_df().

Returns:

Gives name of created B-factor re-colored PDB for each epitope.

Return type:

pandas.DataFrame

mut_escape_plot(*, biochem_order_aas=True, prefix_epitope=None, df_to_merge=None, **kwargs)[source]

Make plot of the mutation escape values, \(\beta_{m,e}\).

Parameters:
Returns:

Interactive line plot and heatmap.

Return type:

altair.Chart

mut_escape_site_summary_df(*, min_times_seen=1, mutation_whitelist=None, exclude_chars=frozenset({'*'}))[source]

Site-level summaries of mutation escape.

Parameters:
  • min_times_seen (int) – Only include in summaries mutations seen in at least this many variants.

  • mutation_whitelist (None or set) – Only include in summaries these mutations.

  • exclude_chars (set or list) – Exclude mutations to these characters when calculating site summaries. Useful if you want to ignore stop codons (*), and perhaps in some cases also gaps (-).

Return type:

pandas.DataFrame

mut_icXX_df(*, x, icXX_col, log_fold_change_icXX_col, min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0))[source]

Get data frame of ICXX and log fold change induced by each mutation.

Parameters:
  • x (float) – Compute this ICXX, so 0.5 means IC50 aand 0.9 means IC90.

  • icXX_col (str) – Name of column with ICXX values.

  • log_fold_change_icXX_col (str) – Name of column with log fold change ICXX values.

  • min_c (float) – Minimum allowed icXX, truncate values < this at this.

  • max_c (float) – Maximum allowed icXX, truncate values > this at this.

  • logbase (float) – Compute log fold-change in ICXX to this base.

  • check_wt_icXX (None or 2-tuple) – If a 2-tuple, raise an error if the ICXX for the unmutated wildtype is outside the range (min_icXX, max_icXX). This check is because the clipping imposed by min_c and max_c will become a problem if the wildtype ICXX is very different than one. In general, concentration units should be used when fitting so that ICXX for wildtype is within a orders of magnitude of 1.

Returns:

Has columns “site”, “wildtype”, “mutant”, icXX_col, log_fold_change_icXX_col, and “times_seen” (if available). Note that the dataframe does include entry for the wildtype (wildtype same as mutant) at each site with a non-wildtype mutation.

Return type:

pandas.DataFrame

mut_icXX_plot(*, x=0.9, icXX_col='IC90', log_fold_change_icXX_col='log2 fold change IC90', min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0), biochem_order_aas=True, df_to_merge=None, positive_color='#0072B2', negative_color='#E69F00', **kwargs)[source]

Make plot of log fold-change in ICXX induced by each mutation.

Parameters:
Returns:

Interactive line plot and heatmap.

Return type:

altair.Chart

property non_neutralized_frac_df

non-neutralizable fractions \(t_e\) for epitopes.

Type:

pandas.DataFrame

prob_escape(*, variants_df, concentrations=None)[source]

Compute predicted probability of escape \(p_v\left(c\right)\).

Computed using current mutation-escape values \(\beta_{m,e}\) and epitope activities \(a_{\rm{wt},e}\) stored in this Polyclonal object.

Parameters:
  • variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if concentrations=None.

  • concentrations (array-like or None) – Concentrations at which we compute probability of escape.

Returns:

Version of variants_df with columns named ‘concentration’ and ‘predicted_prob_escape’ giving predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration.

Return type:

pandas.DataFrame

site_level_model(*, aggregate_mut_escapes='mean', collapse_identical_variants='mean')[source]

Model with mutations collapsed at site level.

Parameters:
  • aggregate_mut_escapes ({'mean'}) – How to aggregate mutation-level escape values to site-level ones in mut_escape_df.

  • collapse_identical_variants ({"mean", "median", False}) – Same meaning as for Polyclonal initialization.

Return type:

Polyclonal

exception polyclonal.polyclonal.PolyclonalConcentrationError[source]

Bases: Exception

Error in concentrations passed to Polyclonal in data_to_fit.

exception polyclonal.polyclonal.PolyclonalFitError[source]

Bases: Exception

Error fitting in Polyclonal.fit().

exception polyclonal.polyclonal.PolyclonalHarmonizeError[source]

Bases: Exception

Error harmonizing epitopes in Polyclonal.epitope_harmonized_model().