polyclonal_collection

Defines PolyclonalCollection for handling collections of multiple Polyclonal objects.

PolyclonalCollection is a base class for the following specific use-case classes:

class polyclonal.polyclonal_collection.PolyclonalAverage(models_df, *, region_col=None, harmonize_to=None, default_avg_to_plot='median')[source]

Bases: PolyclonalCollection

Average several Polyclonal objects.

Parameters:
  • models_df (pandas.DataFrame) – Same meaning as for PolyclonalCollection. However, the resulting collection of models will have copies of these models rather than the actual objects in models_df.

  • region_col (str or None) – Same meaning as for PolyclonalCollection.

  • harmonize_to (PolyclonalCollection or None) – When harmonizing the epitopes, harmonize to this model. If None, just harmonize to the first model in models_df.

  • default_avg_to_plot ({"mean", "median"}) – What type of average do the plotting methods plot by default?

Other attributes of :class:`PolyclonalCollection`.

Inherited from base class.

class polyclonal.polyclonal_collection.PolyclonalBootstrap(root_polyclonal, n_bootstrap_samples, *, n_threads=-1, seed=0, sample_by='barcode', default_avg_to_plot='mean')[source]

Bases: PolyclonalCollection

Bootstrap Polyclonal objects.

Parameters:
  • root_polyclonal (Polyclonal) – The polyclonal object created with the full dataset to draw bootstrapped samples from. The bootstrapped samples are also initialized to mutation effects and activities of this model, so it is highly recommended that this object already have been fit to the full dataset.

  • n_bootstrap_samples (int) – Number of bootstrapped Polyclonal models to fit.

  • seed (int) – Random seed for reproducibility.

  • n_threads (int) – Number of threads to use for multiprocessing, -1 means all available.

  • sample_by – Passed to create_bootstrap_sample(). Should generally be ‘barcode’ if you have same variants at all concentrations, and maybe None otherwise.

  • default_avg_to_plot ({"mean", "median"}) – What type of average do the plotting methods plot by default?

root_polyclonal

The root polyclonal object passed as a parameter.

Type:

Polyclonal

n_threads

Number of threads for multiprocessing.

Type:

int

Other attributes of :class:`PolyclonalCollection`.

Inherited from base class.

fit_models(failures='error', **kwargs)[source]

Fits bootstrapped Polyclonal models.

The fit models will then be in PolyclonalCollection.models, with any models that fail fitting set to None. Their epitopes will also be harmonized with PolyclonalBootstrap.root_polyclonal.

Parameters:
  • failures ({"error", "tolerate"}) – Tolerate failures in model fitting or raise an error if a failure? Always raise an error if all models failed.

  • **kwargs – Keyword arguments for polyclonal.polyclonal.Polyclonal.fit(). If not specified otherwise, fit_site_level_first is set to False, since models are initialized to “good” values from the root object.

Returns:

Number of model fits that failed and succeeded.

Return type:

(n_fit, n_failed)

class polyclonal.polyclonal_collection.PolyclonalCollection(models_df, *, default_avg_to_plot, region_col=None)[source]

Bases: object

Handle a collection of Polyclonal objects.

Parameters:
  • models_df (pandas.DataFrame) – Data frame of models. Should have one column named “model” that has Polyclonal models, and other columns are descriptor for model (e.g., “replicate”, etc). The descriptors for each row must be unique.

  • default_avg_to_plot ({"mean", "median"}) – By default when plotting, plot either “mean” or “median”.

  • region_col (None or str) – Use this option if you want to only include sites in a specific region of the protein for specific models (this is useful for instance if you split the protein into halves in two different libraries). In this case, region_col should be a columnn in models_df with the values being the list of sites to use for that specific model.

models

List of the models in models_df. All models must have same epitopes.

Type:

list

model_descriptors

A list of same length as models with each entry being a dict keyed by descriptors and values being the descriptor for that model. All models must have same descriptor labels. Eg, [{"replicate": 1}, {"replicate": 2}]`. The descriptor labels are all columns in models_df except one named “model”.

Type:

dict

descriptor_names

The names that key the entries in PolyclonalCollection.model_descriptors.

Type:

list

unique_descriptor_names

Names of descriptors in PolyclonalCollection.descriptor_names that are not shared across all models.

Type:

list

epitopes

Same meaning as for epitope, extracted from PolyclonalCollection.models.

Type:

tuple

epitope_colors

Same meaning as for epitope_colors, extracted from PolyclonalCollection.models.

Type:

dict

alphabet

Same meaning as for alphabet, extracted from PolyclonalCollection.models.

Type:

array-like

sequential_integer_sites

Same as for sequential_integer_sites, extracted from PolyclonalCollection.models.

Type:

bool

sites

All sites for which the model is defined.

Type:

tuple

default_avg_to_plot

By default when plotting, plot either “mean” or “median”.

Type:

{“mean”, “median”}

regions

List of same length as PolyclonalCollection.models with each entry being the set of sites that are being used in returned results for that model. If region_col is None, this is all sites (PolyclonalCollection.sites), but if region_col is used to define regions for different models then the different sets of sites may differ for models.

Type:

list

n_models_by_site

Keyed by each site in PolyclonalCollection.sites, with the value being the number of models for which that site is in region for that model (this will just be the number of models when not using region_col).

Type:

dict

Example

Create a toy example collection of two identical models.

>>> data_to_fit = pd.DataFrame.from_records(
...     [("M3A", 1, 0), ("K5G", 1, 1)],
...     columns=["aa_substitutions", "concentration", "prob_escape"],
... )
>>> data_to_fit2 = pd.DataFrame.from_records(
...     [("M3A", 1, 0), ("K5G", 1, 1), ("L6T", 1, 0.5)],
...     columns=["aa_substitutions", "concentration", "prob_escape"],
... )
>>> models_df = pd.DataFrame(
...     {
...         "model": [
...             polyclonal.Polyclonal(data_to_fit=data_to_fit, n_epitopes=1),
...             polyclonal.Polyclonal(data_to_fit=data_to_fit2, n_epitopes=1),
...         ],
...         "description": ["model_1", "model_2"],
...     }
... )
>>> model_collection = polyclonal.PolyclonalCollection(
...     models_df, default_avg_to_plot="mean",
... )
>>> model_collection.sites
(3, 5, 6)
>>> model_collection.regions == [{3, 5, 6}, {3, 5, 6}]
True
>>> model_collection.n_models_by_site
{3: 2, 5: 2, 6: 2}

Now create a toy example with different regions for each model:

>>> models_df = pd.DataFrame(
...     {
...         "model": [
...             polyclonal.Polyclonal(data_to_fit=data_to_fit, n_epitopes=1),
...             polyclonal.Polyclonal(data_to_fit=data_to_fit2, n_epitopes=1),
...         ],
...         "description": ["model_1", "model_2"],
...         "region": [[3, 5], [3, 5, 6]],
...     }
... )
>>> model_region = polyclonal.PolyclonalCollection(
...     models_df, default_avg_to_plot="mean", region_col="region",
... )
>>> model_region.sites
(3, 5, 6)
>>> assert model_region.regions == [{3, 5}, {3, 5, 6}], model_region.regions
>>> model_region.n_models_by_site
{3: 2, 5: 2, 6: 1}
activity_wt_barplot(avg_type=None, **kwargs)[source]

Bar plot of epitope activities mean across models.

Parameters:
Returns:

Interactive plot, with error bars showing standard deviation.

Return type:

altair.Chart

property activity_wt_df

Epitope activities summarized across models.

Type:

pandas.DataFrame

property activity_wt_df_replicates

Epitope activities for all models.

Type:

pandas.DataFrame

property curve_specs_df

activities, Hill coefficients, and non-neutralized fracs.

Values summarized across models.

Type:

pandas.DataFrame

property curve_specs_df_replicates

activities, Hill coefficients, and non-neutralized fracs.

Per-replicate values.

Type:

pandas.DataFrame

curves_plot(*, avg_type=None, per_model_lines=5, **kwargs)[source]

Plot neutralization / binding curve for unmutated protein at each epitope.

This curve effectively illustrates the epitope activity, Hill curve coefficient, and non-neutralizable fraction.

Parameters:
  • avg_type ({"mean", "median", None}) – Type of average to plot, None defaults to PolyclonalCollection.default_avg_to_plot.

  • per_model_lines (int) – Do we plot thin lines for each model, or just the average? If the number of models in the collection is <= than this number, then we plot per-model lines, otherwise we just plot the average. A value of -1 means we always plot per-model lines.

  • **kwargs – Keywords args for polyclonal.plot.curves_plot()

Returns:

Interactive plot.

Return type:

altair.Chart

property hill_coefficient_df

Hill coefficients summarized across models.

Type:

pandas.DataFrame

property hill_coefficient_df_replicates

Hill coefficients for all models.

Type:

pandas.DataFrame

icXX(variants_df, **kwargs)[source]

Predicted concentration at which a variant is neutralized across all models.

Parameters:
  • variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).

  • **kwargs (Dictionary) – Keyword args for icXX()

Returns:

De-duplicated opy of variants_df with added column col containing icXX and summary stats for each variant across all models.

Return type:

pandas.DataFrame

icXX_replicates(variants_df, **kwargs)[source]

Concentration which given fraction is neutralized (eg IC50) for all models.

Parameters:
  • variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).

  • **kwargs (Dictionary) – Keyword args for icXX()

Returns:

Copy of variants_df with added column col containing icXX, and model descriptors. Variants with a mutation lacking in a particular model are missing in that row.

Return type:

pandas.DataFrame

mut_escape_corr(method='pearson', min_times_seen=1)[source]

Correlation of mutation escape values across models for each epitope.

Parameters:
  • method (str) – A correlation method passable to pandas.DataFrame.corr.

  • min_times_seen (int) – Only include mutations with a times_seen >= this value.

Returns:

Tidy data frame giving correlations between models for all epitopes. The models are labeled by their descriptors suffixed with “_1” and “_2” for the two models being compared.

Return type:

pandas.DataFrame

mut_escape_corr_heatmap(method='pearson', min_times_seen=1, plot_corr2=True, **kwargs)[source]

Heatmap of mutation-escape correlation among models at each epitope.

Parameters:
  • method (str) – A correlation method passable to pandas.DataFrame.corr.

  • min_times_seen (int) – Only include mutations with a times_seen >= this value.

  • plot_corr2 (bool) – Plot squared correlation (eg, \(R^2\) rather \(R\)).

  • **kwargs – Keyword args for polyclonal.plot.corr_heatmap()

property mut_escape_df

Mutation escape summarized across models.

Type:

pandas.DataFrame

property mut_escape_df_replicates

Mutation escape by model.

Type:

pandas.DataFrame

property mut_escape_df_w_model_values

Summarized mutation escape plus per model values.

Like PolyclonalCollection.mut_escape_df but then having additional columns giving per-model escape.

Type:

pandas.DataFrame

mut_escape_plot(*, biochem_order_aas=True, avg_type=None, init_n_models=None, prefix_epitope=None, df_to_merge=None, per_model_tooltip=None, scale_stat_col=1, **kwargs)[source]

Make plot of mutation escape values.

Parameters:
  • biochem_order_aas (bool) – Biochemically order amino-acid alphabet PolyclonalCollection.alphabet by passing it through polyclonal.alphabets.biochem_order_aas().

  • avg_type ({"mean", "median", "min_magnitude", None}) – Type of average to plot, None defaults to PolyclonalCollection.default_avg_to_plot.

  • init_n_models (None or int) – Initially only show mutations found in at least this number of models in the collection. A value of None corresponds to choosing a value that is >= half the number of total replicates.

  • prefix_epitope (bool or None) – Do we add the prefix “epitope “ to the epitope labels? If None, do only if epitope is integer.

  • df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with Polyclonal.mut_escape_df before being passed to polyclonal.plot.lineplot_and_heatmap(). Properties will only be included in plot if relevant columns are passed to polyclonal.plot.lineplot_and_heatmap() via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.

  • per_model_tooltip (None or bool) – In the heatmap, do the tooltips report per-model escape values or the standard deviation across models. If None then report per-model when <= 5 models and standard deviation if > 5 models. If True, always report per-model values. If False, always report standard deviation.

  • scale_stat_col (float) – Scale the escape values by this factor before plotting.

  • **kwargs – Keyword args for polyclonal.plot.lineplot_and_heatmap()

Returns:

Interactive heat maps and line plots.

Return type:

altair.Chart

mut_escape_site_summary_df(**kwargs)[source]

Site-level summaries of mutation escape across models.

Parameters:

**kwargs – Keyword arguments to \(~polyclonal.polyclonal.Polyclonal.mut_escape_site_summary_df\). In particular, you may want to use min_times_seen.

Returns:

The different site-summary metrics (‘mean’, ‘total positive’, etc) are in different rows for each site and epitope. The ‘frac_models’ column refers to models with measurements for any mutation at that site.

Return type:

pandas.DataFrame

mut_escape_site_summary_df_replicates(**kwargs)[source]

Site-level summaries of mutation escape for models.

Parameters:

**kwargs – Keyword arguments to mut_escape_site_summary_df().

Return type:

pandas.DataFrame

mut_icXX_df(**kwargs)[source]

Get data frame of log fold change ICXX induced by each mutation.

Parameters:

**kwargs – Keyword arguments to mut_icXX_df()

Returns:

Log fold change ICXX for each mutation.

Return type:

pandas.DataFrame

mut_icXX_df_replicates(**kwargs)[source]

Get data frame of ICXX and log fold change for each mutation by model.

Parameters:

**kwargs – Keyword arguments to mut_icXX_df()

Returns:

Data from of ICXX and log fold change for each model.

Return type:

pandas.DataFrame

mut_icXX_df_w_model_values(**kwargs)[source]

Log fold change ICXX induced by each mutation, plus per-model values.

Like PolyclonalCollection.mut_icXX_df but then having additional columns giving per-model ICXXs.

Parameters:

**kwargs – Keyword arguments to mut_icXX_df()

Returns:

Log fold change ICXX for each mutation, plus per model values.

Return type:

pandas.DataFrame

mut_icXX_plot(*, x=0.9, icXX_col='IC90', log_fold_change_icXX_col='log2 fold change IC90', min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0), biochem_order_aas=True, df_to_merge=None, positive_color='#0072B2', negative_color='#E69F00', avg_type=None, init_n_models=None, per_model_tooltip=None, scale_stat_col=1, **kwargs)[source]
Parameters:
  • x (float) – Same meaning as for Polyclonal.mut_icXX_df().

  • icXX_col (str) – Same meaning as for Polyclonal.mut_icXX_df().

  • log_fold_change_icXX_col (str) – Same meaning as for Polyclonal.mut_icXX_df().

  • min_c (float) – Same meaning as for Polyclonal.mut_icXX_df().

  • max_c (float) – Same meaning as for Polyclonal.mut_icXX_df().

  • logbase (float) – Same meaning as for Polyclonal.mut_icXX_df().

  • check_wt_icXX (None or 2-tuple) – Same meaning as for Polyclonal.mut_icXX_df().

  • biochem_order_aas (bool) – Biochemically order the amino-acid alphabet in Polyclonal.alphabet by passing it through polyclonal.alphabets.biochem_order_aas().

  • df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with Polyclonal.mut_escape_df before being passed to polyclonal.plot.lineplot_and_heatmap(). Properties will only be included in plot if relevant columns are passed to polyclonal.plot.lineplot_and_heatmap() via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.

  • positive_color (str) – Color for positive log fold change in heatmap.

  • negative_color (str) – Color for negative log fold change in heatmap.

  • avg_type ({"mean", "median", "min_magnitude", None}) – Type of average to plot, None defaults to PolyclonalCollection.default_avg_to_plot.

  • init_n_models (None or int) – Initially only show mutations found in at least this number of models in the collection. A value of None corresponds to choosing a value that is >= half the number of total replicates.

  • per_model_tooltip (None or bool) – In the heatmap, do the tooltips report per-model escape values or the standard deviation across models. If None then report per-model when <= 5 models and standard deviation if > 5 models. If True, always report per-model values. If False, always report standard deviation.

  • scale_stat_col (float) – Scale the escape values by this factor before plotting.

  • **kwargs – Keyword args for polyclonal.plot.lineplot_and_heatmap()

Returns:

Interactive heat map and line plot.

Return type:

altair.Chart

property non_neutralized_frac_df

non-neutralizable fraction summarized across models.

Type:

pandas.DataFrame

property non_neutralized_frac_df_replicates

non-neutralizable fraction for all models.

Type:

pandas.DataFrame

prob_escape(variants_df, **kwargs)[source]

Summary of predicted probability of escape across all models.

Parameters:
  • variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if concentrations=None.

  • **kwargs (Dictionary) – Keyword args for prob_escape()

Returns:

De-duplicated copy of variants_df with columns named ‘concentration’ and ‘mean’, ‘median’, and ‘std’ giving corresponding summary stats of predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration across models.

Return type:

pandas.DataFrame

prob_escape_replicates(variants_df, **kwargs)[source]

Compute predicted probability of escape \(p_v\left(c\right)\).

Uses all models to make predictions on variants_df.

Parameters:
  • variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if concentrations=None.

  • **kwargs (Dictionary) – Keyword args for prob_escape()

Returns:

Version of variants_df with columns named ‘concentration’ and ‘predicted_prob_escape’ giving predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration and model. Variants with a mutation lacking in a particular model are missing in that row.

Return type:

pandas.DataFrame

exception polyclonal.polyclonal_collection.PolyclonalCollectionFitError[source]

Bases: Exception

Error fitting models.

polyclonal.polyclonal_collection.create_bootstrap_sample(df, seed=0, group_by_col='concentration', sample_by='barcode')[source]

Bootstrap sample of data frame.

Parameters:
  • df (pandas.DataFrame) – Dataframe to be bootstrapped

  • seed (int) – Random number seed.

  • group_by_col (string or None) – Group by this column and bootstrap each group separately.

  • sample_by (str or None) – For each group, sample the same entities in this column. Requires each group to have same unique set of rows for this column.

Returns:

bootstrap_df – Dataframe with same number of rows as df and same number of samples per group_by_col.

Return type:

pandas.DataFrame

Example

>>> df_groups_same_barcode = pd.DataFrame({
...     "aa_substitutions": ["", "M1A", "G2C", "", "M1A", "G2C"],
...     "concentration": [1, 1, 1, 2, 2, 2],
...     "barcode": ["AA", "AC", "AG", "AA", "AC", "AG"],
... })

Same variants for each concentration:

>>> create_bootstrap_sample(df_groups_same_barcode)
  aa_substitutions  concentration barcode
0                               1      AA
1              M1A              1      AC
2                               1      AA
3                               2      AA
4              M1A              2      AC
5                               2      AA

Different variants for each concentration:

>>> create_bootstrap_sample(df_groups_same_barcode, sample_by=None, seed=2)
  aa_substitutions  concentration barcode
0                               1      AA
1              M1A              1      AC
2                               1      AA
3              G2C              2      AG
4                               2      AA
5              M1A              2      AC

Can’t use sample_by if concentrations don’t have same barcodes:

>>> create_bootstrap_sample(df_groups_same_barcode.head(5))
Traceback (most recent call last):
 ...
ValueError: elements in sample_by='barcode' differ in group_by_col='concentration'
polyclonal.polyclonal_collection.fit_models(models, n_threads, failures='error', **kwargs)[source]

Fit collection of Polyclonal models.

Enables fitting of multiple models simultaneously using multiple threads.

Parameters:
  • models (list) – List of Polyclonal models to fit.

  • n_threads (int) – Number of threads (CPUs, cores) to use for fitting. Set to -1 to use all CPUs available.

  • failures ({"error", "tolerate"}) – What if fitting fails for a model? If “error” then raise an error, if “ignore” then just return None for models that failed optimization.

  • **kwargs – Keyword arguments for polyclonal.polyclonal.Polyclonal.fit().

Returns:

Number of models that fit successfully, number of models that failed, and list of the fit models. Since Polyclonal are mutable, you can also access the fit models in their original data structure.

Return type:

(n_fit, n_failed, fit_models)