polyclonal_collection¶
Defines PolyclonalCollection
for handling collections of multiple
Polyclonal
objects.
PolyclonalCollection
is a base class for the following specific use-case
classes:
PolyclonalAverage
for averaging several Polyclonal objects.
PolyclonalBootstrap
for bootstrapping a model.
- class polyclonal.polyclonal_collection.PolyclonalAverage(models_df, *, region_col=None, harmonize_to=None, default_avg_to_plot='median')[source]¶
Bases:
PolyclonalCollection
Average several
Polyclonal
objects.- Parameters:
models_df (pandas.DataFrame) – Same meaning as for
PolyclonalCollection
. However, the resulting collection of models will have copies of these models rather than the actual objects in models_df.region_col (str or None) – Same meaning as for
PolyclonalCollection
.harmonize_to (
PolyclonalCollection
or None) – When harmonizing the epitopes, harmonize to this model. If None, just harmonize to the first model in models_df.default_avg_to_plot ({"mean", "median"}) – What type of average do the plotting methods plot by default?
- Other attributes of :class:`PolyclonalCollection`.
Inherited from base class.
- class polyclonal.polyclonal_collection.PolyclonalBootstrap(root_polyclonal, n_bootstrap_samples, *, n_threads=-1, seed=0, sample_by='barcode', default_avg_to_plot='mean')[source]¶
Bases:
PolyclonalCollection
Bootstrap
Polyclonal
objects.- Parameters:
root_polyclonal (
Polyclonal
) – The polyclonal object created with the full dataset to draw bootstrapped samples from. The bootstrapped samples are also initialized to mutation effects and activities of this model, so it is highly recommended that this object already have been fit to the full dataset.n_bootstrap_samples (int) – Number of bootstrapped
Polyclonal
models to fit.seed (int) – Random seed for reproducibility.
n_threads (int) – Number of threads to use for multiprocessing, -1 means all available.
sample_by – Passed to
create_bootstrap_sample()
. Should generally be ‘barcode’ if you have same variants at all concentrations, and maybe None otherwise.default_avg_to_plot ({"mean", "median"}) – What type of average do the plotting methods plot by default?
- root_polyclonal¶
The root polyclonal object passed as a parameter.
- Type:
- n_threads¶
Number of threads for multiprocessing.
- Type:
int
- Other attributes of :class:`PolyclonalCollection`.
Inherited from base class.
- fit_models(failures='error', **kwargs)[source]¶
Fits bootstrapped
Polyclonal
models.The fit models will then be in
PolyclonalCollection.models
, with any models that fail fitting set to None. Their epitopes will also be harmonized withPolyclonalBootstrap.root_polyclonal
.- Parameters:
failures ({"error", "tolerate"}) – Tolerate failures in model fitting or raise an error if a failure? Always raise an error if all models failed.
**kwargs – Keyword arguments for
polyclonal.polyclonal.Polyclonal.fit()
. If not specified otherwise, fit_site_level_first is set to False, since models are initialized to “good” values from the root object.
- Returns:
Number of model fits that failed and succeeded.
- Return type:
(n_fit, n_failed)
- class polyclonal.polyclonal_collection.PolyclonalCollection(models_df, *, default_avg_to_plot, region_col=None)[source]¶
Bases:
object
Handle a collection of
Polyclonal
objects.- Parameters:
models_df (pandas.DataFrame) – Data frame of models. Should have one column named “model” that has
Polyclonal
models, and other columns are descriptor for model (e.g., “replicate”, etc). The descriptors for each row must be unique.default_avg_to_plot ({"mean", "median"}) – By default when plotting, plot either “mean” or “median”.
region_col (None or str) – Use this option if you want to only include sites in a specific region of the protein for specific models (this is useful for instance if you split the protein into halves in two different libraries). In this case, region_col should be a columnn in models_df with the values being the list of sites to use for that specific model.
- models¶
List of the models in models_df. All models must have same epitopes.
- Type:
list
- model_descriptors¶
A list of same length as models with each entry being a dict keyed by descriptors and values being the descriptor for that model. All models must have same descriptor labels. Eg,
[{"replicate": 1}, {"replicate": 2}]`
. The descriptor labels are all columns in models_df except one named “model”.- Type:
dict
- descriptor_names¶
The names that key the entries in
PolyclonalCollection.model_descriptors
.- Type:
list
- unique_descriptor_names¶
Names of descriptors in
PolyclonalCollection.descriptor_names
that are not shared across all models.- Type:
list
- epitopes¶
Same meaning as for
epitope
, extracted fromPolyclonalCollection.models
.- Type:
tuple
- epitope_colors¶
Same meaning as for
epitope_colors
, extracted fromPolyclonalCollection.models
.- Type:
dict
- alphabet¶
Same meaning as for
alphabet
, extracted fromPolyclonalCollection.models
.- Type:
array-like
- sequential_integer_sites¶
Same as for
sequential_integer_sites
, extracted fromPolyclonalCollection.models
.- Type:
bool
- sites¶
All sites for which the model is defined.
- Type:
tuple
- default_avg_to_plot¶
By default when plotting, plot either “mean” or “median”.
- Type:
{“mean”, “median”}
- regions¶
List of same length as
PolyclonalCollection.models
with each entry being the set of sites that are being used in returned results for that model. If region_col is None, this is all sites (PolyclonalCollection.sites
), but if region_col is used to define regions for different models then the different sets of sites may differ for models.- Type:
list
- n_models_by_site¶
Keyed by each site in
PolyclonalCollection.sites
, with the value being the number of models for which that site is in region for that model (this will just be the number of models when not using region_col).- Type:
dict
Example
Create a toy example collection of two identical models.
>>> data_to_fit = pd.DataFrame.from_records( ... [("M3A", 1, 0), ("K5G", 1, 1)], ... columns=["aa_substitutions", "concentration", "prob_escape"], ... ) >>> data_to_fit2 = pd.DataFrame.from_records( ... [("M3A", 1, 0), ("K5G", 1, 1), ("L6T", 1, 0.5)], ... columns=["aa_substitutions", "concentration", "prob_escape"], ... )
>>> models_df = pd.DataFrame( ... { ... "model": [ ... polyclonal.Polyclonal(data_to_fit=data_to_fit, n_epitopes=1), ... polyclonal.Polyclonal(data_to_fit=data_to_fit2, n_epitopes=1), ... ], ... "description": ["model_1", "model_2"], ... } ... ) >>> model_collection = polyclonal.PolyclonalCollection( ... models_df, default_avg_to_plot="mean", ... ) >>> model_collection.sites (3, 5, 6) >>> model_collection.regions == [{3, 5, 6}, {3, 5, 6}] True >>> model_collection.n_models_by_site {3: 2, 5: 2, 6: 2}
Now create a toy example with different regions for each model:
>>> models_df = pd.DataFrame( ... { ... "model": [ ... polyclonal.Polyclonal(data_to_fit=data_to_fit, n_epitopes=1), ... polyclonal.Polyclonal(data_to_fit=data_to_fit2, n_epitopes=1), ... ], ... "description": ["model_1", "model_2"], ... "region": [[3, 5], [3, 5, 6]], ... } ... ) >>> model_region = polyclonal.PolyclonalCollection( ... models_df, default_avg_to_plot="mean", region_col="region", ... ) >>> model_region.sites (3, 5, 6) >>> assert model_region.regions == [{3, 5}, {3, 5, 6}], model_region.regions >>> model_region.n_models_by_site {3: 2, 5: 2, 6: 1}
- activity_wt_barplot(avg_type=None, **kwargs)[source]¶
Bar plot of epitope activities mean across models.
- Parameters:
avg_type ({"mean", "median", None}) – Type of average to plot, None defaults to
PolyclonalCollection.default_avg_to_plot
.**kwargs – Keyword arguments for
polyclonal.plot.activity_wt_barplot()
.
- Returns:
Interactive plot, with error bars showing standard deviation.
- Return type:
altair.Chart
- property activity_wt_df¶
Epitope activities summarized across models.
- Type:
pandas.DataFrame
- property activity_wt_df_replicates¶
Epitope activities for all models.
- Type:
pandas.DataFrame
- property curve_specs_df¶
activities, Hill coefficients, and non-neutralized fracs.
Values summarized across models.
- Type:
pandas.DataFrame
- property curve_specs_df_replicates¶
activities, Hill coefficients, and non-neutralized fracs.
Per-replicate values.
- Type:
pandas.DataFrame
- curves_plot(*, avg_type=None, per_model_lines=5, **kwargs)[source]¶
Plot neutralization / binding curve for unmutated protein at each epitope.
This curve effectively illustrates the epitope activity, Hill curve coefficient, and non-neutralizable fraction.
- Parameters:
avg_type ({"mean", "median", None}) – Type of average to plot, None defaults to
PolyclonalCollection.default_avg_to_plot
.per_model_lines (int) – Do we plot thin lines for each model, or just the average? If the number of models in the collection is <= than this number, then we plot per-model lines, otherwise we just plot the average. A value of -1 means we always plot per-model lines.
**kwargs – Keywords args for
polyclonal.plot.curves_plot()
- Returns:
Interactive plot.
- Return type:
altair.Chart
- property hill_coefficient_df¶
Hill coefficients summarized across models.
- Type:
pandas.DataFrame
- property hill_coefficient_df_replicates¶
Hill coefficients for all models.
- Type:
pandas.DataFrame
- icXX(variants_df, **kwargs)[source]¶
Predicted concentration at which a variant is neutralized across all models.
- Parameters:
variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).
**kwargs (Dictionary) – Keyword args for
icXX()
- Returns:
De-duplicated opy of
variants_df
with added columncol
containing icXX and summary stats for each variant across all models.- Return type:
pandas.DataFrame
- icXX_replicates(variants_df, **kwargs)[source]¶
Concentration which given fraction is neutralized (eg IC50) for all models.
- Parameters:
variants_df (pandas.DataFrame) – Data frame defining variants. Should have column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’).
**kwargs (Dictionary) – Keyword args for
icXX()
- Returns:
Copy of
variants_df
with added columncol
containing icXX, and model descriptors. Variants with a mutation lacking in a particular model are missing in that row.- Return type:
pandas.DataFrame
- mut_escape_corr(method='pearson', min_times_seen=1)[source]¶
Correlation of mutation escape values across models for each epitope.
- Parameters:
method (str) – A correlation method passable to pandas.DataFrame.corr.
min_times_seen (int) – Only include mutations with a times_seen >= this value.
- Returns:
Tidy data frame giving correlations between models for all epitopes. The models are labeled by their descriptors suffixed with “_1” and “_2” for the two models being compared.
- Return type:
pandas.DataFrame
- mut_escape_corr_heatmap(method='pearson', min_times_seen=1, plot_corr2=True, **kwargs)[source]¶
Heatmap of mutation-escape correlation among models at each epitope.
- Parameters:
method (str) – A correlation method passable to pandas.DataFrame.corr.
min_times_seen (int) – Only include mutations with a times_seen >= this value.
plot_corr2 (bool) – Plot squared correlation (eg, \(R^2\) rather \(R\)).
**kwargs – Keyword args for
polyclonal.plot.corr_heatmap()
- property mut_escape_df¶
Mutation escape summarized across models.
- Type:
pandas.DataFrame
- property mut_escape_df_replicates¶
Mutation escape by model.
- Type:
pandas.DataFrame
- property mut_escape_df_w_model_values¶
Summarized mutation escape plus per model values.
Like
PolyclonalCollection.mut_escape_df
but then having additional columns giving per-model escape.- Type:
pandas.DataFrame
- mut_escape_plot(*, biochem_order_aas=True, avg_type=None, init_n_models=None, prefix_epitope=None, df_to_merge=None, per_model_tooltip=None, scale_stat_col=1, **kwargs)[source]¶
Make plot of mutation escape values.
- Parameters:
biochem_order_aas (bool) – Biochemically order amino-acid alphabet
PolyclonalCollection.alphabet
by passing it throughpolyclonal.alphabets.biochem_order_aas()
.avg_type ({"mean", "median", "min_magnitude", None}) – Type of average to plot, None defaults to
PolyclonalCollection.default_avg_to_plot
.init_n_models (None or int) – Initially only show mutations found in at least this number of models in the collection. A value of None corresponds to choosing a value that is >= half the number of total replicates.
prefix_epitope (bool or None) – Do we add the prefix “epitope “ to the epitope labels? If None, do only if epitope is integer.
df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with
Polyclonal.mut_escape_df
before being passed topolyclonal.plot.lineplot_and_heatmap()
. Properties will only be included in plot if relevant columns are passed topolyclonal.plot.lineplot_and_heatmap()
via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.per_model_tooltip (None or bool) – In the heatmap, do the tooltips report per-model escape values or the standard deviation across models. If None then report per-model when <= 5 models and standard deviation if > 5 models. If True, always report per-model values. If False, always report standard deviation.
scale_stat_col (float) – Scale the escape values by this factor before plotting.
**kwargs – Keyword args for
polyclonal.plot.lineplot_and_heatmap()
- Returns:
Interactive heat maps and line plots.
- Return type:
altair.Chart
- mut_escape_site_summary_df(**kwargs)[source]¶
Site-level summaries of mutation escape across models.
- Parameters:
**kwargs – Keyword arguments to \(~polyclonal.polyclonal.Polyclonal.mut_escape_site_summary_df\). In particular, you may want to use min_times_seen.
- Returns:
The different site-summary metrics (‘mean’, ‘total positive’, etc) are in different rows for each site and epitope. The ‘frac_models’ column refers to models with measurements for any mutation at that site.
- Return type:
pandas.DataFrame
- mut_escape_site_summary_df_replicates(**kwargs)[source]¶
Site-level summaries of mutation escape for models.
- Parameters:
**kwargs – Keyword arguments to
mut_escape_site_summary_df()
.- Return type:
pandas.DataFrame
- mut_icXX_df(**kwargs)[source]¶
Get data frame of log fold change ICXX induced by each mutation.
- Parameters:
**kwargs – Keyword arguments to
mut_icXX_df()
- Returns:
Log fold change ICXX for each mutation.
- Return type:
pandas.DataFrame
- mut_icXX_df_replicates(**kwargs)[source]¶
Get data frame of ICXX and log fold change for each mutation by model.
- Parameters:
**kwargs – Keyword arguments to
mut_icXX_df()
- Returns:
Data from of ICXX and log fold change for each model.
- Return type:
pandas.DataFrame
- mut_icXX_df_w_model_values(**kwargs)[source]¶
Log fold change ICXX induced by each mutation, plus per-model values.
Like
PolyclonalCollection.mut_icXX_df
but then having additional columns giving per-model ICXXs.- Parameters:
**kwargs – Keyword arguments to
mut_icXX_df()
- Returns:
Log fold change ICXX for each mutation, plus per model values.
- Return type:
pandas.DataFrame
- mut_icXX_plot(*, x=0.9, icXX_col='IC90', log_fold_change_icXX_col='log2 fold change IC90', min_c=1e-08, max_c=100000000.0, logbase=2, check_wt_icXX=(1e-05, 100000.0), biochem_order_aas=True, df_to_merge=None, positive_color='#0072B2', negative_color='#E69F00', avg_type=None, init_n_models=None, per_model_tooltip=None, scale_stat_col=1, **kwargs)[source]¶
- Parameters:
x (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.icXX_col (str) – Same meaning as for
Polyclonal.mut_icXX_df()
.log_fold_change_icXX_col (str) – Same meaning as for
Polyclonal.mut_icXX_df()
.min_c (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.max_c (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.logbase (float) – Same meaning as for
Polyclonal.mut_icXX_df()
.check_wt_icXX (None or 2-tuple) – Same meaning as for
Polyclonal.mut_icXX_df()
.biochem_order_aas (bool) – Biochemically order the amino-acid alphabet in
Polyclonal.alphabet
by passing it throughpolyclonal.alphabets.biochem_order_aas()
.df_to_merge (None or pandas.DataFrame or list) – To include additional properties, specify data frame or list of them which are merged with
Polyclonal.mut_escape_df
before being passed topolyclonal.plot.lineplot_and_heatmap()
. Properties will only be included in plot if relevant columns are passed topolyclonal.plot.lineplot_and_heatmap()
via addtl_slider_stats, addtl_tooltip_stats, or site_zoom_bar_color_col.positive_color (str) – Color for positive log fold change in heatmap.
negative_color (str) – Color for negative log fold change in heatmap.
avg_type ({"mean", "median", "min_magnitude", None}) – Type of average to plot, None defaults to
PolyclonalCollection.default_avg_to_plot
.init_n_models (None or int) – Initially only show mutations found in at least this number of models in the collection. A value of None corresponds to choosing a value that is >= half the number of total replicates.
per_model_tooltip (None or bool) – In the heatmap, do the tooltips report per-model escape values or the standard deviation across models. If None then report per-model when <= 5 models and standard deviation if > 5 models. If True, always report per-model values. If False, always report standard deviation.
scale_stat_col (float) – Scale the escape values by this factor before plotting.
**kwargs – Keyword args for
polyclonal.plot.lineplot_and_heatmap()
- Returns:
Interactive heat map and line plot.
- Return type:
altair.Chart
- property non_neutralized_frac_df¶
non-neutralizable fraction summarized across models.
- Type:
pandas.DataFrame
- property non_neutralized_frac_df_replicates¶
non-neutralizable fraction for all models.
- Type:
pandas.DataFrame
- prob_escape(variants_df, **kwargs)[source]¶
Summary of predicted probability of escape across all models.
- Parameters:
variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if
concentrations=None
.**kwargs (Dictionary) – Keyword args for
prob_escape()
- Returns:
De-duplicated copy of
variants_df
with columns named ‘concentration’ and ‘mean’, ‘median’, and ‘std’ giving corresponding summary stats of predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration across models.- Return type:
pandas.DataFrame
- prob_escape_replicates(variants_df, **kwargs)[source]¶
Compute predicted probability of escape \(p_v\left(c\right)\).
Uses all models to make predictions on
variants_df
.- Parameters:
variants_df (pandas.DataFrame) – Input data frame defining variants. Should have a column named ‘aa_substitutions’ that defines variants as space-delimited strings of substitutions (e.g., ‘M1A K3T’). Should also have a column ‘concentration’ if
concentrations=None
.**kwargs (Dictionary) – Keyword args for
prob_escape()
- Returns:
Version of
variants_df
with columns named ‘concentration’ and ‘predicted_prob_escape’ giving predicted probability of escape \(p_v\left(c\right)\) for each variant at each concentration and model. Variants with a mutation lacking in a particular model are missing in that row.- Return type:
pandas.DataFrame
- exception polyclonal.polyclonal_collection.PolyclonalCollectionFitError[source]¶
Bases:
Exception
Error fitting models.
- polyclonal.polyclonal_collection.create_bootstrap_sample(df, seed=0, group_by_col='concentration', sample_by='barcode')[source]¶
Bootstrap sample of data frame.
- Parameters:
df (pandas.DataFrame) – Dataframe to be bootstrapped
seed (int) – Random number seed.
group_by_col (string or None) – Group by this column and bootstrap each group separately.
sample_by (str or None) – For each group, sample the same entities in this column. Requires each group to have same unique set of rows for this column.
- Returns:
bootstrap_df – Dataframe with same number of rows as df and same number of samples per group_by_col.
- Return type:
pandas.DataFrame
Example
>>> df_groups_same_barcode = pd.DataFrame({ ... "aa_substitutions": ["", "M1A", "G2C", "", "M1A", "G2C"], ... "concentration": [1, 1, 1, 2, 2, 2], ... "barcode": ["AA", "AC", "AG", "AA", "AC", "AG"], ... })
Same variants for each concentration:
>>> create_bootstrap_sample(df_groups_same_barcode) aa_substitutions concentration barcode 0 1 AA 1 M1A 1 AC 2 1 AA 3 2 AA 4 M1A 2 AC 5 2 AA
Different variants for each concentration:
>>> create_bootstrap_sample(df_groups_same_barcode, sample_by=None, seed=2) aa_substitutions concentration barcode 0 1 AA 1 M1A 1 AC 2 1 AA 3 G2C 2 AG 4 2 AA 5 M1A 2 AC
Can’t use sample_by if concentrations don’t have same barcodes:
>>> create_bootstrap_sample(df_groups_same_barcode.head(5)) Traceback (most recent call last): ... ValueError: elements in sample_by='barcode' differ in group_by_col='concentration'
- polyclonal.polyclonal_collection.fit_models(models, n_threads, failures='error', **kwargs)[source]¶
Fit collection of
Polyclonal
models.Enables fitting of multiple models simultaneously using multiple threads.
- Parameters:
models (list) – List of
Polyclonal
models to fit.n_threads (int) – Number of threads (CPUs, cores) to use for fitting. Set to -1 to use all CPUs available.
failures ({"error", "tolerate"}) – What if fitting fails for a model? If “error” then raise an error, if “ignore” then just return None for models that failed optimization.
**kwargs – Keyword arguments for
polyclonal.polyclonal.Polyclonal.fit()
.
- Returns:
Number of models that fit successfully, number of models that failed, and list of the fit models. Since
Polyclonal
are mutable, you can also access the fit models in their original data structure.- Return type:
(n_fit, n_failed, fit_models)