codonvarianttable¶
Defines CodonVariantTable
objects for storing and
handling codon variants of a gene.
- class dms_variants.codonvarianttable.CodonVariantTable(*, barcode_variant_file, geneseq, allowgaps=False, substitutions_are_codon=False, extra_cols=None, substitutions_col='substitutions', primary_target=None)[source]¶
Bases:
object
Store and handle counts of barcoded codon variants of a gene.
- Parameters:
barcode_variant_file (str) – CSV file giving barcodes and variants. Must have columns “library”, “barcode”, “substitutions” (nucleotide mutations in 1, …, numbering in a format like “G301A A302T G856C”), and “variant_call_support” (sequences supporting barcode-variant call). Additional columns are removed unless they are specified extra_cols.
geneseq (str) – Sequence of wildtype protein-coding gene.
substitutions_are_codon (bool) – If True, then “substitutions” column in barcode_variant_file gives substitutions as codon rather than nucleotide mutations (e.g., “ATG1ATA GTA5CCC” for substitutions at codons 1 and 5.
allowgaps (bool) – Allow in-frame codon-length gaps (
---
) as valid codon substitution.extra_cols (list) – Additional columns in barcode_variant_file to retain when creating barcode_variant_df and variant_count_df attributes.
substitutions_col (str) – Name of substitutions column in barcode_variant_file (use if you want it to be something other than “substitutions”).
primary_target (str or None) – Use this option if you have additional targets beyond the main gene for which we are analyzing variants. This might be the case if you have spiked other genes into the library. If this option is set to something other than None, then there must be a column in barcode_variant_file named “target” and one of these targets must be equal to ‘primary_target’. If there are other targets, they should not have any substitutions as we don’t parse substitutions in non-primary targets. Instead, substitutions_col for secondary targets should be empty or just have the name of the secondary target.
- geneseq¶
Wild-type sequence passed at initialization.
- Type:
str
- sites¶
List of all codon sites in 1, 2, … numbering.
- Type:
list
- codons¶
codons[r] is wildtype codon at site r, ordered by sites.
- Type:
collections.OrderedDict
- aas¶
aas[r] is wildtype amino acid at site r, ordered by sites.
- Type:
collections.OrderedDict
- libraries¶
List of libraries in barcode_variant_file.
- Type:
list
- barcode_variant_df¶
Info about codon mutations parsed from barcode_variant_file. For non-primary targets, the mutations column just hold the target name.
- Type:
pandas.DataFrame
- variant_count_df¶
Initially None, but after data added with
CodonVariantTable.addSampleCounts
, holds counts of each variant for each sample. Differs from barcode_variant_df in that the former just holds barcode-variant definitions, whereas variant_count_df has counts for each sample.- Type:
pandas.DataFrame or None
- primary_target¶
If multiple targets, name of the main target for which we are calling variants.
- Type:
str or None
- static addMergedLibraries(df, *, all_lib='all libraries')[source]¶
Add data to df for all libraries merged.
- Parameters:
df (pandas.DataFrame) – Data frame that includes columns named ‘library’ and ‘barcode’.
all_lib (str) – Name given to library that is merge of all other libraries.
- Returns:
If df only has data for one library, just returns df. Otherwise returns copy of df with new library with name given by all_lib that contains data for all individual libraries with the ‘barcode’ column giving library name followed by a hyphen and barcode.
- Return type:
pandas.DataFrame
- addSampleCounts(library, sample, barcodecounts)[source]¶
Add variant counts for a sample to variant_count_df.
Note
If you have many samples to add at once, it is faster to use
CodonVariantTable.add_sample_counts_df()
rather than repeatedly calling this method.- Parameters:
library (str) – Valid library for the
CodonVariantTable
.sample (str) – Sample name, must not already be in
CodonVariantTable.samples
for library.barcodecounts (pandas.DataFrame) – Counts for each variant by barcode. Must have columns “barcode” and “count”. The “barcode” column must have all barcodes in
CodonVariantTable.valid_barcodes
for library. Such data frames are returned bydms_variants.illuminabarcodeparser.IlluminaBarcodeParser
.
- classmethod add_frac_counts(variant_count_df)[source]¶
Add fraction of counts from each variant in library/sample.
- Parameters:
variant_count_df (pandas.DataFrame) – Same format as
CodonVariantTable
.- Returns:
A copy of variant_count_df with added column ‘frac_counts’ that gives fraction of all counts in that library / sample for that variant.
- Return type:
pandas.DataFrame
Example
>>> variant_count_df = pd.DataFrame.from_records( ... [('lib1', 's1', 'AA', 1), ... ('lib1', 's1', 'AT', 3), ... ('lib1', 's2', 'GG', 0), ... ('lib1', 's2', 'GA', 10), ... ('lib2', 's1', 'CC', 5), ... ('lib2', 's1', 'GA', 5), ... ], ... columns=['library', 'sample', 'barcode', 'count'], ... ) >>> CodonVariantTable.add_frac_counts(variant_count_df) library sample barcode count frac_counts 0 lib1 s1 AA 1 0.25 1 lib1 s1 AT 3 0.75 2 lib1 s2 GG 0 0.00 3 lib1 s2 GA 10 1.00 4 lib2 s1 CC 5 0.50 5 lib2 s1 GA 5 0.50
- add_full_seqs(df, *, aa_seq_col='aa_sequence', codon_seq_col='codon_sequence')[source]¶
Add full sequences to data frame, for primary target only.
- Parameters:
df (pandas.DataFrame) – Data frame that specifies substitutions. Might be the barcode_variant_df or variant_count_df attributes, or the result of calling
CodonVariantTable.func_scores()
.aa_seq_col (str or None) – Name of added column with amino-acid sequences. In order to add, df must have column named ‘aa_substitutions’.
codon_seq_col (str or None) – Name of added column with codon sequences. In order to add, df must have column named ‘codon_substitutions’.
- Returns:
Copy of df with columns aa_seq_col and/or codon_seq_col, and only the primary target retained if there is a ‘target’ column in df.
- Return type:
pandas.DataFrame
- add_sample_counts_df(counts_df)[source]¶
Add variant counts for several samples to variant_count_df.
- Parameters:
counts_df (pandas.DataFrame) – Must have columns ‘library’, ‘sample’, ‘barcode’, and ‘count’. The sample must not already be in CodonVariantTable.samples for that library. The barcode columns must have all barcodes for that library including zero-count ones.
- avgCountsPerVariant(*, libraries='all', samples='all', min_support=1, sample_rename=None, by_target=True)[source]¶
Get average counts per variant.
- Parameters:
libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.
samples ({'all', list}) – Include all samples or just samples in list.
min_support (int) – Only include variants with variant call support >= this.
sample_rename (dict or None) – Rename samples by specifying original name as key and new name as value.
by_target (bool) – If True, also group counts by target if multiple targets.
- Returns:
Average counts per variant for each library and sample, and possibly target.
- Return type:
pandas.DataFrame
- static classifyVariants(df, *, variant_class_col='variant_class', max_aa=2, syn_as_wt=False, primary_target=None, non_primary_target_class='secondary target', class_as_categorical=False)[source]¶
Classifies codon variants in df.
- Parameters:
df (pandas.DataFrame) – Must have columns named ‘aa_substitutions’, ‘n_aa_substitutions’, and ‘n_codon_substitutions’. Such a data frame can be obtained via variant_count_df or barcode_variant_df attributes of a
CodonVariantTable
or viaCodonVariantTable.func_scores()
.variant_class_col (str) – Name of column added to df that contains variant classification. Overwritten if already exists.
max_aa (int) – When classifying, group all with >= this many amino-acid mutations.
syn_as_wt (bool) – Do not have a category of ‘synonymous’ and instead classify synonymous variants as ‘wildtype’. If using this option, df does not need column ‘n_codon_substitutions’.
primary_target (None or str) – If df has a column named ‘target’, then this must specify primary target (e.g.,
CodonVariantTable.primary_target
). Variants not from primary target are classified as non_primary_target_class.non_primary_target_class (str) – Classification used for non-primary targets.
class_as_categorical (bool) – Return variant_class as a categorical variable with a reasonable ordering.
- Returns:
- Copy of df with column specified by variant_class_col as:
’wildtype’: no codon mutations
’synonymous’: only synonymous codon mutations
’stop’: at least one stop-codon mutation
’{n_aa} nonsynonymous’ where n_aa is number of amino-acid mutations, or is ‘>{max_aa}’ if more than max_aa.
potentially non_primary_target_class.
- Return type:
pandas.DataFrame
Example
>>> df = pd.DataFrame.from_records( ... [('AAA', '', 0, 0), ... ('AAG', '', 0, 1), ... ('ATA', 'M1* G5K', 2, 3), ... ('GAA', 'G5H', 1, 2), ... ('CTT', 'M1C G5C', 2, 3), ... ('CTT', 'M1A L3T G5C', 3, 3), ... ('AAG', '', 0, 1), ... ], ... columns=['barcode', 'aa_substitutions', ... 'n_aa_substitutions', 'n_codon_substitutions'] ... ) >>> df_classify = CodonVariantTable.classifyVariants(df) >>> all(df_classify.columns == ['barcode', 'aa_substitutions', ... 'n_aa_substitutions', ... 'n_codon_substitutions', ... 'variant_class']) True >>> df_classify[['barcode', 'variant_class']] barcode variant_class 0 AAA wildtype 1 AAG synonymous 2 ATA stop 3 GAA 1 nonsynonymous 4 CTT >1 nonsynonymous 5 CTT >1 nonsynonymous 6 AAG synonymous >>> df_syn_as_wt = CodonVariantTable.classifyVariants(df, ... syn_as_wt=True) >>> df_syn_as_wt[['barcode', 'variant_class']] barcode variant_class 0 AAA wildtype 1 AAG wildtype 2 ATA stop 3 GAA 1 nonsynonymous 4 CTT >1 nonsynonymous 5 CTT >1 nonsynonymous 6 AAG wildtype
Now show how we need to specify how to handle when multiple targets:
>>> df['target'] = ['secondary'] + ['primary'] * 6 >>> (CodonVariantTable.classifyVariants(df) ... [['target', 'barcode', 'variant_class']]) Traceback (most recent call last): ... ValueError: `df` has "target" so give `primary_target`
We need to specify how to handle targets:
>>> (CodonVariantTable.classifyVariants( ... df, ... primary_target='primary', ... non_primary_target_class='homolog') ... [['target', 'barcode', 'variant_class']]) target barcode variant_class 0 secondary AAA homolog 1 primary AAG synonymous 2 primary ATA stop 3 primary GAA 1 nonsynonymous 4 primary CTT >1 nonsynonymous 5 primary CTT >1 nonsynonymous 6 primary AAG synonymous
- classmethod codonToAAMuts(codon_mut_str, allowgaps=False)[source]¶
Convert string of codon mutations to amino-acid mutations.
- Parameters:
codon_mut_str (str) – Codon mutations, delimited by a space and in 1, … numbering.
allowgaps (bool) – Allow gap codon (
---
) translated to gap amino acid (--
).
- Returns:
Amino acid mutations in 1, … numbering.
- Return type:
str
Example
>>> CodonVariantTable.codonToAAMuts('ATG1GTG GGA2GGC TGA3AGA') 'M1V *3R'
- escape_scores(sample_df, score_type, *, pseudocount=0.5, by='barcode', logbase=2, floor_B=0.01, floor_E=0.01, ceil_B=1.0, ceil_E=1.0)[source]¶
Compute a score that represents escape from binding.
Note
Here we couch the explanation in terms of variant escape from antibody binding.
Let \(v\) be a variant, let \(B_v\) be the fraction of this variant that is bound by antibody, and let \(E_v = 1 - B_v\) be the fraction that escapes binding. A variant that completely escapes binding has \(B_v = 0\) and \(E_v = 1\); a variant that is completely bound has \(B_v = 1\) and \(E_v = 0\).
We define the escape score \(s_v\) in one of three ways depending on the value of score_type parameter:
As the escape fraction, so \(s_v = E_v\).
As the log of the escape fraction, so \(s_v = \log_b E_v\) where \(b\) is the logarithm base.
As minus the log of the binding fraction, so \(s_v = -\log_b B_v\).
In all cases, larger values of \(s_v\) indicate more escape.
We calculate \(E_v\) as follows. Let \(f_v^{\rm{pre}}\) be the frequency of \(v\) prior to selection for binding (so \(\sum_v f_v^{\rm{pre}} = 1\)). Then the fraction of the library that is \(v\) after selecting for unbound variants is
\[f_v^{\rm{post}} = \frac{f_v^{\rm{pre}} \times E_v} {\sum_{v'} f_{v'}^{\rm{pre}} \times E_{v'}}.\]Note that the denominator of the above equation, \(F = \sum_v f_v^{\rm{pre}} \times E_v\), represents the overall fraction of the library that escapes binding, which we assume is directly measured experimentally.
We can easily solve for \(E_v\) as
\[E_v = \frac{F \times f_v^{\rm{post}}}{f_v^{\rm{pre}}},\]and can similarly obtain \(B_v\) from \(B_v = 1 - E_v\).
We calculate \(E_v\) directly from the actual counts of the variants pre- and post-selection. Let \(n_v^{\rm{pre}}\) and \(n_v^{\rm{post}}\) be the counts of variant \(v\) pre- and post-selection after adding a pseudocount of \(P \ge 0\), and let \(N^{\rm{pre}} = \sum_v n_v^{\rm{pre}}\) and \(N^{\rm{post}} = \sum_v n_v^{\rm{post}}\) be the total counts of all variants pre- and post-selection. Then:
\[E_v = F \times \frac{n_v^{\rm{post}} N^{\rm{pre}}} {n_v^{\rm{pre}} N^{\rm{post}}}\]A complication is that \(s_v\) can be undefined for certain values of \(E_v\) or \(B_v\). Specifically, when using the log escape fraction definition (\(s_v = \log_b E_v\)) then \(s_v\) is undefined if \(E_v = 0\), so we put a floor on \(E_v\) by defining \(s_v = \log_b \max\left(E_v, E_{\rm{floor}}\right)\). When using the minus log binding fraction definition (\(s_v = -\log_b B_v\)), then \(s_v\) is undefined if \(B_v \le 0\), so we put a floor on \(B_v\) by defining defining \(s_v = -\log_b \max\left(B_v, B_{\rm{floor}}\right)\). We similarly define ceilings on \(E_v\) and \(B_v\), although these ceilings are optional.
We also estimate the variance \(\sigma_{s_v}^2\) on \(s_v\) from the variances on the counts, which we assume are \(\sigma_{n_v^{\rm{pre}}}^2 = n_v^{\rm{pre}}\) and \(\sigma_{n_v^{\rm{post}}}^2 = n_v^{\rm{post}}\) from Poisson counting statistics. To do this, we propagate the errors:
\[\begin{split}\sigma_{s_v}^2 &=& \left(\frac{\partial s_v}{\partial n_v^{\rm{pre}}}\right)^2 \sigma_{n_v^{\rm{pre}}}^2 + \left(\frac{\partial s_v}{\partial n_v^{\rm{post}}}\right)^2 \sigma_{n_v^{\rm{post}}}^2 \\ &=& \left(\frac{\partial s_v}{\partial n_v^{\rm{pre}}}\right)^2 n_v^{\rm{pre}} + \left(\frac{\partial s_v}{\partial n_v^{\rm{post}}}\right)^2 n_v^{\rm{post}}.\end{split}\]We calculate the derivatives of \(s_v\) with respect to the counts numerically with a step size of one rather than analytically, since analytical calculations are confounded by the floors on \(E_v\) or \(B_v\).
- Parameters:
sample_df (pandas.DataFrame) – Comparisons we use to compute the functional scores. Should have these columns: ‘pre_sample’ (pre-selection sample), ‘post_sample’ (post-selection sample), ‘library’, ‘name’ (name for output), ‘frac_escape’ (the overall fraction escaping \(F\)).
score_type ({'frac_escape', 'log_escape', 'minus_log_bind'}) – How to define escape score: \(E_v\) if ‘frac_escape’; \(\log_b E_v\) if ‘log_escape’; \(-\log_b B_v\) if ‘minus_log_bind’.
pseudocount (float) – Pseudocount added to each count.
by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute effects for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined.
logbase (float) – Base for logarithm when calculating functional score.
floor_B (float) – Floor assigned to \(B_v\), \(B_{\rm{floor}}\) if score_type is ‘minus_log_bind’.
floor_E (float) – Floor assigned to \(E_v\), \(E_{\rm{floor}}\) if score_type is ‘frac_escape’ or ‘log_escape’.
ceil_B (float or None) – Ceiling assigned to \(B_v\), or None if no ceiling.
ceil_E (float or None) – Ceiling assigned to \(E_v\), or None if no ceiling.
- Returns:
- Has the following columns:
’name’: specified in sample_df
’library’: the library
’pre_sample’: specified in sample_df
’post_sample’: specified in sample_df
the grouping used to compute scores (the value of by)
’score’: \(s_v\)
’score_var’: \(\sigma_{s_v}^2\)
’pre_count’: \(n_v^{\rm{pre}}\) (without pseudocount)
’post_count’: \(n_v^{\rm{post}}\) (without pseudocount)
as many of ‘aa_substitutions’, ‘n_aa_substitutions’, ‘codon_substitutions’, and ‘n_codon_substitutions’ as makes sense to retain given value of by.
- Return type:
pandas.DataFrame
- classmethod from_variant_count_df(*, variant_count_df_file, geneseq, drop_all_libs=True, allowgaps=False, primary_target=None, extra_cols=None)[source]¶
CodonVariantTable
from CSV of variant_count_df.Note
Use this method when you have written a CSV file of variant_count_df attribute of a
CodonVariantTable
, and wish to re-initialize.- Parameters:
variant_count_df_file (str) – Name of CSV file containing the variant_count_df. Must have following columns: “barcode”, “library”, “variant_call_support”, “codon_substitutions”, “sample”, and “count”.
geneseq (str) – Sequence of wildtype protein-coding gene.
drop_all_libs (bool) – If there is a library named “all libraries”, drop it as it probably added by
CodonVariantTable.addMergedLibraries()
and duplicates information for the individual libraries.allowgaps (bool) – Meaning described in main
CodonVariantTable
doc string.primary_target (None or str) – Meaning described in main
CodonVariantTable
doc string.extra_cols (list) – Meaning described in main
CodonVariantTable
doc string.
- Return type:
- func_scores(preselection, *, pseudocount=0.5, by='barcode', libraries='all', syn_as_wt=False, logbase=2, permit_zero_wt=False, permit_self_comp=False)[source]¶
Get data frame with functional scores for variants.
Note
The functional score is calculated from the change in counts for a variant pre- and post-selection using the formula in Otwinoski et al (2018).
Specifically, let \(n^v_{pre}\) and \(n^v_{post}\) be the counts of variant \(v\) pre- and post-selection, and let \(n^{wt}_{pre}\) and \(n^{wt}_{post}\) be the summed counts of all wildtype variants pre- and post- selection.
Then the functional score of the variant is:
\[f_v = \log_b\left(\frac{n^v_{post} / n^{wt}_{post}}{n^v_{pre} / n^{wt}_{pre}}\right).\]The variance due to Poisson sampling statistics is:
\[\sigma^2_v = \frac{1}{\left(\ln b\right)^2} \left(\frac{1}{n^v_{post}} + \frac{1}{n^{wt}_{post}} + \frac{1}{n^v_{pre}} + \frac{1}{n^{wt}_{pre}}\right)\]where \(b\) is logarithm base (see logbase parameter).
For both calculations, a pseudocount (see pseudocount parameter) is added to each count first. The wildtype counts are computed across all fully wildtype variants (see syn_as_wt for how this is defined).
If there are multiple targets, the functional scores for all targets are relative to the wildtype of the primary target.
- Parameters:
preselection (str or dict) – Pre-selection sample. If the same for all post-selection then provide the name as str. If it differs among post-selection samples, then provide a dict keyed by each post-selection sample with the pre-selection sample being the value.
pseudocount (float) – Pseudocount added to each count.
by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute effects for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined. If you use “aa_substitutions” then it may be more sensible to set syn_as_wt to True.
syn_as_wt (bool) – In formula for functional scores, consider variants with only synonymous mutations when determining wildtype counts? If False, only variants with no mutations of any type contribute.
libraries ({'all', 'all_only', list}) – Perform calculation for all libraries including a merge (named “all libraries”), only for the merge of all libraries, or for the libraries in the list. If by is ‘barcode’, then the barcodes for the merge have the library name pre-pended.
logbase (float) – Base for logarithm when calculating functional score.
permit_zero_wt (bool) – If the wildtype counts are zero for any sample, raise an error or permit the calculation to proceed just using pseudocount?
permit_self_comp (bool) – Permit comparisons for sample pre- and post-selection?
- Returns:
- Has the following columns:
”library”: the library
”pre_sample”: the pre-selection sample
”post_sample”: the post-selection sample
value corresponding to grouping used to compute effects (by)
”func_score”: the functional score
”func_score_var”: variance on the functional score
”pre_count”: pre-selection counts
”post_count: post-selection counts
”pre_count_wt”: pre-selection counts for all wildtype
”post_count_wt”: post-selection counts for all wildtype
”pseudocount”: the pseudocount value
as many of “aa_substitutions”, “n_aa_substitutions”, “codon_substitutions”, and “n_codon_substitutions” as can be retained given value of by.
- Return type:
pandas.DataFrame
- mutCounts(variant_type, mut_type, *, libraries='all', samples='all', min_support=1, sample_rename=None)[source]¶
Get counts of each individual mutations (only in primary target).
- Parameters:
variant_type ({'single', 'all'}) – Include just single mutants, or all mutants?
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram
.
- Returns:
Tidy data frame with columns named “library”, “sample”, “mutation”, “count”, “mutation_type”, and “site”. If there are multiple targets, only returns counts for the primary target.
- Return type:
pandas.DataFrame
- n_variants_df(*, libraries='all', samples='all', min_support=1, variant_type='all', mut_type=None, sample_rename=None, primary_target_only=False)[source]¶
Get number variants per library / sample (and target if specified).
- Parameters:
variant_type ({'single', 'all'}) – Include all variants or just those with <= 1 mut_type mutation.
mut_type ({'aa', 'codon', None}) – If variant_type is ‘single’, indicate what type of single mutants we are filtering for.
primary_target_only (bool) – Only return counts for the primary target.
args (All other) – Same as for
CodonVariantTable.plotNumMutsHistogram
.
- Return type:
pandas.DataFrame
- numCodonMutsByType(variant_type, *, libraries='all', samples='all', min_support=1, sample_rename=None)[source]¶
Get average nonsynonymous, synonymous, stop mutations per variant.
These statistics are only for the primary target.
- Parameters:
all_parameters – Same as for
CodonVariantTable.plotNumCodonMutsByType()
.- Returns:
Data frame with average mutations of each type per variant.
- Return type:
pandas.DataFrame
- plotAvgCountsPerVariant(*, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False, by_target=True)[source]¶
Plot average counts per variant.
- Parameters:
libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.
samples ({'all', list}) – Include all samples or just samples in list.
by_target (bool) – If True, also group counts by target if multiple targets.
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram
.
- Return type:
plotnine.ggplot.ggplot
- plotCountsPerVariant(*, ystat='frac_counts', logy=True, by_variant_class=False, classifyVariants_kwargs=None, variant_type='all', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, mut_type='aa', sample_rename=None, one_lib_facet=False, primary_target_only=False)[source]¶
Plot variant index versus counts (or frac counts).
- Parameters:
ystat ({'frac_counts', 'count'}) – Is y-axis counts from variant, or fraction of counts in library / sample from variant?
logy (bool) – Show the y-axis on a log scale. If so, all values of 0 are set to half the minimum observed value > 0, and dashed line is drawn to indicate that points below it are not observed.
other_parameters – Same as for
CodonVariantTable.plotCumulVariantCounts()
.
- Return type:
plotnine.ggplot.ggplot
- plotCumulMutCoverage(variant_type, mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, max_count=None, sample_rename=None, one_lib_facet=False)[source]¶
Frac mutations seen <= some number of times (primary target only).
- Parameters:
variant_type ({'single', 'all'}) – Include all variants or just those with <=1 mut_type mutation.
max_count (None or int) – Plot cumulative fraction plot out to this number of observations. If None, a reasonable value is automatically determined.
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram
.
- Return type:
plotnine.ggplot.ggplot
- plotCumulVariantCounts(*, variant_type='all', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, mut_type='aa', tot_variants_hline=True, sample_rename=None, one_lib_facet=False, primary_target_only=True)[source]¶
Plot number variants with >= that each number of counts.
- Parameters:
variant_type ({'single', 'all'}) – Include all variants or just those with <=1 mut_type mutation.
tot_variants_hline (bool) – Include dotted horizontal line indicating total number of variants.
primary_target_only (bool) – Only show counts for the primary target.
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram()
.
- Return type:
plotnine.ggplot.ggplot
- plotMutFreqs(variant_type, mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False)[source]¶
Mutation frequency along length of gene (primary target only).
- Parameters:
parameters (All) – Same as for
CodonVariantTable.plotCumulMutCoverage()
.- Return type:
plotnine.ggplot.ggplot
- plotMutHeatmap(variant_type, mut_type, *, count_or_frequency='frequency', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False)[source]¶
Heatmap of mutation counts or frequencies (for primary target only).
- Parameters:
count_or_frequency ({'count', 'frequency'}) – Plot mutation counts or frequencies?
other_parameters – Same as for
CodonVariantTable.plotCumulMutCoverage()
.
- Return type:
plotnine.ggplot.ggplot
- plotNumCodonMutsByType(variant_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, ylabel=None, sample_rename=None, one_lib_facet=False)[source]¶
Plot average nonsynonymous, synonymous, stop mutations per variant.
These statistics are only for the primary target.
- Parameters:
variant_type ({'single', 'all'}) – Include all variants or just those with <=1 codon mutation.
ylabel (None or str) – If not None, specify y-axis label (otherwise it is autoset).
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram
.
- Return type:
plotnine.ggplot.ggplot
- plotNumMutsHistogram(mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, max_muts=None, sample_rename=None, one_lib_facet=False)[source]¶
Plot histogram of num mutations per variant (primary target only).
- Parameters:
mut_type ({'codon' or 'aa'}) – Mutation type.
libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.
samples ({'all', None, list}) – Include all samples, a list of samples, or None to just count each barcoded variant once.
plotfile (None or str) – Name of file to which to save plot.
orientation ({'h', 'v'}) – Facet libraries horizontally or vertically?
widthscale (float) – Expand width of plot by this factor.
heightscale (float) – Expand height of plot by this factor.
min_support (int) – Only include variants with variant call support >= this.
max_muts (int or None) – Group together all variants with >= this many mutations.
sample_rename (dict or None) – Rename samples by specifying original name as key and new name as value.
one_lib_facet (bool) – Plot library facet title even if just one library.
- Return type:
plotnine.ggplot.ggplot
- plotVariantSupportHistogram(*, libraries='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, max_support=None, sample_rename=None, one_lib_facet=False, primary_target_only=False)[source]¶
Plot histogram of variant call support for variants.
- Parameters:
max_support (int or None) – Group together all variants with >= this support.
other_parameters – Same as for
CodonVariantTable.plotNumMutsHistogram
.primary_target_only (bool) – Only include variants that are of the primary target.
- Return type:
plotnine.ggplot.ggplot
- prob_escape(*, selections_df, neut_standard_target='neut_standard', by='barcode', min_neut_standard_frac=0.001, min_neut_standard_count=1000.0, ceil_n_aa_substitutions=4, drop_neut_standard_target=True, primary_target_only=False)[source]¶
Compute probability of escape relative to a neutralization standard.
Assumes one of the targets is the neutralization standard and is unaffected by antibody selection. Then computes escape probability of each variant as \(p_v = \left(n_v^{sel}/S^{sel}\right) / \left(n_v^0/S^0\right)\) where \(n_v^{sel}\) is the counts of variant \(v\) in the selected condition, \(n_v^0\) is the counts of variant \(v\) in the no-antibody contition, \(S^{sel}\) is the total counts of the neutralization standard in the selected condition, and \(S^0\) is the total counts of the neutralization standard in the no-antibody condition.
- Parameters:
selections_df (pandas.DataFrame) – How to pair samples. Should have columns ‘library’, ‘antibody_sample’, and ‘no-antibody_sample’ for each row. The ‘library’ and ‘antibody_sample’ columns must be unique for each row.
neut_standard_target (str) – Name of target that is neutralization standard.
by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined.
min_neut_standard_frac (float) – Raise error if neutralization standard not at least this fraction of the total counts for each library / sample.
min_neut_standard_count (int) – Raise error if not at least this many counts for neutralization standard for each library / sample.
ceil_n_aa_substitutions (int) – When computing the returned neutralization data frame, group variants with >= this many amino-acid substitutions.
drop_neut_standard_target (bool) – Drop the neutralization standard variant-level results from the returned data frames.
primary_target_only (bool) – Drop everything except the primary target and neutralization standard target before beginning calculations.
- Returns:
This tuple consists of three pandas data frames:
prob_escape: gives the escape probabilities, and has the following columns:
’library’
’antibody_sample’
’no-antibody_sample’
as many of ‘target’, ‘aa_substitutions’, ‘n_aa_substitutions’, ‘codon_substitutions’, and ‘n_codon_substitutions’ as makes sense to retain given value of by.
’prob_escape’: censored to be between 0 and 1
’prob_escape_uncensored’: not censored to be between 0 and 1
’antibody_count’: counts for variant in antibody condition
’no-antibody_count’: counts for variant no-antibody condition
’antibody_neut_standard_count’: counts for neut standard
’no-antibody_neut_standard_count’: counts for neut standard
neut_standard_fracs : a version of selections_df that also has columns antibody_sample_frac, no-antibody_sample_frac, antibody_sample_count, no-antibody_sample_count giving total counts and fraction of neutralization standard for each row in selections_df.
neutralization for each target and number of amino-acid (up to ceil_n_aa_substitutions), gives the overall probability escape in a column prob_escape.
- Return type:
(prob_escape, neut_standard_fracs, neutralization)
- samples(library)[source]¶
List of all samples for library.
- Parameters:
library (str) – Valid library for the
CodonVariantTable
.- Returns:
All samples for which barcode counts have been added.
- Return type:
list
- subs_to_seq(subs, subs_type='codon')[source]¶
Convert substitutions to full sequence.
- Parameters:
subs (str) – Space delimited substitutions in 1, … numbering.
subs_type ({'codon', 'aa'}) – Are substitutions codon or amino acid?
- Returns:
Sequence created by these mutations, either codon or amino acid depending on value of subs_type.
- Return type:
str
Example
>>> geneseq = 'ATGGGATGA' >>> with tempfile.NamedTemporaryFile(mode='w') as f: ... _ = f.write('library,barcode,substitutions,' ... 'variant_call_support') ... f.flush() ... variants = CodonVariantTable( ... barcode_variant_file=f.name, ... geneseq=geneseq ... ) >>> variants.subs_to_seq('GGA2CGT ATG1GTG') 'GTGCGTTGA' >>> variants.subs_to_seq('*3W M1C', subs_type='aa') 'CGW'
- valid_barcodes(library)[source]¶
Set of valid barcodes for library.
- Parameters:
library (str) – Name of a valid library.
- Return type:
set
- writeCodonCounts(single_or_all, *, outdir=None, include_all_libs=False)[source]¶
Write codon counts files for all libraries and samples.
Only writes the counts for the primary target.
Note
Useful if you want to analyze individual mutations using dms_tools2. File format: https://jbloomlab.github.io/dms_tools2/dms2_bcsubamp.html#counts-file
- Parameters:
single_or_all ({'single', 'all'}) – If ‘single’, then counts just from single-codon mutants and wildtype, and count all wildtype codons and just mutated codons for single-codon mutants. If ‘all’, count all codons for all variants at all sites. Appropriate if enrichment of each mutation is supposed to represent its effect for ‘single’, and if enrichment of mutation is supposed to represent its average effect across genetic backgrounds in the library for ‘all’ provided mutations are Poisson distributed.
outdir (None or str) – Name of directory to write counts, created if it does not exist. Use None to write to current directory.
include_all_libs (bool) – Include data for a library (named “all-libraries”) that has summmed data for all individual libraries if multiple libraries.
- Returns:
Gives names of created files. Has columns “library”, “sample”, and “countfile”. The “countfile” columns gives name of the created CSV file,
<library>_<sample>_codoncounts.csv
.- Return type:
pandas.DataFrame
- dms_variants.codonvarianttable.filter_by_subs_observed(df, min_single_counts, min_any_counts, *, and_vs_or='and', subs_col='aa_substitutions', n_subs_col='n_aa_substitutions')[source]¶
Filter for variants by observations substitutions in entire data frame.
Filter data frames of the type returned by
CodonVariantTable
to get just variants that have substitutions that are observed in some other number of variants.- Parameters:
df (pandas.DataFrame) – Data frame containing variants as rows.
min_single_counts (int) – Only keep variants with substitutions that are observed as single- substitution variants in at least this many variants.
min_any_counts (int) – Only keep variants with substitutions that are observed in at least this many variants with any number of substitutions.
and_vs_or ({'and', 'or'}) – Require variants to pass the min_single_counts and the ‘min_any_counts’ filters, or just require to pass one or the other.
subs_col ('aa_substitutions') – Column in df with the substitutions as space-delimited strings.
n_subs_col ('n_aa_substitutions') – Column in df with the number of substitutions per variant.
- Returns:
A copy of df that only contains the variants that pass the filters.
- Return type:
pandas.DataFrame
Examples
Create a data frame of variants to filter:
>>> df = pd.DataFrame.from_records([ ... ('var1', '', 0), ... ('var2', 'M1A', 1), ... ('var3', 'M1A G2A', 2), ... ('var4', 'M1A G2C', 2), ... ('var5', 'G2A', 1), ... ('var6', 'M1A', 1), ... ('var7', 'M1C', 1), ... ], ... columns=['barcode', 'aa_substitutions', 'n_aa_substitutions'])
Calling with min_single_counts=1 and min_any_counts=1 gets only variants with substitutions that are observed in single-substitution variants:
>>> filter_by_subs_observed(df, 1, 1) barcode aa_substitutions n_aa_substitutions 0 var1 0 1 var2 M1A 1 2 var3 M1A G2A 2 3 var5 G2A 1 4 var6 M1A 1 5 var7 M1C 1
We can also require substitutions to be seen multiple times as single variants:
>>> filter_by_subs_observed(df, 2, 1) barcode aa_substitutions n_aa_substitutions 0 var1 0 1 var2 M1A 1 2 var6 M1A 1
Or that substitutions be seen a specified number of times in both single- and multi-substitution contexts:
>>> filter_by_subs_observed(df, 1, 2) barcode aa_substitutions n_aa_substitutions 0 var1 0 1 var2 M1A 1 2 var3 M1A G2A 2 3 var5 G2A 1 4 var6 M1A 1
We can also make the requirement or rather than and:
>>> filter_by_subs_observed(df, 1, 2, and_vs_or='or') barcode aa_substitutions n_aa_substitutions 0 var1 0 1 var2 M1A 1 2 var3 M1A G2A 2 3 var5 G2A 1 4 var6 M1A 1 5 var7 M1C 1
Do not filter any variants:
>>> filter_by_subs_observed(df, 0, 0) barcode aa_substitutions n_aa_substitutions 0 var1 0 1 var2 M1A 1 2 var3 M1A G2A 2 3 var4 M1A G2C 2 4 var5 G2A 1 5 var6 M1A 1 6 var7 M1C 1