codonvarianttable

Defines CodonVariantTable objects for storing and handling codon variants of a gene.

class dms_variants.codonvarianttable.CodonVariantTable(*, barcode_variant_file, geneseq, allowgaps=False, substitutions_are_codon=False, extra_cols=None, substitutions_col='substitutions', primary_target=None)[source]

Bases: object

Store and handle counts of barcoded codon variants of a gene.

Parameters:
  • barcode_variant_file (str) – CSV file giving barcodes and variants. Must have columns “library”, “barcode”, “substitutions” (nucleotide mutations in 1, …, numbering in a format like “G301A A302T G856C”), and “variant_call_support” (sequences supporting barcode-variant call). Additional columns are removed unless they are specified extra_cols.

  • geneseq (str) – Sequence of wildtype protein-coding gene.

  • substitutions_are_codon (bool) – If True, then “substitutions” column in barcode_variant_file gives substitutions as codon rather than nucleotide mutations (e.g., “ATG1ATA GTA5CCC” for substitutions at codons 1 and 5.

  • allowgaps (bool) – Allow in-frame codon-length gaps (---) as valid codon substitution.

  • extra_cols (list) – Additional columns in barcode_variant_file to retain when creating barcode_variant_df and variant_count_df attributes.

  • substitutions_col (str) – Name of substitutions column in barcode_variant_file (use if you want it to be something other than “substitutions”).

  • primary_target (str or None) – Use this option if you have additional targets beyond the main gene for which we are analyzing variants. This might be the case if you have spiked other genes into the library. If this option is set to something other than None, then there must be a column in barcode_variant_file named “target” and one of these targets must be equal to ‘primary_target’. If there are other targets, they should not have any substitutions as we don’t parse substitutions in non-primary targets. Instead, substitutions_col for secondary targets should be empty or just have the name of the secondary target.

geneseq

Wild-type sequence passed at initialization.

Type:

str

sites

List of all codon sites in 1, 2, … numbering.

Type:

list

codons

codons[r] is wildtype codon at site r, ordered by sites.

Type:

collections.OrderedDict

aas

aas[r] is wildtype amino acid at site r, ordered by sites.

Type:

collections.OrderedDict

libraries

List of libraries in barcode_variant_file.

Type:

list

barcode_variant_df

Info about codon mutations parsed from barcode_variant_file. For non-primary targets, the mutations column just hold the target name.

Type:

pandas.DataFrame

variant_count_df

Initially None, but after data added with CodonVariantTable.addSampleCounts, holds counts of each variant for each sample. Differs from barcode_variant_df in that the former just holds barcode-variant definitions, whereas variant_count_df has counts for each sample.

Type:

pandas.DataFrame or None

primary_target

If multiple targets, name of the main target for which we are calling variants.

Type:

str or None

static addMergedLibraries(df, *, all_lib='all libraries')[source]

Add data to df for all libraries merged.

Parameters:
  • df (pandas.DataFrame) – Data frame that includes columns named ‘library’ and ‘barcode’.

  • all_lib (str) – Name given to library that is merge of all other libraries.

Returns:

If df only has data for one library, just returns df. Otherwise returns copy of df with new library with name given by all_lib that contains data for all individual libraries with the ‘barcode’ column giving library name followed by a hyphen and barcode.

Return type:

pandas.DataFrame

addSampleCounts(library, sample, barcodecounts)[source]

Add variant counts for a sample to variant_count_df.

Note

If you have many samples to add at once, it is faster to use CodonVariantTable.add_sample_counts_df() rather than repeatedly calling this method.

Parameters:
classmethod add_frac_counts(variant_count_df)[source]

Add fraction of counts from each variant in library/sample.

Parameters:

variant_count_df (pandas.DataFrame) – Same format as CodonVariantTable.

Returns:

A copy of variant_count_df with added column ‘frac_counts’ that gives fraction of all counts in that library / sample for that variant.

Return type:

pandas.DataFrame

Example

>>> variant_count_df = pd.DataFrame.from_records(
...        [('lib1', 's1', 'AA', 1),
...         ('lib1', 's1', 'AT', 3),
...         ('lib1', 's2', 'GG', 0),
...         ('lib1', 's2', 'GA', 10),
...         ('lib2', 's1', 'CC', 5),
...         ('lib2', 's1', 'GA', 5),
...         ],
...        columns=['library', 'sample', 'barcode', 'count'],
...        )
>>> CodonVariantTable.add_frac_counts(variant_count_df)
  library sample barcode  count  frac_counts
0    lib1     s1      AA      1         0.25
1    lib1     s1      AT      3         0.75
2    lib1     s2      GG      0         0.00
3    lib1     s2      GA     10         1.00
4    lib2     s1      CC      5         0.50
5    lib2     s1      GA      5         0.50
add_full_seqs(df, *, aa_seq_col='aa_sequence', codon_seq_col='codon_sequence')[source]

Add full sequences to data frame, for primary target only.

Parameters:
  • df (pandas.DataFrame) – Data frame that specifies substitutions. Might be the barcode_variant_df or variant_count_df attributes, or the result of calling CodonVariantTable.func_scores().

  • aa_seq_col (str or None) – Name of added column with amino-acid sequences. In order to add, df must have column named ‘aa_substitutions’.

  • codon_seq_col (str or None) – Name of added column with codon sequences. In order to add, df must have column named ‘codon_substitutions’.

Returns:

Copy of df with columns aa_seq_col and/or codon_seq_col, and only the primary target retained if there is a ‘target’ column in df.

Return type:

pandas.DataFrame

add_sample_counts_df(counts_df)[source]

Add variant counts for several samples to variant_count_df.

Parameters:

counts_df (pandas.DataFrame) – Must have columns ‘library’, ‘sample’, ‘barcode’, and ‘count’. The sample must not already be in CodonVariantTable.samples for that library. The barcode columns must have all barcodes for that library including zero-count ones.

avgCountsPerVariant(*, libraries='all', samples='all', min_support=1, sample_rename=None, by_target=True)[source]

Get average counts per variant.

Parameters:
  • libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.

  • samples ({'all', list}) – Include all samples or just samples in list.

  • min_support (int) – Only include variants with variant call support >= this.

  • sample_rename (dict or None) – Rename samples by specifying original name as key and new name as value.

  • by_target (bool) – If True, also group counts by target if multiple targets.

Returns:

Average counts per variant for each library and sample, and possibly target.

Return type:

pandas.DataFrame

static classifyVariants(df, *, variant_class_col='variant_class', max_aa=2, syn_as_wt=False, primary_target=None, non_primary_target_class='secondary target', class_as_categorical=False)[source]

Classifies codon variants in df.

Parameters:
  • df (pandas.DataFrame) – Must have columns named ‘aa_substitutions’, ‘n_aa_substitutions’, and ‘n_codon_substitutions’. Such a data frame can be obtained via variant_count_df or barcode_variant_df attributes of a CodonVariantTable or via CodonVariantTable.func_scores().

  • variant_class_col (str) – Name of column added to df that contains variant classification. Overwritten if already exists.

  • max_aa (int) – When classifying, group all with >= this many amino-acid mutations.

  • syn_as_wt (bool) – Do not have a category of ‘synonymous’ and instead classify synonymous variants as ‘wildtype’. If using this option, df does not need column ‘n_codon_substitutions’.

  • primary_target (None or str) – If df has a column named ‘target’, then this must specify primary target (e.g., CodonVariantTable.primary_target). Variants not from primary target are classified as non_primary_target_class.

  • non_primary_target_class (str) – Classification used for non-primary targets.

  • class_as_categorical (bool) – Return variant_class as a categorical variable with a reasonable ordering.

Returns:

Copy of df with column specified by variant_class_col as:
  • ’wildtype’: no codon mutations

  • ’synonymous’: only synonymous codon mutations

  • ’stop’: at least one stop-codon mutation

  • ’{n_aa} nonsynonymous’ where n_aa is number of amino-acid mutations, or is ‘>{max_aa}’ if more than max_aa.

  • potentially non_primary_target_class.

Return type:

pandas.DataFrame

Example

>>> df = pd.DataFrame.from_records(
...         [('AAA', '', 0, 0),
...          ('AAG', '', 0, 1),
...          ('ATA', 'M1* G5K', 2, 3),
...          ('GAA', 'G5H', 1, 2),
...          ('CTT', 'M1C G5C', 2, 3),
...          ('CTT', 'M1A L3T G5C', 3, 3),
...          ('AAG', '', 0, 1),
...          ],
...         columns=['barcode', 'aa_substitutions',
...                  'n_aa_substitutions', 'n_codon_substitutions']
...         )
>>> df_classify = CodonVariantTable.classifyVariants(df)
>>> all(df_classify.columns == ['barcode', 'aa_substitutions',
...                             'n_aa_substitutions',
...                             'n_codon_substitutions',
...                             'variant_class'])
True
>>> df_classify[['barcode', 'variant_class']]
  barcode     variant_class
0     AAA          wildtype
1     AAG        synonymous
2     ATA              stop
3     GAA   1 nonsynonymous
4     CTT  >1 nonsynonymous
5     CTT  >1 nonsynonymous
6     AAG        synonymous
>>> df_syn_as_wt = CodonVariantTable.classifyVariants(df,
...                                                   syn_as_wt=True)
>>> df_syn_as_wt[['barcode', 'variant_class']]
  barcode     variant_class
0     AAA          wildtype
1     AAG          wildtype
2     ATA              stop
3     GAA   1 nonsynonymous
4     CTT  >1 nonsynonymous
5     CTT  >1 nonsynonymous
6     AAG          wildtype

Now show how we need to specify how to handle when multiple targets:

>>> df['target'] = ['secondary'] + ['primary'] * 6
>>> (CodonVariantTable.classifyVariants(df)
...  [['target', 'barcode', 'variant_class']])
Traceback (most recent call last):
    ...
ValueError: `df` has "target" so give `primary_target`

We need to specify how to handle targets:

>>> (CodonVariantTable.classifyVariants(
...                         df,
...                         primary_target='primary',
...                         non_primary_target_class='homolog')
...  [['target', 'barcode', 'variant_class']])
      target barcode     variant_class
0  secondary     AAA           homolog
1    primary     AAG        synonymous
2    primary     ATA              stop
3    primary     GAA   1 nonsynonymous
4    primary     CTT  >1 nonsynonymous
5    primary     CTT  >1 nonsynonymous
6    primary     AAG        synonymous
classmethod codonToAAMuts(codon_mut_str, allowgaps=False)[source]

Convert string of codon mutations to amino-acid mutations.

Parameters:
  • codon_mut_str (str) – Codon mutations, delimited by a space and in 1, … numbering.

  • allowgaps (bool) – Allow gap codon (---) translated to gap amino acid (--).

Returns:

Amino acid mutations in 1, … numbering.

Return type:

str

Example

>>> CodonVariantTable.codonToAAMuts('ATG1GTG GGA2GGC TGA3AGA')
'M1V *3R'
escape_scores(sample_df, score_type, *, pseudocount=0.5, by='barcode', logbase=2, floor_B=0.01, floor_E=0.01, ceil_B=1.0, ceil_E=1.0)[source]

Compute a score that represents escape from binding.

Note

Here we couch the explanation in terms of variant escape from antibody binding.

Let \(v\) be a variant, let \(B_v\) be the fraction of this variant that is bound by antibody, and let \(E_v = 1 - B_v\) be the fraction that escapes binding. A variant that completely escapes binding has \(B_v = 0\) and \(E_v = 1\); a variant that is completely bound has \(B_v = 1\) and \(E_v = 0\).

We define the escape score \(s_v\) in one of three ways depending on the value of score_type parameter:

  1. As the escape fraction, so \(s_v = E_v\).

  2. As the log of the escape fraction, so \(s_v = \log_b E_v\) where \(b\) is the logarithm base.

  3. As minus the log of the binding fraction, so \(s_v = -\log_b B_v\).

In all cases, larger values of \(s_v\) indicate more escape.

We calculate \(E_v\) as follows. Let \(f_v^{\rm{pre}}\) be the frequency of \(v\) prior to selection for binding (so \(\sum_v f_v^{\rm{pre}} = 1\)). Then the fraction of the library that is \(v\) after selecting for unbound variants is

\[f_v^{\rm{post}} = \frac{f_v^{\rm{pre}} \times E_v} {\sum_{v'} f_{v'}^{\rm{pre}} \times E_{v'}}.\]

Note that the denominator of the above equation, \(F = \sum_v f_v^{\rm{pre}} \times E_v\), represents the overall fraction of the library that escapes binding, which we assume is directly measured experimentally.

We can easily solve for \(E_v\) as

\[E_v = \frac{F \times f_v^{\rm{post}}}{f_v^{\rm{pre}}},\]

and can similarly obtain \(B_v\) from \(B_v = 1 - E_v\).

We calculate \(E_v\) directly from the actual counts of the variants pre- and post-selection. Let \(n_v^{\rm{pre}}\) and \(n_v^{\rm{post}}\) be the counts of variant \(v\) pre- and post-selection after adding a pseudocount of \(P \ge 0\), and let \(N^{\rm{pre}} = \sum_v n_v^{\rm{pre}}\) and \(N^{\rm{post}} = \sum_v n_v^{\rm{post}}\) be the total counts of all variants pre- and post-selection. Then:

\[E_v = F \times \frac{n_v^{\rm{post}} N^{\rm{pre}}} {n_v^{\rm{pre}} N^{\rm{post}}}\]

A complication is that \(s_v\) can be undefined for certain values of \(E_v\) or \(B_v\). Specifically, when using the log escape fraction definition (\(s_v = \log_b E_v\)) then \(s_v\) is undefined if \(E_v = 0\), so we put a floor on \(E_v\) by defining \(s_v = \log_b \max\left(E_v, E_{\rm{floor}}\right)\). When using the minus log binding fraction definition (\(s_v = -\log_b B_v\)), then \(s_v\) is undefined if \(B_v \le 0\), so we put a floor on \(B_v\) by defining defining \(s_v = -\log_b \max\left(B_v, B_{\rm{floor}}\right)\). We similarly define ceilings on \(E_v\) and \(B_v\), although these ceilings are optional.

We also estimate the variance \(\sigma_{s_v}^2\) on \(s_v\) from the variances on the counts, which we assume are \(\sigma_{n_v^{\rm{pre}}}^2 = n_v^{\rm{pre}}\) and \(\sigma_{n_v^{\rm{post}}}^2 = n_v^{\rm{post}}\) from Poisson counting statistics. To do this, we propagate the errors:

\[\begin{split}\sigma_{s_v}^2 &=& \left(\frac{\partial s_v}{\partial n_v^{\rm{pre}}}\right)^2 \sigma_{n_v^{\rm{pre}}}^2 + \left(\frac{\partial s_v}{\partial n_v^{\rm{post}}}\right)^2 \sigma_{n_v^{\rm{post}}}^2 \\ &=& \left(\frac{\partial s_v}{\partial n_v^{\rm{pre}}}\right)^2 n_v^{\rm{pre}} + \left(\frac{\partial s_v}{\partial n_v^{\rm{post}}}\right)^2 n_v^{\rm{post}}.\end{split}\]

We calculate the derivatives of \(s_v\) with respect to the counts numerically with a step size of one rather than analytically, since analytical calculations are confounded by the floors on \(E_v\) or \(B_v\).

Parameters:
  • sample_df (pandas.DataFrame) – Comparisons we use to compute the functional scores. Should have these columns: ‘pre_sample’ (pre-selection sample), ‘post_sample’ (post-selection sample), ‘library’, ‘name’ (name for output), ‘frac_escape’ (the overall fraction escaping \(F\)).

  • score_type ({'frac_escape', 'log_escape', 'minus_log_bind'}) – How to define escape score: \(E_v\) if ‘frac_escape’; \(\log_b E_v\) if ‘log_escape’; \(-\log_b B_v\) if ‘minus_log_bind’.

  • pseudocount (float) – Pseudocount added to each count.

  • by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute effects for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined.

  • logbase (float) – Base for logarithm when calculating functional score.

  • floor_B (float) – Floor assigned to \(B_v\), \(B_{\rm{floor}}\) if score_type is ‘minus_log_bind’.

  • floor_E (float) – Floor assigned to \(E_v\), \(E_{\rm{floor}}\) if score_type is ‘frac_escape’ or ‘log_escape’.

  • ceil_B (float or None) – Ceiling assigned to \(B_v\), or None if no ceiling.

  • ceil_E (float or None) – Ceiling assigned to \(E_v\), or None if no ceiling.

Returns:

Has the following columns:
  • ’name’: specified in sample_df

  • ’library’: the library

  • ’pre_sample’: specified in sample_df

  • ’post_sample’: specified in sample_df

  • the grouping used to compute scores (the value of by)

  • ’score’: \(s_v\)

  • ’score_var’: \(\sigma_{s_v}^2\)

  • ’pre_count’: \(n_v^{\rm{pre}}\) (without pseudocount)

  • ’post_count’: \(n_v^{\rm{post}}\) (without pseudocount)

  • as many of ‘aa_substitutions’, ‘n_aa_substitutions’, ‘codon_substitutions’, and ‘n_codon_substitutions’ as makes sense to retain given value of by.

Return type:

pandas.DataFrame

classmethod from_variant_count_df(*, variant_count_df_file, geneseq, drop_all_libs=True, allowgaps=False, primary_target=None, extra_cols=None)[source]

CodonVariantTable from CSV of variant_count_df.

Note

Use this method when you have written a CSV file of variant_count_df attribute of a CodonVariantTable, and wish to re-initialize.

Parameters:
  • variant_count_df_file (str) – Name of CSV file containing the variant_count_df. Must have following columns: “barcode”, “library”, “variant_call_support”, “codon_substitutions”, “sample”, and “count”.

  • geneseq (str) – Sequence of wildtype protein-coding gene.

  • drop_all_libs (bool) – If there is a library named “all libraries”, drop it as it probably added by CodonVariantTable.addMergedLibraries() and duplicates information for the individual libraries.

  • allowgaps (bool) – Meaning described in main CodonVariantTable doc string.

  • primary_target (None or str) – Meaning described in main CodonVariantTable doc string.

  • extra_cols (list) – Meaning described in main CodonVariantTable doc string.

Return type:

CodonVariantTable

func_scores(preselection, *, pseudocount=0.5, by='barcode', libraries='all', syn_as_wt=False, logbase=2, permit_zero_wt=False, permit_self_comp=False)[source]

Get data frame with functional scores for variants.

Note

The functional score is calculated from the change in counts for a variant pre- and post-selection using the formula in Otwinoski et al (2018).

Specifically, let \(n^v_{pre}\) and \(n^v_{post}\) be the counts of variant \(v\) pre- and post-selection, and let \(n^{wt}_{pre}\) and \(n^{wt}_{post}\) be the summed counts of all wildtype variants pre- and post- selection.

Then the functional score of the variant is:

\[f_v = \log_b\left(\frac{n^v_{post} / n^{wt}_{post}}{n^v_{pre} / n^{wt}_{pre}}\right).\]

The variance due to Poisson sampling statistics is:

\[\sigma^2_v = \frac{1}{\left(\ln b\right)^2} \left(\frac{1}{n^v_{post}} + \frac{1}{n^{wt}_{post}} + \frac{1}{n^v_{pre}} + \frac{1}{n^{wt}_{pre}}\right)\]

where \(b\) is logarithm base (see logbase parameter).

For both calculations, a pseudocount (see pseudocount parameter) is added to each count first. The wildtype counts are computed across all fully wildtype variants (see syn_as_wt for how this is defined).

If there are multiple targets, the functional scores for all targets are relative to the wildtype of the primary target.

Parameters:
  • preselection (str or dict) – Pre-selection sample. If the same for all post-selection then provide the name as str. If it differs among post-selection samples, then provide a dict keyed by each post-selection sample with the pre-selection sample being the value.

  • pseudocount (float) – Pseudocount added to each count.

  • by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute effects for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined. If you use “aa_substitutions” then it may be more sensible to set syn_as_wt to True.

  • syn_as_wt (bool) – In formula for functional scores, consider variants with only synonymous mutations when determining wildtype counts? If False, only variants with no mutations of any type contribute.

  • libraries ({'all', 'all_only', list}) – Perform calculation for all libraries including a merge (named “all libraries”), only for the merge of all libraries, or for the libraries in the list. If by is ‘barcode’, then the barcodes for the merge have the library name pre-pended.

  • logbase (float) – Base for logarithm when calculating functional score.

  • permit_zero_wt (bool) – If the wildtype counts are zero for any sample, raise an error or permit the calculation to proceed just using pseudocount?

  • permit_self_comp (bool) – Permit comparisons for sample pre- and post-selection?

Returns:

Has the following columns:
  • ”library”: the library

  • ”pre_sample”: the pre-selection sample

  • ”post_sample”: the post-selection sample

  • value corresponding to grouping used to compute effects (by)

  • ”func_score”: the functional score

  • ”func_score_var”: variance on the functional score

  • ”pre_count”: pre-selection counts

  • ”post_count: post-selection counts

  • ”pre_count_wt”: pre-selection counts for all wildtype

  • ”post_count_wt”: post-selection counts for all wildtype

  • ”pseudocount”: the pseudocount value

  • as many of “aa_substitutions”, “n_aa_substitutions”, “codon_substitutions”, and “n_codon_substitutions” as can be retained given value of by.

Return type:

pandas.DataFrame

mutCounts(variant_type, mut_type, *, libraries='all', samples='all', min_support=1, sample_rename=None)[source]

Get counts of each individual mutations (only in primary target).

Parameters:
Returns:

Tidy data frame with columns named “library”, “sample”, “mutation”, “count”, “mutation_type”, and “site”. If there are multiple targets, only returns counts for the primary target.

Return type:

pandas.DataFrame

n_variants_df(*, libraries='all', samples='all', min_support=1, variant_type='all', mut_type=None, sample_rename=None, primary_target_only=False)[source]

Get number variants per library / sample (and target if specified).

Parameters:
  • variant_type ({'single', 'all'}) – Include all variants or just those with <= 1 mut_type mutation.

  • mut_type ({'aa', 'codon', None}) – If variant_type is ‘single’, indicate what type of single mutants we are filtering for.

  • primary_target_only (bool) – Only return counts for the primary target.

  • args (All other) – Same as for CodonVariantTable.plotNumMutsHistogram.

Return type:

pandas.DataFrame

numCodonMutsByType(variant_type, *, libraries='all', samples='all', min_support=1, sample_rename=None)[source]

Get average nonsynonymous, synonymous, stop mutations per variant.

These statistics are only for the primary target.

Parameters:

all_parameters – Same as for CodonVariantTable.plotNumCodonMutsByType().

Returns:

Data frame with average mutations of each type per variant.

Return type:

pandas.DataFrame

plotAvgCountsPerVariant(*, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False, by_target=True)[source]

Plot average counts per variant.

Parameters:
  • libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.

  • samples ({'all', list}) – Include all samples or just samples in list.

  • by_target (bool) – If True, also group counts by target if multiple targets.

  • other_parameters – Same as for CodonVariantTable.plotNumMutsHistogram.

Return type:

plotnine.ggplot.ggplot

plotCountsPerVariant(*, ystat='frac_counts', logy=True, by_variant_class=False, classifyVariants_kwargs=None, variant_type='all', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, mut_type='aa', sample_rename=None, one_lib_facet=False, primary_target_only=False)[source]

Plot variant index versus counts (or frac counts).

Parameters:
  • ystat ({'frac_counts', 'count'}) – Is y-axis counts from variant, or fraction of counts in library / sample from variant?

  • logy (bool) – Show the y-axis on a log scale. If so, all values of 0 are set to half the minimum observed value > 0, and dashed line is drawn to indicate that points below it are not observed.

  • other_parameters – Same as for CodonVariantTable.plotCumulVariantCounts().

Return type:

plotnine.ggplot.ggplot

plotCumulMutCoverage(variant_type, mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, max_count=None, sample_rename=None, one_lib_facet=False)[source]

Frac mutations seen <= some number of times (primary target only).

Parameters:
  • variant_type ({'single', 'all'}) – Include all variants or just those with <=1 mut_type mutation.

  • max_count (None or int) – Plot cumulative fraction plot out to this number of observations. If None, a reasonable value is automatically determined.

  • other_parameters – Same as for CodonVariantTable.plotNumMutsHistogram.

Return type:

plotnine.ggplot.ggplot

plotCumulVariantCounts(*, variant_type='all', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, mut_type='aa', tot_variants_hline=True, sample_rename=None, one_lib_facet=False, primary_target_only=True)[source]

Plot number variants with >= that each number of counts.

Parameters:
  • variant_type ({'single', 'all'}) – Include all variants or just those with <=1 mut_type mutation.

  • tot_variants_hline (bool) – Include dotted horizontal line indicating total number of variants.

  • primary_target_only (bool) – Only show counts for the primary target.

  • other_parameters – Same as for CodonVariantTable.plotNumMutsHistogram().

Return type:

plotnine.ggplot.ggplot

plotMutFreqs(variant_type, mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False)[source]

Mutation frequency along length of gene (primary target only).

Parameters:

parameters (All) – Same as for CodonVariantTable.plotCumulMutCoverage().

Return type:

plotnine.ggplot.ggplot

plotMutHeatmap(variant_type, mut_type, *, count_or_frequency='frequency', libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, sample_rename=None, one_lib_facet=False)[source]

Heatmap of mutation counts or frequencies (for primary target only).

Parameters:
Return type:

plotnine.ggplot.ggplot

plotNumCodonMutsByType(variant_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, ylabel=None, sample_rename=None, one_lib_facet=False)[source]

Plot average nonsynonymous, synonymous, stop mutations per variant.

These statistics are only for the primary target.

Parameters:
  • variant_type ({'single', 'all'}) – Include all variants or just those with <=1 codon mutation.

  • ylabel (None or str) – If not None, specify y-axis label (otherwise it is autoset).

  • other_parameters – Same as for CodonVariantTable.plotNumMutsHistogram.

Return type:

plotnine.ggplot.ggplot

plotNumMutsHistogram(mut_type, *, libraries='all', samples='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, min_support=1, max_muts=None, sample_rename=None, one_lib_facet=False)[source]

Plot histogram of num mutations per variant (primary target only).

Parameters:
  • mut_type ({'codon' or 'aa'}) – Mutation type.

  • libraries ({'all', 'all_only', list}) – Include all libraries including a merge, only a merge of all libraries, or a list of libraries.

  • samples ({'all', None, list}) – Include all samples, a list of samples, or None to just count each barcoded variant once.

  • plotfile (None or str) – Name of file to which to save plot.

  • orientation ({'h', 'v'}) – Facet libraries horizontally or vertically?

  • widthscale (float) – Expand width of plot by this factor.

  • heightscale (float) – Expand height of plot by this factor.

  • min_support (int) – Only include variants with variant call support >= this.

  • max_muts (int or None) – Group together all variants with >= this many mutations.

  • sample_rename (dict or None) – Rename samples by specifying original name as key and new name as value.

  • one_lib_facet (bool) – Plot library facet title even if just one library.

Return type:

plotnine.ggplot.ggplot

plotVariantSupportHistogram(*, libraries='all', plotfile=None, orientation='h', widthscale=1, heightscale=1, max_support=None, sample_rename=None, one_lib_facet=False, primary_target_only=False)[source]

Plot histogram of variant call support for variants.

Parameters:
  • max_support (int or None) – Group together all variants with >= this support.

  • other_parameters – Same as for CodonVariantTable.plotNumMutsHistogram.

  • primary_target_only (bool) – Only include variants that are of the primary target.

Return type:

plotnine.ggplot.ggplot

prob_escape(*, selections_df, neut_standard_target='neut_standard', by='barcode', min_neut_standard_frac=0.001, min_neut_standard_count=1000.0, ceil_n_aa_substitutions=4, drop_neut_standard_target=True, primary_target_only=False)[source]

Compute probability of escape relative to a neutralization standard.

Assumes one of the targets is the neutralization standard and is unaffected by antibody selection. Then computes escape probability of each variant as \(p_v = \left(n_v^{sel}/S^{sel}\right) / \left(n_v^0/S^0\right)\) where \(n_v^{sel}\) is the counts of variant \(v\) in the selected condition, \(n_v^0\) is the counts of variant \(v\) in the no-antibody contition, \(S^{sel}\) is the total counts of the neutralization standard in the selected condition, and \(S^0\) is the total counts of the neutralization standard in the no-antibody condition.

Parameters:
  • selections_df (pandas.DataFrame) – How to pair samples. Should have columns ‘library’, ‘antibody_sample’, and ‘no-antibody_sample’ for each row. The ‘library’ and ‘antibody_sample’ columns must be unique for each row.

  • neut_standard_target (str) – Name of target that is neutralization standard.

  • by ({'barcode', 'aa_substitutions', 'codon_substitutions'}) – Compute for each barcode”, set of amino-acid substitutions, or set of codon substitutions. In the last two cases, all barcodes with each set of substitutions are combined.

  • min_neut_standard_frac (float) – Raise error if neutralization standard not at least this fraction of the total counts for each library / sample.

  • min_neut_standard_count (int) – Raise error if not at least this many counts for neutralization standard for each library / sample.

  • ceil_n_aa_substitutions (int) – When computing the returned neutralization data frame, group variants with >= this many amino-acid substitutions.

  • drop_neut_standard_target (bool) – Drop the neutralization standard variant-level results from the returned data frames.

  • primary_target_only (bool) – Drop everything except the primary target and neutralization standard target before beginning calculations.

Returns:

This tuple consists of three pandas data frames:

  • prob_escape: gives the escape probabilities, and has the following columns:

    • ’library’

    • ’antibody_sample’

    • ’no-antibody_sample’

    • as many of ‘target’, ‘aa_substitutions’, ‘n_aa_substitutions’, ‘codon_substitutions’, and ‘n_codon_substitutions’ as makes sense to retain given value of by.

    • ’prob_escape’: censored to be between 0 and 1

    • ’prob_escape_uncensored’: not censored to be between 0 and 1

    • ’antibody_count’: counts for variant in antibody condition

    • ’no-antibody_count’: counts for variant no-antibody condition

    • ’antibody_neut_standard_count’: counts for neut standard

    • ’no-antibody_neut_standard_count’: counts for neut standard

  • neut_standard_fracs : a version of selections_df that also has columns antibody_sample_frac, no-antibody_sample_frac, antibody_sample_count, no-antibody_sample_count giving total counts and fraction of neutralization standard for each row in selections_df.

  • neutralization for each target and number of amino-acid (up to ceil_n_aa_substitutions), gives the overall probability escape in a column prob_escape.

Return type:

(prob_escape, neut_standard_fracs, neutralization)

samples(library)[source]

List of all samples for library.

Parameters:

library (str) – Valid library for the CodonVariantTable.

Returns:

All samples for which barcode counts have been added.

Return type:

list

subs_to_seq(subs, subs_type='codon')[source]

Convert substitutions to full sequence.

Parameters:
  • subs (str) – Space delimited substitutions in 1, … numbering.

  • subs_type ({'codon', 'aa'}) – Are substitutions codon or amino acid?

Returns:

Sequence created by these mutations, either codon or amino acid depending on value of subs_type.

Return type:

str

Example

>>> geneseq = 'ATGGGATGA'
>>> with tempfile.NamedTemporaryFile(mode='w') as f:
...     _ = f.write('library,barcode,substitutions,'
...                 'variant_call_support')
...     f.flush()
...     variants = CodonVariantTable(
...                 barcode_variant_file=f.name,
...                 geneseq=geneseq
...                 )
>>> variants.subs_to_seq('GGA2CGT ATG1GTG')
'GTGCGTTGA'
>>> variants.subs_to_seq('*3W M1C', subs_type='aa')
'CGW'
valid_barcodes(library)[source]

Set of valid barcodes for library.

Parameters:

library (str) – Name of a valid library.

Return type:

set

writeCodonCounts(single_or_all, *, outdir=None, include_all_libs=False)[source]

Write codon counts files for all libraries and samples.

Only writes the counts for the primary target.

Note

Useful if you want to analyze individual mutations using dms_tools2. File format: https://jbloomlab.github.io/dms_tools2/dms2_bcsubamp.html#counts-file

Parameters:
  • single_or_all ({'single', 'all'}) – If ‘single’, then counts just from single-codon mutants and wildtype, and count all wildtype codons and just mutated codons for single-codon mutants. If ‘all’, count all codons for all variants at all sites. Appropriate if enrichment of each mutation is supposed to represent its effect for ‘single’, and if enrichment of mutation is supposed to represent its average effect across genetic backgrounds in the library for ‘all’ provided mutations are Poisson distributed.

  • outdir (None or str) – Name of directory to write counts, created if it does not exist. Use None to write to current directory.

  • include_all_libs (bool) – Include data for a library (named “all-libraries”) that has summmed data for all individual libraries if multiple libraries.

Returns:

Gives names of created files. Has columns “library”, “sample”, and “countfile”. The “countfile” columns gives name of the created CSV file, <library>_<sample>_codoncounts.csv.

Return type:

pandas.DataFrame

dms_variants.codonvarianttable.filter_by_subs_observed(df, min_single_counts, min_any_counts, *, and_vs_or='and', subs_col='aa_substitutions', n_subs_col='n_aa_substitutions')[source]

Filter for variants by observations substitutions in entire data frame.

Filter data frames of the type returned by CodonVariantTable to get just variants that have substitutions that are observed in some other number of variants.

Parameters:
  • df (pandas.DataFrame) – Data frame containing variants as rows.

  • min_single_counts (int) – Only keep variants with substitutions that are observed as single- substitution variants in at least this many variants.

  • min_any_counts (int) – Only keep variants with substitutions that are observed in at least this many variants with any number of substitutions.

  • and_vs_or ({'and', 'or'}) – Require variants to pass the min_single_counts and the ‘min_any_counts’ filters, or just require to pass one or the other.

  • subs_col ('aa_substitutions') – Column in df with the substitutions as space-delimited strings.

  • n_subs_col ('n_aa_substitutions') – Column in df with the number of substitutions per variant.

Returns:

A copy of df that only contains the variants that pass the filters.

Return type:

pandas.DataFrame

Examples

Create a data frame of variants to filter:

>>> df = pd.DataFrame.from_records([
...          ('var1', '', 0),
...          ('var2', 'M1A', 1),
...          ('var3', 'M1A G2A', 2),
...          ('var4', 'M1A G2C', 2),
...          ('var5', 'G2A', 1),
...          ('var6', 'M1A', 1),
...          ('var7', 'M1C', 1),
...          ],
...          columns=['barcode', 'aa_substitutions', 'n_aa_substitutions'])

Calling with min_single_counts=1 and min_any_counts=1 gets only variants with substitutions that are observed in single-substitution variants:

>>> filter_by_subs_observed(df, 1, 1)
  barcode aa_substitutions  n_aa_substitutions
0    var1                                    0
1    var2              M1A                   1
2    var3          M1A G2A                   2
3    var5              G2A                   1
4    var6              M1A                   1
5    var7              M1C                   1

We can also require substitutions to be seen multiple times as single variants:

>>> filter_by_subs_observed(df, 2, 1)
  barcode aa_substitutions  n_aa_substitutions
0    var1                                    0
1    var2              M1A                   1
2    var6              M1A                   1

Or that substitutions be seen a specified number of times in both single- and multi-substitution contexts:

>>> filter_by_subs_observed(df, 1, 2)
  barcode aa_substitutions  n_aa_substitutions
0    var1                                    0
1    var2              M1A                   1
2    var3          M1A G2A                   2
3    var5              G2A                   1
4    var6              M1A                   1

We can also make the requirement or rather than and:

>>> filter_by_subs_observed(df, 1, 2, and_vs_or='or')
  barcode aa_substitutions  n_aa_substitutions
0    var1                                    0
1    var2              M1A                   1
2    var3          M1A G2A                   2
3    var5              G2A                   1
4    var6              M1A                   1
5    var7              M1C                   1

Do not filter any variants:

>>> filter_by_subs_observed(df, 0, 0)
  barcode aa_substitutions  n_aa_substitutions
0    var1                                    0
1    var2              M1A                   1
2    var3          M1A G2A                   2
3    var4          M1A G2C                   2
4    var5              G2A                   1
5    var6              M1A                   1
6    var7              M1C                   1