simulate

Simulate data.

class dms_variants.simulate.MultiLatentSigmoidPhenotypeSimulator(sigmoid_phenotype_simulators)[source]

Bases: object

Simulate phenotype that derives from several latent phenotypes.

Note

The observed phenotype is the sum of several sigmoid-transformed latent phenotypes. Specifically, let there be \(k = 1, \ldots, K\) latent phenotypes, each of which is transformed into an observed phenotype \(p_{\rm{observed}}^k as described in the docs for :class:`SigmoidPhenotypeSimulator\). The overall observed phenotype is:

\[p_{\rm{observed}} = \sum_k p_{\rm{observed}}^k\]

and the observed enrichment is:

\[E_{\rm{observed}} = 2^{p_{\rm{observed}}}.\]
Parameters:

sigmoid_phenotype_simulators (list) – A list of SigmoidPhenotypeSimulator objects, each of which transforms one of the latent phenotypes to a component of the observed phenotype.

n_latent_phenotypes

The number of latent phenotypes.

Type:

int

latentPhenotype(subs, k)[source]

Latent phenotype \(k\) of a variant.

Parameters:
Returns:

Latent phenotype k.

Return type:

float

observedEnrichment(subs)[source]

Observed enrichment of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed enrichment relative to wildtype.

Return type:

float

observedPhenotype(subs)[source]

Observed phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed phenotype of a variant.

Return type:

float

plotMutsHistogram(value, *, k=None, mutant_order=1, bins=30, wt_vline=True)[source]

Plot distribution of phenotype for all mutants of a given order.

Parameters:
  • value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.

  • k (int or None) – If value is latentPhenotype, which phenotype (1 <= `k <= MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes) to plot.

  • mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).

  • bins (int) – Number of bins in histogram.

  • wt_vline (bool) – Draw a vertical line at the wildtype value.

Returns:

Histogram of phenotype for all mutants.

Return type:

plotnine.ggplot.ggplot

class dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, *, seed=1, wt_latent=5, norm_weights=((0.4, -0.7, 1.5), (0.6, -7, 3.5)), stop_effect=-15, min_observed_enrichment=0.001)[source]

Bases: object

Simulate phenotypes under sigmoid global epistasis model.

Note

Mutational effects on latent phenotype are simulated to follow compound normal distribution; latent phenotype maps to observed enrichment via sigmoid, and the observed phenotype is the \(\log_2\) of the enrichment. This distinction between latent and observed phenotype parallels the “global epistasis” models of Otwinoski et al and Sailer and Harms.

Specifically, let \(p_{\rm{latent}}\) be the latent phenotype of a variant, let \(p_{\rm{observed}}\) be the observed phenotype, and let \(E_{\rm{observed}}\) be the observed enrichment. Also let \(p_{\rm{latent, wt}}\) be the latent phenotype of the wildtype. Then:

\[\begin{split}E_{\rm{observed}} &=& \frac{\left(1 + \exp\left(-p_{\rm{latent, wt}}\right)\right) \left(1 - E_{\rm{min}}\right)} {1 + \exp\left(-p_{\rm{latent}}\right)} + E_{\rm{min}} \\ p_{\rm{observed}} &=& \log_2 E_{\rm{observed}}\end{split}\]

where \(0 \le E_{\rm{min}} < 1\) is the minimum possible observed enrichment. Typically \(E_{\rm{min}} > 0\) in real experiments as pseudocounts are used so estimates of enrichments can’t be zero.

Parameters:
  • geneseq (str) – Codon sequence of wild-type gene.

  • seed (int or None) – Random number seed.

  • wt_latent (float) – Latent phenotype of wildtype.

  • norm_weights (list or tuple of tuples) – Specify compound normal distribution of mutational effects on latent phenotype as (weight, mean, sd) for each Gaussian.

  • stop_effect (float) – Effect of stop codon at any position.

  • min_observed_enrichment (float) – Minimum possible observed enrichment.

wt_latent

Wildtype latent phenotype.

Type:

float

muteffects

Effect on latent phenotype of each amino-acid mutation.

Type:

dict

latentPhenotype(subs)[source]

Latent phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Latent phenotype of variant.

Return type:

float

latentToObserved(latent, value)[source]

Observed enrichment or observed phenotype from latent phenotype.

Parameters:
  • latent (float or numpy.ndarray) – The latent phenotype.

  • value ({'enrichment', 'phenotype'}) – Get the observed enrichment or the observed phenotype?

Returns:

The observed enrichment or phenotype.

Return type:

float or numpy.ndarray

observedEnrichment(subs)[source]

Observed enrichment of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed enrichment relative to wildtype.

Return type:

float

observedPhenotype(subs)[source]

Observed phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed phenotype of variant.

Return type:

float

plotLatentVsObserved(value, *, latent_min=-15, latent_max=10, npoints=200, wt_vline=True)[source]

Plot observed enrichment/phenotype as function of latent phenotype.

Parameters:
  • value ({'enrichment', 'phenotype'}) – Do we plot observed enrichment or observed phenotype?

  • latent_min (float) – Smallest value of latent phenotype on plot.

  • latent_max (float) – Largest value of latent phenotype on plot.

  • npoints (int) – Plot a line fit to this many points.

  • wt_vline (bool) – Draw a vertical line at the wildtype latent phenotype.

Returns:

Plot of observed enrichment or phenotype as function of latent phenotype.

Return type:

plotnine.ggplot.ggplot

plotMutsHistogram(value, *, mutant_order=1, bins=30, wt_vline=True)[source]

Plot distribution of phenotype for all mutants of a given order.

Parameters:
  • value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.

  • mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).

  • bins (int) – Number of bins in histogram.

  • wt_vline (bool) – Draw a vertical line at the wildtype value.

Returns:

Histogram of phenotype for all mutants.

Return type:

plotnine.ggplot.ggplot

dms_variants.simulate.codon_muts(codonseq, nmuts, nvariants)[source]

Get random codon mutations to sequence.

Parameters:
  • codonseq (str) – A valid codon sequence

  • nmuts (int) – Number of mutations to make per variant.

  • nvariants (int) – Number of variants to generate. There is no guarantee that the variants are unique.

Returns:

  • list – List of mutations in each of the nvariants mutated variants in the form ‘ATG1TAC GGC3GGG’ where the number refers to the codon position.

  • Seed random.seed for reproducible output.

Example

>>> codonseq = 'ATGGAAGGGCATGGA'
>>> random.seed(1)
>>> codon_muts(codonseq, nmuts=2, nvariants=3)
['ATG1CAC GAA2ACT', 'CAT4CTT GGA5GGG', 'GAA2ACG CAT4GAA']
dms_variants.simulate.mutate_seq(seq, nmuts)[source]

Mutate a nucleotide sequence.

Parameters:
  • seq (str) – Sequence to mutate.

  • nmuts (int) – Number of mutations to make to sequence.

Returns:

  • str – Mutagenized sequence, all upper-case.

  • Seed random.seed for reproducible output.

Example

>>> random.seed(1)
>>> mutate_seq('ATGCAA', 2)
'AAGCGA'
dms_variants.simulate.rand_seq(seqlen, *, exclude=None, nseqs=1)[source]

Random nucleotide sequence(s).

Parameters:
  • seqlen (int) – Length of sequence(s).

  • exclude (None or list) – Do not generate any of the sequences in this list.

  • nseqs (int) – If >1, return a list of this many sequences.

Returns:

  • str or list – A random sequence or list of them depending on value of nseqs. If multiple sequences, they are all unique.

  • Seed random.seed for reproducible output.

Example

>>> random.seed(1)
>>> rand_seq(4)
'CAGA'
>>> random.seed(1)
>>> rand_seq(5, nseqs=3)
['CAGAT', 'TTTCA', 'TATTA']
>>> random.seed(1)
>>> rand_seq(5, nseqs=3, exclude=['TTTCA'])
['CAGAT', 'TATTA', 'TGCAG']
dms_variants.simulate.simulateSampleCounts(*, variants, phenotype_func, variant_error_rate, pre_sample, post_samples, secondary_target_phenotypes=None, pre_sample_name='pre-selection', seed=1)[source]

Simulate pre- and post-selection variant counts.

Parameters:
  • variants (class:dms_variants.codonvarianttable.CodonVariantTable) – Holds variants used in simulation.

  • phenotype_func (function) – Takes a space-delimited string of amino-acid substitutions and returns a number giving enrichment of variant relative to wildtype. For instance, SigmoidPhenotypeSimulator.observedEnrichment(). If there are multiple targets in variants, this function is only applied to the primary target. See secondary_target_phenotypes for the secondary targets.

  • variant_error_rate (float) – Rate at which variants mis-called. Provide the probability that a variant has a spuriously called (or missing) codon mutation; each variant then has a random codon mutation added or removed with this probability before being passed to phenotype_func.

  • pre_sample (pandas.DataFrame or dict) – Counts of each variant pre-selection. To specify counts, provide data frame with columns “library”, “barcode”, and “count”. To simulate, provide dict with keys “total_count” and “uniformity”, and for each library, we simulate pre-selection counts as a draw of “total_count” counts from a multinomial parameterized by pre- selection frequencies drawn from a Dirichlet distribution with concentration parameter of “uniformity” (5 is reasonable value).

  • post_samples (dict) – Keyed by name of each sample with value another dict keyed by ‘total_count’, ‘noise’, and ‘bottleneck’. Counts drawn from multinomial parameterized by pre-selection frerquencies after passing through bottleneck of indicated size, and adding noise by mutiplying phenotype by a random variable with mean 1 and standard deviation specified by ‘noise’ (0 is no noise).

  • secondary_target_phenotypes (None or dict) – If there are secondary targets in variants, this should be keyed by each secondary target name and give the associated phenotypes as values.

  • pre_sample_name (str) – Name used for the pre-selection sample.

  • seed (None or int) – If not None, random number seed.

Returns:

Data frame with the following columns:
  • ”library”

  • ”barcode”

  • ”sample”

  • ”count”

Return type:

pandas.DataFrame

dms_variants.simulate.simulate_CodonVariantTable(*, geneseq, bclen, library_specs, seed=1, variant_call_support=0.5, allowed_aa_muts=None, allowgaps=False)[source]

Simulate dms_variants.codonvarianttable.CodonVariantTable.

Note

Only simulates the variants, not counts for samples. To simulate counts, use simulateSampleCounts().

Parameters:
  • geneseq (str) – Sequence of wildtype protein-coding gene.

  • bclen (int) – Length of the barcodes; must enable complexity at least 10-fold greater than max number of variants.

  • library_specs (dict) – Specifications for each simulated library. Keys are ‘avgmuts’ and ‘nvariants’. Mutations per variant are Poisson distributed.

  • seed (int or None) – Random number seed or None to set no seed.

  • variant_call_support (int > 0) – Support is floor of 1 plus draw from exponential distribution with this mean.

  • allowed_aa_muts (None or array-like) – If we only allow some amino-acid substitutions, provide them. Specify mutations like ‘E1E’ if you want to allow synonymous codon mutations.

  • allowgaps (bool) – Allow gaps as a codon character.

Return type:

dms_variants.codonvarianttable.CodonVariantTable

Example

>>> geneseq = 'ATGGGCAGC'
>>> bclen = 4
>>> library_specs = {'lib1': {'avgmuts': 3, 'nvariants': 5}}
>>> variants =  simulate_CodonVariantTable(geneseq=geneseq,
...                                        bclen=bclen,
...                                        library_specs=library_specs)
>>> variants.barcode_variant_df[['barcode', 'aa_substitutions']]
  barcode aa_substitutions
0    AGTC      M1L G2Y S3F
1    ATTC              S3R
2    CAAA      M1C G2S S3A
3    CGAT              M1N
4    TAAT              M1V
>>> allowed_aa_muts = ['M1M', 'M1L', 'M1A', 'G2A', 'G2K', 'S3C', 'S3S']
>>> variants =  simulate_CodonVariantTable(geneseq=geneseq,
...                                        bclen=bclen,
...                                        library_specs=library_specs,
...                                        allowed_aa_muts=allowed_aa_muts,
...                                        seed=2)
>>> variants.barcode_variant_df[['barcode', 'aa_substitutions']]
  barcode aa_substitutions
0    AAAA          M1L G2A
1    CAAC          M1A G2A
2    GTTG              M1A
3    TTAA              G2A
4    TTCA          M1L G2A