simulate

Simulate data.

class dms_variants.simulate.MultiLatentSigmoidPhenotypeSimulator(sigmoid_phenotype_simulators)[source]

Bases: object

Simulate phenotype that derives from several latent phenotypes.

Note

The observed phenotype is the sum of several sigmoid-transformed latent phenotypes. Specifically, let there be k=1,,K latent phenotypes, each of which is transformed into an observed phenotype pobservedkasdescribedinthedocsfor:class:SigmoidPhenotypeSimulator. The overall observed phenotype is:

pobserved=kpobservedk

and the observed enrichment is:

Eobserved=2pobserved.
Parameters:

sigmoid_phenotype_simulators (list) – A list of SigmoidPhenotypeSimulator objects, each of which transforms one of the latent phenotypes to a component of the observed phenotype.

n_latent_phenotypes

The number of latent phenotypes.

Type:

int

latentPhenotype(subs, k)[source]

Latent phenotype k of a variant.

Parameters:
Returns:

Latent phenotype k.

Return type:

float

observedEnrichment(subs)[source]

Observed enrichment of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed enrichment relative to wildtype.

Return type:

float

observedPhenotype(subs)[source]

Observed phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed phenotype of a variant.

Return type:

float

plotMutsHistogram(value, *, k=None, mutant_order=1, bins=30, wt_vline=True)[source]

Plot distribution of phenotype for all mutants of a given order.

Parameters:
  • value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.

  • k (int or None) – If value is latentPhenotype, which phenotype (1 <= `k <= MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes) to plot.

  • mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).

  • bins (int) – Number of bins in histogram.

  • wt_vline (bool) – Draw a vertical line at the wildtype value.

Returns:

Histogram of phenotype for all mutants.

Return type:

plotnine.ggplot.ggplot

class dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, *, seed=1, wt_latent=5, norm_weights=((0.4, -0.7, 1.5), (0.6, -7, 3.5)), stop_effect=-15, min_observed_enrichment=0.001)[source]

Bases: object

Simulate phenotypes under sigmoid global epistasis model.

Note

Mutational effects on latent phenotype are simulated to follow compound normal distribution; latent phenotype maps to observed enrichment via sigmoid, and the observed phenotype is the log2 of the enrichment. This distinction between latent and observed phenotype parallels the “global epistasis” models of Otwinoski et al and Sailer and Harms.

Specifically, let platent be the latent phenotype of a variant, let pobserved be the observed phenotype, and let Eobserved be the observed enrichment. Also let platent,wt be the latent phenotype of the wildtype. Then:

Eobserved=(1+exp(platent,wt))(1Emin)1+exp(platent)+Eminpobserved=log2Eobserved

where 0Emin<1 is the minimum possible observed enrichment. Typically Emin>0 in real experiments as pseudocounts are used so estimates of enrichments can’t be zero.

Parameters:
  • geneseq (str) – Codon sequence of wild-type gene.

  • seed (int or None) – Random number seed.

  • wt_latent (float) – Latent phenotype of wildtype.

  • norm_weights (list or tuple of tuples) – Specify compound normal distribution of mutational effects on latent phenotype as (weight, mean, sd) for each Gaussian.

  • stop_effect (float) – Effect of stop codon at any position.

  • min_observed_enrichment (float) – Minimum possible observed enrichment.

wt_latent

Wildtype latent phenotype.

Type:

float

muteffects

Effect on latent phenotype of each amino-acid mutation.

Type:

dict

latentPhenotype(subs)[source]

Latent phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Latent phenotype of variant.

Return type:

float

latentToObserved(latent, value)[source]

Observed enrichment or observed phenotype from latent phenotype.

Parameters:
  • latent (float or numpy.ndarray) – The latent phenotype.

  • value ({'enrichment', 'phenotype'}) – Get the observed enrichment or the observed phenotype?

Returns:

The observed enrichment or phenotype.

Return type:

float or numpy.ndarray

observedEnrichment(subs)[source]

Observed enrichment of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed enrichment relative to wildtype.

Return type:

float

observedPhenotype(subs)[source]

Observed phenotype of a variant.

Parameters:

subs (str) – Space-delimited list of amino-acid substitutions.

Returns:

Observed phenotype of variant.

Return type:

float

plotLatentVsObserved(value, *, latent_min=-15, latent_max=10, npoints=200, wt_vline=True)[source]

Plot observed enrichment/phenotype as function of latent phenotype.

Parameters:
  • value ({'enrichment', 'phenotype'}) – Do we plot observed enrichment or observed phenotype?

  • latent_min (float) – Smallest value of latent phenotype on plot.

  • latent_max (float) – Largest value of latent phenotype on plot.

  • npoints (int) – Plot a line fit to this many points.

  • wt_vline (bool) – Draw a vertical line at the wildtype latent phenotype.

Returns:

Plot of observed enrichment or phenotype as function of latent phenotype.

Return type:

plotnine.ggplot.ggplot

plotMutsHistogram(value, *, mutant_order=1, bins=30, wt_vline=True)[source]

Plot distribution of phenotype for all mutants of a given order.

Parameters:
  • value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.

  • mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).

  • bins (int) – Number of bins in histogram.

  • wt_vline (bool) – Draw a vertical line at the wildtype value.

Returns:

Histogram of phenotype for all mutants.

Return type:

plotnine.ggplot.ggplot

dms_variants.simulate.codon_muts(codonseq, nmuts, nvariants)[source]

Get random codon mutations to sequence.

Parameters:
  • codonseq (str) – A valid codon sequence

  • nmuts (int) – Number of mutations to make per variant.

  • nvariants (int) – Number of variants to generate. There is no guarantee that the variants are unique.

Returns:

  • list – List of mutations in each of the nvariants mutated variants in the form ‘ATG1TAC GGC3GGG’ where the number refers to the codon position.

  • Seed random.seed for reproducible output.

Example

>>> codonseq = 'ATGGAAGGGCATGGA'
>>> random.seed(1)
>>> codon_muts(codonseq, nmuts=2, nvariants=3)
['ATG1CAC GAA2ACT', 'CAT4CTT GGA5GGG', 'GAA2ACG CAT4GAA']
dms_variants.simulate.mutate_seq(seq, nmuts)[source]

Mutate a nucleotide sequence.

Parameters:
  • seq (str) – Sequence to mutate.

  • nmuts (int) – Number of mutations to make to sequence.

Returns:

  • str – Mutagenized sequence, all upper-case.

  • Seed random.seed for reproducible output.

Example

>>> random.seed(1)
>>> mutate_seq('ATGCAA', 2)
'AAGCGA'
dms_variants.simulate.rand_seq(seqlen, *, exclude=None, nseqs=1)[source]

Random nucleotide sequence(s).

Parameters:
  • seqlen (int) – Length of sequence(s).

  • exclude (None or list) – Do not generate any of the sequences in this list.

  • nseqs (int) – If >1, return a list of this many sequences.

Returns:

  • str or list – A random sequence or list of them depending on value of nseqs. If multiple sequences, they are all unique.

  • Seed random.seed for reproducible output.

Example

>>> random.seed(1)
>>> rand_seq(4)
'CAGA'
>>> random.seed(1)
>>> rand_seq(5, nseqs=3)
['CAGAT', 'TTTCA', 'TATTA']
>>> random.seed(1)
>>> rand_seq(5, nseqs=3, exclude=['TTTCA'])
['CAGAT', 'TATTA', 'TGCAG']
dms_variants.simulate.simulateSampleCounts(*, variants, phenotype_func, variant_error_rate, pre_sample, post_samples, secondary_target_phenotypes=None, pre_sample_name='pre-selection', seed=1)[source]

Simulate pre- and post-selection variant counts.

Parameters:
  • variants (class:dms_variants.codonvarianttable.CodonVariantTable) – Holds variants used in simulation.

  • phenotype_func (function) – Takes a space-delimited string of amino-acid substitutions and returns a number giving enrichment of variant relative to wildtype. For instance, SigmoidPhenotypeSimulator.observedEnrichment(). If there are multiple targets in variants, this function is only applied to the primary target. See secondary_target_phenotypes for the secondary targets.

  • variant_error_rate (float) – Rate at which variants mis-called. Provide the probability that a variant has a spuriously called (or missing) codon mutation; each variant then has a random codon mutation added or removed with this probability before being passed to phenotype_func.

  • pre_sample (pandas.DataFrame or dict) – Counts of each variant pre-selection. To specify counts, provide data frame with columns “library”, “barcode”, and “count”. To simulate, provide dict with keys “total_count” and “uniformity”, and for each library, we simulate pre-selection counts as a draw of “total_count” counts from a multinomial parameterized by pre- selection frequencies drawn from a Dirichlet distribution with concentration parameter of “uniformity” (5 is reasonable value).

  • post_samples (dict) – Keyed by name of each sample with value another dict keyed by ‘total_count’, ‘noise’, and ‘bottleneck’. Counts drawn from multinomial parameterized by pre-selection frerquencies after passing through bottleneck of indicated size, and adding noise by mutiplying phenotype by a random variable with mean 1 and standard deviation specified by ‘noise’ (0 is no noise).

  • secondary_target_phenotypes (None or dict) – If there are secondary targets in variants, this should be keyed by each secondary target name and give the associated phenotypes as values.

  • pre_sample_name (str) – Name used for the pre-selection sample.

  • seed (None or int) – If not None, random number seed.

Returns:

Data frame with the following columns:
  • ”library”

  • ”barcode”

  • ”sample”

  • ”count”

Return type:

pandas.DataFrame

dms_variants.simulate.simulate_CodonVariantTable(*, geneseq, bclen, library_specs, seed=1, variant_call_support=0.5, allowed_aa_muts=None, allowgaps=False)[source]

Simulate dms_variants.codonvarianttable.CodonVariantTable.

Note

Only simulates the variants, not counts for samples. To simulate counts, use simulateSampleCounts().

Parameters:
  • geneseq (str) – Sequence of wildtype protein-coding gene.

  • bclen (int) – Length of the barcodes; must enable complexity at least 10-fold greater than max number of variants.

  • library_specs (dict) – Specifications for each simulated library. Keys are ‘avgmuts’ and ‘nvariants’. Mutations per variant are Poisson distributed.

  • seed (int or None) – Random number seed or None to set no seed.

  • variant_call_support (int > 0) – Support is floor of 1 plus draw from exponential distribution with this mean.

  • allowed_aa_muts (None or array-like) – If we only allow some amino-acid substitutions, provide them. Specify mutations like ‘E1E’ if you want to allow synonymous codon mutations.

  • allowgaps (bool) – Allow gaps as a codon character.

Return type:

dms_variants.codonvarianttable.CodonVariantTable

Example

>>> geneseq = 'ATGGGCAGC'
>>> bclen = 4
>>> library_specs = {'lib1': {'avgmuts': 3, 'nvariants': 5}}
>>> variants =  simulate_CodonVariantTable(geneseq=geneseq,
...                                        bclen=bclen,
...                                        library_specs=library_specs)
>>> variants.barcode_variant_df[['barcode', 'aa_substitutions']]
  barcode aa_substitutions
0    AGTC      M1L G2Y S3F
1    ATTC              S3R
2    CAAA      M1C G2S S3A
3    CGAT              M1N
4    TAAT              M1V
>>> allowed_aa_muts = ['M1M', 'M1L', 'M1A', 'G2A', 'G2K', 'S3C', 'S3S']
>>> variants =  simulate_CodonVariantTable(geneseq=geneseq,
...                                        bclen=bclen,
...                                        library_specs=library_specs,
...                                        allowed_aa_muts=allowed_aa_muts,
...                                        seed=2)
>>> variants.barcode_variant_df[['barcode', 'aa_substitutions']]
  barcode aa_substitutions
0    AAAA          M1L G2A
1    CAAC          M1A G2A
2    GTTG              M1A
3    TTAA              G2A
4    TTCA          M1L G2A