simulate¶
Simulate data.
- class dms_variants.simulate.MultiLatentSigmoidPhenotypeSimulator(sigmoid_phenotype_simulators)[source]¶
Bases:
object
Simulate phenotype that derives from several latent phenotypes.
Note
The observed phenotype is the sum of several sigmoid-transformed latent phenotypes. Specifically, let there be
latent phenotypes, each of which is transformed into an observed phenotype . The overall observed phenotype is:and the observed enrichment is:
- Parameters:
sigmoid_phenotype_simulators (list) – A list of
SigmoidPhenotypeSimulator
objects, each of which transforms one of the latent phenotypes to a component of the observed phenotype.
- n_latent_phenotypes¶
The number of latent phenotypes.
- Type:
int
- latentPhenotype(subs, k)[source]¶
Latent phenotype
of a variant.- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
k (int) – Latent phenotype to get (1 <= k <=
MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes
).
- Returns:
Latent phenotype k.
- Return type:
float
- observedEnrichment(subs)[source]¶
Observed enrichment of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed enrichment relative to wildtype.
- Return type:
float
- observedPhenotype(subs)[source]¶
Observed phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed phenotype of a variant.
- Return type:
float
- plotMutsHistogram(value, *, k=None, mutant_order=1, bins=30, wt_vline=True)[source]¶
Plot distribution of phenotype for all mutants of a given order.
- Parameters:
value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.
k (int or None) – If value is latentPhenotype, which phenotype (1 <= `k <=
MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes
) to plot.mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).
bins (int) – Number of bins in histogram.
wt_vline (bool) – Draw a vertical line at the wildtype value.
- Returns:
Histogram of phenotype for all mutants.
- Return type:
plotnine.ggplot.ggplot
- class dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, *, seed=1, wt_latent=5, norm_weights=((0.4, -0.7, 1.5), (0.6, -7, 3.5)), stop_effect=-15, min_observed_enrichment=0.001)[source]¶
Bases:
object
Simulate phenotypes under sigmoid global epistasis model.
Note
Mutational effects on latent phenotype are simulated to follow compound normal distribution; latent phenotype maps to observed enrichment via sigmoid, and the observed phenotype is the
of the enrichment. This distinction between latent and observed phenotype parallels the “global epistasis” models of Otwinoski et al and Sailer and Harms.Specifically, let
be the latent phenotype of a variant, let be the observed phenotype, and let be the observed enrichment. Also let be the latent phenotype of the wildtype. Then:where
is the minimum possible observed enrichment. Typically in real experiments as pseudocounts are used so estimates of enrichments can’t be zero.- Parameters:
geneseq (str) – Codon sequence of wild-type gene.
seed (int or None) – Random number seed.
wt_latent (float) – Latent phenotype of wildtype.
norm_weights (list or tuple of tuples) – Specify compound normal distribution of mutational effects on latent phenotype as (weight, mean, sd) for each Gaussian.
stop_effect (float) – Effect of stop codon at any position.
min_observed_enrichment (float) – Minimum possible observed enrichment.
- wt_latent¶
Wildtype latent phenotype.
- Type:
float
- muteffects¶
Effect on latent phenotype of each amino-acid mutation.
- Type:
dict
- latentPhenotype(subs)[source]¶
Latent phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Latent phenotype of variant.
- Return type:
float
- latentToObserved(latent, value)[source]¶
Observed enrichment or observed phenotype from latent phenotype.
- Parameters:
latent (float or numpy.ndarray) – The latent phenotype.
value ({'enrichment', 'phenotype'}) – Get the observed enrichment or the observed phenotype?
- Returns:
The observed enrichment or phenotype.
- Return type:
float or numpy.ndarray
- observedEnrichment(subs)[source]¶
Observed enrichment of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed enrichment relative to wildtype.
- Return type:
float
- observedPhenotype(subs)[source]¶
Observed phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed phenotype of variant.
- Return type:
float
- plotLatentVsObserved(value, *, latent_min=-15, latent_max=10, npoints=200, wt_vline=True)[source]¶
Plot observed enrichment/phenotype as function of latent phenotype.
- Parameters:
value ({'enrichment', 'phenotype'}) – Do we plot observed enrichment or observed phenotype?
latent_min (float) – Smallest value of latent phenotype on plot.
latent_max (float) – Largest value of latent phenotype on plot.
npoints (int) – Plot a line fit to this many points.
wt_vline (bool) – Draw a vertical line at the wildtype latent phenotype.
- Returns:
Plot of observed enrichment or phenotype as function of latent phenotype.
- Return type:
plotnine.ggplot.ggplot
- plotMutsHistogram(value, *, mutant_order=1, bins=30, wt_vline=True)[source]¶
Plot distribution of phenotype for all mutants of a given order.
- Parameters:
value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.
mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).
bins (int) – Number of bins in histogram.
wt_vline (bool) – Draw a vertical line at the wildtype value.
- Returns:
Histogram of phenotype for all mutants.
- Return type:
plotnine.ggplot.ggplot
- dms_variants.simulate.codon_muts(codonseq, nmuts, nvariants)[source]¶
Get random codon mutations to sequence.
- Parameters:
codonseq (str) – A valid codon sequence
nmuts (int) – Number of mutations to make per variant.
nvariants (int) – Number of variants to generate. There is no guarantee that the variants are unique.
- Returns:
list – List of mutations in each of the nvariants mutated variants in the form ‘ATG1TAC GGC3GGG’ where the number refers to the codon position.
Seed random.seed for reproducible output.
Example
>>> codonseq = 'ATGGAAGGGCATGGA' >>> random.seed(1) >>> codon_muts(codonseq, nmuts=2, nvariants=3) ['ATG1CAC GAA2ACT', 'CAT4CTT GGA5GGG', 'GAA2ACG CAT4GAA']
- dms_variants.simulate.mutate_seq(seq, nmuts)[source]¶
Mutate a nucleotide sequence.
- Parameters:
seq (str) – Sequence to mutate.
nmuts (int) – Number of mutations to make to sequence.
- Returns:
str – Mutagenized sequence, all upper-case.
Seed random.seed for reproducible output.
Example
>>> random.seed(1) >>> mutate_seq('ATGCAA', 2) 'AAGCGA'
- dms_variants.simulate.rand_seq(seqlen, *, exclude=None, nseqs=1)[source]¶
Random nucleotide sequence(s).
- Parameters:
seqlen (int) – Length of sequence(s).
exclude (None or list) – Do not generate any of the sequences in this list.
nseqs (int) – If >1, return a list of this many sequences.
- Returns:
str or list – A random sequence or list of them depending on value of nseqs. If multiple sequences, they are all unique.
Seed random.seed for reproducible output.
Example
>>> random.seed(1) >>> rand_seq(4) 'CAGA'
>>> random.seed(1) >>> rand_seq(5, nseqs=3) ['CAGAT', 'TTTCA', 'TATTA']
>>> random.seed(1) >>> rand_seq(5, nseqs=3, exclude=['TTTCA']) ['CAGAT', 'TATTA', 'TGCAG']
- dms_variants.simulate.simulateSampleCounts(*, variants, phenotype_func, variant_error_rate, pre_sample, post_samples, secondary_target_phenotypes=None, pre_sample_name='pre-selection', seed=1)[source]¶
Simulate pre- and post-selection variant counts.
Note
Add to
dms_variants.codonvarianttable.CodonVariantTable
usingdms_variants.codonvarianttable.CodonVariantTable.addSampleCounts()
- Parameters:
variants (class:dms_variants.codonvarianttable.CodonVariantTable) – Holds variants used in simulation.
phenotype_func (function) – Takes a space-delimited string of amino-acid substitutions and returns a number giving enrichment of variant relative to wildtype. For instance,
SigmoidPhenotypeSimulator.observedEnrichment()
. If there are multiple targets in variants, this function is only applied to the primary target. See secondary_target_phenotypes for the secondary targets.variant_error_rate (float) – Rate at which variants mis-called. Provide the probability that a variant has a spuriously called (or missing) codon mutation; each variant then has a random codon mutation added or removed with this probability before being passed to phenotype_func.
pre_sample (pandas.DataFrame or dict) – Counts of each variant pre-selection. To specify counts, provide data frame with columns “library”, “barcode”, and “count”. To simulate, provide dict with keys “total_count” and “uniformity”, and for each library, we simulate pre-selection counts as a draw of “total_count” counts from a multinomial parameterized by pre- selection frequencies drawn from a Dirichlet distribution with concentration parameter of “uniformity” (5 is reasonable value).
post_samples (dict) – Keyed by name of each sample with value another dict keyed by ‘total_count’, ‘noise’, and ‘bottleneck’. Counts drawn from multinomial parameterized by pre-selection frerquencies after passing through bottleneck of indicated size, and adding noise by mutiplying phenotype by a random variable with mean 1 and standard deviation specified by ‘noise’ (0 is no noise).
secondary_target_phenotypes (None or dict) – If there are secondary targets in variants, this should be keyed by each secondary target name and give the associated phenotypes as values.
pre_sample_name (str) – Name used for the pre-selection sample.
seed (None or int) – If not None, random number seed.
- Returns:
- Data frame with the following columns:
”library”
”barcode”
”sample”
”count”
- Return type:
pandas.DataFrame
- dms_variants.simulate.simulate_CodonVariantTable(*, geneseq, bclen, library_specs, seed=1, variant_call_support=0.5, allowed_aa_muts=None, allowgaps=False)[source]¶
Simulate
dms_variants.codonvarianttable.CodonVariantTable
.Note
Only simulates the variants, not counts for samples. To simulate counts, use
simulateSampleCounts()
.- Parameters:
geneseq (str) – Sequence of wildtype protein-coding gene.
bclen (int) – Length of the barcodes; must enable complexity at least 10-fold greater than max number of variants.
library_specs (dict) – Specifications for each simulated library. Keys are ‘avgmuts’ and ‘nvariants’. Mutations per variant are Poisson distributed.
seed (int or None) – Random number seed or None to set no seed.
variant_call_support (int > 0) – Support is floor of 1 plus draw from exponential distribution with this mean.
allowed_aa_muts (None or array-like) – If we only allow some amino-acid substitutions, provide them. Specify mutations like ‘E1E’ if you want to allow synonymous codon mutations.
allowgaps (bool) – Allow gaps as a codon character.
- Return type:
Example
>>> geneseq = 'ATGGGCAGC' >>> bclen = 4 >>> library_specs = {'lib1': {'avgmuts': 3, 'nvariants': 5}} >>> variants = simulate_CodonVariantTable(geneseq=geneseq, ... bclen=bclen, ... library_specs=library_specs) >>> variants.barcode_variant_df[['barcode', 'aa_substitutions']] barcode aa_substitutions 0 AGTC M1L G2Y S3F 1 ATTC S3R 2 CAAA M1C G2S S3A 3 CGAT M1N 4 TAAT M1V
>>> allowed_aa_muts = ['M1M', 'M1L', 'M1A', 'G2A', 'G2K', 'S3C', 'S3S'] >>> variants = simulate_CodonVariantTable(geneseq=geneseq, ... bclen=bclen, ... library_specs=library_specs, ... allowed_aa_muts=allowed_aa_muts, ... seed=2) >>> variants.barcode_variant_df[['barcode', 'aa_substitutions']] barcode aa_substitutions 0 AAAA M1L G2A 1 CAAC M1A G2A 2 GTTG M1A 3 TTAA G2A 4 TTCA M1L G2A