simulate¶
Simulate data.
- class dms_variants.simulate.MultiLatentSigmoidPhenotypeSimulator(sigmoid_phenotype_simulators)[source]¶
Bases:
object
Simulate phenotype that derives from several latent phenotypes.
Note
The observed phenotype is the sum of several sigmoid-transformed latent phenotypes. Specifically, let there be \(k = 1, \ldots, K\) latent phenotypes, each of which is transformed into an observed phenotype \(p_{\rm{observed}}^k as described in the docs for :class:`SigmoidPhenotypeSimulator\). The overall observed phenotype is:
\[p_{\rm{observed}} = \sum_k p_{\rm{observed}}^k\]and the observed enrichment is:
\[E_{\rm{observed}} = 2^{p_{\rm{observed}}}.\]- Parameters:
sigmoid_phenotype_simulators (list) – A list of
SigmoidPhenotypeSimulator
objects, each of which transforms one of the latent phenotypes to a component of the observed phenotype.
- n_latent_phenotypes¶
The number of latent phenotypes.
- Type:
int
- latentPhenotype(subs, k)[source]¶
Latent phenotype \(k\) of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
k (int) – Latent phenotype to get (1 <= k <=
MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes
).
- Returns:
Latent phenotype k.
- Return type:
float
- observedEnrichment(subs)[source]¶
Observed enrichment of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed enrichment relative to wildtype.
- Return type:
float
- observedPhenotype(subs)[source]¶
Observed phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed phenotype of a variant.
- Return type:
float
- plotMutsHistogram(value, *, k=None, mutant_order=1, bins=30, wt_vline=True)[source]¶
Plot distribution of phenotype for all mutants of a given order.
- Parameters:
value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.
k (int or None) – If value is latentPhenotype, which phenotype (1 <= `k <=
MultiLatentSigmoidPhenotypeSimulator.n_latent_phenotypes
) to plot.mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).
bins (int) – Number of bins in histogram.
wt_vline (bool) – Draw a vertical line at the wildtype value.
- Returns:
Histogram of phenotype for all mutants.
- Return type:
plotnine.ggplot.ggplot
- class dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, *, seed=1, wt_latent=5, norm_weights=((0.4, -0.7, 1.5), (0.6, -7, 3.5)), stop_effect=-15, min_observed_enrichment=0.001)[source]¶
Bases:
object
Simulate phenotypes under sigmoid global epistasis model.
Note
Mutational effects on latent phenotype are simulated to follow compound normal distribution; latent phenotype maps to observed enrichment via sigmoid, and the observed phenotype is the \(\log_2\) of the enrichment. This distinction between latent and observed phenotype parallels the “global epistasis” models of Otwinoski et al and Sailer and Harms.
Specifically, let \(p_{\rm{latent}}\) be the latent phenotype of a variant, let \(p_{\rm{observed}}\) be the observed phenotype, and let \(E_{\rm{observed}}\) be the observed enrichment. Also let \(p_{\rm{latent, wt}}\) be the latent phenotype of the wildtype. Then:
\[\begin{split}E_{\rm{observed}} &=& \frac{\left(1 + \exp\left(-p_{\rm{latent, wt}}\right)\right) \left(1 - E_{\rm{min}}\right)} {1 + \exp\left(-p_{\rm{latent}}\right)} + E_{\rm{min}} \\ p_{\rm{observed}} &=& \log_2 E_{\rm{observed}}\end{split}\]where \(0 \le E_{\rm{min}} < 1\) is the minimum possible observed enrichment. Typically \(E_{\rm{min}} > 0\) in real experiments as pseudocounts are used so estimates of enrichments can’t be zero.
- Parameters:
geneseq (str) – Codon sequence of wild-type gene.
seed (int or None) – Random number seed.
wt_latent (float) – Latent phenotype of wildtype.
norm_weights (list or tuple of tuples) – Specify compound normal distribution of mutational effects on latent phenotype as (weight, mean, sd) for each Gaussian.
stop_effect (float) – Effect of stop codon at any position.
min_observed_enrichment (float) – Minimum possible observed enrichment.
- wt_latent¶
Wildtype latent phenotype.
- Type:
float
- muteffects¶
Effect on latent phenotype of each amino-acid mutation.
- Type:
dict
- latentPhenotype(subs)[source]¶
Latent phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Latent phenotype of variant.
- Return type:
float
- latentToObserved(latent, value)[source]¶
Observed enrichment or observed phenotype from latent phenotype.
- Parameters:
latent (float or numpy.ndarray) – The latent phenotype.
value ({'enrichment', 'phenotype'}) – Get the observed enrichment or the observed phenotype?
- Returns:
The observed enrichment or phenotype.
- Return type:
float or numpy.ndarray
- observedEnrichment(subs)[source]¶
Observed enrichment of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed enrichment relative to wildtype.
- Return type:
float
- observedPhenotype(subs)[source]¶
Observed phenotype of a variant.
- Parameters:
subs (str) – Space-delimited list of amino-acid substitutions.
- Returns:
Observed phenotype of variant.
- Return type:
float
- plotLatentVsObserved(value, *, latent_min=-15, latent_max=10, npoints=200, wt_vline=True)[source]¶
Plot observed enrichment/phenotype as function of latent phenotype.
- Parameters:
value ({'enrichment', 'phenotype'}) – Do we plot observed enrichment or observed phenotype?
latent_min (float) – Smallest value of latent phenotype on plot.
latent_max (float) – Largest value of latent phenotype on plot.
npoints (int) – Plot a line fit to this many points.
wt_vline (bool) – Draw a vertical line at the wildtype latent phenotype.
- Returns:
Plot of observed enrichment or phenotype as function of latent phenotype.
- Return type:
plotnine.ggplot.ggplot
- plotMutsHistogram(value, *, mutant_order=1, bins=30, wt_vline=True)[source]¶
Plot distribution of phenotype for all mutants of a given order.
- Parameters:
value ({'latentPhenotype', 'observedPhenotype', 'observedEnrichment'}) – What value to plot.
mutant_order (int) – Plot mutations of this order. Currently only works for 1 (single mutants).
bins (int) – Number of bins in histogram.
wt_vline (bool) – Draw a vertical line at the wildtype value.
- Returns:
Histogram of phenotype for all mutants.
- Return type:
plotnine.ggplot.ggplot
- dms_variants.simulate.codon_muts(codonseq, nmuts, nvariants)[source]¶
Get random codon mutations to sequence.
- Parameters:
codonseq (str) – A valid codon sequence
nmuts (int) – Number of mutations to make per variant.
nvariants (int) – Number of variants to generate. There is no guarantee that the variants are unique.
- Returns:
list – List of mutations in each of the nvariants mutated variants in the form ‘ATG1TAC GGC3GGG’ where the number refers to the codon position.
Seed random.seed for reproducible output.
Example
>>> codonseq = 'ATGGAAGGGCATGGA' >>> random.seed(1) >>> codon_muts(codonseq, nmuts=2, nvariants=3) ['ATG1CAC GAA2ACT', 'CAT4CTT GGA5GGG', 'GAA2ACG CAT4GAA']
- dms_variants.simulate.mutate_seq(seq, nmuts)[source]¶
Mutate a nucleotide sequence.
- Parameters:
seq (str) – Sequence to mutate.
nmuts (int) – Number of mutations to make to sequence.
- Returns:
str – Mutagenized sequence, all upper-case.
Seed random.seed for reproducible output.
Example
>>> random.seed(1) >>> mutate_seq('ATGCAA', 2) 'AAGCGA'
- dms_variants.simulate.rand_seq(seqlen, *, exclude=None, nseqs=1)[source]¶
Random nucleotide sequence(s).
- Parameters:
seqlen (int) – Length of sequence(s).
exclude (None or list) – Do not generate any of the sequences in this list.
nseqs (int) – If >1, return a list of this many sequences.
- Returns:
str or list – A random sequence or list of them depending on value of nseqs. If multiple sequences, they are all unique.
Seed random.seed for reproducible output.
Example
>>> random.seed(1) >>> rand_seq(4) 'CAGA'
>>> random.seed(1) >>> rand_seq(5, nseqs=3) ['CAGAT', 'TTTCA', 'TATTA']
>>> random.seed(1) >>> rand_seq(5, nseqs=3, exclude=['TTTCA']) ['CAGAT', 'TATTA', 'TGCAG']
- dms_variants.simulate.simulateSampleCounts(*, variants, phenotype_func, variant_error_rate, pre_sample, post_samples, secondary_target_phenotypes=None, pre_sample_name='pre-selection', seed=1)[source]¶
Simulate pre- and post-selection variant counts.
Note
Add to
dms_variants.codonvarianttable.CodonVariantTable
usingdms_variants.codonvarianttable.CodonVariantTable.addSampleCounts()
- Parameters:
variants (class:dms_variants.codonvarianttable.CodonVariantTable) – Holds variants used in simulation.
phenotype_func (function) – Takes a space-delimited string of amino-acid substitutions and returns a number giving enrichment of variant relative to wildtype. For instance,
SigmoidPhenotypeSimulator.observedEnrichment()
. If there are multiple targets in variants, this function is only applied to the primary target. See secondary_target_phenotypes for the secondary targets.variant_error_rate (float) – Rate at which variants mis-called. Provide the probability that a variant has a spuriously called (or missing) codon mutation; each variant then has a random codon mutation added or removed with this probability before being passed to phenotype_func.
pre_sample (pandas.DataFrame or dict) – Counts of each variant pre-selection. To specify counts, provide data frame with columns “library”, “barcode”, and “count”. To simulate, provide dict with keys “total_count” and “uniformity”, and for each library, we simulate pre-selection counts as a draw of “total_count” counts from a multinomial parameterized by pre- selection frequencies drawn from a Dirichlet distribution with concentration parameter of “uniformity” (5 is reasonable value).
post_samples (dict) – Keyed by name of each sample with value another dict keyed by ‘total_count’, ‘noise’, and ‘bottleneck’. Counts drawn from multinomial parameterized by pre-selection frerquencies after passing through bottleneck of indicated size, and adding noise by mutiplying phenotype by a random variable with mean 1 and standard deviation specified by ‘noise’ (0 is no noise).
secondary_target_phenotypes (None or dict) – If there are secondary targets in variants, this should be keyed by each secondary target name and give the associated phenotypes as values.
pre_sample_name (str) – Name used for the pre-selection sample.
seed (None or int) – If not None, random number seed.
- Returns:
- Data frame with the following columns:
”library”
”barcode”
”sample”
”count”
- Return type:
pandas.DataFrame
- dms_variants.simulate.simulate_CodonVariantTable(*, geneseq, bclen, library_specs, seed=1, variant_call_support=0.5, allowed_aa_muts=None, allowgaps=False)[source]¶
Simulate
dms_variants.codonvarianttable.CodonVariantTable
.Note
Only simulates the variants, not counts for samples. To simulate counts, use
simulateSampleCounts()
.- Parameters:
geneseq (str) – Sequence of wildtype protein-coding gene.
bclen (int) – Length of the barcodes; must enable complexity at least 10-fold greater than max number of variants.
library_specs (dict) – Specifications for each simulated library. Keys are ‘avgmuts’ and ‘nvariants’. Mutations per variant are Poisson distributed.
seed (int or None) – Random number seed or None to set no seed.
variant_call_support (int > 0) – Support is floor of 1 plus draw from exponential distribution with this mean.
allowed_aa_muts (None or array-like) – If we only allow some amino-acid substitutions, provide them. Specify mutations like ‘E1E’ if you want to allow synonymous codon mutations.
allowgaps (bool) – Allow gaps as a codon character.
- Return type:
Example
>>> geneseq = 'ATGGGCAGC' >>> bclen = 4 >>> library_specs = {'lib1': {'avgmuts': 3, 'nvariants': 5}} >>> variants = simulate_CodonVariantTable(geneseq=geneseq, ... bclen=bclen, ... library_specs=library_specs) >>> variants.barcode_variant_df[['barcode', 'aa_substitutions']] barcode aa_substitutions 0 AGTC M1L G2Y S3F 1 ATTC S3R 2 CAAA M1C G2S S3A 3 CGAT M1N 4 TAAT M1V
>>> allowed_aa_muts = ['M1M', 'M1L', 'M1A', 'G2A', 'G2K', 'S3C', 'S3S'] >>> variants = simulate_CodonVariantTable(geneseq=geneseq, ... bclen=bclen, ... library_specs=library_specs, ... allowed_aa_muts=allowed_aa_muts, ... seed=2) >>> variants.barcode_variant_df[['barcode', 'aa_substitutions']] barcode aa_substitutions 0 AAAA M1L G2A 1 CAAC M1A G2A 2 GTTG M1A 3 TTAA G2A 4 TTCA M1L G2A