Libraries with secondary target genes¶

A CodonVariantTable is designed to handle mutations to a specific target gene. But in some cases, you might also want to include specific targets that are not point mutants of this primary target gene, but rather are entirely different genes that are spiked into the library as a control or for comparison. Because these secondary targets might not be closely related to the primary target gene, we do not want to define them simply by their mutations relative to the primary target. Rather, we handle them as different targets by adding a “targets” column in the data.

This notebook illustrates how to do this.

Setup for analysis¶

Import Python modules / packages:

[1]:

import itertools
import random
import tempfile
import warnings

import numpy

import pandas as pd

from plotnine import *

import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.simulate
from dms_variants.constants import CBPALETTE, CODONS_NOSTOP

Set parameters that define the simulated data for the primary target gene. Except for that fact that we use a short gene to reduce the computational run-time of this notebook, these simulation parameters are designed to reflect what we might observe in a real Bloom lab deep mutational scanning experiment:

[2]:

primary_target = "primary_target_gene"
seed = 1  # random number seed
genelength = 30  # gene length in codons
libs = ["lib_1", "lib_2"]  # distinct libraries of gene
variants_per_lib = 500 * genelength  # variants per library
avgmuts = 2.0  # average codon mutations per variant
bclen = 16  # length of nucleotide barcode for each variant
variant_error_rate = 0.005  # rate at which variant sequence mis-called
avgdepth_per_variant = 200  # average per-variant sequencing depth
lib_uniformity = 5  # uniformity of library pre-selection
noise = 0.02  # random noise in selections
bottlenecks = {  # bottlenecks from pre- to post-selection
    "tight_bottle": variants_per_lib * 5,
    "loose_bottle": variants_per_lib * 100,
}

We also define some secondary targets:

[3]:

secondary_targets = [f"secondary_target_{i}" for i in range(1, 4)]

Seed random number generator for reproducible output:

[4]:

random.seed(seed)
numpy.random.seed(seed)

Set pandas display options to show large chunks of Data Frames in this example:

[5]:

pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 500)

Hide warnings that clutter output:

[6]:

warnings.simplefilter("ignore")

Set the gray grid plotnine theme defined in dms_variants.plotnine_themes as this gives a nice appearance for the plots:

[7]:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Simulate library with primary and secondary variants¶

Simulate wildtype gene sequence:

[8]:

geneseq = "".join(random.choices(CODONS_NOSTOP, k=genelength))
print(f"Wildtype gene of {genelength} codons:\n{geneseq}")

Wildtype gene of 30 codons:
AGATCCGTGATTCTGCGTGCTTACACCAACTCACGGGTGAAACGTGTAATCTTATGCAACAACGACTTACCTATCCGCAACATCCGGCTG

Generate a CodonVariantTable using simulate_CodonVariantTable function, then get the barcode-variant data frame for it. This data frame gives the variants for the primary target:

[9]:

primary_bc_variant_df = dms_variants.simulate.simulate_CodonVariantTable(
    geneseq=geneseq,
    bclen=bclen,
    library_specs={
        lib: {"avgmuts": avgmuts, "nvariants": variants_per_lib} for lib in libs
    },
    seed=seed,
).barcode_variant_df

primary_bc_variant_df.head()

[9]:

	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
0	lib_1	AAAAAAAACGTTTTGT	1	CTG5TGG TCA11TCT	L5W	2	1
1	lib_1	AAAAAAAGGCTTATAC	5	TCA11TGC	S11C	1	1
2	lib_1	AAAAAAATCGTCCGTG	2			0	0
3	lib_1	AAAAAAGACAACACCG	5	ATC25TTC	I25F	1	1
4	lib_1	AAAAAAGATGTGGTGG	3	TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC	S11L R12G R15* I25G I28R	5	5

This data frame gives variants for the primary target, so add a column specifying that fact:

[10]:

primary_bc_variant_df = primary_bc_variant_df.assign(target=primary_target)

Now add some barcodes for each secondary targets. We simulate barcodes for each secondary target and each library. Note that the secondary targets are defined genes, so we don’t add any information on substitutions for these:

[11]:

existing_barcodes = set(primary_bc_variant_df["barcode"])

secondary_bc_variant_df = []
for lib, secondary_target in itertools.product(libs, secondary_targets):
    secondary_target_barcodes = dms_variants.simulate.rand_seq(
        seqlen=bclen,
        exclude=existing_barcodes,
        nseqs=random.randint(50, 100),
    )
    existing_barcodes = existing_barcodes.union(secondary_target_barcodes)
    secondary_bc_variant_df.append(
        pd.DataFrame(
            {
                "target": secondary_target,
                "barcode": secondary_target_barcodes,
                "library": lib,
                "variant_call_support": 10,
            }
        )
    )

secondary_bc_variant_df = pd.concat(secondary_bc_variant_df, ignore_index=True)

secondary_bc_variant_df.head()

[11]:

	target	barcode	library	variant_call_support
0	secondary_target_1	TCCTGTACGATATCCC	lib_1	10
1	secondary_target_1	GGGTGTGGCGTTTCCT	lib_1	10
2	secondary_target_1	TAGCCTCTATTGCCTT	lib_1	10
3	secondary_target_1	CGGAGACGTCACGGCG	lib_1	10
4	secondary_target_1	GGGGACAATTATCCCT	lib_1	10

Concatenate the data frame for the primary target (which has mutants) with that for the secondary targets. Because there are no mutations for the secondary targets, fill that information is as empty strings (for substitution columns) or 0 (for number of mutations). This data frame below ends the simulation of the library, and represents the type of information you would typically have at the outset of an experiment:

[12]:

bc_variant_df = pd.concat(
    [primary_bc_variant_df, secondary_bc_variant_df], sort=False
).assign(
    codon_substitutions=lambda x: x["codon_substitutions"].fillna(""),
    aa_substitutions=lambda x: x["aa_substitutions"].fillna(""),
    n_codon_substitutions=lambda x: x["n_codon_substitutions"].fillna(0).astype(int),
    n_aa_substitutions=lambda x: x["n_aa_substitutions"].fillna(0).astype(int),
)

bc_variant_df

[12]:

	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions	target
0	lib_1	AAAAAAAACGTTTTGT	1	CTG5TGG TCA11TCT	L5W	2	1	primary_target_gene
1	lib_1	AAAAAAAGGCTTATAC	5	TCA11TGC	S11C	1	1	primary_target_gene
2	lib_1	AAAAAAATCGTCCGTG	2			0	0	primary_target_gene
3	lib_1	AAAAAAGACAACACCG	5	ATC25TTC	I25F	1	1	primary_target_gene
4	lib_1	AAAAAAGATGTGGTGG	3	TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC	S11L R12G R15* I25G I28R	5	5	primary_target_gene
...	...	...	...	...	...	...	...	...
429	lib_2	GCGGCCTGGAGTTGAA	10			0	0	secondary_target_3
430	lib_2	TGATCACTACGTCCTT	10			0	0	secondary_target_3
431	lib_2	TTTAGGAATCTCCCGT	10			0	0	secondary_target_3
432	lib_2	TGTGCTTAGAGCACCG	10			0	0	secondary_target_3
433	lib_2	TTCTTACTGAAGTTCC	10			0	0	secondary_target_3

30434 rows × 8 columns

Now create the CodonVariantTable. Note how we specify the primary_target option to the name of that target; this is required when using multiple targets:

[13]:

with tempfile.NamedTemporaryFile() as f:
    bc_variant_df.to_csv(f.name, index=False)
    f.flush()
    variants = dms_variants.codonvarianttable.CodonVariantTable(
        barcode_variant_file=f.name,
        geneseq=geneseq,
        substitutions_are_codon=True,
        substitutions_col="codon_substitutions",
        primary_target=primary_target,
    )

We can get basic information about the CodonVariantTable, such as the sites, wildtype codons, and wildtype amino acids. Below we do this for the first few sites. Note that all of this information is for the primary target:

[14]:

variants.sites[:5]

[14]:

[1, 2, 3, 4, 5]

[15]:

list(variants.codons[r] for r in variants.sites[:5])

[15]:

['AGA', 'TCC', 'GTG', 'ATT', 'CTG']

[16]:

list(variants.aas[r] for r in variants.sites[:5])

[16]:

['R', 'S', 'V', 'I', 'L']

The different libraries in the table:

[17]:

variants.libraries

[17]:

['lib_1', 'lib_2']

Here is the data frame that contains all of the information on the barcoded variants. Note that the list of substitutions for the secondary targets just repeat the target name (so you are not confused with thinking mutations are actually enumerated in these secondary targets):

[18]:

variants.barcode_variant_df

[18]:

	target	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
0	primary_target_gene	lib_1	AAAAAAAACGTTTTGT	1	CTG5TGG TCA11TCT	L5W	2	1
1	primary_target_gene	lib_1	AAAAAAAGGCTTATAC	5	TCA11TGC	S11C	1	1
2	primary_target_gene	lib_1	AAAAAAATCGTCCGTG	2			0	0
3	primary_target_gene	lib_1	AAAAAAGACAACACCG	5	ATC25TTC	I25F	1	1
4	primary_target_gene	lib_1	AAAAAAGATGTGGTGG	3	TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC	S11L R12G R15* I25G I28R	5	5
...	...	...	...	...	...	...	...	...
30429	secondary_target_3	lib_2	TGTGCTTAGAGCACCG	10	secondary_target_3	secondary_target_3	0	0
30430	secondary_target_3	lib_2	TTCCGACACGTTGACC	10	secondary_target_3	secondary_target_3	0	0
30431	secondary_target_3	lib_2	TTCTATTACGCTGGGG	10	secondary_target_3	secondary_target_3	0	0
30432	secondary_target_3	lib_2	TTCTTACTGAAGTTCC	10	secondary_target_3	secondary_target_3	0	0
30433	secondary_target_3	lib_2	TTTAGGAATCTCCCGT	10	secondary_target_3	secondary_target_3	0	0

30434 rows × 8 columns

Analyze library composition¶

A CodonVariantTable has methods for summarizing and plotting properties of the barcoded variants. These methods can be used to analyze the barcoded variants themselves, or to analyze the frequency (or counts) of variants in the library in samples that have undergone some type of selection.

We have not yet added counts of the variants in any specific samples, so we just analyze the composition of the variant library itself. This is done by setting samples=None in the method calls below.

Number of variants in each library:

[19]:

variants.n_variants_df(samples=None)

[19]:

	target	library	sample	count
0	primary_target_gene	lib_1	barcoded variants	15000
1	primary_target_gene	lib_2	barcoded variants	15000
2	primary_target_gene	all libraries	barcoded variants	30000
3	secondary_target_1	lib_1	barcoded variants	52
4	secondary_target_1	lib_2	barcoded variants	82
5	secondary_target_1	all libraries	barcoded variants	134
6	secondary_target_2	lib_1	barcoded variants	89
7	secondary_target_2	lib_2	barcoded variants	69
8	secondary_target_2	all libraries	barcoded variants	158
9	secondary_target_3	lib_1	barcoded variants	87
10	secondary_target_3	lib_2	barcoded variants	55
11	secondary_target_3	all libraries	barcoded variants	142

We can also get this information for just the primary target:

[20]:

variants.n_variants_df(samples=None, primary_target_only=True)

[20]:

	library	sample	count
0	lib_1	barcoded variants	15000
1	lib_2	barcoded variants	15000
2	all libraries	barcoded variants	30000

Plot distribution of variant call supports, grouping together all variants with support \(\ge 8\):

[21]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotVariantSupportHistogram(max_support=10)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_39_0.png

Same plot for only the primary target:

[22]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotVariantSupportHistogram(max_support=10, primary_target_only=True)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_41_0.png

The set of valid barcodes for each library:

[23]:

for lib in libs:
    print(f"First few barcodes for library {lib}:")
    print(sorted(variants.valid_barcodes(lib))[:3])

First few barcodes for library lib_1:
['AAAAAAAACGTTTTGT', 'AAAAAAAGGCTTATAC', 'AAAAAAATCGTCCGTG']
First few barcodes for library lib_2:
['AAAAAACCCAGACTTA', 'AAAAAACCGGCGGCAG', 'AAAAAATCTACTGCGT']

Plot the number of amino-acid mutations per variant. Note that this plot is only for the primary target:

[24]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram("aa", samples=None)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_45_0.png

Average number of codon mutations per variant of each type of mutation. We make these plots for: 1. Just single-mutant and wildtype variants 2. For all variants

Again, these are only for the primary target, since mutations are only defined for that target:

[25]:

# NBVAL_IGNORE_OUTPUT

for mut_type in ["single", "all"]:
    p = variants.plotNumCodonMutsByType(mut_type, samples=None)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_47_0.png

_images/codonvariant_sim_data_multi_targets_47_1.png

Here are the numerical data in the plots above:

[26]:

variants.numCodonMutsByType("single", samples=None).round(3)

[26]:

	library	sample	mutation_type	num_muts_count	count	number
0	lib_1	barcoded variants	nonsynonymous	3615	6112	0.591
1	lib_1	barcoded variants	synonymous	194	6112	0.032
2	lib_1	barcoded variants	stop	211	6112	0.035
3	lib_2	barcoded variants	nonsynonymous	3716	6114	0.608
4	lib_2	barcoded variants	synonymous	235	6114	0.038
5	lib_2	barcoded variants	stop	179	6114	0.029
6	all libraries	barcoded variants	nonsynonymous	7331	12226	0.600
7	all libraries	barcoded variants	synonymous	429	12226	0.035
8	all libraries	barcoded variants	stop	390	12226	0.032

[27]:

variants.numCodonMutsByType("all", samples=None).round(3)

[27]:

	library	sample	mutation_type	num_muts_count	count	number
0	lib_1	barcoded variants	nonsynonymous	26916	15000	1.794
1	lib_1	barcoded variants	synonymous	1510	15000	0.101
2	lib_1	barcoded variants	stop	1424	15000	0.095
3	lib_2	barcoded variants	nonsynonymous	26960	15000	1.797
4	lib_2	barcoded variants	synonymous	1498	15000	0.100
5	lib_2	barcoded variants	stop	1371	15000	0.091
6	all libraries	barcoded variants	nonsynonymous	53876	30000	1.796
7	all libraries	barcoded variants	synonymous	3008	30000	0.100
8	all libraries	barcoded variants	stop	2795	30000	0.093

Examine how well amino-acid mutations to the primary target are sampled in the library by looking at the fraction of mutations seen <= some number of times. Here we do that for amino-acid mutations, making separate plots for single mutants and all mutants:

[28]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotCumulMutCoverage(variant_type, mut_type="aa", samples=None)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_52_0.png

_images/codonvariant_sim_data_multi_targets_52_1.png

We can also get the numerical information plotted above (here for single mutants only):

[29]:

variants.mutCounts("single", "aa", samples=None).head(n=5)

[29]:

	library	sample	mutation	count	mutation_type	site
0	lib_1	barcoded variants	V16L	25	nonsynonymous	16
1	lib_1	barcoded variants	C19R	21	nonsynonymous	19
2	lib_1	barcoded variants	R12S	21	nonsynonymous	12
3	lib_1	barcoded variants	R26L	21	nonsynonymous	26
4	lib_1	barcoded variants	S11R	21	nonsynonymous	11

Here are the frequencies of mutations along the primary target gene, looking both at single mutants and all mutants:

[30]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotMutFreqs(variant_type, "codon", samples=None)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_56_0.png

_images/codonvariant_sim_data_multi_targets_56_1.png

We can also look at mutation frequencies in the primary target gene a heat-map form:

[31]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotMutHeatmap("all", "aa", samples=None)
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_58_0.png

Simulate counts for samples¶

An experiment involves subjecting the library to different selections and looking at how the frequencies of the variants changes by using sequencing to count the barcodes in each condition.

Here, we simulate an experiment by simulating variant counts for samples that have undergone various selections.

For these simulations, we first need to define a “phenotype” for each variant. The phenotype represents how much the frequency of the variant is expected to increase or decrease after selection.

Define phenotype function¶

First, we define a “phenotype” function. We will do this using a SigmoidPhenotypeSimulator, which follow the “global epistasis” concepts of Otwinoski et al and Sailer and Harms to define the phenotype in two steps: an underlying latent phenotype that mutations affect additively, and then an observed phenotype that is a non-linear function of the latent phenotype. The variants are then simulated according to their observed enrichments, which are the exponentials of the observed phenotypes.

First, we initialize the simulator:

[32]:

phenosimulator = dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, seed=seed)

Plot the simulated relationship of the latent phenotype with the observed enrichment and phenotype, with a dashed vertical line indicating the wildtype latent phenotype:

[33]:

# NBVAL_IGNORE_OUTPUT

for value in ["enrichment", "phenotype"]:
    p = phenosimulator.plotLatentVsObserved(value)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_62_0.png

_images/codonvariant_sim_data_multi_targets_62_1.png

Plot the latent phenotype, observed phenotype, and observed enrichment of all single amino-acid mutants, with a dashed vertical line indicating the wildtype:

[34]:

# NBVAL_IGNORE_OUTPUT

for value in ["latentPhenotype", "observedPhenotype", "observedEnrichment"]:
    p = phenosimulator.plotMutsHistogram(value)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_64_0.png

_images/codonvariant_sim_data_multi_targets_64_1.png

_images/codonvariant_sim_data_multi_targets_64_2.png

Phenotypes for secondary targets¶

We also specify phenotypes as observed enrichments for the secondary targets:

[35]:

secondary_target_phenotypes = {
    secondary_target: round(random.uniform(0.2, 2), 3)
    for secondary_target in secondary_targets
}

# display the observed enrichments for the secondary targets
pd.Series(secondary_target_phenotypes).rename("observed_enrichment").to_frame()

[35]:

	observed_enrichment
secondary_target_1	0.883
secondary_target_2	1.211
secondary_target_3	1.789

Simulate variant counts¶

Now we use simulateSampleCounts to simulate counts of variants when selection on each variant is proportional to its observed enrichment. In these simulations, we can fine-tune the simulations to reflect real experiments, such as by setting the error-rate in variant calling, bottlenecks when going from the pre- to post-selection samples, and non-uniformity in library composition.

Here we simulate using several bottlenecks in the library going from the pre- to post-selection samples, since in real experiments this seems to be the biggest source of noise / error:

[36]:

counts = dms_variants.simulate.simulateSampleCounts(
    variants=variants,
    phenotype_func=phenosimulator.observedEnrichment,
    variant_error_rate=variant_error_rate,
    pre_sample={
        "total_count": variants_per_lib * numpy.random.poisson(avgdepth_per_variant),
        "uniformity": lib_uniformity,
    },
    pre_sample_name="pre-selection",
    post_samples={
        name: {
            "noise": noise,
            "total_count": variants_per_lib
            * numpy.random.poisson(avgdepth_per_variant),
            "bottleneck": bottle,
        }
        for name, bottle in bottlenecks.items()
    },
    seed=seed,
    secondary_target_phenotypes=secondary_target_phenotypes,
)

Data frame with the simulated counts:

[37]:

counts.head(6)

[37]:

	library	barcode	sample	count
0	lib_1	AAAAAAAACGTTTTGT	pre-selection	317
1	lib_1	AAAAAAAGGCTTATAC	pre-selection	133
2	lib_1	AAAAAAATCGTCCGTG	pre-selection	127
3	lib_1	AAAAAAGACAACACCG	pre-selection	97
4	lib_1	AAAAAAGATGTGGTGG	pre-selection	324
5	lib_1	AAAAAAGTCATACTCG	pre-selection	118

Add counts to variant table¶

Now add the simulated counts for each library / sample to the CodonVariantTable:

[38]:

variants.add_sample_counts_df(counts)

Confirm that we have added the expected number of counts per library / sample:

[39]:

# NBVAL_IGNORE_OUTPUT

variants.n_variants_df(libraries="all_only")

[39]:

	target	library	sample	count
0	primary_target_gene	all libraries	pre-selection	5439967
1	primary_target_gene	all libraries	loose_bottle	5366820
2	primary_target_gene	all libraries	tight_bottle	5916597
3	secondary_target_1	all libraries	pre-selection	23756
4	secondary_target_1	all libraries	loose_bottle	54373
5	secondary_target_1	all libraries	tight_bottle	57359
6	secondary_target_2	all libraries	pre-selection	29184
7	secondary_target_2	all libraries	loose_bottle	92595
8	secondary_target_2	all libraries	tight_bottle	98230
9	secondary_target_3	all libraries	pre-selection	27093
10	secondary_target_3	all libraries	loose_bottle	126212
11	secondary_target_3	all libraries	tight_bottle	137814

Look only for primary target:

[40]:

# NBVAL_IGNORE_OUTPUT

variants.n_variants_df(primary_target_only=True)

[40]:

	library	sample	count
0	lib_1	pre-selection	2719365
1	lib_1	loose_bottle	2674231
2	lib_1	tight_bottle	2949688
3	lib_2	pre-selection	2720602
4	lib_2	loose_bottle	2692589
5	lib_2	tight_bottle	2966909
6	all libraries	pre-selection	5439967
7	all libraries	loose_bottle	5366820
8	all libraries	tight_bottle	5916597

Analyze sample variant counts¶

A CodonVariantTable has methods for summarizing and plotting the variant counts for different samples. These methods are mostly the same as we used above to analyze the variant composition of the libraries themselves, but now we set samples to the samples that we want to analyze (typically 'all'). Therefore, rather than each variant always counting once, each variant is counted in each sample in proportion to how many counts it has.

In the rawest form, we can directly access a data frame giving the counts of each variant in each sample:

[41]:

# NBVAL_IGNORE_OUTPUT

variants.variant_count_df

[41]:

	target	library	sample	barcode	count	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
0	primary_target_gene	lib_1	pre-selection	AACCCGTTCACCACCA	692	2	ATC17CGA CGG29TGC	I17R R29C	2	2
1	primary_target_gene	lib_1	pre-selection	ACTACGGACGATTATT	633	1	GCT7TCT AAA14GAT AAC20CGA	A7S K14D N20R	3	3
2	primary_target_gene	lib_1	pre-selection	GGACTATCTCAGTATG	615	2			0	0
3	primary_target_gene	lib_1	pre-selection	CCAGAAAACCCACCAA	614	7	TTA18CTC		1	0
4	primary_target_gene	lib_1	pre-selection	ACGGCCTAAACTCACA	611	3	CGT6GTA GAC22CAC	R6V D22H	2	2
...	...	...	...	...	...	...	...	...	...	...
91297	secondary_target_3	lib_2	tight_bottle	CAACCTAATTGGGATC	187	10	secondary_target_3	secondary_target_3	0	0
91298	secondary_target_3	lib_2	tight_bottle	TGGGCACGATGCAATA	186	10	secondary_target_3	secondary_target_3	0	0
91299	secondary_target_3	lib_2	tight_bottle	ACGAGTACCCCACTTC	182	10	secondary_target_3	secondary_target_3	0	0
91300	secondary_target_3	lib_2	tight_bottle	TATATCGGCGTGAAGA	176	10	secondary_target_3	secondary_target_3	0	0
91301	secondary_target_3	lib_2	tight_bottle	CAGTGGTAATTAAACG	0	10	secondary_target_3	secondary_target_3	0	0

91302 rows × 10 columns

Get the average number of counts per variant:

[42]:

# NBVAL_IGNORE_OUTPUT

variants.avgCountsPerVariant().round(1)

[42]:

	target	library	sample	avg_counts_per_variant
0	primary_target_gene	lib_1	pre-selection	181.3
1	primary_target_gene	lib_1	loose_bottle	178.3
2	primary_target_gene	lib_1	tight_bottle	196.6
3	primary_target_gene	lib_2	pre-selection	181.4
4	primary_target_gene	lib_2	loose_bottle	179.5
5	primary_target_gene	lib_2	tight_bottle	197.8
6	primary_target_gene	all libraries	pre-selection	181.3
7	primary_target_gene	all libraries	loose_bottle	178.9
8	primary_target_gene	all libraries	tight_bottle	197.2
9	secondary_target_1	lib_1	pre-selection	166.1
10	secondary_target_1	lib_1	loose_bottle	378.2
11	secondary_target_1	lib_1	tight_bottle	401.7
12	secondary_target_1	lib_2	pre-selection	184.4
13	secondary_target_1	lib_2	loose_bottle	423.2
14	secondary_target_1	lib_2	tight_bottle	444.8
15	secondary_target_1	all libraries	pre-selection	177.3
16	secondary_target_1	all libraries	loose_bottle	405.8
17	secondary_target_1	all libraries	tight_bottle	428.1
18	secondary_target_2	lib_1	pre-selection	173.2
19	secondary_target_2	lib_1	loose_bottle	556.6
20	secondary_target_2	lib_1	tight_bottle	611.9
21	secondary_target_2	lib_2	pre-selection	199.5
22	secondary_target_2	lib_2	loose_bottle	624.0
23	secondary_target_2	lib_2	tight_bottle	634.3
24	secondary_target_2	all libraries	pre-selection	184.7
25	secondary_target_2	all libraries	loose_bottle	586.0
26	secondary_target_2	all libraries	tight_bottle	621.7
27	secondary_target_3	lib_1	pre-selection	190.6
28	secondary_target_3	lib_1	loose_bottle	880.0
29	secondary_target_3	lib_1	tight_bottle	919.1
30	secondary_target_3	lib_2	pre-selection	191.1
31	secondary_target_3	lib_2	loose_bottle	902.7
32	secondary_target_3	lib_2	tight_bottle	1051.8
33	secondary_target_3	all libraries	pre-selection	190.8
34	secondary_target_3	all libraries	loose_bottle	888.8
35	secondary_target_3	all libraries	tight_bottle	970.5

Above the counts are grouped by target. To not group by target:

[43]:

variants.avgCountsPerVariant(by_target=False).round(1)

[43]:

	library	sample	avg_counts_per_variant
0	lib_1	pre-selection	181.2
1	lib_1	loose_bottle	185.2
2	lib_1	tight_bottle	203.9
3	lib_2	pre-selection	181.5
4	lib_2	loose_bottle	185.5
5	lib_2	tight_bottle	204.2
6	all libraries	pre-selection	181.4
7	all libraries	loose_bottle	185.3
8	all libraries	tight_bottle	204.0

We also plot the average counts per variant:

[44]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotAvgCountsPerVariant(heightscale=1.5, orientation="h")
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_84_0.png

Here is the same plot not faceted by target:

[45]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotAvgCountsPerVariant(orientation="h", by_target=False)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_86_0.png

Plot the number of counts for each variant in each sample. The horizontal dashed line shows the total number of variants. The plot shows that all variants are well-sampled in the pre-selection libraries, but that post- selection some variants are sampled more or less. This is expected since selection will decrease and increase the frequency of variants. By default, this plot shows only the primary target:

[46]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotCumulVariantCounts()
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_88_0.png

Similar plot showing all targets:

[47]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotCumulVariantCounts(primary_target_only=False)
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_90_0.png

Number of counts per variant:

[48]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotCountsPerVariant(libraries=variants.libraries, by_variant_class=True)
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_92_0.png

Similar plot only for primary targets:

[49]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotCountsPerVariant(
    libraries=variants.libraries, by_variant_class=True, primary_target_only=True
)

_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_94_0.png

Distribution of the number of amino-acid mutations per variant in each sample. This is only for the primary target. As expected, mutations go down after selection:

[50]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram(mut_type="aa")
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_96_0.png

Average number of mutations of each type per variant among just single mutants (and wildtype) and among all variants, again only for primary target:

[51]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotNumCodonMutsByType(variant_type)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_98_0.png

_images/codonvariant_sim_data_multi_targets_98_1.png

Here are the numerical data plotted above (just for the individual libraries and all mutations):

[52]:

# NBVAL_IGNORE_OUTPUT

variants.numCodonMutsByType(variant_type="all", libraries=libs).round(3)

[52]:

	library	sample	mutation_type	num_muts_count	count	number
0	lib_1	pre-selection	nonsynonymous	4864651	2719365	1.789
1	lib_1	pre-selection	synonymous	275311	2719365	0.101
2	lib_1	pre-selection	stop	257132	2719365	0.095
3	lib_1	loose_bottle	nonsynonymous	2330462	2674231	0.871
4	lib_1	loose_bottle	synonymous	270348	2674231	0.101
5	lib_1	loose_bottle	stop	809	2674231	0.000
6	lib_1	tight_bottle	nonsynonymous	2574335	2949688	0.873
7	lib_1	tight_bottle	synonymous	289539	2949688	0.098
8	lib_1	tight_bottle	stop	948	2949688	0.000
9	lib_2	pre-selection	nonsynonymous	4872431	2720602	1.791
10	lib_2	pre-selection	synonymous	279586	2720602	0.103
11	lib_2	pre-selection	stop	249866	2720602	0.092
12	lib_2	loose_bottle	nonsynonymous	2433719	2692589	0.904
13	lib_2	loose_bottle	synonymous	271738	2692589	0.101
14	lib_2	loose_bottle	stop	1253	2692589	0.000
15	lib_2	tight_bottle	nonsynonymous	2669151	2966909	0.900
16	lib_2	tight_bottle	synonymous	288678	2966909	0.097
17	lib_2	tight_bottle	stop	1060	2966909	0.000

Here are mutation frequencies as a function of primary sequence among all variants of the primary target:

[53]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotMutFreqs(variant_type="all", mut_type="codon")
_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_102_0.png

Plot how thoroughly amino-acid mutations are sampled, doing this separately among all variants and single-mutant variants for the primary target. The plots below show that the stop mutations are sampled very poorly post-selection because they are eliminated during selection:

[54]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotCumulMutCoverage(variant_type=variant_type, mut_type="aa")
    _ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_104_0.png

_images/codonvariant_sim_data_multi_targets_104_1.png

Functional scores for variants¶

The CodonVariantTable.func_scores method calculates a functional score for each variant based on its change in frequency from pre- to post-selection.

To calculate these scores, we need to pair each post-selection sample with a pre-selection one. In this case, the pre-selection sample is named ‘pre-selection’ for all post-selection samples:

[55]:

func_scores = variants.func_scores("pre-selection", libraries=libs)

The resulting data frame has a functional score (and its variance) for each barcoded variant:

[56]:

# NBVAL_IGNORE_OUTPUT

func_scores.round(3)

[56]:

	target	library	pre_sample	post_sample	barcode	func_score	func_score_var	pre_count	post_count	pre_count_wt	post_count_wt	pseudocount	codon_substitutions	n_codon_substitutions	aa_substitutions	n_aa_substitutions
0	primary_target_gene	lib_1	pre-selection	loose_bottle	AACCCGTTCACCACCA	-1.060	0.005	692	847	381697	973820	0.5	ATC17CGA CGG29TGC	2	I17R R29C	2
1	primary_target_gene	lib_1	pre-selection	loose_bottle	ACTACGGACGATTATT	-3.845	0.022	633	112	381697	973820	0.5	GCT7TCT AAA14GAT AAC20CGA	3	A7S K14D N20R	3
2	primary_target_gene	lib_1	pre-selection	loose_bottle	GGACTATCTCAGTATG	-0.188	0.005	615	1378	381697	973820	0.5		0		0
3	primary_target_gene	lib_1	pre-selection	loose_bottle	CCAGAAAACCCACCAA	-0.052	0.005	614	1512	381697	973820	0.5	TTA18CTC	1		0
4	primary_target_gene	lib_1	pre-selection	loose_bottle	ACGGCCTAAACTCACA	-3.228	0.016	611	166	381697	973820	0.5	CGT6GTA GAC22CAC	2	R6V D22H	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
60863	secondary_target_3	lib_2	pre-selection	tight_bottle	TGTGCTTAGAGCACCG	0.953	0.033	74	412	359334	1028011	0.5	secondary_target_3	0	secondary_target_3	0
60864	secondary_target_3	lib_2	pre-selection	tight_bottle	CCTGGATTTCTGATTT	1.546	0.034	69	580	359334	1028011	0.5	secondary_target_3	0	secondary_target_3	0
60865	secondary_target_3	lib_2	pre-selection	tight_bottle	CAGTGGTAATTAAACG	-8.614	4.193	68	0	359334	1028011	0.5	secondary_target_3	0	secondary_target_3	0
60866	secondary_target_3	lib_2	pre-selection	tight_bottle	CTTGTTGTATACTAGC	2.630	0.042	52	929	359334	1028011	0.5	secondary_target_3	0	secondary_target_3	0
60867	secondary_target_3	lib_2	pre-selection	tight_bottle	ACTGACCATGATAGTC	1.247	0.084	28	193	359334	1028011	0.5	secondary_target_3	0	secondary_target_3	0

60868 rows × 16 columns

We can also calculate functional scores at the level of amino-acid or codon substitutions rather than at the level of variants. This calculation groups all variants with the same substitutions before calculating the functional score. Here are scores grouping by amino-acid substitutions; we also set syn_as_wt=True to include variants with only synonymous mutations in the counts of wild type in this case:

[57]:

# NBVAL_IGNORE_OUTPUT

aa_func_scores = variants.func_scores(
    "pre-selection", by="aa_substitutions", syn_as_wt=True, libraries=libs
)
aa_func_scores.round(3)

[57]:

	target	library	pre_sample	post_sample	aa_substitutions	func_score	func_score_var	pre_count	post_count	pre_count_wt	post_count_wt	pseudocount	n_aa_substitutions
0	primary_target_gene	lib_1	pre-selection	loose_bottle	I17R R29C	-1.062	0.005	692	847	418909	1070114	0.5	2
1	primary_target_gene	lib_1	pre-selection	loose_bottle	A7S K14D N20R	-3.846	0.022	633	112	418909	1070114	0.5	3
2	primary_target_gene	lib_1	pre-selection	loose_bottle		0.000	0.000	418909	1070114	418909	1070114	0.5	0
3	primary_target_gene	lib_1	pre-selection	loose_bottle	R6V D22H	-3.230	0.016	611	166	418909	1070114	0.5	2
4	primary_target_gene	lib_1	pre-selection	loose_bottle	R1A	0.119	0.002	1863	5168	418909	1070114	0.5	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
35995	secondary_target_1	lib_2	pre-selection	tight_bottle	secondary_target_1	-0.237	0.000	15119	36473	404611	1150576	0.5	0
35996	secondary_target_2	lib_1	pre-selection	tight_bottle	secondary_target_2	0.339	0.000	15418	54461	418909	1169889	0.5	0
35997	secondary_target_2	lib_2	pre-selection	tight_bottle	secondary_target_2	0.161	0.000	13766	43769	404611	1150576	0.5	0
35998	secondary_target_3	lib_1	pre-selection	tight_bottle	secondary_target_3	0.788	0.000	16580	79965	418909	1169889	0.5	0
35999	secondary_target_3	lib_2	pre-selection	tight_bottle	secondary_target_3	0.952	0.000	10513	57849	404611	1150576	0.5	0

36000 rows × 13 columns

We can plot the distribution of the functional scores for variants. These plots are most informative if we classify variants by the “types” of mutations they have, which we do here using the CodonVariantTable.classifyVariants method, which adds a column called variant_class to the data frame. Note that we have to specify the primary target and how to classify non-primary targets:

[58]:

func_scores = dms_variants.codonvarianttable.CodonVariantTable.classifyVariants(
    func_scores,
    primary_target=variants.primary_target,
    non_primary_target_class="secondary target",
)

(func_scores[["target", "codon_substitutions", "aa_substitutions", "variant_class"]])

[58]:

	target	codon_substitutions	aa_substitutions	variant_class
0	primary_target_gene	ATC17CGA CGG29TGC	I17R R29C	>1 nonsynonymous
1	primary_target_gene	GCT7TCT AAA14GAT AAC20CGA	A7S K14D N20R	>1 nonsynonymous
2	primary_target_gene			wildtype
3	primary_target_gene	TTA18CTC		synonymous
4	primary_target_gene	CGT6GTA GAC22CAC	R6V D22H	>1 nonsynonymous
...	...	...	...	...
60863	secondary_target_3	secondary_target_3	secondary_target_3	secondary target
60864	secondary_target_3	secondary_target_3	secondary_target_3	secondary target
60865	secondary_target_3	secondary_target_3	secondary_target_3	secondary target
60866	secondary_target_3	secondary_target_3	secondary_target_3	secondary target
60867	secondary_target_3	secondary_target_3	secondary_target_3	secondary target

60868 rows × 4 columns

Now we use plotnine to plot the distributions of scores in ggplot2-like syntax, coloring by the variant class. This plot shows the expected behavior for different variant classes; for instance, stop codon variants tend to have low scores and synonymous variants tend to have wildtype-like (near 0) scores. As expected, there is more noise with a tighter bottleneck:

[59]:

# NBVAL_IGNORE_OUTPUT

p = (
    ggplot(func_scores, aes("variant_class", "func_score"))
    + geom_violin(aes(fill="variant_class"))
    + ylab("functional score")
    + xlab("")
    + facet_grid("post_sample ~ library")
    + theme(
        figure_size=(2.75 * len(libs), 2 * len(bottlenecks)),
        axis_text_x=element_text(angle=90),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
    + scale_fill_manual(values=CBPALETTE[1:])
)

_ = p.draw(show=True)

_images/codonvariant_sim_data_multi_targets_114_0.png

[ ]: