Libraries with secondary target genes

A CodonVariantTable is designed to handle mutations to a specific target gene. But in some cases, you might also want to include specific targets that are not point mutants of this primary target gene, but rather are entirely different genes that are spiked into the library as a control or for comparison. Because these secondary targets might not be closely related to the primary target gene, we do not want to define them simply by their mutations relative to the primary target. Rather, we handle them as different targets by adding a “targets” column in the data.

This notebook illustrates how to do this.

Setup for analysis

Import Python modules / packages:

[1]:
import itertools
import random
import tempfile
import warnings

import numpy

import pandas as pd

from plotnine import *

import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.simulate
from dms_variants.constants import CBPALETTE, CODONS_NOSTOP

Set parameters that define the simulated data for the primary target gene. Except for that fact that we use a short gene to reduce the computational run-time of this notebook, these simulation parameters are designed to reflect what we might observe in a real Bloom lab deep mutational scanning experiment:

[2]:
primary_target = "primary_target_gene"
seed = 1  # random number seed
genelength = 30  # gene length in codons
libs = ["lib_1", "lib_2"]  # distinct libraries of gene
variants_per_lib = 500 * genelength  # variants per library
avgmuts = 2.0  # average codon mutations per variant
bclen = 16  # length of nucleotide barcode for each variant
variant_error_rate = 0.005  # rate at which variant sequence mis-called
avgdepth_per_variant = 200  # average per-variant sequencing depth
lib_uniformity = 5  # uniformity of library pre-selection
noise = 0.02  # random noise in selections
bottlenecks = {  # bottlenecks from pre- to post-selection
    "tight_bottle": variants_per_lib * 5,
    "loose_bottle": variants_per_lib * 100,
}

We also define some secondary targets:

[3]:
secondary_targets = [f"secondary_target_{i}" for i in range(1, 4)]

Seed random number generator for reproducible output:

[4]:
random.seed(seed)
numpy.random.seed(seed)

Set pandas display options to show large chunks of Data Frames in this example:

[5]:
pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 500)

Hide warnings that clutter output:

[6]:
warnings.simplefilter("ignore")

Set the gray grid plotnine theme defined in dms_variants.plotnine_themes as this gives a nice appearance for the plots:

[7]:
theme_set(dms_variants.plotnine_themes.theme_graygrid())

Simulate library with primary and secondary variants

Simulate wildtype gene sequence:

[8]:
geneseq = "".join(random.choices(CODONS_NOSTOP, k=genelength))
print(f"Wildtype gene of {genelength} codons:\n{geneseq}")
Wildtype gene of 30 codons:
AGATCCGTGATTCTGCGTGCTTACACCAACTCACGGGTGAAACGTGTAATCTTATGCAACAACGACTTACCTATCCGCAACATCCGGCTG

Generate a CodonVariantTable using simulate_CodonVariantTable function, then get the barcode-variant data frame for it. This data frame gives the variants for the primary target:

[9]:
primary_bc_variant_df = dms_variants.simulate.simulate_CodonVariantTable(
    geneseq=geneseq,
    bclen=bclen,
    library_specs={
        lib: {"avgmuts": avgmuts, "nvariants": variants_per_lib} for lib in libs
    },
    seed=seed,
).barcode_variant_df

primary_bc_variant_df.head()
[9]:
library barcode variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions
0 lib_1 AAAAAAAACGTTTTGT 1 CTG5TGG TCA11TCT L5W 2 1
1 lib_1 AAAAAAAGGCTTATAC 5 TCA11TGC S11C 1 1
2 lib_1 AAAAAAATCGTCCGTG 2 0 0
3 lib_1 AAAAAAGACAACACCG 5 ATC25TTC I25F 1 1
4 lib_1 AAAAAAGATGTGGTGG 3 TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC S11L R12G R15* I25G I28R 5 5

This data frame gives variants for the primary target, so add a column specifying that fact:

[10]:
primary_bc_variant_df = primary_bc_variant_df.assign(target=primary_target)

Now add some barcodes for each secondary targets. We simulate barcodes for each secondary target and each library. Note that the secondary targets are defined genes, so we don’t add any information on substitutions for these:

[11]:
existing_barcodes = set(primary_bc_variant_df["barcode"])

secondary_bc_variant_df = []
for lib, secondary_target in itertools.product(libs, secondary_targets):
    secondary_target_barcodes = dms_variants.simulate.rand_seq(
        seqlen=bclen,
        exclude=existing_barcodes,
        nseqs=random.randint(50, 100),
    )
    existing_barcodes = existing_barcodes.union(secondary_target_barcodes)
    secondary_bc_variant_df.append(
        pd.DataFrame(
            {
                "target": secondary_target,
                "barcode": secondary_target_barcodes,
                "library": lib,
                "variant_call_support": 10,
            }
        )
    )

secondary_bc_variant_df = pd.concat(secondary_bc_variant_df, ignore_index=True)

secondary_bc_variant_df.head()
[11]:
target barcode library variant_call_support
0 secondary_target_1 TCCTGTACGATATCCC lib_1 10
1 secondary_target_1 GGGTGTGGCGTTTCCT lib_1 10
2 secondary_target_1 TAGCCTCTATTGCCTT lib_1 10
3 secondary_target_1 CGGAGACGTCACGGCG lib_1 10
4 secondary_target_1 GGGGACAATTATCCCT lib_1 10

Concatenate the data frame for the primary target (which has mutants) with that for the secondary targets. Because there are no mutations for the secondary targets, fill that information is as empty strings (for substitution columns) or 0 (for number of mutations). This data frame below ends the simulation of the library, and represents the type of information you would typically have at the outset of an experiment:

[12]:
bc_variant_df = pd.concat(
    [primary_bc_variant_df, secondary_bc_variant_df], sort=False
).assign(
    codon_substitutions=lambda x: x["codon_substitutions"].fillna(""),
    aa_substitutions=lambda x: x["aa_substitutions"].fillna(""),
    n_codon_substitutions=lambda x: x["n_codon_substitutions"].fillna(0).astype(int),
    n_aa_substitutions=lambda x: x["n_aa_substitutions"].fillna(0).astype(int),
)

bc_variant_df
[12]:
library barcode variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions target
0 lib_1 AAAAAAAACGTTTTGT 1 CTG5TGG TCA11TCT L5W 2 1 primary_target_gene
1 lib_1 AAAAAAAGGCTTATAC 5 TCA11TGC S11C 1 1 primary_target_gene
2 lib_1 AAAAAAATCGTCCGTG 2 0 0 primary_target_gene
3 lib_1 AAAAAAGACAACACCG 5 ATC25TTC I25F 1 1 primary_target_gene
4 lib_1 AAAAAAGATGTGGTGG 3 TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC S11L R12G R15* I25G I28R 5 5 primary_target_gene
... ... ... ... ... ... ... ... ...
429 lib_2 GCGGCCTGGAGTTGAA 10 0 0 secondary_target_3
430 lib_2 TGATCACTACGTCCTT 10 0 0 secondary_target_3
431 lib_2 TTTAGGAATCTCCCGT 10 0 0 secondary_target_3
432 lib_2 TGTGCTTAGAGCACCG 10 0 0 secondary_target_3
433 lib_2 TTCTTACTGAAGTTCC 10 0 0 secondary_target_3

30434 rows × 8 columns

Now create the CodonVariantTable. Note how we specify the primary_target option to the name of that target; this is required when using multiple targets:

[13]:
with tempfile.NamedTemporaryFile() as f:
    bc_variant_df.to_csv(f.name, index=False)
    f.flush()
    variants = dms_variants.codonvarianttable.CodonVariantTable(
        barcode_variant_file=f.name,
        geneseq=geneseq,
        substitutions_are_codon=True,
        substitutions_col="codon_substitutions",
        primary_target=primary_target,
    )

We can get basic information about the CodonVariantTable, such as the sites, wildtype codons, and wildtype amino acids. Below we do this for the first few sites. Note that all of this information is for the primary target:

[14]:
variants.sites[:5]
[14]:
[1, 2, 3, 4, 5]
[15]:
list(variants.codons[r] for r in variants.sites[:5])
[15]:
['AGA', 'TCC', 'GTG', 'ATT', 'CTG']
[16]:
list(variants.aas[r] for r in variants.sites[:5])
[16]:
['R', 'S', 'V', 'I', 'L']

The different libraries in the table:

[17]:
variants.libraries
[17]:
['lib_1', 'lib_2']

Here is the data frame that contains all of the information on the barcoded variants. Note that the list of substitutions for the secondary targets just repeat the target name (so you are not confused with thinking mutations are actually enumerated in these secondary targets):

[18]:
variants.barcode_variant_df
[18]:
target library barcode variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions
0 primary_target_gene lib_1 AAAAAAAACGTTTTGT 1 CTG5TGG TCA11TCT L5W 2 1
1 primary_target_gene lib_1 AAAAAAAGGCTTATAC 5 TCA11TGC S11C 1 1
2 primary_target_gene lib_1 AAAAAAATCGTCCGTG 2 0 0
3 primary_target_gene lib_1 AAAAAAGACAACACCG 5 ATC25TTC I25F 1 1
4 primary_target_gene lib_1 AAAAAAGATGTGGTGG 3 TCA11TTA CGG12GGT CGT15TGA ATC25GGA ATC28CGC S11L R12G R15* I25G I28R 5 5
... ... ... ... ... ... ... ... ...
30429 secondary_target_3 lib_2 TGTGCTTAGAGCACCG 10 secondary_target_3 secondary_target_3 0 0
30430 secondary_target_3 lib_2 TTCCGACACGTTGACC 10 secondary_target_3 secondary_target_3 0 0
30431 secondary_target_3 lib_2 TTCTATTACGCTGGGG 10 secondary_target_3 secondary_target_3 0 0
30432 secondary_target_3 lib_2 TTCTTACTGAAGTTCC 10 secondary_target_3 secondary_target_3 0 0
30433 secondary_target_3 lib_2 TTTAGGAATCTCCCGT 10 secondary_target_3 secondary_target_3 0 0

30434 rows × 8 columns

Analyze library composition

A CodonVariantTable has methods for summarizing and plotting properties of the barcoded variants. These methods can be used to analyze the barcoded variants themselves, or to analyze the frequency (or counts) of variants in the library in samples that have undergone some type of selection.

We have not yet added counts of the variants in any specific samples, so we just analyze the composition of the variant library itself. This is done by setting samples=None in the method calls below.

Number of variants in each library:

[19]:
variants.n_variants_df(samples=None)
[19]:
target library sample count
0 primary_target_gene lib_1 barcoded variants 15000
1 primary_target_gene lib_2 barcoded variants 15000
2 primary_target_gene all libraries barcoded variants 30000
3 secondary_target_1 lib_1 barcoded variants 52
4 secondary_target_1 lib_2 barcoded variants 82
5 secondary_target_1 all libraries barcoded variants 134
6 secondary_target_2 lib_1 barcoded variants 89
7 secondary_target_2 lib_2 barcoded variants 69
8 secondary_target_2 all libraries barcoded variants 158
9 secondary_target_3 lib_1 barcoded variants 87
10 secondary_target_3 lib_2 barcoded variants 55
11 secondary_target_3 all libraries barcoded variants 142

We can also get this information for just the primary target:

[20]:
variants.n_variants_df(samples=None, primary_target_only=True)
[20]:
library sample count
0 lib_1 barcoded variants 15000
1 lib_2 barcoded variants 15000
2 all libraries barcoded variants 30000

Plot distribution of variant call supports, grouping together all variants with support \(\ge 8\):

[21]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotVariantSupportHistogram(max_support=10)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_39_0.png

Same plot for only the primary target:

[22]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotVariantSupportHistogram(max_support=10, primary_target_only=True)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_41_0.png

The set of valid barcodes for each library:

[23]:
for lib in libs:
    print(f"First few barcodes for library {lib}:")
    print(sorted(variants.valid_barcodes(lib))[:3])
First few barcodes for library lib_1:
['AAAAAAAACGTTTTGT', 'AAAAAAAGGCTTATAC', 'AAAAAAATCGTCCGTG']
First few barcodes for library lib_2:
['AAAAAACCCAGACTTA', 'AAAAAACCGGCGGCAG', 'AAAAAATCTACTGCGT']

Plot the number of amino-acid mutations per variant. Note that this plot is only for the primary target:

[24]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram("aa", samples=None)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_45_0.png

Average number of codon mutations per variant of each type of mutation. We make these plots for: 1. Just single-mutant and wildtype variants 2. For all variants

Again, these are only for the primary target, since mutations are only defined for that target:

[25]:
# NBVAL_IGNORE_OUTPUT

for mut_type in ["single", "all"]:
    p = variants.plotNumCodonMutsByType(mut_type, samples=None)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_47_0.png
_images/codonvariant_sim_data_multi_targets_47_1.png

Here are the numerical data in the plots above:

[26]:
variants.numCodonMutsByType("single", samples=None).round(3)
[26]:
library sample mutation_type num_muts_count count number
0 lib_1 barcoded variants nonsynonymous 3615 6112 0.591
1 lib_1 barcoded variants synonymous 194 6112 0.032
2 lib_1 barcoded variants stop 211 6112 0.035
3 lib_2 barcoded variants nonsynonymous 3716 6114 0.608
4 lib_2 barcoded variants synonymous 235 6114 0.038
5 lib_2 barcoded variants stop 179 6114 0.029
6 all libraries barcoded variants nonsynonymous 7331 12226 0.600
7 all libraries barcoded variants synonymous 429 12226 0.035
8 all libraries barcoded variants stop 390 12226 0.032
[27]:
variants.numCodonMutsByType("all", samples=None).round(3)
[27]:
library sample mutation_type num_muts_count count number
0 lib_1 barcoded variants nonsynonymous 26916 15000 1.794
1 lib_1 barcoded variants synonymous 1510 15000 0.101
2 lib_1 barcoded variants stop 1424 15000 0.095
3 lib_2 barcoded variants nonsynonymous 26960 15000 1.797
4 lib_2 barcoded variants synonymous 1498 15000 0.100
5 lib_2 barcoded variants stop 1371 15000 0.091
6 all libraries barcoded variants nonsynonymous 53876 30000 1.796
7 all libraries barcoded variants synonymous 3008 30000 0.100
8 all libraries barcoded variants stop 2795 30000 0.093

Examine how well amino-acid mutations to the primary target are sampled in the library by looking at the fraction of mutations seen <= some number of times. Here we do that for amino-acid mutations, making separate plots for single mutants and all mutants:

[28]:
# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotCumulMutCoverage(variant_type, mut_type="aa", samples=None)
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_52_0.png
_images/codonvariant_sim_data_multi_targets_52_1.png

We can also get the numerical information plotted above (here for single mutants only):

[29]:
variants.mutCounts("single", "aa", samples=None).head(n=5)
[29]:
library sample mutation count mutation_type site
0 lib_1 barcoded variants V16L 25 nonsynonymous 16
1 lib_1 barcoded variants C19R 21 nonsynonymous 19
2 lib_1 barcoded variants R12S 21 nonsynonymous 12
3 lib_1 barcoded variants R26L 21 nonsynonymous 26
4 lib_1 barcoded variants S11R 21 nonsynonymous 11

Here are the frequencies of mutations along the primary target gene, looking both at single mutants and all mutants:

[30]:
# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotMutFreqs(variant_type, "codon", samples=None)
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_56_0.png
_images/codonvariant_sim_data_multi_targets_56_1.png

We can also look at mutation frequencies in the primary target gene a heat-map form:

[31]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotMutHeatmap("all", "aa", samples=None)
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_58_0.png

Simulate counts for samples

An experiment involves subjecting the library to different selections and looking at how the frequencies of the variants changes by using sequencing to count the barcodes in each condition.

Here, we simulate an experiment by simulating variant counts for samples that have undergone various selections.

For these simulations, we first need to define a “phenotype” for each variant. The phenotype represents how much the frequency of the variant is expected to increase or decrease after selection.

Define phenotype function

First, we define a “phenotype” function. We will do this using a SigmoidPhenotypeSimulator, which follow the “global epistasis” concepts of Otwinoski et al and Sailer and Harms to define the phenotype in two steps: an underlying latent phenotype that mutations affect additively, and then an observed phenotype that is a non-linear function of the latent phenotype. The variants are then simulated according to their observed enrichments, which are the exponentials of the observed phenotypes.

First, we initialize the simulator:

[32]:
phenosimulator = dms_variants.simulate.SigmoidPhenotypeSimulator(geneseq, seed=seed)

Plot the simulated relationship of the latent phenotype with the observed enrichment and phenotype, with a dashed vertical line indicating the wildtype latent phenotype:

[33]:
# NBVAL_IGNORE_OUTPUT

for value in ["enrichment", "phenotype"]:
    p = phenosimulator.plotLatentVsObserved(value)
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_62_0.png
_images/codonvariant_sim_data_multi_targets_62_1.png

Plot the latent phenotype, observed phenotype, and observed enrichment of all single amino-acid mutants, with a dashed vertical line indicating the wildtype:

[34]:
# NBVAL_IGNORE_OUTPUT

for value in ["latentPhenotype", "observedPhenotype", "observedEnrichment"]:
    p = phenosimulator.plotMutsHistogram(value)
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_64_0.png
_images/codonvariant_sim_data_multi_targets_64_1.png
_images/codonvariant_sim_data_multi_targets_64_2.png

Phenotypes for secondary targets

We also specify phenotypes as observed enrichments for the secondary targets:

[35]:
secondary_target_phenotypes = {
    secondary_target: round(random.uniform(0.2, 2), 3)
    for secondary_target in secondary_targets
}

# display the observed enrichments for the secondary targets
pd.Series(secondary_target_phenotypes).rename("observed_enrichment").to_frame()
[35]:
observed_enrichment
secondary_target_1 0.883
secondary_target_2 1.211
secondary_target_3 1.789

Simulate variant counts

Now we use simulateSampleCounts to simulate counts of variants when selection on each variant is proportional to its observed enrichment. In these simulations, we can fine-tune the simulations to reflect real experiments, such as by setting the error-rate in variant calling, bottlenecks when going from the pre- to post-selection samples, and non-uniformity in library composition.

Here we simulate using several bottlenecks in the library going from the pre- to post-selection samples, since in real experiments this seems to be the biggest source of noise / error:

[36]:
counts = dms_variants.simulate.simulateSampleCounts(
    variants=variants,
    phenotype_func=phenosimulator.observedEnrichment,
    variant_error_rate=variant_error_rate,
    pre_sample={
        "total_count": variants_per_lib * numpy.random.poisson(avgdepth_per_variant),
        "uniformity": lib_uniformity,
    },
    pre_sample_name="pre-selection",
    post_samples={
        name: {
            "noise": noise,
            "total_count": variants_per_lib
            * numpy.random.poisson(avgdepth_per_variant),
            "bottleneck": bottle,
        }
        for name, bottle in bottlenecks.items()
    },
    seed=seed,
    secondary_target_phenotypes=secondary_target_phenotypes,
)

Data frame with the simulated counts:

[37]:
counts.head(6)
[37]:
library barcode sample count
0 lib_1 AAAAAAAACGTTTTGT pre-selection 317
1 lib_1 AAAAAAAGGCTTATAC pre-selection 133
2 lib_1 AAAAAAATCGTCCGTG pre-selection 127
3 lib_1 AAAAAAGACAACACCG pre-selection 97
4 lib_1 AAAAAAGATGTGGTGG pre-selection 324
5 lib_1 AAAAAAGTCATACTCG pre-selection 118

Add counts to variant table

Now add the simulated counts for each library / sample to the CodonVariantTable:

[38]:
variants.add_sample_counts_df(counts)

Confirm that we have added the expected number of counts per library / sample:

[39]:
# NBVAL_IGNORE_OUTPUT

variants.n_variants_df(libraries="all_only")
[39]:
target library sample count
0 primary_target_gene all libraries pre-selection 5439967
1 primary_target_gene all libraries loose_bottle 5366820
2 primary_target_gene all libraries tight_bottle 5916597
3 secondary_target_1 all libraries pre-selection 23756
4 secondary_target_1 all libraries loose_bottle 54373
5 secondary_target_1 all libraries tight_bottle 57359
6 secondary_target_2 all libraries pre-selection 29184
7 secondary_target_2 all libraries loose_bottle 92595
8 secondary_target_2 all libraries tight_bottle 98230
9 secondary_target_3 all libraries pre-selection 27093
10 secondary_target_3 all libraries loose_bottle 126212
11 secondary_target_3 all libraries tight_bottle 137814

Look only for primary target:

[40]:
# NBVAL_IGNORE_OUTPUT

variants.n_variants_df(primary_target_only=True)
[40]:
library sample count
0 lib_1 pre-selection 2719365
1 lib_1 loose_bottle 2674231
2 lib_1 tight_bottle 2949688
3 lib_2 pre-selection 2720602
4 lib_2 loose_bottle 2692589
5 lib_2 tight_bottle 2966909
6 all libraries pre-selection 5439967
7 all libraries loose_bottle 5366820
8 all libraries tight_bottle 5916597

Analyze sample variant counts

A CodonVariantTable has methods for summarizing and plotting the variant counts for different samples. These methods are mostly the same as we used above to analyze the variant composition of the libraries themselves, but now we set samples to the samples that we want to analyze (typically 'all'). Therefore, rather than each variant always counting once, each variant is counted in each sample in proportion to how many counts it has.

In the rawest form, we can directly access a data frame giving the counts of each variant in each sample:

[41]:
# NBVAL_IGNORE_OUTPUT

variants.variant_count_df
[41]:
target library sample barcode count variant_call_support codon_substitutions aa_substitutions n_codon_substitutions n_aa_substitutions
0 primary_target_gene lib_1 pre-selection AACCCGTTCACCACCA 692 2 ATC17CGA CGG29TGC I17R R29C 2 2
1 primary_target_gene lib_1 pre-selection ACTACGGACGATTATT 633 1 GCT7TCT AAA14GAT AAC20CGA A7S K14D N20R 3 3
2 primary_target_gene lib_1 pre-selection GGACTATCTCAGTATG 615 2 0 0
3 primary_target_gene lib_1 pre-selection CCAGAAAACCCACCAA 614 7 TTA18CTC 1 0
4 primary_target_gene lib_1 pre-selection ACGGCCTAAACTCACA 611 3 CGT6GTA GAC22CAC R6V D22H 2 2
... ... ... ... ... ... ... ... ... ... ...
91297 secondary_target_3 lib_2 tight_bottle CAACCTAATTGGGATC 187 10 secondary_target_3 secondary_target_3 0 0
91298 secondary_target_3 lib_2 tight_bottle TGGGCACGATGCAATA 186 10 secondary_target_3 secondary_target_3 0 0
91299 secondary_target_3 lib_2 tight_bottle ACGAGTACCCCACTTC 182 10 secondary_target_3 secondary_target_3 0 0
91300 secondary_target_3 lib_2 tight_bottle TATATCGGCGTGAAGA 176 10 secondary_target_3 secondary_target_3 0 0
91301 secondary_target_3 lib_2 tight_bottle CAGTGGTAATTAAACG 0 10 secondary_target_3 secondary_target_3 0 0

91302 rows × 10 columns

Get the average number of counts per variant:

[42]:
# NBVAL_IGNORE_OUTPUT

variants.avgCountsPerVariant().round(1)
[42]:
target library sample avg_counts_per_variant
0 primary_target_gene lib_1 pre-selection 181.3
1 primary_target_gene lib_1 loose_bottle 178.3
2 primary_target_gene lib_1 tight_bottle 196.6
3 primary_target_gene lib_2 pre-selection 181.4
4 primary_target_gene lib_2 loose_bottle 179.5
5 primary_target_gene lib_2 tight_bottle 197.8
6 primary_target_gene all libraries pre-selection 181.3
7 primary_target_gene all libraries loose_bottle 178.9
8 primary_target_gene all libraries tight_bottle 197.2
9 secondary_target_1 lib_1 pre-selection 166.1
10 secondary_target_1 lib_1 loose_bottle 378.2
11 secondary_target_1 lib_1 tight_bottle 401.7
12 secondary_target_1 lib_2 pre-selection 184.4
13 secondary_target_1 lib_2 loose_bottle 423.2
14 secondary_target_1 lib_2 tight_bottle 444.8
15 secondary_target_1 all libraries pre-selection 177.3
16 secondary_target_1 all libraries loose_bottle 405.8
17 secondary_target_1 all libraries tight_bottle 428.1
18 secondary_target_2 lib_1 pre-selection 173.2
19 secondary_target_2 lib_1 loose_bottle 556.6
20 secondary_target_2 lib_1 tight_bottle 611.9
21 secondary_target_2 lib_2 pre-selection 199.5
22 secondary_target_2 lib_2 loose_bottle 624.0
23 secondary_target_2 lib_2 tight_bottle 634.3
24 secondary_target_2 all libraries pre-selection 184.7
25 secondary_target_2 all libraries loose_bottle 586.0
26 secondary_target_2 all libraries tight_bottle 621.7
27 secondary_target_3 lib_1 pre-selection 190.6
28 secondary_target_3 lib_1 loose_bottle 880.0
29 secondary_target_3 lib_1 tight_bottle 919.1
30 secondary_target_3 lib_2 pre-selection 191.1
31 secondary_target_3 lib_2 loose_bottle 902.7
32 secondary_target_3 lib_2 tight_bottle 1051.8
33 secondary_target_3 all libraries pre-selection 190.8
34 secondary_target_3 all libraries loose_bottle 888.8
35 secondary_target_3 all libraries tight_bottle 970.5

Above the counts are grouped by target. To not group by target:

[43]:
variants.avgCountsPerVariant(by_target=False).round(1)
[43]:
library sample avg_counts_per_variant
0 lib_1 pre-selection 181.2
1 lib_1 loose_bottle 185.2
2 lib_1 tight_bottle 203.9
3 lib_2 pre-selection 181.5
4 lib_2 loose_bottle 185.5
5 lib_2 tight_bottle 204.2
6 all libraries pre-selection 181.4
7 all libraries loose_bottle 185.3
8 all libraries tight_bottle 204.0

We also plot the average counts per variant:

[44]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotAvgCountsPerVariant(heightscale=1.5, orientation="h")
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_84_0.png

Here is the same plot not faceted by target:

[45]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotAvgCountsPerVariant(orientation="h", by_target=False)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_86_0.png

Plot the number of counts for each variant in each sample. The horizontal dashed line shows the total number of variants. The plot shows that all variants are well-sampled in the pre-selection libraries, but that post- selection some variants are sampled more or less. This is expected since selection will decrease and increase the frequency of variants. By default, this plot shows only the primary target:

[46]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotCumulVariantCounts()
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_88_0.png

Similar plot showing all targets:

[47]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotCumulVariantCounts(primary_target_only=False)
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_90_0.png

Number of counts per variant:

[48]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotCountsPerVariant(libraries=variants.libraries, by_variant_class=True)
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_92_0.png

Similar plot only for primary targets:

[49]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotCountsPerVariant(
    libraries=variants.libraries, by_variant_class=True, primary_target_only=True
)

_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_94_0.png

Distribution of the number of amino-acid mutations per variant in each sample. This is only for the primary target. As expected, mutations go down after selection:

[50]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram(mut_type="aa")
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_96_0.png

Average number of mutations of each type per variant among just single mutants (and wildtype) and among all variants, again only for primary target:

[51]:
# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotNumCodonMutsByType(variant_type)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_98_0.png
_images/codonvariant_sim_data_multi_targets_98_1.png

Here are the numerical data plotted above (just for the individual libraries and all mutations):

[52]:
# NBVAL_IGNORE_OUTPUT

variants.numCodonMutsByType(variant_type="all", libraries=libs).round(3)
[52]:
library sample mutation_type num_muts_count count number
0 lib_1 pre-selection nonsynonymous 4864651 2719365 1.789
1 lib_1 pre-selection synonymous 275311 2719365 0.101
2 lib_1 pre-selection stop 257132 2719365 0.095
3 lib_1 loose_bottle nonsynonymous 2330462 2674231 0.871
4 lib_1 loose_bottle synonymous 270348 2674231 0.101
5 lib_1 loose_bottle stop 809 2674231 0.000
6 lib_1 tight_bottle nonsynonymous 2574335 2949688 0.873
7 lib_1 tight_bottle synonymous 289539 2949688 0.098
8 lib_1 tight_bottle stop 948 2949688 0.000
9 lib_2 pre-selection nonsynonymous 4872431 2720602 1.791
10 lib_2 pre-selection synonymous 279586 2720602 0.103
11 lib_2 pre-selection stop 249866 2720602 0.092
12 lib_2 loose_bottle nonsynonymous 2433719 2692589 0.904
13 lib_2 loose_bottle synonymous 271738 2692589 0.101
14 lib_2 loose_bottle stop 1253 2692589 0.000
15 lib_2 tight_bottle nonsynonymous 2669151 2966909 0.900
16 lib_2 tight_bottle synonymous 288678 2966909 0.097
17 lib_2 tight_bottle stop 1060 2966909 0.000

Here are mutation frequencies as a function of primary sequence among all variants of the primary target:

[53]:
# NBVAL_IGNORE_OUTPUT

p = variants.plotMutFreqs(variant_type="all", mut_type="codon")
_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_102_0.png

Plot how thoroughly amino-acid mutations are sampled, doing this separately among all variants and single-mutant variants for the primary target. The plots below show that the stop mutations are sampled very poorly post-selection because they are eliminated during selection:

[54]:
# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotCumulMutCoverage(variant_type=variant_type, mut_type="aa")
    _ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_104_0.png
_images/codonvariant_sim_data_multi_targets_104_1.png

Functional scores for variants

The CodonVariantTable.func_scores method calculates a functional score for each variant based on its change in frequency from pre- to post-selection.

To calculate these scores, we need to pair each post-selection sample with a pre-selection one. In this case, the pre-selection sample is named ‘pre-selection’ for all post-selection samples:

[55]:
func_scores = variants.func_scores("pre-selection", libraries=libs)

The resulting data frame has a functional score (and its variance) for each barcoded variant:

[56]:
# NBVAL_IGNORE_OUTPUT

func_scores.round(3)
[56]:
target library pre_sample post_sample barcode func_score func_score_var pre_count post_count pre_count_wt post_count_wt pseudocount codon_substitutions n_codon_substitutions aa_substitutions n_aa_substitutions
0 primary_target_gene lib_1 pre-selection loose_bottle AACCCGTTCACCACCA -1.060 0.005 692 847 381697 973820 0.5 ATC17CGA CGG29TGC 2 I17R R29C 2
1 primary_target_gene lib_1 pre-selection loose_bottle ACTACGGACGATTATT -3.845 0.022 633 112 381697 973820 0.5 GCT7TCT AAA14GAT AAC20CGA 3 A7S K14D N20R 3
2 primary_target_gene lib_1 pre-selection loose_bottle GGACTATCTCAGTATG -0.188 0.005 615 1378 381697 973820 0.5 0 0
3 primary_target_gene lib_1 pre-selection loose_bottle CCAGAAAACCCACCAA -0.052 0.005 614 1512 381697 973820 0.5 TTA18CTC 1 0
4 primary_target_gene lib_1 pre-selection loose_bottle ACGGCCTAAACTCACA -3.228 0.016 611 166 381697 973820 0.5 CGT6GTA GAC22CAC 2 R6V D22H 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
60863 secondary_target_3 lib_2 pre-selection tight_bottle TGTGCTTAGAGCACCG 0.953 0.033 74 412 359334 1028011 0.5 secondary_target_3 0 secondary_target_3 0
60864 secondary_target_3 lib_2 pre-selection tight_bottle CCTGGATTTCTGATTT 1.546 0.034 69 580 359334 1028011 0.5 secondary_target_3 0 secondary_target_3 0
60865 secondary_target_3 lib_2 pre-selection tight_bottle CAGTGGTAATTAAACG -8.614 4.193 68 0 359334 1028011 0.5 secondary_target_3 0 secondary_target_3 0
60866 secondary_target_3 lib_2 pre-selection tight_bottle CTTGTTGTATACTAGC 2.630 0.042 52 929 359334 1028011 0.5 secondary_target_3 0 secondary_target_3 0
60867 secondary_target_3 lib_2 pre-selection tight_bottle ACTGACCATGATAGTC 1.247 0.084 28 193 359334 1028011 0.5 secondary_target_3 0 secondary_target_3 0

60868 rows × 16 columns

We can also calculate functional scores at the level of amino-acid or codon substitutions rather than at the level of variants. This calculation groups all variants with the same substitutions before calculating the functional score. Here are scores grouping by amino-acid substitutions; we also set syn_as_wt=True to include variants with only synonymous mutations in the counts of wild type in this case:

[57]:
# NBVAL_IGNORE_OUTPUT

aa_func_scores = variants.func_scores(
    "pre-selection", by="aa_substitutions", syn_as_wt=True, libraries=libs
)
aa_func_scores.round(3)
[57]:
target library pre_sample post_sample aa_substitutions func_score func_score_var pre_count post_count pre_count_wt post_count_wt pseudocount n_aa_substitutions
0 primary_target_gene lib_1 pre-selection loose_bottle I17R R29C -1.062 0.005 692 847 418909 1070114 0.5 2
1 primary_target_gene lib_1 pre-selection loose_bottle A7S K14D N20R -3.846 0.022 633 112 418909 1070114 0.5 3
2 primary_target_gene lib_1 pre-selection loose_bottle 0.000 0.000 418909 1070114 418909 1070114 0.5 0
3 primary_target_gene lib_1 pre-selection loose_bottle R6V D22H -3.230 0.016 611 166 418909 1070114 0.5 2
4 primary_target_gene lib_1 pre-selection loose_bottle R1A 0.119 0.002 1863 5168 418909 1070114 0.5 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
35995 secondary_target_1 lib_2 pre-selection tight_bottle secondary_target_1 -0.237 0.000 15119 36473 404611 1150576 0.5 0
35996 secondary_target_2 lib_1 pre-selection tight_bottle secondary_target_2 0.339 0.000 15418 54461 418909 1169889 0.5 0
35997 secondary_target_2 lib_2 pre-selection tight_bottle secondary_target_2 0.161 0.000 13766 43769 404611 1150576 0.5 0
35998 secondary_target_3 lib_1 pre-selection tight_bottle secondary_target_3 0.788 0.000 16580 79965 418909 1169889 0.5 0
35999 secondary_target_3 lib_2 pre-selection tight_bottle secondary_target_3 0.952 0.000 10513 57849 404611 1150576 0.5 0

36000 rows × 13 columns

We can plot the distribution of the functional scores for variants. These plots are most informative if we classify variants by the “types” of mutations they have, which we do here using the CodonVariantTable.classifyVariants method, which adds a column called variant_class to the data frame. Note that we have to specify the primary target and how to classify non-primary targets:

[58]:
func_scores = dms_variants.codonvarianttable.CodonVariantTable.classifyVariants(
    func_scores,
    primary_target=variants.primary_target,
    non_primary_target_class="secondary target",
)

(func_scores[["target", "codon_substitutions", "aa_substitutions", "variant_class"]])
[58]:
target codon_substitutions aa_substitutions variant_class
0 primary_target_gene ATC17CGA CGG29TGC I17R R29C >1 nonsynonymous
1 primary_target_gene GCT7TCT AAA14GAT AAC20CGA A7S K14D N20R >1 nonsynonymous
2 primary_target_gene wildtype
3 primary_target_gene TTA18CTC synonymous
4 primary_target_gene CGT6GTA GAC22CAC R6V D22H >1 nonsynonymous
... ... ... ... ...
60863 secondary_target_3 secondary_target_3 secondary_target_3 secondary target
60864 secondary_target_3 secondary_target_3 secondary_target_3 secondary target
60865 secondary_target_3 secondary_target_3 secondary_target_3 secondary target
60866 secondary_target_3 secondary_target_3 secondary_target_3 secondary target
60867 secondary_target_3 secondary_target_3 secondary_target_3 secondary target

60868 rows × 4 columns

Now we use plotnine to plot the distributions of scores in ggplot2-like syntax, coloring by the variant class. This plot shows the expected behavior for different variant classes; for instance, stop codon variants tend to have low scores and synonymous variants tend to have wildtype-like (near 0) scores. As expected, there is more noise with a tighter bottleneck:

[59]:
# NBVAL_IGNORE_OUTPUT

p = (
    ggplot(func_scores, aes("variant_class", "func_score"))
    + geom_violin(aes(fill="variant_class"))
    + ylab("functional score")
    + xlab("")
    + facet_grid("post_sample ~ library")
    + theme(
        figure_size=(2.75 * len(libs), 2 * len(bottlenecks)),
        axis_text_x=element_text(angle=90),
        panel_grid_major_x=element_blank(),  # no vertical grid lines
    )
    + scale_fill_manual(values=CBPALETTE[1:])
)

_ = p.draw(show=True)
_images/codonvariant_sim_data_multi_targets_114_0.png
[ ]: