Variants with deletion mutations¶

In some cases, you might want to include variants with in-frame codon-length deletions. There is an option to do this using --- as the codon gap character, and - as the resulting amino-acid gap character. Note that only codon-length in-frame deletions are allowed.

This notebook illustrates working with such variants.

Setup for analysis¶

Import Python modules / packages:

[1]:

# NBVAL_IGNORE_OUTPUT

import random
import warnings

import pandas as pd

from plotnine import *

import dms_variants.codonvarianttable
import dms_variants.plotnine_themes
import dms_variants.simulate
from dms_variants.constants import CBPALETTE, CODONS_WITHGAP, CODONS_NOSTOP

Set parameters that define simulated data:

[2]:

seed = 1  # random number seed
genelength = 30  # gene length in codons
libs = ["lib_1", "lib_2"]  # distinct libraries of gene
variants_per_lib = 500 * genelength  # variants per library
avgmuts = 2.0  # average codon mutations per variant
bclen = 16  # length of nucleotide barcode for each variant
variant_error_rate = 0.005  # rate at which variant sequence mis-called

Seed random number generator for reproducible output:

[3]:

random.seed(seed)

Set pandas display options to show large chunks of Data Frames in this example:

[4]:

pd.set_option("display.max_columns", 20)
pd.set_option("display.width", 500)

Hide warnings that clutter output:

[5]:

warnings.simplefilter("ignore")

Set the gray grid plotnine theme defined in dms_variants.plotnine_themes as this gives a nice appearance for the plots:

[6]:

theme_set(dms_variants.plotnine_themes.theme_graygrid())

Simulate library of variants¶

Simulate wildtype gene sequence:

[7]:

geneseq = "".join(random.choices(CODONS_NOSTOP, k=genelength))
print(f"Wildtype gene of {genelength} codons:\n{geneseq}")

Wildtype gene of 30 codons:
AGATCCGTGATTCTGCGTGCTTACACCAACTCACGGGTGAAACGTGTAATCTTATGCAACAACGACTTACCTATCCGCAACATCCGGCTG

Generate a CodonVariantTable using simulate_CodonVariantTable function:

[8]:

variants = dms_variants.simulate.simulate_CodonVariantTable(
    geneseq=geneseq,
    bclen=bclen,
    library_specs={
        lib: {"avgmuts": avgmuts, "nvariants": variants_per_lib} for lib in libs
    },
    seed=seed,
    allowgaps=True,
)

We can get basic information about the CodonVariantTable, such as the sites, wildtype codons, and wildtype amino acids. Below we do this for the first few sites:

[9]:

variants.sites[:5]

[9]:

[1, 2, 3, 4, 5]

[10]:

list(variants.codons[r] for r in variants.sites[:5])

[10]:

['AGA', 'TCC', 'GTG', 'ATT', 'CTG']

[11]:

list(variants.aas[r] for r in variants.sites[:5])

[11]:

['R', 'S', 'V', 'I', 'L']

The different libraries in the table:

[12]:

variants.libraries

[12]:

['lib_1', 'lib_2']

Here is the data frame that contains all of the information on the barcoded variants (just looking at the first few lines)

[13]:

variants.barcode_variant_df.head(n=6)

[13]:

	library	barcode	variant_call_support	codon_substitutions	aa_substitutions	n_codon_substitutions	n_aa_substitutions
0	lib_1	AAAAAACTAAAAGTCG	2	CGT15AGC ATC25GAA	R15S I25E	2	2
1	lib_1	AAAAACAATGCCGCAT	1	ACC9TCG TTA18TAC	T9S L18Y	2	2
2	lib_1	AAAAACAGACCTACAA	1	TCC2CCG TAC8TTC	S2P Y8F	2	2
3	lib_1	AAAAACCGTGGTTCTT	1	AGA1GAC CCT24GGC	R1D P24G	2	2
4	lib_1	AAAAACTCTATCTCAT	1	CTG5GTA GCT7TGG TCA11ACG CGG29CTG	L5V A7W S11T R29L	4	4
5	lib_1	AAAAACTTGAGCATGA	6	GTG3TCT GCT7TGG TAC8AAT TGC19TGA GAC22AGG AAC2...	V3S A7W Y8N C19* D22R N27D	6	6

Analyze library composition¶

A CodonVariantTable has methods for summarizing and plotting properties of the barcoded variants. These methods can be used to analyze the barcoded variants themselves, or to analyze the frequency (or counts) of variants in the library in samples that have undergone some type of selection.

We have not yet added counts of the variants in any specific samples, so we just analyze the composition of the variant library itself. This is done by setting samples=None in the method calls below.

Number of variants in each library:

[14]:

variants.n_variants_df(samples=None)

[14]:

	library	sample	count
0	lib_1	barcoded variants	15000
1	lib_2	barcoded variants	15000
2	all libraries	barcoded variants	30000

Plot distribution of variant call supports, grouping together all variants with support \(\ge 8\):

[15]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotVariantSupportHistogram(max_support=10)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_27_0.png

The set of valid barcodes for each library:

[16]:

for lib in libs:
    print(f"First few barcodes for library {lib}:")
    print(sorted(variants.valid_barcodes(lib))[:3])

First few barcodes for library lib_1:
['AAAAAACTAAAAGTCG', 'AAAAACAATGCCGCAT', 'AAAAACAGACCTACAA']
First few barcodes for library lib_2:
['AAAAAAACCCAGACTT', 'AAAAAACATCCGATCC', 'AAAAAACCTTCTGCCG']

Plot the number of amino-acid mutations per variant:

[17]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram("aa", samples=None)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_31_0.png

Number of codon mutations per variant:

[18]:

# NBVAL_IGNORE_OUTPUT

p = variants.plotNumMutsHistogram("codon", samples=None)
p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
_ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_33_0.png

Average number of codon mutations per variant of each type of mutation. We make these plots for: 1. Just single-mutant and wildtype variants 2. For all variants

[19]:

# NBVAL_IGNORE_OUTPUT

for mut_type in ["single", "all"]:
    p = variants.plotNumCodonMutsByType(mut_type, samples=None, heightscale=1.5)
    p = p + theme(panel_grid_major_x=element_blank())  # no vertical grid lines
    _ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_35_0.png

_images/codonvariant_sim_data_w_gaps_35_1.png

Here are the numerical data in the plots above:

[20]:

variants.numCodonMutsByType("single", samples=None).round(3)

[20]:

	library	sample	mutation_type	num_muts_count	count	number
0	lib_1	barcoded variants	nonsynonymous	3611	6112	0.591
1	lib_1	barcoded variants	synonymous	194	6112	0.032
2	lib_1	barcoded variants	stop	215	6112	0.035
3	lib_1	barcoded variants	deletion	56	6112	0.009
4	lib_2	barcoded variants	nonsynonymous	3746	6114	0.613
5	lib_2	barcoded variants	synonymous	188	6114	0.031
6	lib_2	barcoded variants	stop	196	6114	0.032
7	lib_2	barcoded variants	deletion	69	6114	0.011
8	all libraries	barcoded variants	nonsynonymous	7357	12226	0.602
9	all libraries	barcoded variants	synonymous	382	12226	0.031
10	all libraries	barcoded variants	stop	411	12226	0.034
11	all libraries	barcoded variants	deletion	125	12226	0.010

[21]:

variants.numCodonMutsByType("all", samples=None).round(3)

[21]:

	library	sample	mutation_type	num_muts_count	count	number
0	lib_1	barcoded variants	nonsynonymous	26917	15000	1.794
1	lib_1	barcoded variants	synonymous	1476	15000	0.098
2	lib_1	barcoded variants	stop	1457	15000	0.097
3	lib_1	barcoded variants	deletion	445	15000	0.030
4	lib_2	barcoded variants	nonsynonymous	26919	15000	1.795
5	lib_2	barcoded variants	synonymous	1485	15000	0.099
6	lib_2	barcoded variants	stop	1425	15000	0.095
7	lib_2	barcoded variants	deletion	458	15000	0.031
8	all libraries	barcoded variants	nonsynonymous	53836	30000	1.795
9	all libraries	barcoded variants	synonymous	2961	30000	0.099
10	all libraries	barcoded variants	stop	2882	30000	0.096
11	all libraries	barcoded variants	deletion	903	30000	0.030

Examine how well amino-acid mutations are sampled in the library by looking at the fraction of mutations seen <= some number of times. Here we do that for amino- acid mutations, making separate plots for single mutants and all mutants:

[22]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotCumulMutCoverage(variant_type, mut_type="aa", samples=None)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_40_0.png

_images/codonvariant_sim_data_w_gaps_40_1.png

We can also get the numerical information plotted above (here for single mutants only):

[23]:

variants.mutCounts("single", "aa", samples=None).head(n=5)

[23]:

	library	sample	mutation	count	mutation_type	site
0	lib_1	barcoded variants	T9L	24	nonsynonymous	9
1	lib_1	barcoded variants	A7R	21	nonsynonymous	7
2	lib_1	barcoded variants	R1L	21	nonsynonymous	1
3	lib_1	barcoded variants	N10L	20	nonsynonymous	10
4	lib_1	barcoded variants	N27S	20	nonsynonymous	27

Here are the frequencies of mutations along the gene, looking both at single mutants and all mutants:

[24]:

# NBVAL_IGNORE_OUTPUT

for variant_type in ["single", "all"]:
    p = variants.plotMutFreqs(variant_type, "codon", samples=None)
    _ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_44_0.png

_images/codonvariant_sim_data_w_gaps_44_1.png

We can also look at mutation frequencies in a heat-map form:

[25]:

p = variants.plotMutHeatmap("all", "aa", samples=None)
_ = p.draw(show=True)

_images/codonvariant_sim_data_w_gaps_46_0.png

Variants with deletion mutations¶

Setup for analysis¶

Simulate library of variants¶

Analyze library composition¶

dms_variants

Navigation

Related Topics