dms_inferprefs

Overview

dms_inferprefs is a program included with the dms_tools package. It infers the site-specific preferences \(\pi_{r,x}\) for each character \(x\) (an amino acid, codon, or nucleotide) at each site \(r\). The inference is done by MCMC using PyStan.

For a detailed description of the algorithm implemented by this program, see Algorithm to infer site-specific preferences.

After you install dms_tools, this program will be available to run at the command line.

Command-line usage

Infer site-specific preferences for amino acids, nucleotides, or codons. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/

usage: dms_inferprefs [-h] [--chartype {codon_to_aa,DNA,codon,aa}]
                      [--errpre ERRPRE] [--errpost ERRPOST] [--excludestop]
                      [--pi_alpha PI_ALPHA] [--mu_alpha MU_ALPHA]
                      [--err_alpha ERR_ALPHA] [--logfile LOGFILE]
                      [--ncpus NCPUS] [--seed SEED]
                      [--sites SITES [SITES ...]]
                      [--ratio_estimation MINCOUNTS] [-v]
                      n_pre n_post outfile

Positional Arguments

n_pre

Existing file with pre-selection counts from deep sequencing. For nucleotides, header line should be “# POSITION WT A C G T” and then subsequent lines give wildtype and counts for each site (i.e. “1 G 1013 23 19 47”); for codons just use the 64 codons instead (i.e. “AAA AAC AAG …”).

This file gives the values described in inferprefs_algorithm.

See dms_counts for full specification of the format for this file.

n_post

Existing file with post-selection counts from deep sequencing. Same format as “n_pre”.

This file gives the values described in inferprefs_algorithm.

See dms_counts for full specification of the format for this file.

outfile

Created output file with site-specific preferences; overwritten if it already exists.

In addition to the inferred preferences , this file also gives the site entropies calculated from these preferences and the median-centered 95% credible intervals for each preference.

See preferences_file for full specification of the format for this file.

Named Arguments

--chartype

Possible choices: codon_to_aa, DNA, codon, aa

Characters for which preferences are inferred: “codon_to_aa” = counts for codons and inferred preferences for amino acids; “DNA” = counts and inferred preference for DNA; “codon” = counts and inferred preferences for codons; “aa” = counts and inferred preferences for amino acids (possibly including stop codons, see “–excludestop”).

Default: “codon_to_aa”

Note that the default value is “codon_to_aa”.

A value of “codon_to_aa” corresponds to using codon characters for the deep sequencing data, but inferring preferences for amino acids, as described in chartype_codon_to_aa This would be the sensible approach if you assume that selection is operating on the amino-acid level. Note that the priors over the mutagenesis rates assume that all codon mutations are made at equal frequencies (NNN libraries).

A value of “DNA” corresponds to using DNA characters for both the deep sequencing data and inferred preferences, as described in chartype_DNA

A value of “codon” corresponds to using codon characters for both the deep sequencing data and inferred preferences, as described in chartype_codon You would prefer this option over “codon_to_aa” if you thought that there was different selection on different synonymous mutations. Note that the priors over the mutagenesis rates assume that all codon mutations are made at equal frequencies (NNN libraries).

A value of “aa” corresponds to using amino-acid characters for both the deep sequence data and inferred preferences, as described in chartype_aa. You should only use this option if your data make it impossible to use “codon_to_aa”. Note that the priors for this option assume that all amino-acid mutations are made at equal frequencies.

--errpre

Control used to estimate errors in counts from “n_pre”. If “none” then the counts in “n_pre” are assumed to have no errors; otherwise this should be a counts file with the same format as “n_pre” giving the counts for sequencing unmutated gene. Currently if this option is not “none” then –errpost also cannot be “none”.

Default: none

Setting this option to “none” corresponds to setting to zero the error rates described in inferprefs_algorithm.

If you set this option to the name of a file, that file is assumed to give the counts for sequencing the pre-selection error control () and is used to estimate the error rate described in inferprefs_algorithm. This file should have the format specified by dms_counts.

--errpost

Control used to estimate errors in counts from “n_post”. If “none” then the counts in “n_post” are assumed to have no errors; otherwise this should be a counts file with the same format as “n_pre” giving the counts for sequencing unmutated gene. If you have the same error control for “n_pre” and “n_post”, then set that file name for both this option and –errpre. Currently if this option is not “none” then –errpre also cannot be “none”.

Default: none

Setting this option to “none” corresponds to setting to zero the error rates described in inferprefs_algorithm.

If you set this option to the name of a file, that file is assumed to give the counts for sequencing the post-selection error control () and is used to estimate the error rate described in inferprefs_algorithm. This file should have the format specified by dms_counts.

--excludestop

Exclude stop codons as a possible amino acid if using “–chartype” of “codon_to_aa” or “aa”. Stop codons are only excluded if this option is specified; otherwise they are included, and are required to be specified if “–chartype” is “aa”.

Default: False

By default, when using –chartype codon_to_aa or –chartype aa, we infer preferences for 21 amino acids, with stop codons (denoted by *) one of the possibilities. If you specify the –excludestop option, then we constrain the preference for stop codons to be zero regardless of whether or not there are counts for these codons in the data, and so only infer preferences for the 20 non-stop amino acids.

--pi_alpha

Concentration parameter for Dirichlet prior over preferences (pi).

Default: 1.0

This is the parameter described in inferprefs_algorithm.

Smaller values favor distributions of preferences that are more unequal among the possible characters at each site; larger values favor distributions of preferences more equal among the characters at each site.

--mu_alpha

Concentration parameter for Dirichlet prior over mutagenesis rate (mu).

Default: 1.0

This is the parameter described in inferprefs_algorithm.

--err_alpha

Concentration parameter for Dirichlet priors over error rates (epsilon, rho).

Default: 1.0

This gives the parameters and described in inferprefs_algorithm; currently these two parameters must be equal.

--logfile

Log progress to this file; overwritten if it already exists.

Default: “Base name of “outfile” with extension “.log”“

--ncpus

Number of CPUs to use; set to -1 to use all available CPUs.

Default: 1

--seed

Random number seed.

Default: 1

--sites Only perform the inference for the specified sites, which should be a space separated list such as “–sites 1 2 10”. All of these sites must have data in the counts files.
--ratio_estimation
 

Rather than use MCMC to estimate the preferences using a statistical model, we simply compute them by calculating enrichment ratios relative to wildtype post- and pre-selection, and then normalizing these ratios to sum to one at each site. The single argument for this option is MINCOUNTS, which is a number > 0. If the counts for a character are less than MINCOUNTS, either due to low counts or error correction, then the counts for that character are changed to MINCOUNTS to avoid estimating ratios of zero, less than zero, or infinity.

A suggested value for mincounts is one. In general, larger values of mincounts conceptually corresponds to a stronger prior favoring all preferences being equal.

Preferences are computed using ratio estimation as follows. For each site, we set the enrichment ratio of the wildtype character equal to one

Next, for each non-wildtype character , we calculate the enrichment ratio relative to as

where is the value of mincounts. When error_model is none, then all terms involving the error corrections (with superscript err) are ignored and is set to zero; otherwise is one.

Next, for each character , including , we calculate the preference for as

-v, --version show program’s version number and exit

Output

The inferred preferences are in the Preferences file specified by outfile. A summary of the progress is in the file specified by logfile. In addition, some information will be written to standard output. Do not worry about Informational Messages.

Because standard output will include lots of informational messages that obscure what is happening, it is most informative to track the progress by watching logfile.

Do not worry about Informational Messages

When you run dms_inferprefs, you may get a lot of Informational Message output on the screen. These messages look like this:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
stan::prob::dirichlet_log(N4stan5agrad3varE): prior sample sizes[1]  is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

You do not need to worry about these messages. They are automatically produced by PyStan. Although it seems like there are a lot of these messages, running dms_inferprefs on a 500-residue gene will invoke at least \(4 \times 500 = 2000\) different MCMC chains, all of which may sporadically produce this warning.

dms_inferprefs automatically tests for convergence and described in Runtime and convergence and Inferring the preferences by MCMC. If the program actually fails to converge for any site, you will not get an outfile and the program will terminate with an error message.

Runtime and convergence

Because dms_inferprefs uses MCMC to infer the preferences as described in Algorithm to infer site-specific preferences, it may take the program a while to run. The program will run fastest if you use multiple CPUs (such as by using the option --ncpus -1 to use all available CPUs). It should take anywhere from a few minutes to a few hours to complete on a multi-processor machine, depending on the size of the protein, which characters are being used (nucleotides are fastest, codons are slowest), and whether error controls are being included in the analysis (via --errpre and --errpost).

The program automatically checks for MCMC convergence using the criteria described in Inferring the preferences by MCMC. The program will terminate with an error if it fails to converge; otherwise it converged properly.

If the program is taking too long, you can also just use the --ratio_estimation option to speed things up dramatically. This method is somewhat less accurate when the sequencing counts are fairly low (see Software for the analysis and visualization of deep mutational scanning data (Bloom, 2015)), however it still should give fairly good results.

Examples

Imagine we have data from codon-level mutagenesis of an influenza gene followed by functional sequencing and deep sequencing. We want to determine amino acid preferences. Let’s say the pre-selection mutant library codon counts are in a file called mutDNA_codoncounts.txt and the post-selection codon counts are in a file called mutvirus_codoncounts.txt. If we are not trying to account for error rates, we would use the command:

dms_inferprefs mutDNA_codoncounts.txt mutvirus_codoncounts.txt amino_acid_preferences_no_errors.txt

The inferred preferences would then be in the created file amino_acid_preferences_no_errors.txt. The log of the output would be in the file amino_acid_preferences_no_errors.log.

The command above would include preferences for 21 different possible amino acids, as stop codons are considered a possible amino acid by default. If you do not want stop codons to be considered a possible amino acid (essentially constraining the preference for stop codons to zero), then use:

dms_inferprefs --excludestop mutDNA_codoncounts.txt mutvirus_codoncounts.txt amino_acid_preferences_no_errors.txt

If we instead wanted to account for error rates, and we had sequenced unmutated plasmid to generate the codon counts file DNA_codoncounts.txt and had sequenced unmutated virus to generate the codon counts file virus_codoncounts.txt, we would use the command:

dms_inferprefs --errpre DNA_codoncounts.txt --errpost virus_codoncounts.txt mutDNA_codoncounts.txt mutvirus_codoncounts.txt amino_acid_preferences_corrected_for_errors.txt

If we ran that command and it was taking too long, we might want to use more CPUs if we have multi-core processors. If we want to use 12 CPUs, we would run:

dms_inferprefs --ncpus 12 --errpre DNA_codoncounts.txt --errpost virus_codoncounts.txt mutDNA_codoncounts.txt mutvirus_codoncounts.txt amino_acid_preferences_no_errors.txt

MCMC is a stochastic algorithm. If we wanted to re-run it with a different random number seed to ensure we get the same results, we could use a different random number seed:

dms_inferprefs --seed 2 --ncpus 12 --errpre DNA_codoncounts.txt --errpost virus_codoncounts.txt mutDNA_codoncounts.txt mutvirus_codoncounts.txt amino_acid_preferences_no_errors.txt