dms_inferdiffprefs

Overview

dms_inferdiffprefs is a program included with the dms_tools package. It infers the differential preference \(\Delta\pi_{r,x}\) for each character \(x\) (an amino acid, codon, or nucleotide) at each site \(r\) in an alternative selection versus a control selection. Values of \(\Delta\pi_{r,x} > 0\) indicate that \(x\) is favored in the alternative selection versus the control selection at site \(r\).

The inference is done by MCMC using PyStan.

For a detailed description of the algorithm implemented by this program, see Algorithm to infer differential preferences.

After you install dms_tools, this program will be available to run at the command line.

Should you use this program?

Note that we are no longer certain if the method implemented in this program is superior to the much simpler dms_diffselection. So consider trying that program as well.

Command-line usage

Infer differential preferences for amino acids, nucleotides, or codons. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/

usage: dms_inferdiffprefs [-h] [--chartype {codon_to_aa,DNA,codon,aa}]
                          [--err ERR] [--excludestop]
                          [--alpha_start ALPHA_START]
                          [--alpha_pis1 ALPHA_PIS1] [--alpha_err ALPHA_ERR]
                          [--alpha_deltapi ALPHA_DELTAPI] [--logfile LOGFILE]
                          [--ncpus NCPUS] [--sites SITES [SITES ...]]
                          [--ratio_estimation MINCOUNTS] [--seed SEED] [-v]
                          n_start n_s1 n_s2 outfile

Positional Arguments

n_start

Existing file with counts from deep sequencing of starting library. For nucleotides, header line should be “# POSITION WT A C G T” and then subsequent lines give wildtype and counts for each site (i.e. “1 G 1013 23 19 47”); for codons use the 64 codons instead (i.e. “AAA AAC AAG …”).

This file gives the values described in inferdiffprefs_algorithm.

See dms_counts for full specification of the format for this file.

n_s1

Existing file with counts from deep sequencing library subject to selection 1. Typically this is the control selection; differential preferences are for changes in selection 2 relative to selection 1. File has same format as “n_start”.

This file gives the values described in inferdiffprefs_algorithm. These are for selection , which is taken as the “control” selection.

See dms_counts for full specification of the format for this file.

n_s2

Existing file with counts from deep sequencing library subject to selection 2. Typically this is the alternate selection; differential preferences are for changes in selection 2 relative to selection 1. File has same format as “n_start”.

This file gives the values described in inferdiffprefs_algorithm. These are for selection , which is taken as the “alternative” selection to the “control” selection .

See dms_counts for full specification of the format for this file.

outfile

Created output file with differential preferences; overwritten if it already exists.

In addition to the inferred differential preferences for each character in selection versus selection , this file also gives the root-mean-square (RMS) differential preference for each site, and the posterior probabilities that these preferences are greater than or less than zero.

Note that the preferences are for relative to , so means that is preferred more at site in selection than selection .

See diffpreferences_file for full specification of the format for this file.

Named Arguments

--chartype

Possible choices: codon_to_aa, DNA, codon, aa

Characters for which differential preferences are inferred: “codon_to_aa” = counts for codons and inferred differential preferences for amino acids; “DNA” = counts and inferred differential preference for DNA; “codon” = counts and inferred differential preferences for codons; “aa” = counts and inferred differential preferences for amino acids (possibly including stop codons, see “–excludestop”).

Default: “codon_to_aa”

Note that the default value is “codon_to_aa”.

A value of “codon_to_aa” corresponds to using codon characters for the deep sequencing data, but inferring differential preferences for amino acids, as described in chartype_codon_to_aa This would be the sensible approach if you assume that selection is operating on the amino-acid level. Note that the priors over the mutagenesis rates assume that all codon mutations are made at equal frequencies (NNN libraries).

A value of “DNA” corresponds to using DNA characters for both the deep sequencing data and inferred preferences, as described in chartype_DNA

A value of “codon” corresponds to using codon characters for both the deep sequencing data and inferred preferences, as described in chartype_codon You would prefer this option over “codon_to_aa” if you thought that there was different selection on different synonymous mutations. Note that the priors over the mutagenesis rates assume that all codon mutations are made at equal frequencies (NNN libraries).

A value of “aa” corresponds to using amino-acid characters for both the deep sequence data and inferred preferences, as described in chartype_aa. You should only use this option if your data make it impossible to use “codon_to_aa”. Note that the priors for this option assume that all amino-acid mutations are made at equal frequencies.

--err

Control used to estimate errors in counts for all libraries (“n_start”, “n_s1”, “n_s2”). If “none” then the counts are assumed to have no errors; otherwise this should be a counts file with the same format as “n_start” giving the counts for sequencing unmutated gene.

Default: none

Setting this option to “none” corresponds to setting to zero the error rates described in inferdiffprefs_algorithm.

If you set this option to the name of a file, that file is assumed to give the counts for sequencing the error control () and is used to estimate the error rate described in inferdiffprefs_algorithm. This file should have the format specified by dms_counts.

--excludestop

Exclude stop codons as a possible amino acid if using “–chartype” of “codon_to_aa” or “aa”. Stop codons are only excluded if this option is specified; otherwise they are included, and are required to be specified if “–chartype” is “aa”.

Default: False

By default, when using –chartype codon_to_aa or –chartype aa, we infer differential preferences for 21 amino acids, with stop codons (denoted by *) one of the possibilities. If you specify the –excludestop option, then we constrain the differential preference for stop codons to be zero regardless of whether or not there are counts for these codons in the data, and so only infer differential preferences for the 20 non-stop amino acids.

--alpha_start

Concentration parameter for Dirichlet prior over starting frequencies.

Default: 1.0

This is the parameter described in inferdiffprefs_algorithm.

Smaller values favor distributions of starting frequencies that are more unequal among the possible characters at each site; larger values favor distributions of frequencies more equal among the characters at each site.

--alpha_pis1

Concentration parameter for Dirichlet prior over preferences for selection 1.

Default: 1.0

This is the parameter described in inferdiffprefs_algorithm.

Smaller values favor distributions of preferences that are more unequal among the possible characters at each site; larger values favor distributions of preferences more equal among the characters at each site.

--alpha_err

Concentration parameter for Dirichlet priors over error rate.

Default: 1.0

This gives the parameter described in inferdiffprefs_algorithm.

Smaller values favor distributions of error rates that are more unequal among the possible characters at each site; larger values favor distributions of errors more equal among the characters at each site.

--alpha_deltapi
 

Concentration parameter for Dirichlet priors over differential preferences. Larger values correspond to stronger expectation of differential preferences of zero.

Default: 2.0

This gives the parameter described in inferdiffprefs_algorithm.

Larger values correspond to a stronger prior expectation that the differential preference is zero. You might therefore consider using a value in order to enforce a prior expectation that there is no difference between the selections.

--logfile

Log progress to this file; overwritten if it already exists.

Default: “Base name of “outfile” with extension “.log”“

--ncpus

Number of CPUs to use; set to -1 to use all available CPUs.

Default: 1

--sites Only perform the inference for the specified sites, which should be a space separated list such as “–sites 1 2 10”. All of these sites must have data in the counts files.
--ratio_estimation
 

Rather than use MCMC to estimate the differential preferences using a statistical model, we simply compute them by calculating enrichment ratios relative to wildtype for both selections, then normalizing these ratios to sum to one at each site, and finally taking the difference. The single argument for this option is MINCOUNTS, which is a number > 0. If the counts for a character are less than MINCOUNTS, either due to low counts or error correction, then the counts for that character are changed to MINCOUNTS to avoid estimating ratios of zero, less than zero, or infinity.

A suggested value for mincounts is one. In general, larger values of mincounts conceptually corresponds to a stronger prior favoring all differential preferences being zero.

Differential preferences are computed using ratio estimation as follows. For each site, we set the enrichment ratio of the wildtype character equal to one. We do this for each of the two selections and :

Next, for each non-wildtype character , we calculate the enrichment ratio relative to for and as:

and

where is the value of mincounts. When error_model is none, then all terms involving the error corrections (with superscript err) are ignored and is set to zero; otherwise is one.

Next, for each character , including , we compute the preferences as normalized enrichment ratios for both and as:

and

and finally calculate the differential preference as

--seed

Random number seed.

Default: 1

-v, --version show program’s version number and exit

Output

The inferred differential preferences are in the Differential preferences file specified by outfile. A summary of the progress is in the file specified by logfile. In addition, some information will be written to standard output. Do not worry about Informational Messages.

Becase standard output will include lots of informational messages that obscure what is happening, it will be more informative to track the progress by watching logfile.

Do not worry about Informational Messages

When you run dms_inferdiffprefs, you may get a lot of Informational Message output on the screen. These messages look like this:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
stan::prob::dirichlet_log(N4stan5agrad3varE): prior sample sizes[1]  is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

You do not need to worry about these messages. They are automatically produced by PyStan. Although it seems like there are a lot of these messages, running dms_inferdiffprefs on a 500-residue gene will invoke at least \(4 \times 500 = 2000\) different MCMC chains, all of which may sporadically produce this warning.

dms_inferdiffprefs automatically tests for convergence and described in Runtime and convergence and Inferring the preferences by MCMC. If the program actually fails to converge for any site, you will not get an outfile and the program will terminate with an error message.

Runtime and convergence

Because dms_inferdiffprefs uses MCMC to infer the preferences as described in Algorithm to infer differential preferences, it may take the program a while to run. The program will run fastest if you use multiple CPUs (such as by using the option --ncpus -1 to use all available CPUs). It should take anywhere from a few minutes to a few hours to complete on a multi-processor machine, depending on the size of the protein, which characters are being used (nucleotides are fastest, codons are slowest), and whether error controls are being included in the analysis (via --err).

The program automatically checks for MCMC convergence using the criteria described in Inferring the preferences by MCMC. The program will terminate with an error if it fails to converge; otherwise it converged properly.

You can make the program run dramatically faster by using the --ratio_estimation option. This option forgoes the Bayesian estimation and regularizing prior over the differential preferences, but will still be reasonably accurate if you have a fairly large number of counts.

Examples

Imagine we have created a library of codon mutants of an influenza gene and selected for functional variants. We deep sequence this starting pool of functional variants and put the codon counts in a file called mutvirus_codoncounts.txt. We subject the starting pool of functional mutant variants to immune selection by passaging in the presence of antibody, and also do a control passage in the absence of immune selection. The codon counts from deep sequencing these pools are in the files immuneselected_codoncounts.txt and controlselected_codoncounts.txt, respectively. We also do control sequencing of unmutated virus to quantify the error rates, and put the codon counts in the file unmutatedcontrol_codoncounts.txt. We could then compute the differential preference \(\Delta\pi_{r,a}\) for each amino acid \(a\) at each site \(r\) in the immune selected viruses versus the controls using:

dms_inferdiffprefs --err unmutatedcontrol_codoncounts.txt mutvirus_codoncounts.txt controlselected_codoncounts.txt immuneselected_codoncounts.txt differentialpreferences.txt

The differential preferences are then in the created file called differentialpreferences.txt, with \(\Delta\pi_{r,a} > 0\) indicating that amino acid \(a\) is preferentially favored at \(r\) in the immune selected versus the control selected sample. A log file summarizing output will be written to differentialpreferences.log.