the phydms program

Overview

phydms can perform phylogenetic analyses with Experimentally Informed Codon Models as well as with some standard non-site-specific substitution models (variants of the YNGKP models described in Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449). In addition, phydms can be used to detect site-specific diversifying or differential selection using the Experimentally Informed Codon Models as described in Bloom, Biology Direct, 12:1.

See below for information on the Command-line usage and Output files.

Command-line usage

Phylogenetic analysis informed by deep mutational scanning data. The ‘phydms’ package is written by the Bloom lab (see https://github.com/jbloomlab/phydms/contributors). Version 2.4.0. Full documentation at http://jbloomlab.github.io/phydms

usage: phydms [-h] [--brlen {scale,optimize}] [--gammaomega] [--gammabeta]
              [--omegabysite] [--omegabysite_fixsyn] [--diffprefsbysite]
              [--diffprefsprior DIFFPREFSPRIOR] [--fitphi] [--randprefs]
              [--avgprefs] [--divpressure DIVPRESSURE] [--ncpus NCPUS]
              [--fitprefsmethod {1,2}] [--ncats NCATS] [--minbrlen MINBRLEN]
              [--minpref MINPREF] [--seed SEED] [--initparams INITPARAMS]
              [--profile] [--opt_details] [--nograd] [-v]
              alignment tree model outprefix

Positional Arguments

alignment

Existing FASTA file with aligned codon sequences.

Should contain aligned DNA codon sequences. Stop codons are not allowed, except if a stop codon is the terminal character in all sequences, in which case they are automatically trimmed.

No checking is done to remove identical sequences, so you may want to remove such sequences yourself if there a lot of them.

The headers must be unique strings that do not contain any of the following: spaces, commas, colons, semicolons, parentheses, square brackets, and single or double quotes.

tree

Existing Newick file giving input tree.

Typically you infer this tree using a program such as RAxML or codonPhyML. The topology is fixed to that in tree. How the branch lengths are handled during optimization depends on --brlen. The branch lengths in tree are assumed to be in codon substitutions per site.

model

Substitution model: ExpCM_<prefsfile> or YNGKP_<m> (where <m> is M0, M5). For ExpCM, <prefsfile> has first column labeled ‘site’ and others labeled by 1-letter amino-acid code.

YNGKP_<m> is a non-site-specific Goldman-Yang style codon model from Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449. The <m> indicates the model variant. Codon frequencies are set empirically using the CF3X4 method described by Pond et al, PLoS One, 5:e11230 (this sets the codon frequencies to the the product of nucleotide frequency parameters at each of the three positions after correcting for stop codons). The transition-transversion ratio \(\kappa\) is optimized by maximum likelihood. The nonsynonymous-synonymous ratio \(\omega\) is optimized differently depending on the variant:

-YNGKP_M0 : a single \(\omega\) is optimized by maximum likelihood.

-YNGKP_M5 : \(\omega\) drawn from a gamma distribution with the two parameters optimized by maximum likelihood. The distribution has --ncats categories.

ExpCM_<prefsfile> is an Experimentally Informed Codon Models with amino-acid preferences taken from the file prefsfile. The preferences file must specify a preference for the amino acid encoded by every site in alignment, using sequential 1, 2, … numbering. Any stop-codon preferences specified in the file are ignored. The preferences file must be a comma-, tab-, or space-delimited file with the first column giving the site and the rest giving the preference for each amino-acid (see detailed example below). Importantly, the amino-acid preferences must be obtained independently from the sequence alignment being analyzed. A deep mutational scanning experiment is an independent means of obtaining the preferences, but estimating them from the amino-acid frequencies in the alignment of homologs is not a valid approach as you are then estimating the preferences from the same sequences that you are subsequently analyzing.

outprefix

Output file prefix.

If this prefix contains a directory name, that directory is created if it does not already exist.

See Output files for a description of the files created with this prefix. Any existing files with these names are deleted at the start of the program’s execution.

Named Arguments

--brlen

Possible choices: scale, optimize

How to handle branch lengths: scale by single parameter or optimize each one

Default: “optimize”

The tree topology is fixed to that in tree. But there are several options for how the branch lengths are handled:

-scale: fix the relative length of the branches to the values in tree, but scale them all by a single parameter \(\mu\) that is optimized. This approach will be faster.

-optimize: optimize all branch lengths as free parameters. This approach will be more accurate, but slower.

--gammaomega

Omega for ExpCM from gamma distribution rather than single value. To achieve same for YNGKP, use ‘model’ of YNGKP_M5.

Default: False

Use this option for a model of ExpCM if you want \(\omega\) to be drawn from --ncats gamma-distributed categories, similar to a YNGKP_M5 model. This option adds one extra free parameter, as we are now optimizing two parameters for the \(\omega\) distribution. It will also increase run-time by about 5-fold if you use the default for --ncats.

You do not use this option with YNGKP models; for those, simply set model to YNGKP_M5 to get gamma-distributed \(\omega\).

--gammabeta

Beta for ExpCM from gamma distribution rather than single value.

Default: False

--omegabysite

Fit omega (dN/dS) for each site.

Default: False

If using a YNGKP model, then the \(\omega_r\) value is nearly analogous that obtained using the FEL model described by Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222. If using and ExpCM, then \(\omega_r\) has the meaning described in Experimentally Informed Codon Models. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for \(\omega_r \ne 1\) if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the --omegabysite_fixsyn option.

--omegabysite_fixsyn

For ‘–omegabysite’, assign all sites same dS rather than fit for each site.

Default: False

This option is meaningful only if you are using --omegabysite. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites.

--diffprefsbysite

Fit differential preferences for each site.

Default: False

This option can only be used with ExpCM models, not with YNGKP models.

The fitting of site-specific differential selection is designed to identify sites that prefer different amino acids in the natural alignment than expected from the deep mutational scanning data. This approach is described in Bloom, Biology Direct, 12:1.

The prior that is used is specified by --diffprefsprior.

--diffprefsprior

Regularizing prior for ‘–diffprefsbysite’: ‘invquadratic,C1,C2’ is prior in Bloom, Biology Direct, 12:1.

Default: invquadratic,150,0.5

How do we regularize the differential preferences when using --diffprefsbysite?

To use the inverse-quadratic prior in Bloom, Biology Direct, 12:1, specify the string invquadratic,C1,C2 where C1 and C2 are numbers > 0 that specify how concentrated the prior is.

--fitphi

Fit ExpCM phi rather than setting so stationary state matches alignment frequencies.

Default: False

This option is not typically recommended. It will typically lead to only very slight improvements in log likelihood at substantial computational cost.

--randprefs

Randomize preferences among sites for ExpCM.

Default: False

Only for ExpCM models. This option randomly reassigns the preferences among sites. This can be used as a control: randomizing the preferences should make them lose their efficacy for describing evolution.

--avgprefs

Average preferences across sites for ExpCM.

Default: False

Only for ExpCM models. This option computes an average of each preference across sites (\(\pi_a = \frac{1}{L} \sum_r \pi_{r,a}\) where \(r = 1, \ldots, L\)), and then uses these average preferences for all sites. This can be used as a control, as it merges all the information in the preferences into a non-site-specific model.

--divpressure

Known diversifying pressure at sites: file with column 1 = position, column 2 = diversification pressure; columns space-, tab-, or comma-delimited.

Only for ExpCM models. This option specifies the name of the file with the predetermined diversifying pressure for each site, \(\delta_{r}\). phydms will fit \(\omega\) and \(\omega_{2}\) as described in Experimentally Informed Codon Models.

You can not use --divpressure and --fitphi simultaneously.

--ncpus

Use this many CPUs; -1 means all available.

Default: 1

Using multiples CPUs accelerates the by-site fitting (e.g., --omegabysite) as multiple sites can be analyzed at once. However, it does not accelerate the initial optimization of the model and tree, as this calculation is not currently parallelized.

--fitprefsmethod

Possible choices: 1, 2

Implementation to we use when fitting prefs.

Default: 2

As described in Numerical implementation, there are two different internal parameter representations that can be used when optimizing the preferences as free parameters. Both in principle should give the same result, but in practice one method might be more efficient than the other or work better in rare cases where the optimization runs into problems.

--ncats

Number of categories for gamma-distribution.

Default: 4

Determines the number of categories when using --gammaomega and model of YNGKP_M5. More categories leads to longer run-time, values of 4-5 are usually adequate.

--minbrlen

Adjust all branch lengths in starting ‘tree’ to >= this.

Default: 1e-06

All branches with lengths less than this value will be set to this value in the initial starting tree. Branches can still end up with lengths less than this after subsequent optimization of this starting tree.

--minpref

Adjust all preferences in ExpCM ‘prefsfile’ to >= this.

Default: 0.002

Adjust all amino-acid preferences in the ExpCM prefsfile to be at least this large. If this number is too close to zero, you will get runtime errors.

--seed

Random number seed.

Default: 1

The random number seed affects the outcomes when there is randomization, as when using --randprefs.

--initparams

Initialize model params from this file, which should be format of ‘*_modelparams.txt’ file created by ‘phydms’ with this model.

--profile

Profile likelihood maximization, write pstats files. For code-development purposes.

Default: False

--opt_details

Print details about optimization

Default: False

--nograd

Do not use gradients for likelihood maximization.

Default: False

-v, --version

show program’s version number and exit

Preferences file format

The preferences file should follow the dms_tools2 preferences file format and be a comma-, tab-, or space-delimited file with the first column giving the site and the rest giving the preference for each amino-acid:

site A C D E F G H I K L M N P Q R S T V W Y
1 0.044  0.046 0.048 0.0461 0.046 0.0411 0.052 0.027 0.046 0.043 0.172 0.045 0.045 0.045 0.044 0.041 0.037 0.028 0.046 0.048
2 0.658 0.006 0.005 0.029 0.005 0.019 0.007 0.004 0.007 0.001 0.036 0.005 0.009 0.003 0.005 0.014 0.013 0.148 0.009 0.005

Output files

Running phydms will create the following output files, all with the prefix specified by outprefix.

Log file

This is a text file with the suffix _log.log that records information about the program’s progress.

Log likelihood file

This file has the suffix _loglikelihood.txt. It simply gives the optimized log likelihood. Here is an example of a file’s contents:

log likelihood = -4415.24

Model parameters

This file has the suffix _modelparams.txt.

Here is an example of the contents for an ExpCM model. It gives the values of the parameters described in Experimentally Informed Codon Models; note that \(\mu\) is not included as it is confounded with the branch lengths:

beta = 2.99307
kappa = 6.31364
omega = 0.780782
phiA = 0.372278
phiC = 0.196416
phiG = 0.224367

Here is an example of the contents for a YNGKP_M0 model. In this file, phi0A is the corrected empirical frequency of A at the first codon site, phi0A is the frequency of A at the second codon site, etc.:

kappa = 5.78484
omega = 0.11308
phi0A = 0.340656
phi0C = 0.153395
phi0G = 0.308294
phi1A = 0.320503
phi1C = 0.206698
phi1G = 0.236844
phi2A = 0.331485
phi2C = 0.202991
phi2G = 0.243997

If you use a model with a gamma-distributed \(\omega\) (i.e., the --gammarates option for an ExpCM, or the YNGKP_M5 model) or \(\beta\), rather than have a single value for the parameter, there are instead two parameters that determine the gamma distribution. For a gamma-distributed \(\omega\), these are the shape parameter \(\alpha_{\omega}\) (denoted alpha_omega) and the inverse scale parameter \(\beta_{\omega}\) (denoted by beta_omega). The mean and variance of the omega distribution are \(\alpha_{\omega}/ \beta_{\omega}\) and \(\alpha_{\omega} / \left(\beta_{\omega}\right)^2\), respectively. To get the exact values, use phydmslib.models.DiscreteGamma (alpha_omega, beta_omega, ncats) where ncats is the value set by --ncats. Here is an example of the model parameter file contents for an ExpCM with --gammaomega:

alpha_omega = 0.835183
beta = 3.01549
beta_omega = 0.976053
kappa = 6.3424
phiA = 0.372475
phiC = 0.196471
phiG = 0.22418

The shape and scale parameters for gamma-distributed \(\beta\) are \(\alpha_{\beta}\) and \(\beta_{\beta}\), respectively.

Tree file

This file has the suffix _tree.newick, and gives the optimized tree in Newick format. The branch lengths are in units of codon substitutions per site.

Site-specific omega file

This file has the suffix _omegabysite.txt, and is created only if using the --omegabysite option. This file gives the \(\omega_r\) values optimized for each site. In the case of a YNGKP model, these are site-specific dN/dS ratios that should be essentially analogous to those obtained under the FEL model described by Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222. In the case of an ExpCM model, these values indicate diversifying selection for nonsynonymous amino-acid change as described in Bloom, Biology Direct, 12:1 (see also Experimentally Informed Codon Models).

Here is an example of the first few lines of a file. The entries are tab separated:

site    omega   P   dLnL    Q
213 0.000   0.000216    6.841   0.108
55  0.000   0.000649    5.815   0.162
354 0.000   0.00166 4.948   0.177
298 0.000   0.00181 4.865   0.177

The P-values are for rejection of the null hypothesis that \(\omega_r = 1\) (calculated using a \(\chi_1^2\) test). These P-values are not corrected for multiple hypothesis testing; so you should instead look at the Q-values, which give the false discovery rate if this site and all above it are considered to have \(\omega \ne 1\). The sites are sorted in the file by Q-value.

For this type of test, the quantity of interest is typically not so much the exact value of \(\omega_r\), but rather the P-value (or Q-value) that \(\omega_r \ne 1\). That P-value (or Q-value) is what indicates the credence that you should lend to the idea that a site is under diversifying selection if \(\omega_r > 1\).

Note that the Q-values are computed separately for \(\omega > 1\) and \(\omega < 1\), since we do not necessarily expect a symmetric distribution of \(\omega\) values.

Site-specific differential selection

This file has the suffix _diffprefsbysite.txt, and is created only if using the --diffprefsbysite option to an ExpCM. These are the differential preferences (the \(\Delta\pi_{r,a}\) values) described in Bloom, Biology Direct, 12:1 and Experimentally Informed Codon Models.

Here is an example of the first few lines of a file. The entries are tab separated:

site    dpi_A   dpi_C   dpi_D   dpi_E   dpi_F   dpi_G   dpi_H   dpi_I   dpi_K   dpi_L   dpi_M   dpi_N   dpi_P   dpi_Q   dpi_R   dpi_S   dpi_T   dpi_V   dpi_W   dpi_Y   half_sum_abs_dpi
375 -0.0004 -0.0040 0.0995  -0.8408 -0.0014 0.4839  -0.0029 -0.0003 -0.0062 -0.0333 -0.0092 -0.0039 -0.0003 -0.0340 -0.0000 -0.0036 -0.0073 0.3659  -0.0013 -0.0003 0.9492
421 -0.0107 -0.0062 -0.0092 0.1521  -0.0009 -0.0004 -0.0078 -0.0124 -0.0003 -0.0168 -0.0036 -0.0271 -0.0000 -0.0019 -0.0060 -0.0259 -0.0022 -0.0116 -0.0003 -0.0085 0.1521
283 -0.0126 -0.0124 -0.0001 -0.0031 -0.0052 -0.0129 -0.0131 -0.0047 -0.0014 -0.0085 -0.0077 -0.0012 0.1502  -0.0123 -0.0004 -0.0237 -0.0164 -0.0089 -0.0054 -0.0000 0.1502
294 -0.0078 -0.0087 -0.0112 0.1458  -0.0084 -0.0079 -0.0098 -0.0001 -0.0064 -0.0028 -0.0079 -0.0134 -0.0000 -0.0126 -0.0104 -0.0156 -0.0110 -0.0115 -0.0001 -0.0002 0.1458
127 -0.0088 -0.0006 -0.0010 0.1423  -0.0021 -0.0179 -0.0059 -0.0096 -0.0208 -0.0100 -0.0021 -0.0095 -0.0007 -0.0066 -0.0073 -0.0114 -0.0146 -0.0075 -0.0010 -0.0049 0.1423
289 -0.0079 -0.0127 -0.0005 -0.0002 -0.0228 -0.0005 -0.0154 -0.0156 -0.0033 -0.0167 -0.0113 -0.0034 -0.0004 -0.0004 -0.0094 -0.0020 -0.0028 -0.0133 -0.0006 0.1391  0.1391

The first column gives the site numbers, subsequent columns give the differential preference (\(\Delta\pi_{r,a}\)) for each amino acid. The last column gives the half absolute sum of the differential preferences, \(\sum_a |\Delta\pi_{r,a}|\), at each site. This quantity can range from zero to one. The sites are sorted with the highest half absolute sum differential preference first.