the phydms
program¶
Contents
Overview¶
phydms
can perform phylogenetic analyses with Experimentally Informed Codon Models as well as with some standard non-site-specific substitution models (variants of the YNGKP models described in Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449). In addition, phydms
can be used to detect site-specific diversifying or differential selection using the Experimentally Informed Codon Models as described in Bloom, Biology Direct, 12:1.
See below for information on the Command-line usage and Output files.
Command-line usage¶
Phylogenetic analysis informed by deep mutational scanning data. The ‘phydms’ package is written by the Bloom lab (see https://github.com/jbloomlab/phydms/contributors). Version 2.4.0. Full documentation at http://jbloomlab.github.io/phydms
usage: phydms [-h] [--brlen {scale,optimize}] [--gammaomega] [--gammabeta]
[--omegabysite] [--omegabysite_fixsyn] [--diffprefsbysite]
[--diffprefsprior DIFFPREFSPRIOR] [--fitphi] [--randprefs]
[--avgprefs] [--divpressure DIVPRESSURE] [--ncpus NCPUS]
[--fitprefsmethod {1,2}] [--ncats NCATS] [--minbrlen MINBRLEN]
[--minpref MINPREF] [--seed SEED] [--initparams INITPARAMS]
[--profile] [--opt_details] [--nograd] [-v]
alignment tree model outprefix
Positional Arguments¶
- alignment
Existing FASTA file with aligned codon sequences.
Should contain aligned DNA codon sequences. Stop codons are not allowed, except if a stop codon is the terminal character in all sequences, in which case they are automatically trimmed.
No checking is done to remove identical sequences, so you may want to remove such sequences yourself if there a lot of them.
The headers must be unique strings that do not contain any of the following: spaces, commas, colons, semicolons, parentheses, square brackets, and single or double quotes.
- tree
Existing Newick file giving input tree.
Typically you infer this tree using a program such as RAxML or codonPhyML. The topology is fixed to that in
tree
. How the branch lengths are handled during optimization depends on--brlen
. The branch lengths intree
are assumed to be in codon substitutions per site.- model
Substitution model: ExpCM_<prefsfile> or YNGKP_<m> (where <m> is M0, M5). For ExpCM, <prefsfile> has first column labeled ‘site’ and others labeled by 1-letter amino-acid code.
YNGKP_<m> is a non-site-specific Goldman-Yang style codon model from Yang, Nielsen, Goldman, and Krabbe Pederson, Genetics, 155:431-449. The <m> indicates the model variant. Codon frequencies are set empirically using the CF3X4 method described by Pond et al, PLoS One, 5:e11230 (this sets the codon frequencies to the the product of nucleotide frequency parameters at each of the three positions after correcting for stop codons). The transition-transversion ratio \(\kappa\) is optimized by maximum likelihood. The nonsynonymous-synonymous ratio \(\omega\) is optimized differently depending on the variant:
-YNGKP_M0 : a single \(\omega\) is optimized by maximum likelihood.
-YNGKP_M5 : \(\omega\) drawn from a gamma distribution with the two parameters optimized by maximum likelihood. The distribution has
--ncats
categories.ExpCM_<prefsfile> is an Experimentally Informed Codon Models with amino-acid preferences taken from the file
prefsfile
. The preferences file must specify a preference for the amino acid encoded by every site inalignment
, using sequential 1, 2, … numbering. Any stop-codon preferences specified in the file are ignored. The preferences file must be a comma-, tab-, or space-delimited file with the first column giving the site and the rest giving the preference for each amino-acid (see detailed example below). Importantly, the amino-acid preferences must be obtained independently from the sequence alignment being analyzed. A deep mutational scanning experiment is an independent means of obtaining the preferences, but estimating them from the amino-acid frequencies in the alignment of homologs is not a valid approach as you are then estimating the preferences from the same sequences that you are subsequently analyzing.- outprefix
Output file prefix.
If this prefix contains a directory name, that directory is created if it does not already exist.
See Output files for a description of the files created with this prefix. Any existing files with these names are deleted at the start of the program’s execution.
Named Arguments¶
- --brlen
Possible choices: scale, optimize
How to handle branch lengths: scale by single parameter or optimize each one
Default: “optimize”
The tree topology is fixed to that in
tree
. But there are several options for how the branch lengths are handled:-scale: fix the relative length of the branches to the values in
tree
, but scale them all by a single parameter \(\mu\) that is optimized. This approach will be faster.-optimize: optimize all branch lengths as free parameters. This approach will be more accurate, but slower.
- --gammaomega
Omega for ExpCM from gamma distribution rather than single value. To achieve same for YNGKP, use ‘model’ of YNGKP_M5.
Default: False
Use this option for a model of ExpCM if you want \(\omega\) to be drawn from
--ncats
gamma-distributed categories, similar to a YNGKP_M5 model. This option adds one extra free parameter, as we are now optimizing two parameters for the \(\omega\) distribution. It will also increase run-time by about 5-fold if you use the default for--ncats
.You do not use this option with YNGKP models; for those, simply set
model
to YNGKP_M5 to get gamma-distributed \(\omega\).- --gammabeta
Beta for ExpCM from gamma distribution rather than single value.
Default: False
- --omegabysite
Fit omega (dN/dS) for each site.
Default: False
If using a YNGKP model, then the \(\omega_r\) value is nearly analogous that obtained using the FEL model described by Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222. If using and ExpCM, then \(\omega_r\) has the meaning described in Experimentally Informed Codon Models. Essentially, we fix all other model / tree parameters and then compare a model that fits a synonymous and nonsynonymous rate to each site to a null model that only fits a synonymous rate; there is evidence for \(\omega_r \ne 1\) if fitting both nonsynonymous and synonymous rate gives sufficiently better likelihood than fitting synonymous rate alone. See also the
--omegabysite_fixsyn
option.- --omegabysite_fixsyn
For ‘–omegabysite’, assign all sites same dS rather than fit for each site.
Default: False
This option is meaningful only if you are using
--omegabysite
. If you use this option, then we compare a model in which we fit a nonsynonymous rate to each site to a model in which we fit nothing. The synonymous rate is not fit, and so is assumed to be equal to the overall value fit for the tree. According to Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222, in some cases this can yield greater power if there is relatively limited data. However, it comes with the risk of giving spurious results if there is substantial variation in the synonymous substitution rate among sites.- --diffprefsbysite
Fit differential preferences for each site.
Default: False
This option can only be used with ExpCM models, not with YNGKP models.
The fitting of site-specific differential selection is designed to identify sites that prefer different amino acids in the natural alignment than expected from the deep mutational scanning data. This approach is described in Bloom, Biology Direct, 12:1.
The prior that is used is specified by
--diffprefsprior
.- --diffprefsprior
Regularizing prior for ‘–diffprefsbysite’: ‘invquadratic,C1,C2’ is prior in Bloom, Biology Direct, 12:1.
Default: invquadratic,150,0.5
How do we regularize the differential preferences when using
--diffprefsbysite
?To use the inverse-quadratic prior in Bloom, Biology Direct, 12:1, specify the string invquadratic,C1,C2 where C1 and C2 are numbers > 0 that specify how concentrated the prior is.
- --fitphi
Fit ExpCM phi rather than setting so stationary state matches alignment frequencies.
Default: False
This option is not typically recommended. It will typically lead to only very slight improvements in log likelihood at substantial computational cost.
- --randprefs
Randomize preferences among sites for ExpCM.
Default: False
Only for ExpCM models. This option randomly reassigns the preferences among sites. This can be used as a control: randomizing the preferences should make them lose their efficacy for describing evolution.
- --avgprefs
Average preferences across sites for ExpCM.
Default: False
Only for ExpCM models. This option computes an average of each preference across sites (\(\pi_a = \frac{1}{L} \sum_r \pi_{r,a}\) where \(r = 1, \ldots, L\)), and then uses these average preferences for all sites. This can be used as a control, as it merges all the information in the preferences into a non-site-specific model.
- --divpressure
Known diversifying pressure at sites: file with column 1 = position, column 2 = diversification pressure; columns space-, tab-, or comma-delimited.
Only for ExpCM models. This option specifies the name of the file with the predetermined diversifying pressure for each site, \(\delta_{r}\).
phydms
will fit \(\omega\) and \(\omega_{2}\) as described in Experimentally Informed Codon Models.You can not use
--divpressure
and--fitphi
simultaneously.- --ncpus
Use this many CPUs; -1 means all available.
Default: 1
Using multiples CPUs accelerates the by-site fitting (e.g.,
--omegabysite
) as multiple sites can be analyzed at once. However, it does not accelerate the initial optimization of the model and tree, as this calculation is not currently parallelized.- --fitprefsmethod
Possible choices: 1, 2
Implementation to we use when fitting prefs.
Default: 2
As described in Numerical implementation, there are two different internal parameter representations that can be used when optimizing the preferences as free parameters. Both in principle should give the same result, but in practice one method might be more efficient than the other or work better in rare cases where the optimization runs into problems.
- --ncats
Number of categories for gamma-distribution.
Default: 4
Determines the number of categories when using
--gammaomega
and model of YNGKP_M5. More categories leads to longer run-time, values of 4-5 are usually adequate.- --minbrlen
Adjust all branch lengths in starting ‘tree’ to >= this.
Default: 1e-06
All branches with lengths less than this value will be set to this value in the initial starting tree. Branches can still end up with lengths less than this after subsequent optimization of this starting tree.
- --minpref
Adjust all preferences in ExpCM ‘prefsfile’ to >= this.
Default: 0.002
Adjust all amino-acid preferences in the ExpCM
prefsfile
to be at least this large. If this number is too close to zero, you will get runtime errors.- --seed
Random number seed.
Default: 1
The random number seed affects the outcomes when there is randomization, as when using
--randprefs
.- --initparams
Initialize model params from this file, which should be format of ‘*_modelparams.txt’ file created by ‘phydms’ with this model.
- --profile
Profile likelihood maximization, write pstats files. For code-development purposes.
Default: False
- --opt_details
Print details about optimization
Default: False
- --nograd
Do not use gradients for likelihood maximization.
Default: False
- -v, --version
show program’s version number and exit
Preferences file format¶
The preferences file should follow the dms_tools2 preferences file format and be a comma-, tab-, or space-delimited file with the first column giving the site and the rest giving the preference for each amino-acid:
site A C D E F G H I K L M N P Q R S T V W Y
1 0.044 0.046 0.048 0.0461 0.046 0.0411 0.052 0.027 0.046 0.043 0.172 0.045 0.045 0.045 0.044 0.041 0.037 0.028 0.046 0.048
2 0.658 0.006 0.005 0.029 0.005 0.019 0.007 0.004 0.007 0.001 0.036 0.005 0.009 0.003 0.005 0.014 0.013 0.148 0.009 0.005
Output files¶
Running phydms
will create the following output files, all with the prefix specified by outprefix
.
Log file¶
This is a text file with the suffix _log.log
that records information about the program’s progress.
Log likelihood file¶
This file has the suffix _loglikelihood.txt
. It simply gives the optimized log likelihood. Here is an example of a file’s contents:
log likelihood = -4415.24
Model parameters¶
This file has the suffix _modelparams.txt
.
Here is an example of the contents for an ExpCM model. It gives the values of the parameters described in Experimentally Informed Codon Models; note that \(\mu\) is not included as it is confounded with the branch lengths:
beta = 2.99307
kappa = 6.31364
omega = 0.780782
phiA = 0.372278
phiC = 0.196416
phiG = 0.224367
Here is an example of the contents for a YNGKP_M0 model. In this file, phi0A is the corrected empirical frequency of A at the first codon site, phi0A is the frequency of A at the second codon site, etc.:
kappa = 5.78484
omega = 0.11308
phi0A = 0.340656
phi0C = 0.153395
phi0G = 0.308294
phi1A = 0.320503
phi1C = 0.206698
phi1G = 0.236844
phi2A = 0.331485
phi2C = 0.202991
phi2G = 0.243997
If you use a model with a gamma-distributed \(\omega\) (i.e., the --gammarates
option for an ExpCM, or the YNGKP_M5 model) or \(\beta\), rather than have a single value for the parameter, there are instead two parameters that determine the gamma distribution.
For a gamma-distributed \(\omega\), these are the shape parameter \(\alpha_{\omega}\) (denoted alpha_omega) and the inverse scale parameter \(\beta_{\omega}\) (denoted by beta_omega).
The mean and variance of the omega distribution are \(\alpha_{\omega}/ \beta_{\omega}\) and \(\alpha_{\omega} / \left(\beta_{\omega}\right)^2\), respectively.
To get the exact values, use phydmslib.models.DiscreteGamma
(alpha_omega, beta_omega, ncats)
where ncats is the value set by --ncats
. Here is an example of the model parameter file contents for an ExpCM with --gammaomega
:
alpha_omega = 0.835183
beta = 3.01549
beta_omega = 0.976053
kappa = 6.3424
phiA = 0.372475
phiC = 0.196471
phiG = 0.22418
The shape and scale parameters for gamma-distributed \(\beta\) are \(\alpha_{\beta}\) and \(\beta_{\beta}\), respectively.
Tree file¶
This file has the suffix _tree.newick
, and gives the optimized tree in Newick format.
The branch lengths are in units of codon substitutions per site.
Site-specific omega file¶
This file has the suffix _omegabysite.txt
, and is created only if using the --omegabysite
option. This file gives the \(\omega_r\) values optimized for each site. In the case of a YNGKP model, these are site-specific dN/dS ratios that should be essentially analogous to those obtained under the FEL model described by Kosakovsky Pond and Frost, Mol Biol Evol, 22:1208-1222. In the case of an ExpCM model, these values indicate diversifying selection for nonsynonymous amino-acid change as described in Bloom, Biology Direct, 12:1 (see also Experimentally Informed Codon Models).
Here is an example of the first few lines of a file. The entries are tab separated:
site omega P dLnL Q
213 0.000 0.000216 6.841 0.108
55 0.000 0.000649 5.815 0.162
354 0.000 0.00166 4.948 0.177
298 0.000 0.00181 4.865 0.177
The P-values are for rejection of the null hypothesis that \(\omega_r = 1\) (calculated using a \(\chi_1^2\) test). These P-values are not corrected for multiple hypothesis testing; so you should instead look at the Q-values, which give the false discovery rate if this site and all above it are considered to have \(\omega \ne 1\). The sites are sorted in the file by Q-value.
For this type of test, the quantity of interest is typically not so much the exact value of \(\omega_r\), but rather the P-value (or Q-value) that \(\omega_r \ne 1\). That P-value (or Q-value) is what indicates the credence that you should lend to the idea that a site is under diversifying selection if \(\omega_r > 1\).
Note that the Q-values are computed separately for \(\omega > 1\) and \(\omega < 1\), since we do not necessarily expect a symmetric distribution of \(\omega\) values.
Site-specific differential selection¶
This file has the suffix _diffprefsbysite.txt
, and is created only if using the --diffprefsbysite
option to an ExpCM. These are the differential preferences (the \(\Delta\pi_{r,a}\) values) described in Bloom, Biology Direct, 12:1 and Experimentally Informed Codon Models.
Here is an example of the first few lines of a file. The entries are tab separated:
site dpi_A dpi_C dpi_D dpi_E dpi_F dpi_G dpi_H dpi_I dpi_K dpi_L dpi_M dpi_N dpi_P dpi_Q dpi_R dpi_S dpi_T dpi_V dpi_W dpi_Y half_sum_abs_dpi
375 -0.0004 -0.0040 0.0995 -0.8408 -0.0014 0.4839 -0.0029 -0.0003 -0.0062 -0.0333 -0.0092 -0.0039 -0.0003 -0.0340 -0.0000 -0.0036 -0.0073 0.3659 -0.0013 -0.0003 0.9492
421 -0.0107 -0.0062 -0.0092 0.1521 -0.0009 -0.0004 -0.0078 -0.0124 -0.0003 -0.0168 -0.0036 -0.0271 -0.0000 -0.0019 -0.0060 -0.0259 -0.0022 -0.0116 -0.0003 -0.0085 0.1521
283 -0.0126 -0.0124 -0.0001 -0.0031 -0.0052 -0.0129 -0.0131 -0.0047 -0.0014 -0.0085 -0.0077 -0.0012 0.1502 -0.0123 -0.0004 -0.0237 -0.0164 -0.0089 -0.0054 -0.0000 0.1502
294 -0.0078 -0.0087 -0.0112 0.1458 -0.0084 -0.0079 -0.0098 -0.0001 -0.0064 -0.0028 -0.0079 -0.0134 -0.0000 -0.0126 -0.0104 -0.0156 -0.0110 -0.0115 -0.0001 -0.0002 0.1458
127 -0.0088 -0.0006 -0.0010 0.1423 -0.0021 -0.0179 -0.0059 -0.0096 -0.0208 -0.0100 -0.0021 -0.0095 -0.0007 -0.0066 -0.0073 -0.0114 -0.0146 -0.0075 -0.0010 -0.0049 0.1423
289 -0.0079 -0.0127 -0.0005 -0.0002 -0.0228 -0.0005 -0.0154 -0.0156 -0.0033 -0.0167 -0.0113 -0.0034 -0.0004 -0.0004 -0.0094 -0.0020 -0.0028 -0.0133 -0.0006 0.1391 0.1391
The first column gives the site numbers, subsequent columns give the differential preference (\(\Delta\pi_{r,a}\)) for each amino acid. The last column gives the half absolute sum of the differential preferences, \(\sum_a |\Delta\pi_{r,a}|\), at each site. This quantity can range from zero to one. The sites are sorted with the highest half absolute sum differential preference first.