how to use phydms

Overview

The idea behind phydms is that deep mutational scanning data can inform quantitative models of protein evolution. These models can be used to compare selection measured in the lab to that in natural evolution.

For extensive background, see the following references:

  1. Bloom, Mol Biol Evol, 31:1956-1978 introduces the idea of using deep mutational scanning to build quantitative phylogenetic substitution models. Bloom, Mol Biol Evol, 31:2753-2769 makes several extensions to this idea.

  2. Bloom, Biology Direct, 12:1 describes methods to test whether sites are evolving differently in nature than expected from the deep mutational scanning.

  3. Hilton, Doud, and Bloom, PeerJ, 2017 describes the phydms software package.

How to get deep mutational scanning data into a usable form

The Experimentally Informed Codon Models used by phydms incorporate deep mutational scanning data in the form of amino-acid preferences. Each residue has a preference for each amino-acid, and these preferences sum to one for each residue.

Your deep mutational scanning data may or may not already be in this form. For instance, dms_tools2 directly output amino-acid preferences. But many other methods of analyzing deep mutational scanning data give enrichment ratios or functional scores. It is easy to convert these scores or enrichment ratios into amino-acid preferences: just normalize the enrichment ratios to sum to one at each site. See the Tutorial for details.

If you are missing data for some mutations, you have to interpolate them somehow. If you are only missing a few mutations, a reasonable approach is to estimate their effect as equal to the average for all other mutations. You can’t simply leave some estimates out – any mutation can happen in evolution, so an evolutionary model has to include an estimate for each mutation’s effects.

You can not simply estimate amino-acid preferences as the frequencies in a natural sequence alignment and use them in phydms. That approach is circular: the preferences need to come from data independent of the natural sequences. A deep mutational scanning experiment is independent, a natural alignment is probably not. If you want to try to get the preferences out of the alignment, consider Bayesian approaches like the ones here or here; these Bayesian approaches don’t have the same problem with overfitting, although they are vastly more computationally expensive and have other shortcomings.

Tutorial

The best way to learn to use phydms is to look at the tutorials. You can access the tutorials by clicking here.