dms_diffselection
¶
Contents
Overview¶
dms_diffselection
is a program included with the dms_tools package.
It is designed to assess the differential selection on each mutation between a library that is mock-treated and subjected to some selective condition.
After you install dms_tools, this program will be available to run at the command line.
Differential selection¶
Say you have a deep mutational scanning library that is mock-treated and selected. For instance, this might be a virus library that is passaged in the absence and presence of immune selection. For each site \(r\) in the gene, let \(n_{r,x}^{\rm{mock}}\) be the number of observed counts of character \(x\) (might be amino acid, nucleotide, or codon) at site \(r\) in the mock-treated condition, and let \(n_{r,x}^{\rm{selected}}\) be the number of observed counts of \(x\) at \(r\) in the selected condition. Let \(\operatorname{wt}\left(r\right)\) be denote the wildtype character at \(r\). Then the relative enrichment of the mutant relative to the wildtype after selection is
where \(P > 0\) is a pseudocount that is added to each observed count (specified by --pseudocount
option), and \(f_{r, \rm{selected}}\) and \(f_{r, \rm{mock}}\) are defined so that the pseudocount is scaled up for the library (mock or selected) with higher depth at site \(r\):
The reason for scaling the pseudocount by library depth is that the mock and selected libraries may often be sequenced at different depths. In that case, if the same pseudocount is added to both, then estimates of \(E_{r,x}\) will tend to be systematically different than one even if the relative counts for the wildtype and mutant amino acid are the same in both two conditions. Scaling the pseudocounts by the ratio of depths fixes this problem. If you do not want to do this scaling, see the --no-scale-pseudocounts
option.
Note that by definition, \(E_{r,\operatorname{wt}\left(r\right)}\) is always one.
We then quantify the differential selection \(s_{r,x}\) for \(x\) at \(r\) in the selected versus control condition as:
It is these \(s_{r,x}\) values that are reported by dms_diffselection
. Note that the choice of logarithm base is arbitrary; we have chosen 2 here. Note that by definition, \(s_{r,\operatorname{wt}\left(r\right)}\) is always zero.
When \(s_{r,x} > 0\), then \(x\) is more favored in the selection versus mock condition. When \(s_{r,x} < 0\), then \(x\) is less favored in the selection versus mock condition.
Larger values of the pseudocount \(P\) will avoid spuriously estimating strong selection when in fact you just have a lot of statistical noise due to small counts.
Error correction¶
You can optionally correct for the potential inflation of some counts by sequencing errors by using the --errorcontrolcounts
option to dms_diffselection
.
In this case, the counts defined in Equation (1) are adjusted as follows.
Let \(n_{r,x}\) be the counts for character \(x\) at site \(r\) in the mock or selected sample (these are the numbers in the mockcounts
or selectecounts
files).
Let \(n^{\rm{err}}_{r,x}\) be the number of counts of \(x\) at site \(r\) in the error control specified by --errorcontrolcounts
.
Define
When \(x \ne \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is the rate of errors to \(x\) at site \(r\), and when \(x = \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is one minus the rate of errors away from the wildtype at site \(r\).
We then adjust observed counts \(n_{r,x}\) to the error-corrected counts \(\hat{n}_{r,x}\) by
These adjusted counts are then used in place of the un-adjusted counts in Equation (1) above.
Note that if you are using --chartype
of codon_to_aa
, the corrections are done on the codon counts, and these corrected which are then aggregated into the amino acid counts before computing the differential selection.
Command-line usage¶
Differential selection between mock-treated and selected samples. Quantified as log2[{(n_sel_mut + f_sel * P) / (n_sel_wt + f_sel * P)} / {(n_mock_mut + f_mock * P) / (n_mock_wt + f_mock * P)}] where P is pseudocount, f_sel = max(1, N_sel / N_mock), and f_mock = max(1, N_mock / N_sel) where N_mock and N_sel are the sequencing depth for mock and selected libraries at that site. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/
usage: dms_diffselection [-h] [--errorcontrolcounts ERRORCONTROLCOUNTS]
[--pseudocount PSEUDOCOUNT] [--mincounts MINCOUNTS]
[--no-scale-pseudocounts]
[--chartype {codon_to_aa,DNA,codon,aa}]
[--includestop] [-v]
mockcounts selectedcounts outprefix
Positional Arguments¶
mockcounts | File with counts from mock-selected library. For nucleotides, header line should be “# POSITION WT A C G T” and then subsequent lines give wildtype and counts for each site (i.e. “1 G 1013 23 19 47”); for codons use the 64 codons instead (i.e. “AAA AAC AAG …”). This file gives the values described in Differential selection. Should be in the format of a dms_counts file. |
selectedcounts | File with counts from selected library This file gives the values described in Differential selection. Should be in the format of a dms_counts file. |
outprefix | Prefix for output files. Suffixes are ‘mutdiffsel.txt’ and ‘sitediffsel.txt’ For instance, if outprefix is antibodyselection_ then the output files are antibodyselection_mutdiffsel.txt and antibodyselection_sitediffsel.txt. See Output for an explanation of the contents of these files. |
Named Arguments¶
--errorcontrolcounts | |
Counts file for error control (e.g., wildtype gene or virus) to correct mutation counts. You can use this option if you have estimated your sequencing error rate by sequencing a “wildtype” sample where you expect all errors come from sequencing (or other related processes such as PCR or reverse transcription). The counts will then be adjusted to correct for the error rates estimated from this control. This file gives the values described in Error correction. Should be in the format of a dms_counts file. | |
--pseudocount | Pseudocount added to each count for selected library. By default, pseudocounts is added to library with smaller depth, and pseudcount for other library is scaled by relative depth at that site (see –no-scale-pseudocounts). Default: 10 This is the parameter described in the Equation E_rx. Note that by default this pseudocount is for the library with greater depth at that site, and the pseudocount for the other library is scaled up by the ratio of the relative depths as described by Equations f_rselected and f_rmock. |
--mincounts | Only report differential selection for mutations for which at least one of mock or selected counts is >= this number. Mutations that do not meet this threshold have differential selection written as “NaN”. Default: 0 If you set this option to anything other than its default of 0, then you will get differential selection ( values) of NaN (not a number) for mutations that have very low counts both before and after selection. In some cases, this may be preferable than estimating a value for a mutation with very little data. Note that if you keep –mincounts at its default of 0 but have a reasonable value of –pseudocount, then mutations with few or no counts will have differential selection values of 0 estimated for them. |
--no-scale-pseudocounts | |
Use this option if you do NOT want to scale the pseudocounts for each library by relative depth, but instead want to use unscaled value specified by ‘–pseudocount’. Default: False If you use this option, the scaling and in Equations f_rselected and f_rmock are always set to one. Note that this can give biased estimates if the depth is unequal in the two libraries. | |
--chartype | Possible choices: codon_to_aa, DNA, codon, aa Characters: “codon_to_aa” = counts for codons and selection for amino acids; “DNA” = counts and selection for DNA; “codon” = counts and selection for codons; “aa” = counts and selection for amino acids (possibly including stop codons, see “–includestop”). Default: “codon_to_aa” A value of codon_to_aa means we have codon characters for the deep sequencing data, but infer selection for amino acids. Essentially, we sum the counts of all codons for each amino acid, and then do the calculation described in Overview on these amino-acid counts. |
--includestop | Include stop codons as a possible amino acid if using “–chartype” of “codon_to_aa” or “aa”. Default: False If this is true, treat stop codons (denoted by *) as a possible amino acid. Otherwise simply ignore any counts for stop codons. |
-v, --version | show program’s version number and exit |
Example usage¶
Let’s say that mock_counts.txt
has the codon counts in a mock-selected library, and antibody_counts
has these same counts in a library selected with an antibody. To determine selection on the amino acids, you could then run:
dms_diffselection mock_counts.txt antibody_counts.txt antibodyselection_
The created files would be antibodyselection_mutdiffsel.txt
and antibodyselection_sitediffsel.txt
.
Output¶
Running dms_diffselection
will cause some descriptive output to be printed to standard output.
In addition, two files with the prefix outprefix
are created (these files are overwritten if they already exist). These files are:
The
*mutdiffsel.txt
file gives the \(s_{r,x}\) values in a CSV file. These are sorted by \(s_{r,x}\) value. The site number and wildtype character are also indicated. Here are the first few lines of an example file:site,wt,mut,diffsel 156,G,S,7.15912988065 146,N,D,6.85510232461 157,K,S,6.78492241017 159,S,D,6.61445818152 153,S,K,6.55644791926 153,S,I,6.45786904581 158,S,A,6.31170638356In these files, the entry for \(x\) equal to the wildtype character is always
NaN
(not a number). So for instance, given the data above, we would also have entries like:156,G,G,NaN 146,N,N,NaNIn addition, if
--mincounts
is set to something other than zero, then there may be mutations for which an \(s_{r,x}\) value cannot be computed. In that case, that value is also reported asNaN
(not a number) in this output file.
*sitediffsel.txt
file summarizes total differential selection on each site. It provides three summary statistics:
- abs_diffsel is the sum of the absolute values of the selection on all mutations at the site, and so is a measure of the total selection at this site. It is computed as \(\sum_x \left|s_{r,x}\right|\).
- positive_diffsel is a sum of the values of positively selected mutations at a site, and so is a measure of the total selection for favorable mutations at this site. It is computed as \(\sum_x \max\left(0, s_{r,x}\right)\).
- negative_diffsel is a sum of the values of negatively selected mutations at a site, and so is a measure of the total selection against mutations at this site. It is computed as \(\sum_x \min\left(0, s_{r,x}\right)\).
The file gives all three of these quantities for each site, sorted by abs_diffsel. Here are the first few lines of an example file:
site,abs_diffsel,positive_diffsel,negative_diffsel 157,92.5132004464,92.5132004464,0.0 153,85.7023127899,84.7159706476,-0.986342142299 136,56.4005369298,56.3664732104,-0.0340637194023 158,56.3773123754,56.3773123754,0.0 156,56.0690809464,56.0690809464,0.0 175,31.899455678,3.79784216691,-28.1016135111Note that if
--mincounts
is not 0, then some of the \(s_{r,x}\) values may beNaN
(not a number). In that case, any characters \(x\) that do not have a value for the \(s_{r,x}\) value do not contribute to the sums used to define abs_diffsel, positive_diffsel, and negative_diffsel above (in other words, they are counted as if the differential selection for character \(x\) is 0). If all characters have a differential selection ofNaN
, then the three summed values shown in this output file areNaN
as well.
These output files are in a form where they can directly read into pandas data frames via that package’s read_csv
function.
You can visualize the *mutdiffsel.txt
files with dms_logoplot.