dms_diffselection

Overview

dms_diffselection is a program included with the dms_tools package.

It is designed to assess the differential selection on each mutation between a library that is mock-treated and subjected to some selective condition.

After you install dms_tools, this program will be available to run at the command line.

Differential selection

Say you have a deep mutational scanning library that is mock-treated and selected. For instance, this might be a virus library that is passaged in the absence and presence of immune selection. For each site \(r\) in the gene, let \(n_{r,x}^{\rm{mock}}\) be the number of observed counts of character \(x\) (might be amino acid, nucleotide, or codon) at site \(r\) in the mock-treated condition, and let \(n_{r,x}^{\rm{selected}}\) be the number of observed counts of \(x\) at \(r\) in the selected condition. Let \(\operatorname{wt}\left(r\right)\) be denote the wildtype character at \(r\). Then the relative enrichment of the mutant relative to the wildtype after selection is

(1)\[E_{r,x} = \frac{\left(n_{r,x}^{\rm{selected}} + f_{r, \rm{selected}} \times P\right) / \left(n_{r,\operatorname{wt}\left(r\right)}^{\rm{selected}} + f_{r, \rm{selected}} \times P\right)}{\left(n_{r,x}^{\rm{mock}} + f_{r, \rm{mock}} \times P\right) / \left(n_{r,\operatorname{wt}\left(r\right)}^{\rm{mock}} + f_{r, \rm{mock}} \times P\right)}\]

where \(P > 0\) is a pseudocount that is added to each observed count (specified by --pseudocount option), and \(f_{r, \rm{selected}}\) and \(f_{r, \rm{mock}}\) are defined so that the pseudocount is scaled up for the library (mock or selected) with higher depth at site \(r\):

(2)\[f_{r, \rm{selected}} = \max\left[1, \left(\sum_x n_{r,x}^{\rm{selected}}\right) / \left(\sum_x n_{r,x}^{\rm{mock}}\right)\right]\]
(3)\[f_{r, \rm{mock}} = \max\left[1, \left(\sum_x n_{r,x}^{\rm{mock}}\right) / \left(\sum_x n_{r,x}^{\rm{selected}}\right)\right].\]

The reason for scaling the pseudocount by library depth is that the mock and selected libraries may often be sequenced at different depths. In that case, if the same pseudocount is added to both, then estimates of \(E_{r,x}\) will tend to be systematically different than one even if the relative counts for the wildtype and mutant amino acid are the same in both two conditions. Scaling the pseudocounts by the ratio of depths fixes this problem. If you do not want to do this scaling, see the --no-scale-pseudocounts option.

Note that by definition, \(E_{r,\operatorname{wt}\left(r\right)}\) is always one.

We then quantify the differential selection \(s_{r,x}\) for \(x\) at \(r\) in the selected versus control condition as:

(4)\[s_{r,x} = \log_2 E_{r,x}\]

It is these \(s_{r,x}\) values that are reported by dms_diffselection. Note that the choice of logarithm base is arbitrary; we have chosen 2 here. Note that by definition, \(s_{r,\operatorname{wt}\left(r\right)}\) is always zero.

When \(s_{r,x} > 0\), then \(x\) is more favored in the selection versus mock condition. When \(s_{r,x} < 0\), then \(x\) is less favored in the selection versus mock condition.

Larger values of the pseudocount \(P\) will avoid spuriously estimating strong selection when in fact you just have a lot of statistical noise due to small counts.

Error correction

You can optionally correct for the potential inflation of some counts by sequencing errors by using the --errorcontrolcounts option to dms_diffselection.

In this case, the counts defined in Equation (1) are adjusted as follows. Let \(n_{r,x}\) be the counts for character \(x\) at site \(r\) in the mock or selected sample (these are the numbers in the mockcounts or selectecounts files). Let \(n^{\rm{err}}_{r,x}\) be the number of counts of \(x\) at site \(r\) in the error control specified by --errorcontrolcounts. Define

(5)\[\epsilon_{r,x} = \left(n^{\rm{err}}_{r,x}\right) / \left(\sum_y n^{\rm{err}}_{r,y}\right).\]

When \(x \ne \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is the rate of errors to \(x\) at site \(r\), and when \(x = \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is one minus the rate of errors away from the wildtype at site \(r\).

We then adjust observed counts \(n_{r,x}\) to the error-corrected counts \(\hat{n}_{r,x}\) by

(6)\[\begin{split}\hat{n}_{r,x} = \begin{cases} \max\left[\left(\sum_y n_{r,y}\right) \left(\frac{n_{r,x}}{\sum_y n_{r,y}} - \epsilon_{r,x}\right), 0\right] & \mbox{if } x \ne \operatorname{wt}\left(r\right) \\ n_{r,x} / \epsilon_{r,x} & \mbox{if } x = \operatorname{wt}\left(r\right) \\ \end{cases}\end{split}\]

These adjusted counts are then used in place of the un-adjusted counts in Equation (1) above.

Note that if you are using --chartype of codon_to_aa, the corrections are done on the codon counts, and these corrected which are then aggregated into the amino acid counts before computing the differential selection.

Command-line usage

Differential selection between mock-treated and selected samples. Quantified as log2[{(n_sel_mut + f_sel * P) / (n_sel_wt + f_sel * P)} / {(n_mock_mut + f_mock * P) / (n_mock_wt + f_mock * P)}] where P is pseudocount, f_sel = max(1, N_sel / N_mock), and f_mock = max(1, N_mock / N_sel) where N_mock and N_sel are the sequencing depth for mock and selected libraries at that site. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/

usage: dms_diffselection [-h] [--errorcontrolcounts ERRORCONTROLCOUNTS]
                         [--pseudocount PSEUDOCOUNT] [--mincounts MINCOUNTS]
                         [--no-scale-pseudocounts]
                         [--chartype {codon_to_aa,DNA,codon,aa}]
                         [--includestop] [-v]
                         mockcounts selectedcounts outprefix

Positional Arguments

mockcounts

File with counts from mock-selected library. For nucleotides, header line should be “# POSITION WT A C G T” and then subsequent lines give wildtype and counts for each site (i.e. “1 G 1013 23 19 47”); for codons use the 64 codons instead (i.e. “AAA AAC AAG …”).

This file gives the values described in Differential selection.

Should be in the format of a dms_counts file.

selectedcounts

File with counts from selected library

This file gives the values described in Differential selection.

Should be in the format of a dms_counts file.

outprefix

Prefix for output files. Suffixes are ‘mutdiffsel.txt’ and ‘sitediffsel.txt’

For instance, if outprefix is antibodyselection_ then the output files are antibodyselection_mutdiffsel.txt and antibodyselection_sitediffsel.txt. See Output for an explanation of the contents of these files.

Named Arguments

--errorcontrolcounts
 

Counts file for error control (e.g., wildtype gene or virus) to correct mutation counts.

You can use this option if you have estimated your sequencing error rate by sequencing a “wildtype” sample where you expect all errors come from sequencing (or other related processes such as PCR or reverse transcription). The counts will then be adjusted to correct for the error rates estimated from this control.

This file gives the values described in Error correction.

Should be in the format of a dms_counts file.

--pseudocount

Pseudocount added to each count for selected library. By default, pseudocounts is added to library with smaller depth, and pseudcount for other library is scaled by relative depth at that site (see –no-scale-pseudocounts).

Default: 10

This is the parameter described in the Equation E_rx. Note that by default this pseudocount is for the library with greater depth at that site, and the pseudocount for the other library is scaled up by the ratio of the relative depths as described by Equations f_rselected and f_rmock.

--mincounts

Only report differential selection for mutations for which at least one of mock or selected counts is >= this number. Mutations that do not meet this threshold have differential selection written as “NaN”.

Default: 0

If you set this option to anything other than its default of 0, then you will get differential selection ( values) of NaN (not a number) for mutations that have very low counts both before and after selection. In some cases, this may be preferable than estimating a value for a mutation with very little data.

Note that if you keep –mincounts at its default of 0 but have a reasonable value of –pseudocount, then mutations with few or no counts will have differential selection values of 0 estimated for them.

--no-scale-pseudocounts
 

Use this option if you do NOT want to scale the pseudocounts for each library by relative depth, but instead want to use unscaled value specified by ‘–pseudocount’.

Default: False

If you use this option, the scaling and in Equations f_rselected and f_rmock are always set to one. Note that this can give biased estimates if the depth is unequal in the two libraries.

--chartype

Possible choices: codon_to_aa, DNA, codon, aa

Characters: “codon_to_aa” = counts for codons and selection for amino acids; “DNA” = counts and selection for DNA; “codon” = counts and selection for codons; “aa” = counts and selection for amino acids (possibly including stop codons, see “–includestop”).

Default: “codon_to_aa”

A value of codon_to_aa means we have codon characters for the deep sequencing data, but infer selection for amino acids. Essentially, we sum the counts of all codons for each amino acid, and then do the calculation described in Overview on these amino-acid counts.

--includestop

Include stop codons as a possible amino acid if using “–chartype” of “codon_to_aa” or “aa”.

Default: False

If this is true, treat stop codons (denoted by *) as a possible amino acid. Otherwise simply ignore any counts for stop codons.

-v, --version show program’s version number and exit

Example usage

Let’s say that mock_counts.txt has the codon counts in a mock-selected library, and antibody_counts has these same counts in a library selected with an antibody. To determine selection on the amino acids, you could then run:

dms_diffselection mock_counts.txt antibody_counts.txt antibodyselection_

The created files would be antibodyselection_mutdiffsel.txt and antibodyselection_sitediffsel.txt.

Output

Running dms_diffselection will cause some descriptive output to be printed to standard output.

In addition, two files with the prefix outprefix are created (these files are overwritten if they already exist). These files are:

  • The *mutdiffsel.txt file gives the \(s_{r,x}\) values in a CSV file. These are sorted by \(s_{r,x}\) value. The site number and wildtype character are also indicated. Here are the first few lines of an example file:

    site,wt,mut,diffsel
    156,G,S,7.15912988065
    146,N,D,6.85510232461
    157,K,S,6.78492241017
    159,S,D,6.61445818152
    153,S,K,6.55644791926
    153,S,I,6.45786904581
    158,S,A,6.31170638356
    

    In these files, the entry for \(x\) equal to the wildtype character is always NaN (not a number). So for instance, given the data above, we would also have entries like:

    156,G,G,NaN
    146,N,N,NaN
    

    In addition, if --mincounts is set to something other than zero, then there may be mutations for which an \(s_{r,x}\) value cannot be computed. In that case, that value is also reported as NaN (not a number) in this output file.

  • *sitediffsel.txt file summarizes total differential selection on each site. It provides three summary statistics:

    • abs_diffsel is the sum of the absolute values of the selection on all mutations at the site, and so is a measure of the total selection at this site. It is computed as \(\sum_x \left|s_{r,x}\right|\).
    • positive_diffsel is a sum of the values of positively selected mutations at a site, and so is a measure of the total selection for favorable mutations at this site. It is computed as \(\sum_x \max\left(0, s_{r,x}\right)\).
    • negative_diffsel is a sum of the values of negatively selected mutations at a site, and so is a measure of the total selection against mutations at this site. It is computed as \(\sum_x \min\left(0, s_{r,x}\right)\).

    The file gives all three of these quantities for each site, sorted by abs_diffsel. Here are the first few lines of an example file:

    site,abs_diffsel,positive_diffsel,negative_diffsel
    157,92.5132004464,92.5132004464,0.0
    153,85.7023127899,84.7159706476,-0.986342142299
    136,56.4005369298,56.3664732104,-0.0340637194023
    158,56.3773123754,56.3773123754,0.0
    156,56.0690809464,56.0690809464,0.0
    175,31.899455678,3.79784216691,-28.1016135111
    

    Note that if --mincounts is not 0, then some of the \(s_{r,x}\) values may be NaN (not a number). In that case, any characters \(x\) that do not have a value for the \(s_{r,x}\) value do not contribute to the sums used to define abs_diffsel, positive_diffsel, and negative_diffsel above (in other words, they are counted as if the differential selection for character \(x\) is 0). If all characters have a differential selection of NaN, then the three summed values shown in this output file are NaN as well.

These output files are in a form where they can directly read into pandas data frames via that package’s read_csv function.

You can visualize the *mutdiffsel.txt files with dms_logoplot.