# Differential selection¶

## Overview¶

Deep mutational scanning libraries are often subject to multiple different selection pressures with the goal of identifying mutations that are differentially selected in one condition versus the other. For instance, you might select a library of viral mutants with a monoclonal antibody, and also perform a mock selection without antibody.

One way to quantify the differential selection on mutations in the selected versus mock condition is in terms of the *differential selection* measure defined here.
A visualization of differential selection is shown in Fig. 5.
Essentially, most mutations are **not** selected differently in the selected (e.g., antibody) treatment versus the mock selection – but a small number are under strong differential selection.

The dms_tools2 software contains programs to estimate the differential selection from the counts of mutations generated from the deep sequencing data. Typically you would want to estimate the differential selection using dms2_batch_diffsel, which in turn calls dms2_diffsel. These programs create visualizations of the results, and you can also visualize the differential selection with dms2_logoplot as in Fig. 5.

The Examples include some analyses of this type; for instance see:

## Definition of differential selection¶

The definition of *differential selection* used here was originally introduced in Doud et al (2017).
A schematic of the overall integrated experimental and computational workflow is shown in Fig. 6.

### Formula for differential selection¶

For each site \(r\) in the gene, let \(n_{r,x}^{\rm{mock}}\) be the number of observed counts of character \(x\) (e.g., an amino acid) at site \(r\) in the mock-treated condition, and let \(n_{r,x}^{\rm{selected}}\) be the number of observed counts of \(x\) at \(r\) in the selected condition. Let \(\operatorname{wt}\left(r\right)\) be denote the wildtype character at \(r\). Then the relative enrichment of the mutant relative to the wildtype after selection is

where \(P > 0\) is a pseudocount that is added to each observed count (specified by `--pseudocount`

option to dms2_diffsel / dms2_batch_diffsel), and \(f_{r, \rm{selected}}\) and \(f_{r, \rm{mock}}\) are defined so that the pseudocount is scaled up for the library (*mock* or *selected*) with higher depth at site \(r\):

The reason for scaling the pseudocount by library depth is that the *mock* and *selected* libraries may often be sequenced at different depths.
If the same pseudocount is added to both, then estimates of \(E_{r,x}\) will be systematically different than one even if the relative counts for the wildtype and mutant amino acid are the same in both two conditions.
Scaling the pseudocounts by the ratio of depths fixes this problem.
Larger values of the pseudocount \(P\) will avoid spuriously estimating differential selection when there is just statistical noise due to low counts, but this comes at the cost of reducing sensitivity.
If you are using `--chartype`

of `codon_to_aa`

, the counts for amino acids are aggregated **before** doing the calculations above.

Note that by definition, \(E_{r,\operatorname{wt}\left(r\right)}\) is always one.

We quantify the *differential selection* \(s_{r,x}\) for the mutation to \(x\) at site \(r\) in the selected versus control condition as:

These \(s_{r,x}\) are reported as the *mutation differential selection* values by dms2_diffsel / dms2_batch_diffsel, and can be plotted in the logo plots above using dms2_logoplot.
When \(s_{r,x} > 0\), then \(x\) is more favored in the selection versus mock condition.
When \(s_{r,x} < 0\), then \(x\) is less favored in the selection versus mock condition.
By definition, \(s_{r,\operatorname{wt}\left(r\right)}\) is always zero.

There is a mutation differential selection value for each mutation at a site – for instance, there will be 19 such values for amino acid characters (the wildtype value is always one by definition). For visualization and analysis, it is sometimes useful to summarize the mutation differential selection values at a site with a single number. This can be done in several ways:

We define

*absolute site differential selection*in terms of the*mutation differential selection*values at that site as(22)¶\[\mbox{abs_diffsel} = \sum_x \left|s_{r,x}\right|.\]The absolute site differential selection is equivalent to the total height of the letter stack in both the positive and negative directions in logoplots such as Fig. 5.

We define

*positive site differential selection*in terms of the*mutation differential selection*values at that site as(23)¶\[\mbox{positive_diffsel} = \sum_x \max\left(0, s_{r,x}\right).\]The positive site differential selection is equivalent to the total height of the letter stack in just the positive direction in logoplots such as Fig. 5.

We define

*negative site differential selection*in terms of the*mutation differential selection*values at that site as(24)¶\[\mbox{negative_diffsel} = \sum_x \min\left(0, s_{r,x}\right).\]The negative site differential selection is equivalent to the total height of the letter stack in just the negative direction in logoplots such as Fig. 5.

We define

*maximum site differential selection*in terms of the*mutation differential selection*values at that site as(25)¶\[\mbox{max_diffsel} = \sum_x \max\left(0, \max_x\left(s_{r,x}\right)\right).\]The maximum site differential selection is equivalent to the tallest letter in the positive direction in logoplots such as Fig. 5, or zero if no letters are in the positive direction.

We define

*minimum site differential selection*in terms of the*mutation differential selection*values at that site as(26)¶\[\mbox{min_diffsel} = \sum_x \min\left(0, \min_x\left(s_{r,x}\right)\right).\]The maximum site differential selection is equivalent to the tallest letter in the negative direction in logoplots such as Fig. 5, or zero if no letters are in the negative direction.

### Error correction¶

You can optionally correct for the potential inflation of some counts by sequencing errors by using the `--err`

option to dms2_diffsel / dms2_batch_diffsel.
To do this, you should use some control where the **only** mutations will come from sequencing / library-prep-induced errors that are presumed to occur at the same rate as the actual samples being analyzed.
Typically, this would be deep sequencing of a wildtype control.

With an error control, the counts are adjusted as follows.
Let \(n_{r,x}\) be the counts for character \(x\) at site \(r\) in the *mock* or *selected* sample.
Let \(n^{\rm{err}}_{r,x}\) be the number of counts of \(x\) at site \(r\) in the error control.
Define

When \(x \ne \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is the rate of errors to \(x\) at site \(r\), and when \(x = \operatorname{wt}\left(r\right)\) then \(\epsilon_{r,x}\) is one minus the rate of errors away from the wildtype at site \(r\).

We then adjust observed counts \(n_{r,x}\) to the error-corrected counts \(\hat{n}_{r,x}\) by

These adjusted counts are then used in place of the un-adjusted counts in Equation (18) above.

Note that if you are using `--chartype`

of `codon_to_aa`

, the corrections are done on the codon counts, and these corrected codon counts are then aggregated into the amino acid counts before computing the differential selection.

## Programs and examples¶

The dms_tools2 software contains programs to analyze differential selection:

- dms2_diffsel estimates differential selection from counts files.
- dms2_batch_diffsel runs dms2_diffsel on a set of samples and summarizes the results. If you have multiple related replicates, it is easier to use dms2_batch_diffsel rather than runnning dms2_diffsel directly.
- dms2_logoplot can be used to create logo plots that visualize differential selection.

Here are some example analyses: