dms2_diffsel
¶
Overview¶
The dms2_diffsel
program processes files giving the number of observed counts of characters in a selected and mock-selected condition to estimate Differential selection.
If you have multiple related replicates or samples (or even if you hvae just one), you should probably use the dms2_batch_diffsel program rather than running dms2_diffsel
directly.
This is because dms2_batch_diffsel runs dms2_diffsel
, but then also makes some nice summary plots.
Command-line usage¶
Estimate differential selection from mutation counts. Part of dms_tools2 (version 2.6.6) written by the Bloom Lab.
usage: dms2_diffsel [-h] [--outdir OUTDIR] [--ncpus NCPUS]
[--use_existing {yes,no}] [-v] [--indir INDIR]
[--chartype {codon_to_aa}] [--excludestop {yes,no}]
[--pseudocount PSEUDOCOUNT] [--mincount MINCOUNT] --name
NAME --sel SEL --mock MOCK [--err ERR]
Named Arguments¶
- --outdir
Output files to this directory (create if needed).
- --ncpus
Number of CPUs to use, -1 is all available.
Default: -1
- --use_existing
Possible choices: yes, no
If files with names of expected output already exist, do not re-run.
Default: “no”
- -v, --version
show program’s version number and exit
- --indir
Input counts files in this directory.
This option can be useful if the counts files are found in a common directory. Instead of repeatedly listing that directory name, you can just provide it here.
- --chartype
Possible choices: codon_to_aa
Characters for which differential selection is estimated. codon_to_aa = amino acids from codon counts.
Default: “codon_to_aa”
- --excludestop
Possible choices: yes, no
Exclude stop codons as a possible amino acid?
Default: “yes”
- --pseudocount
Pseudocount added to each count for sample with smaller depth; pseudocount for other sample scaled by relative depth.
Default: 5
- --mincount
Report as NaN the diffsel of mutations for which both selected and mock-selected samples have < this many counts.
Default: 0
- --name
Name used for output files.
The Output files will have a prefix equal to the name specified here. This name should only contain letters, numbers, dashes, and spaces. Underscores are not allowed as they are a LaTex special character.
- --sel
Post-selection counts file or prefix used when creating this file.
The counts files have the format of the files created by programs such as
dms2_bcsubamp
. Specifically, they must have the following columns: ‘site’, ‘wildtype’, and then a column for each possible character (e.g., codon).- --mock
Like
--sel
, but for mock-selection counts.- --err
Like
--sel
but for error-control to correct mutation counts.
Output files¶
The output files all have the prefix specified by --outdir
and --name
.
For instance, if you use --outdir results --name replicate-1
, then the output files will have the prefix ./results/replicate-1
and the suffixes described below.
Here are the specific output files:
Mutation differential selection file¶
This file has the suffix _mutdiffsel.csv
.
It gives the differential selection for each mutation at each site, which is the \(s_{r,x}\) value defined in Equation (21) of the Differential selection section.
The mutation differential selection values are shown as NaN
for the wildtype identity at a site.
If --mincounts
is greater than zero, the differential selection may also be undefined for some mutations due to low counts, and any such undefined differential selection values are also shown as NaN
.
Here are the first and last few lines of a _mutdiffsel.csv
file:
site,wildtype,mutation,mutdiffsel
156,G,S,8.20616611420281
157,K,S,7.843970369736835
146,N,D,7.839003488125009
157,K,I,7.636912452846413
153,S,I,7.618894947835398
...
560,Q,Q,NaN
561,C,C,NaN
562,R,R,NaN
563,I,I,NaN
564,C,C,NaN
565,I,I,NaN
Note that the file is sorted from largest to smallest mutation differential selection, with NaN
values last.
Site differential selection file¶
This file has the suffix _sitediffsel.csv
It gives several measures that summarize the differential selection at each site.
All values in the _sitediffsel.csv
file can be calculated from the values in the _mutdiffsel.csv
file, but we output both files to make things simpler for the user.
Specifically, it gives the following quantities:
abs_diffsel
is the total mutation differential selection (both positive and negative) at a site, as defined in Equation (22) in the Differential selection section.positive_diffsel
is the total positive mutation differential selection at a site, as defined in Equation (23) in the Differential selection section.negative_diffsel
is the total negative mutation differential selection at a site, as defined in Equation (24) in the Differential selection section.max_diffsel
is the maximum mutation differential selection at a site, as defined in Equation (25) in the Differential selection section.min_diffsel
is the minimum mutation differential selection at a site, as defined in Equation (26) in the Differential selection section.
Here are the first and last lines of a _sitediffsel.csv
file:
site,abs_diffsel,positive_diffsel,negative_diffsel,max_diffsel,min_diffsel
157,103.49320489879341,103.49320489879341,0.0,7.843970369736835,0.0
153,83.14675113695142,81.29729853121766,-1.8494526057337632,7.618894947835398,-1.5180701665199172
156,50.6281426671937,50.6281426671937,0.0,8.20616611420281,0.0
158,44.54232019822232,44.09637071052413,-0.4459494876981922,6.721775476391176,-0.3980554768143438
175,32.69746741931557,0.0,-32.69746741931557,0.0,-3.5655146998011067
...
274,16.78563070533967,3.9335896820646017,-12.85204102327507,2.5127254463901822,-1.5263088393038542
171,16.687556395725117,5.582191979487671,-11.105364416237446,5.286815935040484,-1.3671039380789383
229,16.589450103457924,0.0,-16.589450103457924,0.0,-2.4947190310966607
299,16.3418067952726,0.0,-16.3418067952726,0.0,-1.596540281615842
499,16.33889299748386,6.3537025445869935,-9.985190452896864,3.564608664594661,-2.218441347118039
If all mutations at a site have a mutation differential selection of NaN
(which can be the case if --mincounts
is > 0), then the site differential selection values are reported as 0.