dms2_prefs

Overview

The dms2_prefs program processes files giving the number of observed counts of characters pre- and post-selection to estimate Amino-acid preferences.

If you have multiple replicates, you should probably use the dms2_batch_prefs program rather than running dms2_prefs directly.

Command-line usage

Estimate preferences from mutation counts. Part of dms_tools2 (version 2.6.6) written by the Bloom Lab.

usage: dms2_prefs [-h] [--outdir OUTDIR] [--ncpus NCPUS]
                  [--use_existing {yes,no}] [-v] [--method {ratio,bayesian}]
                  [--indir INDIR] [--chartype {codon_to_aa}]
                  [--excludestop {yes,no}] [--conc Cprefs Cmut Cerr]
                  [--pseudocount PSEUDOCOUNT] --pre PRE --post POST --name
                  NAME [--err ERRPRE ERRPOST]

Named Arguments

--outdir

Output files to this directory (create if needed).

--ncpus

Number of CPUs to use, -1 is all available.

Default: -1

--use_existing

Possible choices: yes, no

If files with names of expected output already exist, do not re-run.

Default: “no”

-v, --version

show program’s version number and exit

--method

Possible choices: ratio, bayesian

Method to estimate preferences: normalized enrichment ratios or Bayesian inference.

Default: “bayesian”

--indir

Input counts files in this directory.

This option can be useful if the counts files are found in a common directory. Instead of repeatedly listing that directory name, you can just provide it here.

--chartype

Possible choices: codon_to_aa

Characters for which preferences are estimated. codon_to_aa = amino acids from codon counts.

Default: “codon_to_aa”

--excludestop

Possible choices: yes, no

Exclude stop codons as a possible amino acid?

Default: “yes”

--conc

Concentration parameters for priors for --method bayesian. Priors are over preferences, mutagenesis rate, and error rate(s).

Default: [1, 1, 1]

--pseudocount

Pseudocount used with --method ratio.

Default: 1

--pre

Pre-selection counts file or prefix used when creating this file.

The counts files have the format of the files created by programs such as dms2_bcsubamp. Specifically, they must have the following columns: ‘site’, ‘wildtype’, and then a column for each possible character (e.g., codon).

--post

Like --pre but for post-selection counts.

--name

Name used for output files.

The Output files will have a prefix equal to the name specified here. This name should only contain letters, numbers, dashes, and spaces. Underscores are not allowed as they are a LaTex special character.

--err

Like --pre but for counts for error control(s) for --pre and --post. Specify same file twice for same control for both.

Output files

The output files all have the prefix specified by --outdir and --name. For instance, if you use --outdir results --name replicate-1, then the output files will have the prefix ./results/replicate-1 and the suffixes described below.

Here are the specific output files:

Log file

This file has the suffix .log. It is a text file that logs the progress of the program.

Preferences file

This file has the suffix _prefs.csv. It gives the estimate preference for each character at each site. For instance:

site,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
1,0.05201542494870646,0.05006892743588129,0.04935929298955449,0.04561969825792428,0.05027584699870549,0.049473267870167204,0.04966234700207467,0.05035563145797863,0.05180290264027603,0.052502385368045426,0.05323583603885511,0.046134643822643394,0.051011235226750974,0.052905549301723524,0.04290757945900044,0.049609243914675485,0.06315116938502258,0.04129711732370402,0.051893451567002306,0.04671844899130815
2,0.009456820014302609,0.0887864717361851,0.03202933078899705,0.01243744054076179,0.012598634010240075,0.016079011606326327,0.14360828098089565,0.05270848041464168,0.05479041395821132,0.059805747035781634,0.10546956692273086,0.03356178777566189,0.012037714482087113,0.08503842285654405,0.017824696236264426,0.027336538959474643,0.05028936751898547,0.029519124422321352,0.01629464236061828,0.1403275073789687
3,0.094394784492365,0.033233499951948485,0.10037681454416572,0.041772952245424946,0.01871075286571138,0.010914843906391419,0.01994461441568695,0.09430640509845868,0.010261045290749045,0.050955385392754314,0.06764316761334091,0.06593302352530313,0.047625012474641924,0.017370598629944167,0.1082951339123566,0.04003184839931041,0.07144380858649375,0.026212403552398438,0.02646517359744569,0.05410873150510903
4,0.07817657215908004,0.03148741643399614,0.005538443259083886,0.018851757050952038,0.0034453072574090094,0.030655060310952557,0.03370373802129379,0.023488641120853936,0.05342118049856918,0.05175840113766944,0.2235830210977376,0.07104192962903758,0.03487046604114975,0.0796424680240337,0.052235719104467615,0.02309884775188897,0.05227025898510587,0.04266732483424344,0.04636513033841905,0.04369831694405645

Program run time

If you run dms_prefs with --method ratio then it will run very quickly.

If you run it with --method bayesian then the runtime will be somewhat longer due to the MCMC. Exactly how long depends on whether you are using error controls for the counts (the --err option). If you use different files for the pre- and post-selection error controls, and are using --chartype codon_to_aa then the program will typically take about 4 or 5 hours if you give it 4 CPUs. If you give it more CPUs, or using the same (or no) error control for pre- and post-selection, then it will be faster.