dms_correlate
¶
Overview¶
dms_correlate
is a program included with the dms_tools package. It computes the correlations between pairs of preferences, differential preferences, or differential selections.
After you install dms_tools, this program will be available to run at the command line.
Command-line usage¶
Determine and plot the Pearson correlation between pairs of preferences, differential selections, or differential preferences. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/
usage: dms_correlate [-h] [--name1 NAME1] [--name2 NAME2] [--excludestop]
[--noplot] [--alpha ALPHA] [--markersize MARKERSIZE]
[--plot_title PLOT_TITLE] [--corr_on_plot] [--r2]
[--rms_dpi] [--pref_entropy] [--enrichment] [-v]
[--restrictdiffsel {None,positive,negative}]
file1 file2 outfileprefix
Positional Arguments¶
file1 | Existing file giving first set of preferences, differential selections (either site or mutation), or differential preferences. Should be in the formats of a preferences_file, a diffpreferences_file, or a site or mutation differential selection file returned by dms_diffselection. |
file2 | Existing file giving second set of data; must be the same type of data “file1”. Must be in the same format as “file1”. |
outfileprefix | Prefix for name of created output files (these files are overwritten if they already exist). The correlations are written to a file with this prefix and the suffix “.txt”. Unless you use the option “–noplot”, a scatter plot is written to a file with this prefix and the suffix “.pdf”. In the correlations text file, the first line gives the Pearson correlation coefficient in the format: “R = 0.5312”. The second line gives the P-value in the format: “P = 0.0000131”. The third line gives the number of points in the format: “N = 4200”. |
Named Arguments¶
--name1 | Name used for the data in “file1” in the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in quotes. Default: “data 1” |
--name2 | Name used for the data in “file2” in the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in quotes. Default: “data 2” |
--excludestop | If we are using amino acids, do we remove stop codons (denoted by “*”)? We only remove stop codons if this argument is specified. If this option is used, then data for stop codons is removed by re-normalizing preferences to sum to one, and differential preferences to sum to zero. Default: False |
--noplot | Normally this script creates a PDF scatter plot. If this option is specified, then no such plot will be created. Default: False |
--alpha | The transparency (alpha value) for the points on the scatter plot. A value of 1.0 correspond to no transparency; values close to zero give high transparency. Transparency (alpha < 1) might be helpful if you have many points on top of each other. Default: 0.1 |
--markersize | The size of the marker for the points on the scatter plot. Default: 4 |
--plot_title | Title put at the top of the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in spaces. A value of “None” means no title. Default: “None” |
--corr_on_plot | If this option is used, then the correlation coefficient will be visually displayed on the plot. Default: False |
--r2 | If this option is used, the correlation coefficient displayed on the plot when using “–corr_on_plot” will show R-squared rather than the R value. Default: False |
--rms_dpi | If “file1” and “file2” specify differential preferences, this argument specifies that we compute the correlation between the root-mean-square (RMS) differential preference at each site rather than between the differential preferences themselves. Default: False |
--pref_entropy | If “file1” and “file2” specify preferences, this argument specifies that we compute the correlation between the site entropy of the preferences (logarithm base 2, so bits) at each site rather than between the preferences themselves. Default: False |
--enrichment | If this option is set, we plot the enrichment ratio for all mutations on a log scale rather than plotting the preferences. The computed correlations are also then for the log-transformed enrichment ratios. The enrichment ratio for the wildtype identity is always one, and so is not included. Default: False The enrichment ratio for character at site is defined by where is the preference of for , and is the wildtype identity at . Note that we do not include enrichment ratios for wildtype characters (which are one by definition) in the correlations or plots, and that the enrichment ratios are log transformed before plotting and computing correlations. |
-v, --version | show program’s version number and exit |
--restrictdiffsel | |
Possible choices: None, positive, negative Specify “positive” or “negative” to restrict plotted correlation in site differential selection to positive or negative selection. Only meaningful if file1 and file2 are sitediffsel files. Default: “None” |
Examples¶
Imagine that you have used several experiments to measure site-specific preferences in the files prefs1.txt
and prefs2.txt
. Then run the command:
dms_correlate prefs1.txt prefs2.txt pref1_vs_pref2_corr --name1 "sample 1" --name2 "sample 2"
This will create the files pref1_vs_pref2_corr.txt
and pref1_vs_pref2_corr.pdf
.
Here are the contents of pref1_vs_pref2_corr.txt
:
R = 0.540113
P = 0
N = 11844
Here is an image of pref1_vs_pref2_corr.pdf
: