dms_correlate

Overview

dms_correlate is a program included with the dms_tools package. It computes the correlations between pairs of preferences, differential preferences, or differential selections.

After you install dms_tools, this program will be available to run at the command line.

Command-line usage

Determine and plot the Pearson correlation between pairs of preferences, differential selections, or differential preferences. This script is part of dms_tools (version 1.1.20) written by the Bloom Lab (see https://github.com/jbloomlab/dms_tools/graphs/contributors for all contributors). Detailed documentation is at http://jbloomlab.github.io/dms_tools/

usage: dms_correlate [-h] [--name1 NAME1] [--name2 NAME2] [--excludestop]
                     [--noplot] [--alpha ALPHA] [--markersize MARKERSIZE]
                     [--plot_title PLOT_TITLE] [--corr_on_plot] [--r2]
                     [--rms_dpi] [--pref_entropy] [--enrichment] [-v]
                     [--restrictdiffsel {None,positive,negative}]
                     file1 file2 outfileprefix

Positional Arguments

file1

Existing file giving first set of preferences, differential selections (either site or mutation), or differential preferences.

Should be in the formats of a preferences_file, a diffpreferences_file, or a site or mutation differential selection file returned by dms_diffselection.

file2

Existing file giving second set of data; must be the same type of data “file1”.

Must be in the same format as “file1”.

outfileprefix Prefix for name of created output files (these files are overwritten if they already exist). The correlations are written to a file with this prefix and the suffix “.txt”. Unless you use the option “–noplot”, a scatter plot is written to a file with this prefix and the suffix “.pdf”. In the correlations text file, the first line gives the Pearson correlation coefficient in the format: “R = 0.5312”. The second line gives the P-value in the format: “P = 0.0000131”. The third line gives the number of points in the format: “N = 4200”.

Named Arguments

--name1

Name used for the data in “file1” in the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in quotes.

Default: “data 1”

--name2

Name used for the data in “file2” in the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in quotes.

Default: “data 2”

--excludestop

If we are using amino acids, do we remove stop codons (denoted by “*”)? We only remove stop codons if this argument is specified. If this option is used, then data for stop codons is removed by re-normalizing preferences to sum to one, and differential preferences to sum to zero.

Default: False

--noplot

Normally this script creates a PDF scatter plot. If this option is specified, then no such plot will be created.

Default: False

--alpha

The transparency (alpha value) for the points on the scatter plot. A value of 1.0 correspond to no transparency; values close to zero give high transparency. Transparency (alpha < 1) might be helpful if you have many points on top of each other.

Default: 0.1

--markersize

The size of the marker for the points on the scatter plot.

Default: 4

--plot_title

Title put at the top of the correlation plot. The string specified here uses LaTex formatting; names with spaces should be enclosed in spaces. A value of “None” means no title.

Default: “None”

--corr_on_plot

If this option is used, then the correlation coefficient will be visually displayed on the plot.

Default: False

--r2

If this option is used, the correlation coefficient displayed on the plot when using “–corr_on_plot” will show R-squared rather than the R value.

Default: False

--rms_dpi

If “file1” and “file2” specify differential preferences, this argument specifies that we compute the correlation between the root-mean-square (RMS) differential preference at each site rather than between the differential preferences themselves.

Default: False

--pref_entropy

If “file1” and “file2” specify preferences, this argument specifies that we compute the correlation between the site entropy of the preferences (logarithm base 2, so bits) at each site rather than between the preferences themselves.

Default: False

--enrichment

If this option is set, we plot the enrichment ratio for all mutations on a log scale rather than plotting the preferences. The computed correlations are also then for the log-transformed enrichment ratios. The enrichment ratio for the wildtype identity is always one, and so is not included.

Default: False

The enrichment ratio for character at site is defined by

where is the preference of for , and is the wildtype identity at .

Note that we do not include enrichment ratios for wildtype characters (which are one by definition) in the correlations or plots, and that the enrichment ratios are log transformed before plotting and computing correlations.

-v, --version show program’s version number and exit
--restrictdiffsel
 

Possible choices: None, positive, negative

Specify “positive” or “negative” to restrict plotted correlation in site differential selection to positive or negative selection. Only meaningful if file1 and file2 are sitediffsel files.

Default: “None”

Examples

Imagine that you have used several experiments to measure site-specific preferences in the files prefs1.txt and prefs2.txt. Then run the command:

dms_correlate prefs1.txt prefs2.txt pref1_vs_pref2_corr --name1 "sample 1" --name2 "sample 2"

This will create the files pref1_vs_pref2_corr.txt and pref1_vs_pref2_corr.pdf.

Here are the contents of pref1_vs_pref2_corr.txt:

R = 0.540113
P = 0
N = 11844

Here is an image of pref1_vs_pref2_corr.pdf:

pref1_vs_pref2_corr.pdf