Data file formats

Below is information about the formats of the input and output files used by the Programs in dms_tools.

Deep mutational scanning counts file

The programs dms_inferprefs and dms_inferdiffprefs take their input data from files giving the counts of each possible character (codon, nucleotide, or amino acid) at each site in the gene. You will need to generate these files after using some other program to analyze your deep sequencing data.

Here is an example of the first few lines of a valid file giving nucleotide counts:

# POSITION WT A T C G
1 A 310818 13 0 37
2 T 16 311782 37 4
3 G 27 27 11 312520
4 A 313647 13 4 31

A file giving codon counts would be structured similarly except rather than having the four nucleotide identities (A, T, C, G), the file should have the 64 codon identities (AAA, AAC, AAG, …). A file giving amino-acid counts would have the 20 amino-acid identites, possibly including stop codons indicated by *. The file is allowed to have additional lines and columns, but they are ignored as detailed below. For instance, this file would be handled just like the one above (the additional lines and the columns for a and N would be ignored, and the different order of characters is fine, as is the additional space between columns):

# Line ignored because it begins with # but isn't followed by POSITION WT
#
# The columns beginning with N and a are ignored since they don't start with an upper case nucleotide.
# The fact that the nucleotides are not in alphabetical order is irrelevant.
# POSITION WT G A T C N a
1 A 37 310818 13 0 0 10
2 T 4 16 311782 37 1 1
# The extra spaces between columns in the next line are irrelevant
3  G   312520   27 27  11 0 0
4 A 31 313647 13 4 50 49

Specifically, the requirements for the file are as follows:

  • Any lines beginning with # are treated as comments or headers, and do not contain data.

  • Columns are space delimited. They can separated by one or more spaces, or by one or more tabs.

  • Before the first data line (a line not beginning with #) there must be a header line beginning with # POSITION WT and then followed in arbitrary order by entries specifying valid DNA nucleotides, valid DNA codons, or valid amino acids. The order of the entries in this header line establish the order of the data in the subsequent lines.

  • The case does matter, all nucleotides, amino acids, and codons must be upper case.

  • It is OK to intersperse columns that do not represent valid codons, amino acids, or nucleotides as long as all of the codons, amino acids, or nucleotides are present. This might be useful if you file is using N to denote ambiguous nucleotides (which are ignored as not being definitive counts), or is using lower-case letters (such as a to denote unsure calls).

  • For every data line, the first two entries will be POSITION and WT:

    • POSITION is the position number. These do not have to be consecutive integers, so it is OK to skip sites (e.g. 1, 2, 5) or have negative site numbers (e.g. -1, 0, 1) or have sites with letter suffixes (e.g. 1, 2, 2A, 3). All of these options are useful when referring to numbering schemes that are non-sequential, as is sometimes the case in gene numbering. Note however that each position number must be unique.
    • WT must give a valid wildtype identity of the site as a nucleotide, amino acid, or codon or be ? if no wildtype is specified.

The *_codoncounts.txt and *_ntcounts.txt files generated by mapmuts represent valid deep mutational scanning counts input files.

Preferences file

The program dms_inferprefs infers the site-specific preferences \(\pi_{r,x}\) and writes these to a file. Here is an example of the format of one these preferences files for nucleotide characters:

# POSITION WT SITE_ENTROPY PI_A PI_C PI_G PI_T PI_A_95 PI_C_95 PI_G_95 PI_T_95
1 G 1.61 0.21 0.09 0.58 0.12 0.17,0.28 0.04,0.12 0.51,0.69 0.090,0.150

Specifically, there is a header line beginning with # that gives the column headers. Columns are whitespace delimited (by one more spaces or tabs). The columns are as follows:

  • The first column gives the position number, using the same number as in the Deep mutational scanning counts file that were analyzed to infer the preferences.
  • The second column WT gives the wildtype identity at that site.
  • The third column gives the site entropy computed from the preferences in bits, using the formula \(h_r = -\sum\limits_x \pi_{r,x} \log_2 \pi_{r,x}\) (so higher site entropies indicate sites that are tolerant of more amino acids).
  • The next sets of columns gives the preferences for each character (PI_A for \(\pi_{r,A}\), etc).
  • Optionally, there is a last set of columns gives the 95% median-centered credible interval for each preference, with the lower and upper bounds separated by a comma with no spaces.

If the preferences are for amino acids, the format is the same except there are now 20 or 21 columns of the form PI_A, PI_C, PI_D, etc. There are 20 if we are excluding stop codons, and 21 if we are including stop codons. Stop codons are indicated by *.

If the preferences are for codons, the format is the same except there are now 64 columns of the form PI_AAA, PI_AAC, PI_AAG, …, PI_TTG, PI_TTT.

When the credible intervals are present, these columns are named PI_A_95, etc.

Differential preferences file

The program dms_inferdiffprefs infers the differential preferences \(\Delta\pi_{r,x}\) and writes these to a file. Here is an example of the formate of one of these files for nucleotide characters:

# POSITION WT RMS_dPI dPI_A dPI_C dPI_G dPI_T Pr_dPI_A_lt0 Pr_dPI_A_gt0 Pr_dPI_C_lt0 Pr_dPI_C_gt0 Pr_dPI_G_lt0 Pr_dPI_G_gt0 Pr_dPI_T_lt0 Pr_dPI_T_gt0
1 G 0.241 0.02 -0.4 0.2 0.18 0.41 0.59 0.90 0.10 0.2 0.8 0.3 0.7

Specifically, there is a header line beginning with # that gives the column headers. Columns are whitespace delimited (by one or more spaces or tabs). The columns are as follows:

  • The first column gives the position number, using the same number as in the Deep mutational scanning counts file that were analyzed to infer the preferences.
  • The second column WT gives the wildtype identity at the site or is ? if no wildtype identity is specified.
  • The third column gives the root-mean-square differential preference for that site, \(\rm{RMS}_{\Delta\pi_r} = \sqrt{\frac{1}{\mathcal{N}_x}\sum\limits_x \left(\Delta\pi_{r,x}\right)^2}\).
  • The next set of columns gives the differential preferences for each character (dPI_A for \(\Delta\pi_{r,A}\), etc).
  • Optionally, there is a last set of columns giving the posterior probabilities that differential preferences are greater than zero (Pr_dPI_gt0) or less than zero (Pr_dPI_lt0).

If the differential preferences are for amino acids, the format is the same except there are now 20 or 21 columns of the form dPI_A, dPI_C, dPI_D, etc. There are 20 if we are excluding stop codons, and 21 if we are including stop codons. Stop codons are indicated by *.

If the preferences are for codons, the format is the same except there are now 64 columns of the form dPI_AA, dPI_AAC, dPI_AAG, …, dPI_TTG, dPI_TTT.

Subassembly file

This file format matches barcodes to subassembled variants. It is the primary output file of dms_subassemble.

Each line has three space-delimited entries:

  1. The first entry is the barcode.
  2. The second entry is the sequence corresponding to that barcode.
  3. The third entry is the list of mutations (separated by comma, no space) of each mutation in the sequence relative to the wildtype parent. If there are no mutations, this entry is no_mutations. The sites are numbered in 1, 2, … numbering in naming these mutations. Note that if the mutations are codon mutations, then the numbers correspond to the codon number, not the nucleotide number.

The following restrictions apply:

  • all barcodes must be the same length
  • each barcode must be unique.
  • the second and third entries (the sequence of the subassembled variant and the mutations) must correspond to the same wildtype sequence.

Here are a few example lines:

TATTACATCTGCCCCCAA ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGCAGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA GGC109GCA
AACTCTGTGTTCCCATCA ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACCGTTCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAGATGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA no_mutations
TCTAGGGTGCGATCTATT ATGGCTATCGACGAAAACAAACAGAAAGCGTTGGCGGCAGCACTGGGCCAGATTGAGAAACAATTTGGTAAAGGCTCCATCATGCGCCTGGGTGAAGACTCATCCATGGATGTGGAAACCATCTCTACCGGTTCGCTTTCACTGGATATCGCGCTTGGGGCAGGTGGTCTGCCGATGGGCCGTATCGTCGAAATCTACGGACCGGAATCTTCCGGTAAAACCACGCTGACGCTGCAGGTGATCGCCGCAGCGCAGCGTGAAGGTAAAACCTGTGCGTTTATCGATGCTGAACACGCGCTGGACCCAATCTACGCACGTAAACTGGGCGTCGATATCGACAACCTGCTGTGCTCCCAGCCGGACACCGGCGAGCAGGCACTGGAAATCTGTGACGCCCTGGCGCGTTCTGGCGCAGTAGACGTTATCGTCGTTGACTCCGTGGCGGCACTGACGCCGAAAGCGGAAATCGAAGGCGAAATCGGCGACTCTCATATGGGCCTTGCGGCACGTATGATGAGCCAGGCGATGCGTAAGCTGGCGGGTAACCTGAAGCAGTCCAACACGCTGCTGATCTTCATCAACCAGATCCGTATGAAAATTGGTGTGATGTTCGGCAACCCGGAAACCACTACCGGTGGTAACGCGCTGAAATTCTACGCCTCTGTTCGTCTCGACATCCGTCGTATCGGCGCGGTGAAAGAGGGCGAAAACGTGGTGGGTAGCGAAACCCGCGTGAAAGTGGTGAAGAACAAAATCGCTGCGCCGTTTAAACAGGCTGAATTCCAGATCCTCTACGGCGAAGGTATCAACTTCTACGGCGAACTGGTTGACCTGGGCGTAAAAGAGAAGCTGATCGAGAAAGCAGGCGCGTGGTACAGCTACAAAGGTGAGAAGATCGGTCAGGGTAAAGCGAATGCGACTGCCTGGCTGAAAGATAACCCGGAAACCGCGAAAGAGATCGAGAAGAAAGTACGTGAGTTGCTGCTGAGCAACCCGAACTCAACGCCGGATTTCTCTGTAATGGATAGCGAAGGCGTAGCAGAAACTAACGAAGATTTTTAA CGT34TCA,GAT341ATG