Python API

dms_tools is written in Python, so if you install the package you can access the following API for the package via Python.

dms_tools package

This package contains utilities for analyzing deep mutational scanning (DMS) data.

The package webpage is https://github.com/jbloom/dms_tools

The package is typically run via a series of command line scripts. However, it also provides the following Python API.

Modules in the package

  • mcmc : Perform MCMC inference
  • parsearguments : Parses command line arguments to scripts.
  • file_io : File input / output in dms_tools formats.
  • plot : make plots
  • weblogo : use weblogolib to make logo plots.
  • utils : assorted utilities for dms_tools
  • simulate : simulate data to test inferences.

Package-level variables

The following variables are defined for this package:

  • aminoacids : list of the 20 amino acids in one-letter upper-case code.
  • aminoacids_nostop : another name for aminoacids.
  • aminoacids_withstop : like aminoacids but also includes the stop codon (*).
  • nts : a list of the four upper-case DNA nucleotides.
  • codons : a list of the 64 codons as upper-case three-letter strings.
  • codon_to_aa : dictionary keyed by codons in codons, values are amino acids.
  • aa_to_codons : dictionary keyed by all amino acids in aminoacids, values list of all encoding codons.

mcmc module

Module that performs MCMC to infer preferences and differential preferences.

MCMC done using Stan (http://mc-stan.org/) via pystan.

Written by Jesse Bloom.

Functions in this module

  • InferSitePreferences : use MCMC to infer site-specific preferences.
  • InferSitePreferencesFromEnrichmentRatios :infers site-specific preferences directly from enrichment ratios.
  • InferSiteDiffPreferencesFromEnrichmentRatios :infers site-specific differential preferences directly from enrichment ratios.
  • InferSiteDiffPrefs : use MCMC to infer site-specific differential preferences.
  • StanModelNoneErr : pystan model used by InferSitePreferences.
  • StanModelSameErr : pystan model used by InferSitePreferences.
  • StanModelDifferentErr : pystan model used by InferSitePreferences.
  • StanModelDiffPrefNoErr : pystan model used by InferSiteDiffPrefs.
  • StanModelDiffPrefSameErr : pystan model used by InferSiteDiffPrefs.

Function documentation

mcmc.InferSiteDiffPreferencesFromEnrichmentRatios(characterlist, wtchar, error_model, counts, pseudocounts=1)

Infers site-specific differential preferences from enrichment ratios.

This function mirrors the operations performed by InferSiteDiffPrefs, expect the preferences are calculated directly from enrichment ratios. Enrichment ratios are computed using the same algorithm as in the function InferSitePreferencesFromEnrichmentRatios.

characterlist, wtchar, error_model, and counts have the same meaning as for InferSitePreferences.

pseudocounts is a number > 0. If the counts for a character are less than pseudocounts, either due to low counts or error correction, then the counts for that character are changed to pseudocounts to avoid estimating ratios of zero, less than zero, or infinity.

The return value is: (converged, deltapi_means, pr_deltapi_gt0, pr_deltapi_lt0, logstring), where the tuple entries have the same meaning as for InferSiteDiffPrefs except that pr_deltapi_gt0 and pr_deltapi_lt0 are None since they cannot be estimated from direct enrichment ratio calculation as it is not a statistical model, and converged is True since this calculation always converges.

Differential preferences are computed using ratio estimation as follows. For each site, we set the enrichment ratio of the wildtype character \(\rm{wt}\) equal to one. We do this for each of the two selections \(s1\) and \(s2\):

\[\phi_{wt}^{\rm{s1}} = \phi_{wt}^{\rm{s2}} = 1\]

Next, for each non-wildtype character \(x\), we calculate the enrichment ratio relative to \(\rm{wt}\) for \(s1\) and \(s2\) as:

\[\phi_{x}^{s1} = \frac{ \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{s1}}}, \frac{n_{r,x}^{\rm{s1}}}{N_r^{\rm{s1}}} - \frac{n_{r,x}^{\rm{err}}}{N_r^{\rm{err}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{s1}}}{N_r^{\rm{s1}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{err}}}{N_r^{\rm{err}}} }\right) } { \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{pre}}}, \frac{n_{r,x}^{\rm{pre}}}{N_r^{\rm{pre}}} - \frac{n_{r,x}^{\rm{err}}}{N_r^{\rm{err}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{pre}}}{N_r^{\rm{pre}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{err}}}{N_r^{\rm{err}}} }\right) }\]

and

\[\phi_{x}^{s2} = \frac{ \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{s2}}}, \frac{n_{r,x}^{\rm{s2}}}{N_r^{\rm{s2}}} - \frac{n_{r,x}^{\rm{err}}}{N_r^{\rm{err}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{s2}}}{N_r^{\rm{s2}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{err}}}{N_r^{\rm{err}}} }\right) } { \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{pre}}}, \frac{n_{r,x}^{\rm{pre}}}{N_r^{\rm{pre}}} - \frac{n_{r,x}^{\rm{err}}}{N_r^{\rm{err}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{pre}}}{N_r^{\rm{pre}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{err}}}{N_r^{\rm{err}}} }\right) }\]

where \(\mathcal{P}\) is the value of mincounts. When error_model is none, then all terms involving the error corrections (with superscript err) are ignored and \(\delta\) is set to zero; otherwise \(\delta\) is one.

Next, for each character \(x\), including \(\rm{wt}\), we compute the preferences as normalized enrichment ratios for both \(s1\) and \(s2\) as:

\[\pi_{x}^{s1} = \frac{\phi_x^{s1}}{\sum_y \phi_y^{s1}}\]

and

\[\pi_{x}^{s2} = \frac{\phi_x^{s2}}{\sum_y \phi_y^{s2}}\]

and finally calculate the differential preference as

\[\Delta\pi_{x} = \pi_{x}^{s2} - \pi_{x}^{s1}\]

For testing the code using doctest:

>>> # Hypothetical data for a site
>>> characterlist = ['A', 'T', 'G', 'C']
>>> wtchar = 'A'
>>> counts = {}
>>> counts['nrs1'] = {'A':310923, 'T':13, 'C':0, 'G':37}
>>> counts['nrs2'] = {'A':328715, 'T':0, 'C':30, 'G':55}
>>> counts['nrstart'] = {'A':390818, 'T':50, 'C':0, 'G':80}
>>> counts['nrerr'] = {'A':310818, 'T':0, 'C':0, 'G':40}
>>> # Using error_model = 'none'
>>> (converged, deltapi_means, pr_deltapi_gt0, pr_deltapi_lt0, logstring) = InferSiteDiffPreferencesFromEnrichmentRatios(characterlist, wtchar, 'none', counts, pseudocounts=1)
>>> [round(deltapi_means[x], 9) for x in characterlist]
[-0.289284007, -0.102619748, -0.161880651, 0.553784406]
>>> # Using error_model = 'same'
>>> (converged, deltapi_means, pr_deltapi_gt0, pr_deltapi_lt0, logstring) = InferSiteDiffPreferencesFromEnrichmentRatios(characterlist, wtchar, 'same', counts, pseudocounts=1)
>>> [round(deltapi_means[x], 9) for x in characterlist]
[-0.35391081, -0.123807582, -0.002459048, 0.48017744]
mcmc.InferSiteDiffPrefs(characterlist, wtchar, error_model, counts, priors, seed=1, niter=10000, n_jobs=1, r_max=1.1, neff_min=100, nchains=4, increasefactor=2, increasetries=6)

Infers site-specific differntial preferences by MCMC for a specific site.

The algorithm is explained in the documentation for the dms_tools package.

The MCMC automatically tries to guarantee convergence via the parameters specified by r_max, neff_min, nchains, niter, increasefactor, and increasetries.

Here are the calling variables:

  • characterlist is a list of all valid characters; entries must be unique. Typically a list of all amino acids, codons, or nucleotides. Characters are case sensitive.

  • wtchar is a character that is in characterlist and defines the wildtype character at the site.

  • error_model specifies how the errors are estimated. It can be one of the following strings:

    • none : no errors
    • same : same error rates for starting library and selections \(s1\) and \(s2\).

    Alternatively, it could be an instance of StanModelDiffPrefNoErr or StanModelDiffPrefSameErr. These are the pystan models that implement the three strings described above. If you pass these model objects, then they will not need to be compiled by this function (strings require models to be compiled). So it is advantageous to pass the pystan models rather than the strings if calling this function repeatedly.

  • counts is a dictionary specifying the deep sequencing counts. Each string key should specify a dictionary keyed by all characters and with values giving the integer counts for that character. Keys:

    • nrstart : counts for starting library
    • nrs1 : counts for selection \(s1\)
    • nrs2 : counts for selection \(s2\)
    • nrerr : only required if error_model is same, specifis counts for error control.
  • priors is a dictionary specifying the parameter vectors for the Dirichlet priors. Each string key should specify a dictionary keyed by all characters and with values giving the prior for that character. Keys:

    • pirs1_prior_params : specifies Dirichlet prior vector for \(\pi_r^{s1}\)
    • frstart_prior_params : specifies Dirichlet prior vector for \(f_r^{\textrm{start}}\)
    • deltapi_concentration : specifies floating number giving concentration parameter for \(\Delta\pi_r\)
    • xir_prior_params : only required if error_model is same, specified Dirichlet prior vector for error rates \(\xi_r\)

    Any values less than PRIOR_MIN_VALUE (a constant specified in this module) are automatically set to PRIOR_MIN_VALUE.

  • seed : integer seed for MCMC. Calling multiple times with same seed and data will give identical results provided architecture for floating point math on the computer is also the same.

  • n_jobs : The number of CPUs to use. If -1, use all available CPUs.

  • The following parameters specify how the MCMC tries to guarantee convergence. They all have reasonable default values, so you probably don’t need to change them. The MCMC is considered to have converged if the mean Gelman-Rubin R statistic (http://www.jstor.org/stable/2246093) over all \(\pi_{r,x}\) values is less than or equal to r_max, and the mean effective sample size is greater than or equal to neff_min. The MCMC first runs with nchains chains and niter iterations. If it fails to converge, it then increases the number of iterations by a factor of increasefactor and tries again, and repeats until it converges or until it has tried increasetries times to increase the number of iterations.

The return value is: (converged, deltapi_means, pr_deltapi_gt0, pr_deltapi_lt0, logstring). The entries in this tuple have the following meanings:

  • converged is True if the MCMC converges and False otherwise.
  • deltapi_means is a dictionary keyed by all characters in characterlist with the value for each character giving the \(\Delta\pi_{r,a}\) value.
  • pr_deltapi_gt0 is a dictionary keyed by all characters in characterlist with the value being the posterior probability that \(\Delta\pi_{r,a} > 0\).
  • pr_deltapi_lt0 is a dictionary keyed by all characters in characterlist with the value being the posterior probability that \(\Delta\pi_{r,a} < 0\).
  • logstring is a string that can be printed to give summary information about the MCMC run and its convergence.
mcmc.InferSitePreferences(characterlist, wtchar, error_model, counts, priors, seed=1, niter=10000, increasetries=6, n_jobs=1, r_max=1.1, neff_min=100, nchains=4, increasefactor=2)

Infers site-specific preferences by MCMC for a specific site.

Uses MCMC to infer the site-specific preferences \(\pi_{r,a}\) for some site \(r\) for each character \(a\) by integrating over the posterior defined by the product of the following priors and likelihoods, where for instance \(\boldsymbol{\mathbf{\pi_r}}\) is used to indicate the vector of \(\pi_{r,a}\) values for all characters:

\[ \begin{align}\begin{aligned}\Pr\left(\boldsymbol{\mathbf{\pi_r}}\right) = \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\pi,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\mu_r}}\right) = \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\mu,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\epsilon_r}}\right) = \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\epsilon,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{\rho_r}}\right) = \mathrm{Dirichlet}\left(\boldsymbol{\mathbf{a_{\rho,r}}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{pre}}}} \mid \boldsymbol{\mathbf{\mu_r}}, \boldsymbol{\mathbf{\epsilon_r}}\right) = \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{pre}}}}; \boldsymbol{\mathbf{\mu_r}} + \boldsymbol{\mathbf{\epsilon_r}} - \boldsymbol{\mathbf{\delta_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{post}}}} \mid \boldsymbol{\mathbf{\mu_r}}, \boldsymbol{\mathbf{\epsilon_r}}\right) = \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{post}}}}; \frac{\boldsymbol{\mathbf{\mu_r}} \circ \boldsymbol{\mathbf{\pi_r}}}{\boldsymbol{\mathbf{\mu_r}} \cdot \boldsymbol{\mathbf{\pi_r}}} + \boldsymbol{\mathbf{\rho_r}} - \boldsymbol{\mathbf{\delta_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}} \mid \boldsymbol{\mathbf{\epsilon_r}}\right) = \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}}; \boldsymbol{\mathbf{\epsilon_r}}\right)\\\Pr\left(\boldsymbol{\mathbf{n_r^{\rm{err,post}}}} \mid \boldsymbol{\mathbf{\rho_r}}\right) = \mathrm{Multinomial}\left(\boldsymbol{\mathbf{n_r^{\rm{err,post}}}}; \boldsymbol{\mathbf{\rho_r}}\right)\end{aligned}\end{align} \]

where \(\boldsymbol{\mathbf{\delta_r}}\) is a vector that is zero except for a one at the element corresponding to the wildtype character.

The MCMC automatically tries to guarantee convergence via the parameters specified by r_max, neff_min, nchains, niter, increasefactor, and increasetries.

Here are the calling variables:

  • characterlist is a list of all valid characters; entries must be unique.

    Typically a list of all amino acids, codons, or nucleotides. Characters are case sensitive.

  • wtchar is a character that is in characterlist and defines the wildtype character at the site.

  • error_model specifies how the errors are estimated. It might be one of the following strings:

    • none : no errors (\(\boldsymbol{\mathbf{\epsilon_r}} = \boldsymbol{\mathbf{\rho_r}} = \mathbf{0}\))
    • same : same error rates for pre and post-selection (\(\boldsymbol{\mathbf{\epsilon_r}} = \boldsymbol{\mathbf{\rho_r}}\)).
    • different : different error rates for pre and post selection (\(\boldsymbol{\mathbf{\epsilon_r}} \ne \boldsymbol{\mathbf{\rho_r}}\)).

    Alternatively, it could be an instance of StanModelNoneErr, StanModelSameErr, or StanModelDifferentErr. These are the pystan models that implement the three strings described above. If you pass these model objects, then they will not need to be compiled by this function (strings require models to be compiled). So it is advantageous to pass the pystan models rather than the strings if calling this function repeatedly.

  • counts is a dictionary specifying the deep sequencing counts. Each string key should specify a dictionary keyed by all characters and with values giving the integer counts for that character. Keys:

    • nrpre : specifies \(\boldsymbol{\mathbf{n_r^{\rm{pre}}}}\)
    • nrpost : specifies \(\boldsymbol{\mathbf{n_r^{\rm{post}}}}\)
    • nrerr : only required if error_model is same, specifies \(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}} = \boldsymbol{\mathbf{n_r^{\rm{err,post}}}}\)
    • nrerrpre and nrerrpost : only required if error_model is different, specify \(\boldsymbol{\mathbf{n_r^{\rm{err,pre}}}}\) and \(\boldsymbol{\mathbf{n_r^{\rm{err,post}}}}\).
  • priors is a dictionary specifying the parameter vectors for the Dirichlet priors. Each string key should specify a dictionary keyed by all characters and with values giving the prior for that character. Keys:

    • pir_prior_params : specifies \(\boldsymbol{\mathbf{a_{\pi,r}}}\)
    • mur_prior_params : specifies \(\boldsymbol{\mathbf{a_{\mu,r}}}\)
    • epsilonr_prior_params : only required if error_model is same or different, specifies \(\boldsymbol{\mathbf{a_{\epsilon,r}}}\)
    • rhor_prior_params : only required if error_model is different, specifies \(\boldsymbol{\mathbf{a_{\rho,r}}}\)

    Any values less than PRIOR_MIN_VALUE (a constant specified in this module) are automatically set to PRIOR_MIN_VALUE.

  • seed : integer seed for MCMC. Calling multiple times with same seed and data will give identical results provided architecture for floating point math on the computer is also the same.

  • n_jobs : The number of CPUs to use. If -1, use all available CPUs.

  • The following parameters specify how the MCMC tries to guarantee convergence. They all have reasonable default values, so you probably don’t need to change them. The MCMC is considered to have converged if the mean Gelman-Rubin R statistic (http://www.jstor.org/stable/2246093) over all \(\pi_{r,x}\) values is less than or equal to r_max, and the mean effective sample size is greater than or equal to neff_min. The MCMC first runs with nchains chains and niter iterations. If it fails to converge, it then increases the number of iterations by a factor of increasefactor and tries again, and repeats until it converges or until it has tried increasetries times to increase the number of iterations. Additionally, if the effective sample size exceeds 3 times neff_min then we allow R to be 1 + 1.5 (r_min - 1).

The return value is: (converged, pi_means, pi_95credint, logstring). The entries in this tuple have the following meanings:

  • converged is True if the MCMC converges and False otherwise.
  • pi_means is a dictionary keyed by all characters in characterlist with the value for each character giving the \(\pi_{r,a}\) value.
  • pi_95credint is a dictionary keyed by all characters in characterlist with the value being a 2-tuple giving the lower and upper limits on the median-centered credible interval for \(\pi_{r,a}\).
  • logstring is a string that can be printed to give summary information about the MCMC run and its convergence.
mcmc.InferSitePreferencesFromEnrichmentRatios(characterlist, wtchar, error_model, counts, pseudocounts=1)

Infers site-specific preferences from enrichment ratios.

This function mirrors the operations performed by InferSitePreferences, expect the preferences are calculated directly from enrichment ratios.

characterlist, wtchar, error_model, and counts have the same meaning as for InferSitePreferences.

pseudocounts is a number > 0. If the counts for a character are less than pseudocounts, either due to low counts or error correction, then the counts for that character are changed to pseudocounts to avoid estimating ratios of zero, less than zero, or infinity.

Briefly, we set the enrichment ratio of the wildtype character \(\rm{wt}\) at this site equal to one

\[\phi_{wt} = 1\]

Then, for each non-wildtype character \(x\), we calculate the enrichment relative to \(\rm{wt}\) as

\[\phi_{x} = \frac{ \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{post}}}, \frac{n_{r,x}^{\rm{post}}}{N_r^{\rm{post}}} - \frac{n_{r,x}^{\rm{errpost}}}{N_r^{\rm{errpost}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{post}}}{N_r^{\rm{post}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{errpost}}}{N_r^{\rm{errpost}}} }\right) } { \left(\frac{ \max\left(\frac{\mathcal{P}}{N_r^{\rm{pre}}}, \frac{n_{r,x}^{\rm{pre}}}{N_r^{\rm{pre}}} - \frac{n_{r,x}^{\rm{errpre}}}{N_r^{\rm{errpre}}}\right) }{ \frac{n_{r,\rm{wt}}^{\rm{pre}}}{N_r^{\rm{pre}}} + \delta - \frac{n_{r,\rm{wt}}^{\rm{errpre}}}{N_r^{\rm{errpre}}} }\right) }\]

where \(\mathcal{P}\) is the value of pseudocounts. When error_model is none, then all terms involving the error corrections (with superscript err) are ignored and \(\delta\) is set to zero; otherwise \(\delta\) is one.

Next, for each character \(x\), including \(\rm{wt}\), we calculate the preference for \(x\) as

\[\pi_x = \frac{\phi_x}{\sum_y \phi_y}.\]

The return value is: (converged, pi_means, pi_95credint, logstring), where the tuple entries have the same meaning as for InferSitePreferences except that pi_95credint is None since no credible intervals can be estimated from direct enrichment ratio calculation as it is not a statistical model, and converged is True since this calculation always converges.

For testing the code using doctest:

>>> # Hypothetical data for a site
>>> characterlist = ['A', 'T', 'G', 'C']
>>> wtchar = 'A'
>>> counts = {}
>>> counts['nrpost'] = {'A':310923, 'T':13, 'C':0, 'G':37}
>>> counts['nrerrpost'] = {'A':310818, 'T':0, 'C':0, 'G':40}
>>> counts['nrerr'] = {'A':310818, 'T':0, 'C':0, 'G':40} # Same as 'nrerrpost'
>>> counts['nrpre'] = {'A':390818, 'T':50, 'C':0, 'G':80}
>>> counts['nrerrpre'] = {'A':390292, 'T':0, 'C':5, 'G':9}
>>> # Using error_model = 'none'
>>> (converged, pi, pi95, logstring) = InferSitePreferencesFromEnrichmentRatios(characterlist, wtchar, 'none', counts, pseudocounts=1)
>>> [round(pi[x], 9) for x in characterlist]
[0.315944301, 0.10325369, 0.18367243, 0.397129578]
>>> # Using error_model = 'same'
>>> (converged, pi, pi95, logstring) = InferSitePreferencesFromEnrichmentRatios(characterlist, wtchar, 'same', counts, pseudocounts=1)
>>> [round(pi[x], 9) for x in characterlist]
[0.380792732, 0.124446795, 0.016118953, 0.47864152]
>>> # Using error_model = 'different'
>>> (converged, pi, pi95, logstring) = InferSitePreferencesFromEnrichmentRatios(characterlist, wtchar, 'different', counts, pseudocounts=1)
>>> [round(pi[x], 9) for x in characterlist]
[0.384418849, 0.125620184, 0.006806413, 0.483154554]
class mcmc.StanModelDiffPrefNoErr

Bases: object

PyStan model for error_model of ‘none’ for InferSiteDiffPrefs.

class mcmc.StanModelDiffPrefSameErr

Bases: object

PyStan model for error_model of ‘same’ for InferSiteDiffPrefs.

class mcmc.StanModelDifferentErr

PyStan model for error_model of ‘different’ for InferSitePreferences.

class mcmc.StanModelNoneErr

Bases: object

PyStan model for error_model of ‘none’ for InferSitePreferences.

class mcmc.StanModelSameErr

PyStan model for error_model of ‘same’ for InferSitePreferences.

file_io module

Module that reads and writes files.

Written by Jesse Bloom.

Functions in this module

  • Versions : Returns string giving versions of utilized packages.
  • ReadDMSCounts : reads deep mutational scanning counts.
  • WriteDMSCounts : writes deep mutational scanning counts.
  • ReadMultipleDMSCountsFiles : reads multiple counts files for the same sequence.
  • WritePreferences : writes site-specific preferences.
  • ReadPreferences : reads the preferences file written by WritePreferences.
  • WriteDiffPrefs : writes site-specific differential preferences.
  • ReadDiffPrefs : reads differential preferences file written by WriteDiffPrefs.
  • ReadMultiPrefOrDiffPref : reads multiple preferences or differential preferences files.
  • IteratePairedFASTQ : iterate over paired-read FASTQ files.
  • ReadSummaryStats : reads alignment summary statistics.
  • ReadSubassembledVariants : reads subassembled variants

Function documentation

file_io.IteratePairedFASTQ(r1files, r2files, gzipped, applyfilter, usegzip=False)

Iterates over FASTQ file pairs for paired-end sequencing reads.

This function has been tested for its ability to read FASTQ files generated by the Illumina Casava 1.8 pipeline for paired-end reads. For FASTQ files generated by the Casava 1.8 pipeline it will ensure that the read directions are appropriate. If the sequence headers do NOT match the format of the Casava 1.8 pipeline, they will need to match the default format written by fastq-dump when extracting FASTQ files from SRA data, using the option ‘–split-files’ to generate r1 files and r2 files appropriately.

CALLING VARIABLES:

r1files : list of file name(s) containing the R1 reads.

r2files : list of file name(s) containing the R2 reads. The files specified here must specify the exact same number of reads as those in r1files, and they must be in the same sequence (i.e. the first R1 read found going through r1files in order must match the first R2 read found going through r2files in order, etc).

gzipped : Boolean switch specifying whether the FASTQ files specified by r1files and r2files are gzipped. If set to True, they are gzipped. This iterator will then read them without unzipping the files.

applyfilter : Boolean switch specifying whether we remove read pairs in which one or more of the reads failed the Illumina chastity filter. If True, the iterator simply returns False each time it encounters a read pair where the chastity filter is failed. The chastity filter is indicated by Y or N in the sequence header – values of Y indicate a failed filter. Note this option will only work if the sequence headers are the ones provided by the Illumina Casava 1.8 pipeline; for FASTQ files downloaded from the SRA using fastq-dump this filter information will not be present in the sequence header and the value of applyfilter will no longer be meaningful and all read pairs will be returned.

usegzip : Boolean switch that is meaningful only if gzipped is True. This specifies that we use the Python gzip module to unzip the reads. Otherwise we use the system gzip -cd command, which may be faster. You should definitely specify usegzip to True if you are only reading a few lines, as with usegzip set to False the entire file is unzipped before anything happens. False by default.

RESULT OF THIS FUNCTION:

The function will iterate over all read pairs in the indicated files. It will first try to match sequence headers that follow the format of the Illumina Casava 1.8 pipeline, and if that fails it will match headers more permissively if they begin with the default structure written by fastq-dump: @(samplename).(read number).

For each iteration, it will return:

  • The return variable is False if applyfilter == True and one of the reads in the pair failed the Illumina chastity filter and the sequence headers are of the type written by the Illumina Casava 1.8 pipeline.
  • Otherwise return variable is the following tuple: (name, r1, r2, q1, q2) where:
    • name : name of the read pair (a string)
    • r1 : sequence of the first read R1 (a string)
    • r2 : sequence of the second read R2 (a string)
    • q1 : string of Q-scores for the R1 read
    • q2 : string of Q-scores for the R2 read
file_io.ReadDMSCounts(f, chartype, translate_codon_to_aa=False, return_as_df=False)

Reads deep mutational scanning counts.

f is a readable file-like object that contains the counts, or a string that gives the name of a file.

chartype is the type of character. Valid values are the following strings:

  • DNA : DNA nucleotides, those listed in the nts variable defined in this module.
  • codon : DNA codons, those listed in the codons variable defined in this module.
  • aminoacids_nostop : amino acids, not including stop codons.
  • aminoacids_withstop : amino acids, including stop codons (*).

The counts are returned in the dictionary counts. The dictionary is keyed by strings giving the position (e.g. ‘1’, ‘2’, or ‘5A’). For each position r, counts[r] is a dictionary keyed by the string WT and characters (nucleotides, codons, or amino-acids) specified by chartype, in upper case. counts[r][‘WT’] is the wildtype identity at the site; counts[r][x] is the number of counts for character x. In addition, counts[r][‘F_{0}’.format(x)] is the frequency of counts of x, and counts[r][‘COUNTS’] is the total number of counts at site r. If there are no counts at a site, the frequency of counts for each character x at the site is set to float(NaN).

Alternatively, if return_as_df is set as True, the counts are returned as a pandas dataframe.

If chartype is codon and the option translate_codon_to_aa is set as True, then the returned counts will be for amino-acids as translated by the codon counts.

The specifications on the format of the contents of f are as follows:

  • Any lines beginning with # are treated as comments or headers, and do not contain data.

  • Columns are space delimited. They can separated by one or more spaces, or by one or more tabs.

  • Before the first data line (a line not beginning with #) there must be a header line beginning with # POSITION WT and then followed in arbitrary order by entries specifying valid DNA nucleotides or valid DNA codons. The order of the entries in this header line establish the order of the data in subsequent lines.

  • The case does matter, all nucleotides and codons must be upper case.

  • It is OK to intersperse columns that do not represent valid codons or nucleotides as long as all of the codons or nucleotides are present. This might be useful if you file is using N to denote ambiguous nucleotides (which are ignored as not being definitive counts), or is using lower-case letters (such as a to denote unsure calls).

  • For every data line, the first two entries will be POSITION and WT:

    • POSITION is the position number. These do not have to be consecutive integers, so it is OK to skip sites (e.g. 1, 2, 5) or have negative site numbers (e.g. -1, 0, 1) or have sites with letter suffixes (e.g. 1, 2, 2A, 3). All of these options are useful when referring to numbering schemes that are non-sequential, as is sometimes the case in gene numbering. Note however that each position number must be unique.
    • WT must give a valid wildtype identity of the site as a
      nucleotide or codon.

Example of parsing nucleotide counts files with minimal required format and extra lines / columns.

>>> f = cStringIO.StringIO()
>>> f.write('# POSITION WT A T C G\n')
>>> f.write('1 A 310818 13 0 37\n')
>>> f.write('2 T 16 311782 37 4\n')
>>> f.write('2A G 27 27 11 312520\n')
>>> f.write('3 A 313647 13 4 31\n')
>>> f.seek(0)
>>> f2 = cStringIO.StringIO()
>>> f2.write('# extra comment line\n')
>>> f2.write('# POSITION WT N A T G C a\n')
>>> f2.write('1 A 1 310818  13 37 0 15\n')
>>> f2.write('2 T 15 16 311782   4 37\t 0\n')
>>> f2.write('# another comment line\n')
>>> f2.write('2A G 7 27 27 312520 11 1\n')
>>> f2.write('3 A 0 313647 13 31 4 2\n')
>>> f2.seek(0)
>>> d = ReadDMSCounts(f, 'DNA')
>>> d2 = ReadDMSCounts(f2, 'DNA')
>>> d == d2
True
>>> d['1']['WT'] == 'A'
True
>>> d['2A']['WT'] == 'G'
True
>>> d['2']['T'] == 311782
True
>>> d['2']['C'] == 37
True
>>> 1 in d
False
>>> 'N' in d2['1']
False
>>> 'a' in d2['2']
False
>>> f.seek(0)
>>> ReadDMSCounts(f, 'codon')
Traceback (most recent call last):
    ...
ValueError: file does not have a valid header for codon

Example for codons

>>> nts = ['A', 'C', 'G', 'T']
>>> codons = []
>>> for nt1 in nts:
...     for nt2 in nts:
...         for nt3 in nts:
...             codons.append('%s%s%s' % (nt1, nt2, nt3))
>>> f = cStringIO.StringIO()
>>> f.write('# POSITION WT %s\n' % ' '.join(codons))
>>> f.write('1 AAA %s\n' % ' '.join([str(i) for i in range(len(codons))]))
>>> f.seek(0)
>>> d = ReadDMSCounts(f, 'codon')
>>> d['1']['AAC'] == 1
True
>>> d['1']['WT'] == 'AAA'
True
>>> d['1']['WT'] == 'AAa'
False
>>> codons[5] = codons[5].lower()
>>> f2 = cStringIO.StringIO()
>>> f2.write('# POSITION WT %s\n' % ' '.join(codons))
>>> f2.write('1 AAA %s\n' % ' '.join([str(i) for i in range(len(codons))]))
>>> f2.seek(0)
>>> d = ReadDMSCounts(f2, 'codon')
Traceback (most recent call last):
    ...
ValueError: file does not have a valid header for codon
file_io.ReadDiffPrefs(f)

Reads the differential preferences written by WriteDiffPrefs.

f is the name of an existing file or a readable file-like object.

The return value is the tuple: (sites, wts, deltapi_means, pr_deltapi_lt0, pr_deltapi_gt0, rms) where sites, wts, deltapi_means, pr_deltapi_lt0, and pr_deltapi_gt0 all have the same values used to write the file with WriteDiffPrefs and rms is a dictionary with rms[r] giving the root-mean-square differential preference for each r in sites.

See docstring of WriteDiffPrefs for example usage.

file_io.ReadMultiPrefOrDiffPref(infiles, removestop=False, sortsites=True)

Reads multiple preferences or differential preferences files.

infiles is a list of preferences or differential preferences files in the formats read by ReadPreferences or ReadDiffPrefs. The files can be a mix of these two formats. Each of the files must have data for the same set of sites or a ValueError is raised. In addition, the files must use the same sets of characters or a ValueError is raised. Valid character sets are:

  • nucleotides
  • codons
  • amino acids without stop codons
  • amino acids with stop codons
  • a mix of amino acids with and without stop codons if removestop is True.

removestop specifies whether we remove stop codons from amino acids. If this is done, all preferences are renormalized to sum to one by scaling them, and all differential preferences are renormalized to sum to zero by adding or subtracting a fixed amount to each.

sortsites specifies that we do a natural sort on the site numbers in infiles rather than keeping them in their original order.

The return value is as follows: (sites, characters, wts, databyfile). In this tuple:

sites is a list of all site numbers (as strings) for which we have preferences or differential preferences.

characters is a list of the characters that are being used (i.e. nucleotides, codons, or amino acids with or without stop codons).

wts is a dictionary with wts[r] giving the wildtype character at site r for all r in sites. If the files do not all have the same wildtype at r, then wts[r] = ‘?’.

databyfile is a dictionary keyed by each file name in infiles If that file is a preferences file, then the value is the dictionary pi_means that would be returned by ReadPreferences. If that file is a differential preferences file, then the value is the dictionary deltapi_means that would be returned by ReadDiffPrefs.

file_io.ReadMultipleDMSCountsFiles(filelist, chartype)

Reads multiple deep mutational scanning counts for the same sequence.

filelist is a list of names of files that can be read by ReadDMSCounts using chartype.

This function reads the counts for all files in filelist. It confirms that each file specifies counts for the same set of sites, and that the wildtype sequence specified in each file is the same. It then returns the following tuple: (sites, wts, counts). In this tuple:

  • sites is a list of strings giving the sites, sorted so that the numeric portion of the string is interpreted as a number.
  • wts is a dictionary, with wts[r] giving the wildtype character for site r for each r in sites.
  • counts is a dictionary keyed by each file in filelist, with the value for each file name being the counts dictonary for that file in the format returned by ReadDMSCounts.
file_io.ReadPreferences(f)

Reads the site-specific preferences written by WritePreferences.

f is the name of an existing file or a readable file-like object.

The return value is the tuple: (sites, wts, pi_means, pi_95credint, h) where sites, wts, pi_means, and pi_95credint will all have the same values used to write the file with WritePreferences, and h is a dictionary with h[r] giving the site entropy (log base 2) for each r in sites.

See docstring of WritePreferences for example usage.

file_io.ReadSubassembledVariants(infile, assumechartype='codon')

Reads subassembled variants file produced by dms_subassemble.

infile : name of a *_subassembled_variants.txt file produced by dms_subassemble.

assumechartype is the type of nucleotide character assumed if it is not possible to auto-determine this from infile, which will be the case only if there are no mutations.

The return value is (bc_dict, bc_len, wtseq, chartype)

bc_dict is a dictionary keyed by each barcode, and with values being the list of mutations at that site.

bc_len is the integer length of barcodes.

wtseq is a string giving the wildtype sequence

chartype indicates the character type of the sequence / mutations as a string, which is auto-determined from infile. The chartype can be any of codon, DNA, or aminoacid. If all variants are unmutated and the sequence is a nucleotide sequence, then it is not possible to auto- determine whether the character type is codon or DNA. In that case, it is assumed to have the type of assumechartype.

file_io.ReadSummaryStats(f)

Reads alignment summary statistics.

f should be either a readable file-like object or a string giving the name of an existing file that contains alignment summary statistics. For instance, this might be the summarystats.txt file produced by dms_barcodedsubamplicons. Here is example content from such a file:

total read pairs = 4580258
read pairs that fail Illumina filter = 0
low quality read pairs = 915131
barcodes randomly purged = 0
discarded barcodes with 1 reads = 844768
aligned barcodes with 2 reads = 466387
discarded barcodes with 2 reads = 6457
un-alignable barcodes with 2 reads = 40753
aligned barcodes with 3 reads = 262149
discarded barcodes with 3 reads = 5426
un-alignable barcodes with 3 reads = 24891
aligned barcodes with 4 reads = 120828
discarded barcodes with 4 reads = 3281
un-alignable barcodes with 4 reads = 7675
aligned barcodes with 5 reads = 45342
discarded barcodes with 5 reads = 1516
un-alignable barcodes with 5 reads = 2863

The return value is the 2-tuple (keylist, stats). keylist is a list of all of the keys in f (these are the strings before the equals sign) in the same order as they appear in the file. stats is a dictionary keyed by each key in keylist and with the value being an integer giving the number of counts for that key. The reason for returning keylist as well as stats is that keylist preserves the order of the entries in f, whereas the dictionary stats will have an arbitrary order.

Any lines in f that begin with # are treated as comments and are not processed into the return variables.

file_io.Versions()

Returns a string with version information.

You would call this function if you want a string giving detailed information on the version of dms_tools and the associated packages that it uses. This string might be useful if you want to print this information at the beginning of a script.

file_io.WriteDMSCounts(f, counts)

Writes deep mutational scanning counts file.

This function is the inverse of ReadDMSCounts.

f is a writable file-like object, or a string that gives the name of a file that is created.

counts is a dictionary of the same format returned by ReadDMSCounts. These counts are written to f.

There is no chartype argument – that is automatically determined from counts.

>>> counts = {'1':{'A':2, 'C':3, 'G':4, 'T':1, 'WT':'A'}, '2':{'A':1, 'C':1, 'G':3, 'T':4, 'WT':'C'}}
>>> counts['1']['COUNTS'] = 10
>>> counts['2']['COUNTS'] = 9
>>> counts['1']['F_A'] = counts['1']['A'] / float(counts['1']['COUNTS'])
>>> counts['1']['F_C'] = counts['1']['C'] / float(counts['1']['COUNTS'])
>>> counts['1']['F_G'] = counts['1']['G'] / float(counts['1']['COUNTS'])
>>> counts['1']['F_T'] = counts['1']['T'] / float(counts['1']['COUNTS'])
>>> counts['2']['F_A'] = counts['2']['A'] / float(counts['2']['COUNTS'])
>>> counts['2']['F_C'] = counts['2']['C'] / float(counts['2']['COUNTS'])
>>> counts['2']['F_G'] = counts['2']['G'] / float(counts['2']['COUNTS'])
>>> counts['2']['F_T'] = counts['2']['T'] / float(counts['2']['COUNTS'])
>>> f = cStringIO.StringIO()
>>> WriteDMSCounts(f, counts)
>>> f.seek(0)
>>> readcounts = ReadDMSCounts(f, 'DNA')
>>> counts == readcounts
True
file_io.WriteDiffPrefs(f, sites, wts, deltapi_means, pr_deltapi_lt0, pr_deltapi_gt0)

Writes differential preferences to a file.

f is a writeable file-like object to which we write the differential
preferences, or a string giving the name of a file that we create.
sites is a list of all site numbers (as strings) for which
we have preferences.
wts is a dictionary with wts[r] giving the wildtype character
at site r for all r in sites.
deltapi_means is a dictionary keyed by all sites (as strings). For each
site r, deltapi_means[r] is a dictionary where deltapi_means[r][x] is the preference of r for character x, where the characters are nucleotide, codons, or amino acids.
pr_deltapi_gt0 and pr_deltapi_lt0 can either both be None,
or can both be dictionaries keyed by all sites in deltapi_means that have as their values the posterior probability that particular differential preference is greater than or less than zero, respectively.

The written file has the following format:

# POSITION WT RMS_dPI dPI_A dPI_C dPI_G dPI_T Pr_dPI_A_lt0 Pr_dPI_A_gt0 Pr_dPI_C_lt0 Pr_dPI_C_gt0 Pr_dPI_G_lt0 Pr_dPI_G_gt0 Pr_dPI_T_lt0 Pr_dPI_T_gt0
1 G 0.241 0.02 -0.4 0.2 0.18 0.41 0.59 0.90 0.10 0.2 0.8 0.3 0.7

In this format, the first header line begins with # and indicates the characters as the suffix to dPI_. The posterior probabilities are indicated by columns of the form Pr_dPI_A_gt0 for greater than zero, and Pr_dPI_A_lt0 for less than zero. The RMS_dPI column gives the root-mean-square differential preference as \(\textrm{RMS}_{\Delta\pi_r} = \sqrt{\frac{1}{\mathcal{N}_x}\sum_x \left(\Delta\pi_{r,x}\right)^2}\)

>>> sites = ['1']
>>> deltapi_means = {'1':{'A':-0.2, 'C':-0.3, 'G':0.4, 'T':0.1}}
>>> wts = {'1':'G'}
>>> f = cStringIO.StringIO()
>>> WriteDiffPrefs(f, sites, wts, deltapi_means, None, None)
>>> f.seek(0)
>>> lines = f.readlines()
>>> lines[0] == '# POSITION WT RMS_dPI dPI_A dPI_C dPI_G dPI_T\n'
True
>>> entries = lines[1].split()
>>> entries[0] == str(sites[0])
True
>>> entries[1] == wts[sites[0]]
True
>>> 1.0e-6 > abs(float(entries[2]) - math.sqrt(sum([x**2 for x in deltapi_means[sites[0]].values()]) / float(len(deltapi_means[sites[0]].values()))))
True
>>> 1.0e-6 > abs(float(entries[3]) - deltapi_means[sites[0]]['A'])
True
>>> f2 = cStringIO.StringIO()
>>> pr_deltapi_gt0 = {'1':{'A':0.95, 'C':0.5, 'G':0.7, 'T':0.5}}
>>> pr_deltapi_lt0 = {'1':{'A':0.05, 'C':0.5, 'G':0.3, 'T':0.5}}
>>> WriteDiffPrefs(f2, sites, wts, deltapi_means, pr_deltapi_lt0, pr_deltapi_gt0)
>>> f2.seek(0)
>>> lines = f2.readlines()
>>> lines[0] == '# POSITION WT RMS_dPI dPI_A dPI_C dPI_G dPI_T Pr_dPI_A_lt0 Pr_dPI_A_gt0 Pr_dPI_C_lt0 Pr_dPI_C_gt0 Pr_dPI_G_lt0 Pr_dPI_G_gt0 Pr_dPI_T_lt0 Pr_dPI_T_gt0\n'
True
>>> f.seek(0)
>>> tup = ReadDiffPrefs(f)
>>> len(tup) == 6
True
>>> tup[0] == sites
True
>>> tup[1] == wts
True
>>> tup[2] == deltapi_means
True
>>> tup[3] == tup[4] == None
True
>>> 1.0e-6 > tup[5]['1'] - float(entries[2])
True
>>> f.close()
>>> f2.seek(0)
>>> tup = ReadDiffPrefs(f2)
>>> len(tup) == 6
True
>>> tup[0] == sites
True
>>> tup[1] == wts
True
>>> tup[2] == deltapi_means
True
>>> tup[3] == pr_deltapi_lt0
True
>>> tup[4] == pr_deltapi_gt0
True
>>> 1.0e-6 > tup[5]['1'] - float(entries[2])
True
>>> f2.close()
file_io.WritePreferences(f, sites, wts, pi_means, pi_95credint)

Writes site-specific preferences to a file.

f is a writeable file-like object to which we write the preferences,
or a string giving the name of a file that we create.
sites is a list of all site numbers (as strings) for which
we have preferences.
wts is a dictionary with wts[r] giving the wildtype character
at site r for all r in sites.
pi_means is a dictionary keyed by all sites (as strings). For each
site r, pi_means[r] is a dictionary where pi_means[r][x] is the preference of r for character x, where the characters are nucleotide, codons, or amino acids.
pi_95credint can either be None or a dictionary keyed by all
sites that also key pi_means, and has as values 2-tuples giving the upper and lower bounds of the 95% credible interval. (If it is a dictionary with any value of None, that is the same as setting the overall variable to None).

The written file has the following format:

# POSITION WT SITE_ENTROPY PI_A PI_C PI_G PI_T PI_A_95 PI_C_95 PI_G_95 PI_T_95
1 G 1.61 0.21 0.09 0.58 0.12 0.17,0.28 0.04,0.12 0.51,0.69 0.09,0.15

In this format, the first header line begins with # and indicates the characters as the suffix to PI_. The 95% credible intervals (if given) are indicated by columns beginning with PI_ and ending with _95. The SITE_ENTROPY column gives the site entropy in bits (log base 2) as \(h_r = \sum_x \pi_{r,x} \times \log_2 \pi_{r,x}\).

The POSITION header can also be SITE.

>>> sites = ['1']
>>> pi_means = {'1':{'A':0.2, 'C':0.3, 'G':0.4, 'T':0.1}}
>>> wts = {'1':'G'}
>>> pi_95credint = {'1':{'A':(0.1, 0.3), 'C':(0.25, 0.35), 'G':(0.3, 0.5), 'T':(0.05, 0.15)}}
>>> f = cStringIO.StringIO()
>>> WritePreferences(f, sites, wts, pi_means, pi_95credint)
>>> f.seek(0)
>>> lines = f.readlines()
>>> lines[0] == '# POSITION WT SITE_ENTROPY PI_A PI_C PI_G PI_T PI_A_95 PI_C_95 PI_G_95 PI_T_95\n'
True
>>> f.seek(0)
>>> tup = ReadPreferences(f)
>>> tup[0] == sites
True
>>> tup[3] == pi_95credint
True
>>> f2 = cStringIO.StringIO()
>>> WritePreferences(f2, sites, wts, pi_means, None)
>>> f2.seek(0)
>>> lines = f2.readlines()
>>> lines[0] == '# POSITION WT SITE_ENTROPY PI_A PI_C PI_G PI_T\n'
True
>>> f2.seek(0)
>>> (sites2, wts2, pi_means2, pi_95credint2, h) = ReadPreferences(f2)
>>> sites == sites2
True
>>> wts == wts2
True
>>> pi_95credint2 == None
True
>>> r = sites[0]
>>> all([abs(pi_means[r][x] - pi_means2[r][x]) < 1.0e-6 for x in pi_means[r].keys()])
True
>>> print "%.3f" % h[r]
1.846

utils module

Module with utilities for manipulating data related to dms_tools.

Written by Jesse Bloom.

Functions in this module

  • NaturalSort : Performs a natural sort (alphanumeric sort) on a list.
  • AvgMutRate : Returns the average mutation rate across all sites.
  • SumCodonToAA : Sums all codon entries for each amino acid.
  • SiteEntropy : Computes site entropy.
  • RMS : Computes root-mean-square value.
  • RemoveStopFromPreferences : removes stop codon
  • RemoveStopFromDiffPrefs : removes stop codon
  • SumPrefsDiffPrefs : sums preferences and differential preferences
  • SumCounts : sums counts at each site across a set of experimental samples
  • Pref_or_DiffPref : determines whether data represent preferences or differential preferences.
  • PrefsToEnrichments : converts preferences to enrichment ratios.
  • AdjustErrorCounts : adjust error counts so they don’t exceed actual counts.
  • TrimReads : trim the ends for a pair of reads.
  • CheckReadQuality : checks quality of pair of reads.
  • BuildReadConsensus : builds consensus of reads if they are sufficiently similar.
  • AlignSubamplicon : attempt to align subamplicon at specified position.
  • AlignRead : attempt to align a read at a specified position
  • Subassemble : subassembles a read from per site counts
  • ClassifyCodonCounts : classifies codon mutations.
  • CodonMutsCumulFracs : cumulative counts of codon mutations.
  • ParseNSMutFreqBySite : Parse nonsynonymous mutation frequencies at each site from a DMS counts file.

Function documentation

utils.AdjustErrorCounts(wt, counts, maxexcess)

Adjust error counts so that they don’t exceed counts of interest.

This deals with the case when the error control estimates a higher rate of a mutation than is found in the actual sample. This will confound the MCMC inference by dms_inferprefs and dms_inferdiffprefs. So this script adjusts down the error counts until they are compatbile with the counts in the actual sample.

Returns a dictionary like counts but with adjusted counts.

This is a function utilized to clean counts data for dms_inferprefs and dms_inferdiffprefs.

Essentially, it checks if the mutation rate in the error control is higher than that of the sample for which it is supposed to control. If so, it reduces the mutation counts for the error control so that they only exceed the controlled for sample by maxexcess.

>>> wt = 'A'
>>> counts = {'nrpre':{'A':500, 'C':10, 'G':40, 'T':20}, 'nrerrpre':{'A':250, 'C':1, 'G':30, 'T':10}, 'nrpost':{'A':500, 'C':20, 'G':20, 'T':20}, 'nrerrpost':{'A':1000, 'C':30, 'G':20, 'T':50}}
>>> adjusted_counts = AdjustErrorCounts(wt, counts, maxexcess=1)
>>> counts['nrpre'] == adjusted_counts['nrpre']
True
>>> counts['nrpost'] == adjusted_counts['nrpost']
True
>>> adjusted_counts['nrerrpre'] == {'A':250, 'C':1, 'G':21, 'T':10}
True
>>> adjusted_counts['nrerrpost'] == {'A':1000, 'C':30, 'G':20, 'T':41}
True
>>> wt = 'A'
>>> counts = {'nrpre':{'A':500, 'C':10, 'G':40, 'T':20}, 'nrpost':{'A':500, 'C':20, 'G':20, 'T':20}, 'nrerr':{'A':1000, 'C':30, 'G':20, 'T':50}}
>>> adjusted_counts = AdjustErrorCounts(wt, counts, maxexcess=1)
>>> counts['nrpre'] == adjusted_counts['nrpre']
True
>>> counts['nrpost'] == adjusted_counts['nrpost']
True
>>> adjusted_counts['nrerr'] == {'A':1000, 'C':21, 'G':20, 'T':41}
True
>>> counts = {'nrstart':{'A':500, 'C':10, 'G':40, 'T':20}, 'nrs1':{'A':1000, 'C':20, 'G':80, 'T':40}, 'nrs2':{'A':500, 'C':20, 'G':20, 'T':20}, 'nrerr':{'A':1000, 'C':30, 'G':20, 'T':50}}
>>> adjusted_counts = AdjustErrorCounts(wt, counts, maxexcess=1)
>>> counts['nrstart'] == adjusted_counts['nrstart']
True
>>> counts['nrs1'] == adjusted_counts['nrs1']
True
>>> counts['nrs2'] == adjusted_counts['nrs2']
True
>>> adjusted_counts['nrerr'] == {'A':1000, 'C':21, 'G':20, 'T':41}
True
utils.AlignRead(refseq, read, refseqstart, maxmuts, counts, chartype)

Attempt to gaplessly align a read at a specific location.

refseq and read are strings, assumed to be upper case.

Tries to align refseq starting at position refseqstart (in 1, 2, … numbering), counting the number of mutations. If the read aligns with \(\le\) maxmuts mutations, then returns True and updates counts as described below; otherwise returns False and does nothing to counts. Note that characters with ambiguous nucleotides (N) nucleotides are ignored, and not counted as mutations but also not recorded as counts.

chartype is the type of character for which we count mutations. Allowable values:

  • ‘codon’ : this requires refseq to be an in-frame gene, and only counts identities when codon is fully spanned.

counts is a dictionary that keyed by every site of chartype in refseq using 1, 2, … numbering.

This functions modifies counts as follows: counts[isite][char] is incremented (creating entry for char if not already present) if the read aligns with a character of char at isite.

>>> refseq = 'ATGGGACCC'
>>> read = 'GGGAC'
>>> counts = {1:{}, 2:{}, 3:{}}
>>> AlignRead(refseq, read, 3, 0, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1} and counts[3] == {}
True
>>> read2 = 'GGTAC'
>>> AlignRead(refseq, read2, 3, 0, counts, 'codon')
False
>>> AlignRead(refseq, read2, 3, 1, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1, 'GTA':1} and counts[3] == {}
True
>>> AlignRead(refseq, read2, 3, 1, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1, 'GTA':2} and counts[3] == {}
True
>>> read3 = 'GGNAC'
>>> AlignRead(refseq, read3, 3, 1, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1, 'GTA':2} and counts[3] == {}
True
>>> refseq = 'ATGGGACCC'
>>> read = 'GGACCC'
>>> counts = {1:{}, 2:{}, 3:{}}
>>> AlignRead(refseq, read, 4, 1, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1} and counts[3] == {'CCC':1}
True
>>> read = 'NGACCC'
>>> AlignRead(refseq, read, 4, 1, counts, 'codon')
True
>>> counts[1] == {} and counts[2] == {'GGA':1} and counts[3] == {'CCC':2}
True
utils.AlignSubamplicon(refseq, r1, r2, refseqstart, refseqend, maxmuts, maxN, chartype, counts, use_cutils=True)

Tries to align subamplicon into reference sequence.

The return value is (True, nmuts) if the subamplicon aligns and False otherwise. In the case where the subamplicon aligns, nmuts is the number of mutations of the type indicated by chartype. If the subamplicon aligns, then the character counts in counts are updated. A subamplicon is considered to align if there are <= maxmuts mutations of chartype. Characters that contain ambiguous nucleotides (N) are not counted as mutations or recorded as counts. If r1 and r2 overlap, identities are counted as ambiguous unless both reads agree. If r1 and r2 do not reach far enough to overlap, and uncalled identities in between are called as ambiguous.

Note that if using the default fast C implementation (which is done if use_cutils is True), then there is not much error checking on the validity of the calling variables. Therefore, you could end up with a segmentation fault with invalid calling variables and use_cutils of False.

Here are the calling variables:

  • refseq : string giving the reference sequencing to which we attempt to align the subamplicon. Assumed uppercase nucleotides.

  • r1 : string giving the forward read, assumed uppercase nucleotides with N for ambiguous / low-quality sites. Note that if this read has 5’ portions (such as a barcode) that we aren’t aligning, they should have been trimmed prior to calling this function.

  • r2 : like r1 but for reverse read.

  • refseqstart : nucleotide in refseq (in 1, 2, … numbering) where

    first nucleotide in r1 aligns.

  • refseqend : nucleotide in refseq (in 1, 2, … numbering) where first nucleotide in r2 aligns (note that r2 then reads backwards towards 5’ end of refseq).

  • maxmuts : maximum number of mutations of character type chartype that are allowed; subamplicons with more than this many mutations are considered not to align and False is returned.

  • maxN : maximum number of ambiguous nucleotides allowed in the consensus built up from the two reads (i.e. including the reads and any overlap).

  • chartype : type of character for which mutations are counted; currently the only allowable value is the string “codon”.

  • counts : a dictionary for counting mutations to refseq of character type chartype in the format used by dms_tools.utils.WriteDMSCounts and dms_tools.utils.ReadDMSCounts.

  • use_cutils : do we use the fast C-implemenation of this function in dms_tools.cutils?

>>> counts = {'1':dict([('WT', 'ATG')] + [(codon, 0) for codon in dms_tools.codons]),
...           '2':dict([('WT', 'GGG')] + [(codon, 0) for codon in dms_tools.codons]),
...           '3':dict([('WT', 'AAA')] + [(codon, 0) for codon in dms_tools.codons])}
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GGGGAA', 'TTTCCC', 3, 9, 1, 1, 'codon', counts)
>>> returntup == (True, 0)
True
>>> all([counts['1'][codon] == 0 for codon in dms_tools.codons])
True
>>> counts['2']['GGG'] == 1
True
>>> all([counts['2'][codon] == 0 for codon in dms_tools.codons if codon != 'GGG'])
True
>>> all([counts['2'][codon] == 0 for codon in dms_tools.codons if codon != 'GGG'])
True
>>> counts['3']['AAA'] == 1
True
>>> all([counts['3'][codon] == 0 for codon in dms_tools.codons if codon != 'AAA'])
True
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GGGGAA', 'TTTCCC', 1, 9, 1, 1, 'codon', counts)
>>> returntup == False
True
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GGGGAT', 'TTTCCC', 3, 9, 1, 0, 'codon', counts)
>>> returntup == False
True
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GGGGAT', 'TTTCCC', 3, 9, 1, 1, 'codon', counts)
>>> returntup == (True, 0)
True
>>> counts['2']['GGG'] == 2
True
>>> counts['3']['AAA'] == 1
True
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GAGGAA', 'TTTCCT', 3, 9, 0, 1, 'codon', counts)
>>> returntup == False
True
>>> returntup = AlignSubamplicon('ATGGGGAAA', 'GAGGAA', 'TTTCCT', 3, 9, 1, 1, 'codon', counts)
>>> counts['2']['GGG'] == 2
True
>>> counts['2']['AGG'] == 1
True
utils.AvgMutRate(counts, chartype)

Returns average mutation rate.

counts is a dictionary of deep mutational scanning counts in the format returned by file_io.ReadDMSCounts.

chartype is the character type of these counts, with the same meaning as for file_io.ReadDMSCounts.

This function computes the average mutation rate over all counts for all sites for mutations with each possible number of nucleotide changes. Specifically, we define the average rate of mutations to characters with \(m\) nucleotide changes relative to wildtype as

\[\overline{\epsilon_m} = \frac{1}{L}\sum\limits_r \frac{1}{N_r} \sum\limits_x n_{r,x} \times \delta_{m,D_{x,\operatorname{wt}\left(r\right)}}\]

where \(r\) ranges over all \(L\) sites for which there are deep mutational scanning counts, \(N_r\) is the total number of counts at site \(r\), \(n_{r,x}\) is the number of counts that report character \(x\) at site \(r\), \(D_{x,\operatorname{wt}\left(r\right)}\) is the number of nucleotide differences between character \(x\) and the wildtype character \(\operatorname{wt}\left(r\right)\) at site \(r\), and \(\delta_{xy}\) is the Kronecker delta. Note that this definition implies \(1 = \sum_m \overline{\epsilon_m}\). For nucleotides, there are just two values of \(m\): a value of \(m = 0\) is no differences (the wildtype character) and a value of \(m = 1\) is the a mutant nucleotide. For codons, \(m\) can be 1, 2, or 3.

The return value is the dictionary mutrates which is keyed by all values of \(m\) for the chartype in question. The value of mutrates[m] is the mutation rate for characters with this number of differences.

>>> counts = {'1':{'WT':'A', 'A':7, 'C':1, 'G':1, 'T':1},
...           '2':{'WT':'C', 'A':2, 'C':8, 'G':0, 'T':0}}
>>> mutrates = AvgMutRate(counts, 'DNA')
>>> print "%.3f" % mutrates[0]
0.750
>>> print "%.3f" % mutrates[1]
0.250
utils.BuildReadConsensus(reads, minreadidentity, minreadconcurrence, maxreadtrim, use_cutils=True)

Builds consensus of reads if they are sufficiently identical.

reads : is a list of (r1, r2) read pairs, which are assumed to be upper case with N inserted for ambiguous or low-quality positions (i.e. the reads have passed through CheckReadQuality). reads must have at least two entries.

minreadidentity : all reads in reads are required to have >= this fraction of their nucleotides identical and non-ambiguous (not N) to the first read in reads. If this is not the case, return False.

minreadconcurrence : among reads that pass minreadidentity, return the consensus identity at a position if >= this fraction; otherwise make that position an N. Must be > 0.5 and <= 1.

maxreadtrim : all R1 reads in reads need to be same length. If they aren’t initially, trim up to this many nucleotides off the longer ones. If that doesn’t make them the same length, then return False. The same is done for R2 reads.

use_cutils specifies that we use the fast C implementation of this function provided by dms_tools.cutils.

The return variable will be False if reads differ in length after applying maxreadtrim or minreadidentity is failed; otherwise it will be a 2-tuple (r1_consensus, r2_consensus) with each element a string of the consensus with N at positions where the read concurrence does not satisfy minreadconcurrence.

>>> reads = [('ATGC', 'CGAT'), ('ATGC', 'CGAT')]
>>> BuildReadConsensus(reads, 0.75, 0.6, 1)
('ATGC', 'CGAT')
>>> reads = [('TTGC', 'CGAT'), ('ATGC', 'CGANA')]
>>> BuildReadConsensus(reads, 0.75, 0.6, 1)
('NTGC', 'CGAN')
utils.CheckReadQuality(r1, r2, q1, q2, minq, maxlowqfrac, barcodelength, use_cutils=True)

Checks quality of reads and replaces low Q nucleotides with N.

r1 and r2 are strings representing a pair of sequencing reads.

q1 and q2 are strings giving the Q-scores for the nucleotides in r1 and r2; these Q-scores are assumed to be encoded as the ASCII code for the Q-score + 33.

minq is the minimum allowed Q-score (nucleotides with Q < minq) are considered low quality.

maxlowqfrac is the maximum allowable number of N or low quality nucleotides allowed in each read.

barcodelength is the number of nucleotides at the beginning of each read that are required to be high-quality (Q > minq and not an N).

use_cutils specifies that we use the fast C implementation of this function provided by dms_tools.cutils.

This function examines the quality of read r1 and r2. It returns False if any of the following are true:

1) Either read has a low-quality nucleotide or an N in the first barcodelength nucleotides.

2) Either read has > maxlowqfrac of its nucleotides that are either low-quality or are N.

If neither of those conditions are true, then the return variable is (r1_new, r2_new), which are copies of r1 and r2 where all nucleotides are upper case and any low quality nucleotides have been replaced by N.

utils.ClassifyCodonCounts(codon_counts)

Classifies codon mutations.

CALLING VARIABLES:

codon_counts : dictionary as returned by dms_tools.file_io.ReadDMSCounts with chartype of codon.

RESULT OF CALLING THIS FUNCTION:

This function returns a new dictionary that is a copy of codon_counts but also has the following new keys for each site:

‘COUNTS’ : total number of counts of non-ambiguous codons.

‘N_WT’ : total number of called codons that match ‘WT’, the wildtpe codon at this position.

‘N_NS’ : total number of called codons that represent non-synonymous mutations from the wildtype codon. Mutations from stop codons to non-stop codons are classified as nonsynonymous. Mutations from non-stop codons to stop codons are not classified as nonsynonymous, instead look under ‘N_STOP’.

‘N_SYN’ : total number of called codons that represent synonymous mutations from the wildtype codon. Mutations from stop codons to other stop codons are classified as synonymous.

‘N_1MUT’, ‘N_2MUT’, ‘N_3MUT’ : the total number of called mutant codons that contain one, two, or three nucleotide mutations relative to the wildtype codon at the site.

‘N_STOP’ : total number of definitively called codons (all nucleotides upper case) that encode stop codons.

In addition, the following keys are added directly to codon_counts, and represent totals over all codon positions:

‘TOTAL_COUNTS’ : total number of counts for all codons.

‘TOTAL_NS’ : total number of ‘N_NS’ for all codons.

‘TOTAL_SYN’ : total number of ‘N_SYN’ for all codons.

‘TOTAL_STOP’ : total number of ‘N_STOP’ for all codons.

‘TOTAL_MUT’ : sum of ‘N_NS’ + ‘N_SYN’ + ‘N_STOP’, total number of mutations.

‘TOTAL_N_1MUT’, ‘TOTAL_N_2MUT’, ‘TOTAL_N_3MUT’ : total number of ‘N_1MUT’, N_2MUT’, and ‘N_3MUT’ for all codons.

‘TOTAL_1MUT_AtoC’, ‘TOTAL_1MUT_AtoG’, etc : total number of one-nucleotide codon mutations that change from A to C, etc.

utils.CodonMutsCumulFracs(codon_counts, maxcumul=500)

Fraction of codon mutations found >= some number of times.

codon_counts : a dictionary of the type returned by calling dms_tools.file_io.ReadDMSCounts with chartype of codon.

maxcumul : only compute cumulative fraction up to this many occurrences of mutation.

For all sites in codon_counts, this function iterates through all possible codon mutations and counts how many times they occur. These counts are then used to generate the following is the number of codons. It records the number of definitely called (all uppercase nucleotides) occurrences of each of these mutations. It then returns the following tuple: (all_cumulfracs, all_counts, syn_cumulfracs, syn_counts, multi_nt_all_cumulfracs, multi_nt_all_counts, multi_nt_syn_cumulfracs, multi_nt_syn_counts)

  • all_cumulfracs[i] is the fraction of all of the possible codon mutations that are found >= i times.
  • all_counts is the total number of unique codon mutations.
  • syn_cumulfracs[i] is the fraction of all of synonymous codon mutations that are found >= i times.
  • syn_counts is the total number of unique synonymous codon mutations.
  • The next four elements in the tuple (prefixed with multi_nt_) are like the four above but only for multi-nucleotide codon mutations.
utils.NaturalSort(xlist)

Performs a natural sort (alphanumeric sort) in a list of strings.

xlist is a list of strings. Sorts them, reading the integer portion as numbers. Currently works for integers, but not floats (anything including or after a decimal is handled as a string, not number).

Any entries that are integers or floats are converted to strings.

This function simply returns None, but xlist is sorted in place, with the integer/float to string conversion if relevant.

>>> xlist = ['1', '20', '100', '5A', '51A', '5']
>>> NaturalSort(xlist)
>>> print xlist
['1', '5', '5A', '20', '51A', '100']
>>> xlist = [1, '20', 100, '5A', '51A', '5']
>>> NaturalSort(xlist)
>>> print xlist
['1', '5', '5A', '20', '51A', '100']
utils.ParseNSMutFreqBySite(countsfile, chartype)

Parses a deep mutational scanning counts file and returns the nonsynonymous mutation frequency at each site.

countsfile is a string that gives the name of a file.

chartype is the type of character, with valid values consisting of:

  • codon : DNA codons.
  • aminoacids_nostop : amino acids not including stop codons.
  • aminoacids_withstop : amino acids including stop codons (*).

The returned item is a list of tuples of (site, nonsynonymous frequency).

utils.Pref_or_DiffPref(data, tol=0.001)

Determines if data represent preferences or differential preferences.

data is a dictionary keyed by site, with each entry a dictionary keyed by character giving the preference / differential preference for that character.

If values for each site are >= 0 and <= 1 and sum to one (within tol), then returns ‘preferences’.

If values are <= 1 and sum to zero (within tol), then returns ‘diffprefs’.

Otherwise returns ‘neither preferences nor diffprefs’

>>> data = {'1':{'A':0.5, 'T':0.1, 'C':0.2, 'G':0.2}}
>>> print Pref_or_DiffPref(data)
preferences
>>> data = {'1':{'A':-0.5, 'T':0.1, 'C':0.2, 'G':0.2}}
>>> print Pref_or_DiffPref(data)
diffprefs
>>> data = {'1':{'A':-0.5, 'T':0.1, 'C':0.2, 'G':0.1}}
>>> print Pref_or_DiffPref(data)
neither preferences nor diffprefs
utils.PrefsToEnrichments(wts, pi_means, include_wt)

Converts site-specific preferences to enrichment ratios.

wts and pi_means have the same meaning as the arguments by the same name returned by file_io.ReadPreferences.

If include_wt is True, then the returned dictionary includes ratios for the wildtype characters (these are always one). If it is False, then ratios for the wildtype characters are not included.

The return value is a dictionary enrichments, where enrichments[r][x] returns the enrichment ratio for character x at site r for all sites and characters in pi_means.

The enrichment ratio \(\phi_{r,x}\) of site \(r\) for character \(x\) is defined in terms of the preference \(\pi_{r,x}\) as

\[\phi_{r,x} = \frac{\pi_{r,x}}{\pi_{r,\operatorname{wt}\left(r\right)}}\]

where \(\operatorname{wt}\left(r\right)\) is the wildtype residue at site \(r\).

>>> wts = {'1':'A'}
>>> pi_means = {'1':{'A':0.4, 'C':0.45, 'G':0.1, 'T':0.05}}
>>> enrichments = PrefsToEnrichments(wts, pi_means, include_wt=True)
>>> print "%.3f" % enrichments['1']['A']
1.000
>>> print "%.3f" % enrichments['1']['C']
1.125
>>> print "%.3f" % enrichments['1']['G']
0.250
>>> print "%.3f" % enrichments['1']['T']
0.125
>>> enrichments = PrefsToEnrichments(wts, pi_means, include_wt=False)
>>> 'A' in enrichments
False
utils.RMS(xlist)

Computes root-mean-square value of entries in xlist.

>>> xlist = [-0.2, 0.1, 0.03, 0.5]
>>> print "%.5f" % RMS(xlist)
0.27427
utils.RemoveStopFromDiffPrefs(deltapi_means)

Removes stop codon and rescales differential preferences to sum to zero.

Stop codons are denoted by *.

Essentially, the preference assigned to the stop codon is removed and all other preferences are scaled by an additive factor so they sum to zero.

If there is not a stop codon present, then nothing is done.

>>> aas = dms_tools.aminoacids_withstop
>>> deltapi_means = {'1':dict([(aa, 1.0 / (len(aas) - 1)) for aa in aas if aa != '*'])}
>>> deltapi_means['1']['*'] = -1.0
>>> nostop_deltapi_means = RemoveStopFromDiffPrefs(deltapi_means)
>>> deltapi_means == nostop_deltapi_means
False
>>> all(["%.3f" % nostop_deltapi_means['1'][aa] == '0.000' for aa in aas if aa != '*'])
True
>>> '*' in nostop_deltapi_means['1']
False
>>> aas = dms_tools.aminoacids_withstop
>>> deltapi_means = {'1':dict([(aa, 1.0 / (len(aas) - 1)) for aa in aas[1 : ]])}
>>> deltapi_means['1'][aas[0]] = -1.0
>>> nostop_deltapi_means = RemoveStopFromDiffPrefs(deltapi_means)
>>> deltapi_means == nostop_deltapi_means
False
>>> '*' in nostop_deltapi_means['1']
False
>>> print '%.5f' % nostop_deltapi_means['1'][aas[0]]
-0.99750
>>> print '%.5f' % nostop_deltapi_means['1'][aas[1]]
0.05250
>>> aas = dms_tools.aminoacids_nostop
>>> deltapi_means = {'1':dict([(aa, 1.0 / (len(aas) - 1)) for aa in aas[1 : ]])}
>>> deltapi_means['1'][aas[0]] = -1.0
>>> nostop_deltapi_means = RemoveStopFromDiffPrefs(deltapi_means)
>>> deltapi_means == nostop_deltapi_means
True
utils.RemoveStopFromPreferences(pi_means)

Removes stop codon and renormalizes preference to sum to one.

Stop codons are denoted by *.

Essentially, the preference assigned to the stop codon is removed and all other preferences are scaled by a multiplicative factor \(\ge 1\) until they sum to one.

If there is not a stop codon present, then nothing is done.

>>> aas = dms_tools.aminoacids_withstop
>>> pi_means = {'1':dict([(aa, 1.0 / len(aas)) for aa in aas]), '2':dict([(aas[0], 0.9)] + [(aa, 0.1 / (len(aas) - 1)) for aa in aas[1 : ]])}
>>> nostop_pi_means = RemoveStopFromPreferences(pi_means)
>>> pi_means == nostop_pi_means
False
>>> all(["%.3f" % nostop_pi_means['1'][aa] == '0.050' for aa in aas if aa != '*'])
True
>>> '*' in nostop_pi_means['1']
False
>>> '*' in nostop_pi_means['2']
False
>>> print '%.4f' % nostop_pi_means['2'][aas[0]]
0.9045
>>> print '%.6f' % nostop_pi_means['2'][aas[1]]
0.005025
>>> aas = dms_tools.aminoacids_nostop
>>> pi_means = {'1':dict([(aa, 1.0 / len(aas)) for aa in aas]), '2':dict([(aas[0], 0.9)] + [(aa, 0.1 / (len(aas) - 1)) for aa in aas[1 : ]])}
>>> nostop_pi_means = RemoveStopFromPreferences(pi_means)
>>> pi_means == nostop_pi_means
True
utils.SiteEntropy(xlist, base=2.0)

Computes site entropy to log base base.

xlist should be a list of numbers \(\ge 0\) that sum to one.

>>> print "%.3f" % SiteEntropy([1.0, 0.0])
0.000
>>> print "%.3f" % SiteEntropy([0.5, 0.5])
1.000
utils.Subassemble(counts, minpersite, minconcurrence, refseq_chars)

Subassembles sequence if possible.

counts is keyed by all site numbers (1, 2, … numbering) in the sequence. counts[r][x] is the number of counts for character x at site r.

refseq_chars is keyed by all site numbers, and the value is the wildtype (reference sequence) character at that site.

A read is subassembled if there are >= minpersite counts at each site, and >= minconcurrence of these counts agree on the identity at that site. minconcurrence must be > 0.5 to ensure a unique subassembly. minpersite must be >= 1.

The return variable is the 3-tuple (subassembled, seq, failurestring). subassembled is True if the subassembly is successful at all sites, and False otherwise. seq is the subassembled sequence; at sites that cannot be successfully subassembled there is an N (for nucleotide characters) or a NNN (for codon characters). failurestring is a string given the reason subassembly failed (or ‘’) if subassembly succeeded.

>>> counts = {1:{'ATG':2}, 2:{'GGA':4, 'AGA':1}}
>>> refseq_chars = {1:'ATG', 2:'AGA'}
>>> Subassemble(counts, 2, 0.8, refseq_chars)
(True, 'ATGGGA', '')
>>> Subassemble(counts, 3, 0.8, refseq_chars)
(False, 'NNNGGA', 'insufficient counts at 1')
>>> Subassemble(counts, 2, 0.9, refseq_chars)
(False, 'ATGNNN', 'insufficient concurrence between mutant and wildtype identity at 2')
>>> refseq_chars = {1:'ATG', 2:'TGA'}
>>> Subassemble(counts, 2, 0.9, refseq_chars)
(False, 'ATGNNN', 'insufficient concurrence between two mutant identities at 2')
>>> counts = {1:{}, 2:{'GGA':4, 'AGA':1}}
>>> refseq_chars = {1:'ATG', 2:'AGA'}
>>> Subassemble(counts, 2, 0.8, refseq_chars)
(False, 'NNNGGA', 'no counts at 1')
>>> counts = {1:{'GGA':1}, 2:{'AGA':1}}
>>> refseq_chars = {1:'ATG', 2:'AGA'}
>>> Subassemble(counts, 1, 0.75, refseq_chars)
(True, 'GGAAGA', '')
utils.SumCodonToAA(codondict, includestop=True)

Sums all codon entries for each amino acid.

codondict is a dictionary keyed by all codons (dms_tools.codons), with values being counts.

Note that this is NOT a dictionary of the form returned by dms_tools.file_io.ReadDMSCounts(), which has sites as keys and codondicts as values.

includestop specifies whether we also sum values for stop codons to give an entry for * (the stop codon). Do this only if has default value of True.

The return value is aadict which is keyed by all amino acids and has values equal to the sum of the entries for the encoding codons in codondict.

>>> codondict = dict([(codon, 0) for codon in dms_tools.codons])
>>> codondict['GGG'] = 0.1
>>> codondict['GGA'] = 0.2
>>> codondict['TGA'] = 0.3
>>> codondict['CCC'] = 0.4
>>> aadict = SumCodonToAA(codondict)
>>> print "%.2f" % aadict['G']
0.30
>>> print "%.2f" % aadict['P']
0.40
>>> print "%.2f" % aadict['*']
0.30
>>> print "%.2f" % aadict['Y']
0.00
>>> aadict = SumCodonToAA(codondict, includestop=False)
>>> print "%.2f" % aadict['G']
0.30
>>> print "%.2f" % aadict['P']
0.40
>>> print "%.2f" % aadict['*']
Traceback (most recent call last):
    ...
KeyError: '*'
>>> print "%.2f" % aadict['Y']
0.00
utils.SumCounts(counts, chartype, normalize=False)

Sums counts at each site across two or more experimental samples.

counts : A list of dictionaries, where each dictionary contains counts in the format generated by dms_tools.file_io.ReadDMSCounts. Each counts dictionary should be keyed by the same set of sites and characters.

chartype is the type of character. Valid values are the following strings:

  • DNA : DNA nucleotides, those listed in the nts variable defined in this module.
  • codon : DNA codons, those listed in the codons variable defined in this module.
  • aminoacids_nostop : amino acids, not including stop codons.
  • aminoacids_withstop : amino acids, including stop codons (*).

normalize : If set to true, at each site across the counts dictionary , the counts are normalized to the minimum number of counts appearing at that site across the experimental samples.

This function takes a list of count dictionaries, and makes a new dictionary containing the sum of those counts. The characters at each site are summed with the corresponding sites and characters from the provided count dictionaries. If there are ambiguous nucleotides or codons (represented by a,t,c,g,n) in any counts dictionaries, these are excluded during the summing, and they are not included in the returned counts dictionary.

In addition, before summing the counts, the counts at each site may be normalized by the minimum number of counts at the site seen across the provided count dictionaries. As an example of normalization, if replicate 1 had 100 counts covering site 1 and replicate 2 had 200 counts covering site 1, all the character counts in site 1 replicate 2 would each be divided by 2. Then the counts at site 1 for replicate 1 and 2 would be added. This would continue for each site in the counts dictionary. Note, this function uses rounding to the nearest integer during the normalization, so the total counts at a site across the different count dictionaries after normalization may differ by <0.1%.

>>> count1 = {'1':{'A':10, 'C': 20, 'G': 30, 'T': 40, 'WT': 'A'}, '2':{'A':20, 'C': 40, 'G': 10, 'T': 30, 'WT': 'T'}}
>>> count2 = {'1':{'A':40, 'C': 20, 'G': 80, 'T': 60, 'WT': 'A'}, '2':{'A':80, 'C': 40, 'G': 60, 'T': 20, 'WT': 'T'}}
>>> counts = [count1, count2]
>>> SumCounts(counts, 'dna')
{'1': {'A': 50, 'C': 40, 'WT': 'A', 'T': 100, 'G': 110}, '2': {'A': 100, 'C': 80, 'WT': 'T', 'T': 50, 'G': 70}}
>>> SumCounts(counts, 'dna', normalize=True)
{'1': {'A': 30, 'C': 30, 'WT': 'A', 'T': 70, 'G': 70}, '2': {'A': 60, 'C': 60, 'WT': 'T', 'T': 40, 'G': 40}}
utils.SumPrefsDiffPrefs(datalist, minus=[])

Sums preferences and differential preferences.

datalist is a list of preferences or differential preferences, which must be keyed by the same sets of sites and characters.

minus is an optional list that is subtracted rather than added in the sum.

Returns a dictionary giving the sums.

>>> prefs = {'1':{'A':0.2, 'T':0.1, 'C':0.5, 'G':0.2}}
>>> diffprefs = {'1':{'A':-0.1, 'T':0.3, 'C':-0.2, 'G':0.0}}
>>> datasum = SumPrefsDiffPrefs([prefs, diffprefs])
>>> print '%.2f' % datasum['1']['A']
0.10
>>> print '%.2f' % datasum['1']['T']
0.40
>>> print '%.2f' % datasum['1']['C']
0.30
>>> print '%.2f' % datasum['1']['G']
0.20
>>> minusprefs = {'1':{'A':0.1, 'T':0.1, 'C':-0.1, 'G':-0.1}}
>>> datasum = SumPrefsDiffPrefs([prefs, diffprefs], minus=[minusprefs])
>>> print '%.2f' % datasum['1']['A']
0.00
>>> print '%.2f' % datasum['1']['T']
0.30
>>> print '%.2f' % datasum['1']['C']
0.40
>>> print '%.2f' % datasum['1']['G']
0.30
utils.TrimReads(r1, r2, q1, q2, R1trimlength=None, R2trimlength=None)

Trims a pair of reads to a specified length by removing nucleotides from the 3’ ends of the reads.

r1 and r2 are strings representing a pair of sequencing reads.

q1 and q2 are strings giving the Q-scores for the nucleotides in r1 and r2; these Q-scores are assumed to be encoded as the ASCII code for the Q-score + 33.

R1trimlength and R2trimlength are the lengths to which the r1 and r2 reads will be trimmed, as well as the corresponding Q-scores q1 and q2. If no lengths are specified, the un-modified r1, r2, q1, and q2 are returned.

This function trims the r1 and r2 reads in a pair of reads. Nucleotides are trimmed away from the 3’ end of each read to a specified length. The user-provided length includes the primer binding site and the barcode sequence at the 5’ end of each read. The trimming is done from the 3’ end of the read as Illumina Q-scores typically drop in this region resulting in less accurate sequence calls. This drop can be observed by running FastQC on the gzipped read files. In addition to trimming reads r1 and r2, the corresponding Q-scores q1 and q2 for those reads are trimmed in the same manner.

The return variable for this function is (r1trim, r2trim, q1trim, q2trim), which are the trimmed sequencing reads and corresponding Q-scores.

>>> r1 = 'GATTCTACATCCAAATGTGCACTGAACTTAAACTCAGTGATTATGAGGGGCGACTGATCCAGAACAGCTTAACAA'
>>> r2 = 'TGTTAAGCTGTTCTGGATCAGTCGCCCCTCATAATCACTGAGTTTAAGTTCAGTGCACATTTGGATGTAGAATCT'
>>> q1 = 'C9@ACCEFEGGG99-C9E<C9@E,,,CFFFC9CCC,,CA9EEEDD,,,,+66:FEECFFFCGCFE,@DFFFGGGG'
>>> q2 = '6,A@,,6CFFGGGGCA,EC<,CFFCCCFFFFG9<EF9EE,,,<CC,,;EEFC9<,;,;,;CE,,,C8EE99,BC,'
>>> TrimReads(r1, r2, q1, q2)
('GATTCTACATCCAAATGTGCACTGAACTTAAACTCAGTGATTATGAGGGGCGACTGATCCAGAACAGCTTAACAA', 'TGTTAAGCTGTTCTGGATCAGTCGCCCCTCATAATCACTGAGTTTAAGTTCAGTGCACATTTGGATGTAGAATCT', 'C9@ACCEFEGGG99-C9E<C9@E,,,CFFFC9CCC,,CA9EEEDD,,,,+66:FEECFFFCGCFE,@DFFFGGGG', '6,A@,,6CFFGGGGCA,EC<,CFFCCCFFFFG9<EF9EE,,,<CC,,;EEFC9<,;,;,;CE,,,C8EE99,BC,')
>>> (R1trimlength, R2trimlength) = (100, 100)
>>> TrimReads(r1, r2, q1, q2, R1trimlength, R2trimlength)
('GATTCTACATCCAAATGTGCACTGAACTTAAACTCAGTGATTATGAGGGGCGACTGATCCAGAACAGCTTAACAA', 'TGTTAAGCTGTTCTGGATCAGTCGCCCCTCATAATCACTGAGTTTAAGTTCAGTGCACATTTGGATGTAGAATCT', 'C9@ACCEFEGGG99-C9E<C9@E,,,CFFFC9CCC,,CA9EEEDD,,,,+66:FEECFFFCGCFE,@DFFFGGGG', '6,A@,,6CFFGGGGCA,EC<,CFFCCCFFFFG9<EF9EE,,,<CC,,;EEFC9<,;,;,;CE,,,C8EE99,BC,')
>>> (R1trimlength, R2trimlength) = (5, 10)
>>> TrimReads(r1, r2, q1, q2, R1trimlength, R2trimlength)
('GATTC', 'TGTTAAGCTG', 'C9@AC', '6,A@,,6CFF')

parsearguments module

Module that parses command-line arguments for the scripts in dms_tools.

Written by Jesse Bloom.

Defined in this module

  • ArgumentParserNoArgHelp : argument parser that prints help when no args
  • FloatGreaterThanZero : parses whether string is float greater than zero.
  • FloatBetweenZeroAndOne : parses whether is float between 0 and 1.
  • FloatBetweenHalfAndOne : parses whether is float > 0.5 and <= 1.
  • NonNegativeInt : parses whether string is non-negative integer.
  • IntGreaterEqual1 : parses whether string is integer >= 1.
  • IntGreaterEqual2 : parses whether string is integer >= 2.
  • ExistingFile : parses whether string gives an existing file.
  • PDF_File : parses whether string gives a PDF file name.
  • CommaSeparatedFASTQFiles : parses list of FASTQ files.
  • AlignSpecs : parses alignment specs for BarcodedSubampliconsParser
  • R2AlignSpecs : parses alignment specs for SubassembleParser
  • AlignprefixName : parses alignment prefixes/names for SummarizeAlignmentsParser
  • InferPrefsParser : parser for dms_inferprefs
  • InferDiffPrefsParser : parser for dms_inferdiffprefs
  • DiffSelectionParser : parser for dms_diffselection
  • MergeParser : parser for dms_merge
  • EditSitesParser : parsers for dms_editsites
  • CorrelateParser : parser for dms_correlate
  • LogoPlotParser : parser for dms_logoplot
  • BarcodedSubampliconsParser : parser for dms_barcodedsubamplicons
  • SubassembleParser : parser for dms_subassemble
  • MatchSubassembledBarcodesParser : parser for dms_matchsubassembledbarcodes
  • SummarizeAlignmentsParser : parser for dms_summarizealignments
  • SmartHelpFormatter : new class for formatting argument helps.

Detailed documentation

parsearguments.AlignSpecs(alignspecs)

Parses alignment specs for BarcodedSubampliconsParser.

>>> AlignSpecs('1,300,10,12')
(1, 300, 10, 12)
parsearguments.AlignprefixName(alignprefixname)

Parses prefix / name for SummarizeAlignmentsParser.

>>> AlignprefixName('replicate_1/replicate_1_mutDNA_,mutDNA')
('replicate_1/replicate_1_mutDNA_', 'mutDNA')
class parsearguments.ArgumentParserNoArgHelp(prog=None, usage=None, description=None, epilog=None, version=None, parents=[], formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True)

Bases: argparse.ArgumentParser

Like argparse.ArgumentParser, but prints help when no arguments.

error(message)

Prints error message, then help.

parsearguments.BarcodedSubampliconsParser()

Returns argparse.ArgumentParser for dms_barcodedsubamplicons script.

parsearguments.CommaSeparatedFASTQFiles(files)

Returns name of comma-separated FASTQ files.

Checks that all files exist. Returns error if files cannot be found or if they are not .fastq or .fastq.gz Otherwise returns list of files.

parsearguments.CorrelateParser()

Returns argparse.ArgumentParser for dms_correlate script.

parsearguments.DiffSelectionParser()

Returns argparse.ArgumentParser for dms_diffselection script.

parsearguments.EditSitesParser()
parsearguments.ExistingFile(fname)

If fname is name of an existing file return it, otherwise an error.

It is also acceptable for fname to be the string “none”.

parsearguments.FloatBetweenHalfAndOne(x)

If x is float > 0.5 or <= 1, returns it, otherwise error.

parsearguments.FloatBetweenZeroAndOne(x)

If x is float >= 0 or <= 1, returns it, otherwise error.

parsearguments.FloatGreaterThanZero(x)

If x is string for float > 0, returns it, otherwise an error.

Designed based on this: http://stackoverflow.com/questions/12116685/how-can-i-require-my-python-scripts-argument-to-be-a-float-between-0-0-1-0-usin

>>> print "%.3f" % FloatGreaterThanZero('0.1')
0.100
>>> FloatGreaterThanZero('0.0')
Traceback (most recent call last):
    ...
ArgumentTypeError: 0.0 not a float greater than zero
>>> FloatGreaterThanZero('hi')
Traceback (most recent call last):
    ...
ValueError: could not convert string to float: hi
parsearguments.InferDiffPrefsParser()

Returns argparse.ArgumentParser for dms_inferdiffprefs script.

parsearguments.InferPrefsParser()

Returns argparse.ArgumentParser for dms_inferprefs script.

parsearguments.IntGreaterEqual1(n)

If n is integer >= 1 returns it, otherwise an error.

parsearguments.IntGreaterEqual2(n)

If n is integer >= 2 returns it, otherwise an error.

parsearguments.LogoPlotParser()

Returns argparse.ArgumentParser for dms_logoplot script.

parsearguments.MatchSubassembledBarcodesParser()

Returns argparse.ArgumentParser for dms_matchsubassembledbarcodes.

parsearguments.MergeParser()

Returns argparse.ArgumentParser for dms_merge script.

parsearguments.NonNegativeInt(n)

If n is non-negative integer returns it, otherwise an error.

>>> print "%d" % NonNegativeInt('8')
8
>>> NonNegativeInt('8.1')
Traceback (most recent call last):
   ...
ArgumentTypeError: 8.1 is not an integer
>>> print "%d" % NonNegativeInt('0')
0
>>> NonNegativeInt('-1')
Traceback (most recent call last):
   ...
ArgumentTypeError: -1 is not non-negative
parsearguments.PDF_File(fname)

If fname ends in PDF extension, returns it. Otherwise an error.

parsearguments.R2AlignSpecs(alignspecs)

Parses alignment specs for SubasssembleParser.

>>> R2AlignSpecs('1,300')
(1, 300)
class parsearguments.SmartHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: argparse.ArgumentDefaultsHelpFormatter

Raw text for help beginning with R|, ArgumentDefaultsHelpFormatter otherwise.

Based on SmartFormatter described at http://stackoverflow.com/questions/3853722/python-argparse-how-to-insert-newline-in-the-help-text.

This option allows you to specify exact help format by preceding R|.

parsearguments.SubassembleParser()

Returns argparse.ArgumentParser for dms_subassemble.

parsearguments.SummarizeAlignmentsParser()

Returns argparse.ArgumentParser for dms_summarizealignments.

plot module

This module uses pylab and matplotlib to make plots. The pdf backend is used for matplotlib / pylab.

Written by Jesse Bloom.

Functions in this module

PlotCorrelation : plots correlation between two variables.

Base10Formatter : formats numbers in base 10

PlotPairedMutFracs : bar graph of mutation frequencies per sample

PlotMutCountFracs : plots fraction of mutations that occur >= a given number of times.

PlotDepth : plots per-site depth along primary sequence

PlotReadStarts : plots start of read distributions for subassembly.

PlotSampleBargraphs : plots bargraphs of categories for several samples.

Function documentation

plot.Base10Formatter(number, exp_cutoff, exp_decimal_digits, decimal_digits)

Converts a number into Latex formatting with scientific notation.

Takes a number and converts it to a string that can be shown in LaTex using math mode. It is converted to scientific notation if the criteria specified by exp_cutoff is met.

number : the number to be formatted, should be a float or integer. Currently only works for number >= 0.

exp_cutoff : convert to scientific notation if abs(math.log10(number)) >= exp_cutoff.

exp_decimal_digits : show this many digits after the decimal if number is converted to scientific notation.

decimal_digits : show this many digits after the decimal if number is NOT converted to scientific notation.

The returned value is a LaTex formatted string. If the number is zero, the returned string is simply ‘0’.

EXAMPLES:

>>> Base10Formatter(103, 3, 1, 1)
'103.0'
>>> Base10Formatter(103.0, 2, 1, 1)
'1.0 \\times 10^{2}'
>>> Base10Formatter(103.0, 2, 2, 1)
'1.03 \\times 10^{2}'
>>> Base10Formatter(2892.3, 3, 1, 1) 
'2.9 \\times 10^{3}'
>>> Base10Formatter(0.0, 3, 1, 1) 
'0'
>>> Base10Formatter(0.012, 2, 1, 1)
'1.2 \\times 10^{-2}'
>>> Base10Formatter(-0.1, 3, 1, 1)
Traceback (most recent call last):
    ...
ValueError: number must be >= 0
plot.PlotCorrelation(xs, ys, plotfile, xlabel, ylabel, logx=False, logy=False, corr=None, title=False, alpha=1.0, symmetrize=False, fixaxes=False, additionalxy=[], bigmargin=0.35, xsize=1.8, r2=False, marker_style='b.', additional_marker_style='r^', marker_size=4, additional_marker_size=3)

Plots the correlation between two variables as a scatter plot.

The data is plotted as a scatter plot. This function uses pylab / matplotlib. The calling variables use LaTex format for strings. So for example, ‘$10^5$’ will print the LaTex equivalent of this string. Similarly, certain raw text strings (such as those including underscores) will cause problems if you do not escape the LaTex format meaning. For instance, ‘x_label’ will cause a problem since underscore is not valid outside of math mode in LaTex, so you would need to use ‘x_label’ to escape the underscore.

CALLING VARIABLES:

  • xs and ys are lists of numbers, with the lists being of the same length. Entry xs[i] is plotted on the x-axis against entries ys[i] on the y-axis.
  • plotfile is a string giving the name of the plot PDF file that we create. It should end in the extension .pdf. If this plot already exists, it is overwritten.
  • xlabel is a string giving the label placed on the x-axis.
  • ylabel is a string giving the label placed on the y-axis.
  • logx specifies that we log transform the data in xs. This is False by default; set to True if you want to log transform (base 10 logarithms) the data.
  • logy is like logx, but for the data in ys.
  • corr specifies that we include a correlation coefficient on the plot. If it has its default value of None, then no correlation coefficient is included. Otherwise, corr = (r, p) where r is the correlation coefficient (displayed with the label R) and p is the P-value (displayed with the label P).
  • r2 : if True and using corr, then display \(R^2\) rather than \(R\).
  • title is a string giving the title placed above the plot. It can be False if no title is to be used. Otherwise, it should be the title string (using LaTex formatting, spaces are allowed). Is False by default.
  • alpha is the transparency of the plotted points. By default it is one, which means that there is no transparency. If you make the value closer to zero, then the points become partially transparent. The rationale is that if you have many overlapping points, making them partially transparent helps you better see the density. At any position on the plot, the intensity will be saturated when there are 1.0 / alpha points plotted. So a reasonable value of alpha might be something like 0.1.
  • symmetrize is an optional argument that is False by default. If True, we make the X and Y limits on the plot the same.
  • fixaxes is an optional argument that is False by default. If True, we fix both the X and Y axes to go from 0 to 1, with ticks at 0, 0.5, and 1. If you set this option to True, then you must set logx and logy to False.
  • additionalxy is an optional list. By default, it is an empty list. However, you can use it to plot additional data points with a different color. The main data (specified by xs and ys) is plotted with blue circles. For each addition set of data that you want to plot, include a 2-tuple of lists of the form (x2s, y2s) in additionalxy. Currently, only one such 2-tuple is allowed (so only one additional data set). These are plotted as red triangles, whereas the main data is plotted as blue circles. The same alpha value set by the alpha parameter applies to these points as well.
  • bigmargin is an optional argument that is 0.26 by default. It is the fraction of the plot width taken up by the larger margin, which is the bottom and left. Make larger if you need more space for axis labels.
  • xsize is an optional argument that is the width of the plot in inches. It is 1.8 by default.
  • marker_style and additional_marker_style are optional arguments to change the color/style of the marker for the main and additional data points, respectively. See the matplotlib documentation for a full explanation.
  • marker_size and additional_marker_size are optional arguments to set the size of the markers.
plot.PlotDepth(codon_counts, names, plotfile, mutdepth=False, y_axis_label=None, separatemuttypes=False)

Plots per-site depth along primary sequence.

codon_counts : a list of dictionaries giving the codon counts for each sample as read by dms_tools.file_io.ReadDMSCounts.

names : a list of strings giving the names of the samples corresponding to each entry in codon_counts.

plotfile : name of the output plot file created by this method (such as ‘plot.pdf’). The extension must be .pdf.

mutdepth : Boolean switch, if True then rather than plotting sequencing depth, we plot per-site mutation rate.

y_axis_label : a string, if specified, overrides the default y-axis label ‘reads’ or ‘mutation frequency’.

separatemuttypes : Boolean switch specifying that we plot a different line for each of nonsynonymous, synonymous, and stop codon mutations. Only can be used if mutdepth is True.

plot.PlotMutCountFracs(plotfile, title, names, all_cumulfracs, syn_cumulfracs, all_counts, syn_counts, legendloc, writecounts=True, nmax=None)

Plots fraction of mutations with >= a given number of counts.

Does this for both all mutations and synonymous mutations. The plots are placed side by side.

The plots show the fractions of mutations that are found >= n times for a range of values of n.

CALLING VARIABLES:

  • plotfile : string giving name of the created plot file. Must end in the extension .pdf.

  • title : string giving the title placed about the plot.

  • names : a list of strings giving the names of the samples to plot.

  • all_cumulfracs : a list of the same length as names giving the cumulative fractions for all mutations. Each entry should be a list, and all of these lists should be the same length. cumulfracs[i][n] gives the fraction of mutations to sample names[i] that are found >= n times. The x-axis of the created plot will go from 0 to len(all_cumulfracs[0]) - 1.

  • syn_cumulfracs : a list like all_cumulfracs except for synonymous mutations.

  • all_counts : integer counts of all mutations (the total number of mutations used for all_cumulfracs), used to create plot title.

  • syn_counts : like all_counts but for synonymous mutations.

  • legendloc : specifies the location of the legend. Should be a string. Valid values are:

    • bottom : put legend at the bottom of the plot.
    • right : put legend at the right of the plot.
  • writecounts is a Boolean switch specifying whether we include the counts of all mutations (specified by all_counts and syn_counts) in the plot title. We do this if writecounts is True, and do not if it is False.

nmax if specified, should be an integer > 1 giving the x-axis maximum.

plot.PlotPairedMutFracs(codon_counts, names, plotfile, ylabel='fraction')

Makes paired bar graph of mutation fractions per codon.

For each sample, calculates the fraction of codon counts that represent synonymous / nonsynoymous / stop codons, and one- / two- / three-nucleotide mutations. Makes paired stacked bar graphs showing these fractions. The fractions for all sites in the gene are aggregated together.

codon_counts : a list of dictionaries giving the codon counts for each sample as read by dms_tools.file_io.ReadDMSCounts.

names : a list of strings giving the names of the samples corresponding to each entry in codon_counts.

plotfile : name of the output plot file created by this method (such as ‘plot.pdf’). The extension must be .pdf.

ylabel : the text placed on the ylabel.

plot.PlotReadStarts(names, depthbystart, plotfile)

Plots distribution of read starts.

names : list of sample names.

depthbystart : dictionary keyed by each name in names. V Value is dictionary keyed by read start position, value is number of reads. All names must have same start positions.

plotfile : name of created PDF plot.

plot.PlotSampleBargraphs(names, categories, data, plotfile, ylabel, groupbyfirstword=True, ncolumns=3)

Plots bargraph of counts for different samples.

names is a list of the names of the different samples.

categories is a list of the different categories for each sample.

data is a dictionary keyed by each name in names, and that value is in turn a dictionary keyed by each category in categories.

plotfile is the name of the PDF file that we are creating.

ylabel is the label placed on the y-axis.

groupbyfirstword is a Boolean switch that specifies whether we group consecutive categories with the same first word in the string to have the same color.

ncolumns is the number of columns in each legend line.

weblogo module

Module for making sequence logos with the weblogolib package distributed with weblogo (http://weblogo.threeplusone.com/). This module interfaces with the weblogolib API, and so is only known to work with weblogolib version 3.4.

Written by Jesse Bloom.

Functions in this module

  • LogoPlot : plots a sequence logo of site-specific preferences or differential preferences.
  • LogoOverlay : makes logo plot overlays, called by LogoPlot.
  • KyteDoolittleColorMapping : maps amino-acid hydrophobicities to colors.
  • MWColorMapping : maps amino-acid molecular weights to colors.
  • ChargeColorMapping : maps amino-acid charge at neural pH to colors.
  • FunctionalGroupColorMapping : maps amino-acid functional groups (small, nucleophilic, hydrophobic, aromatic, acidic, amide, basic) to colors.

Function documentation

weblogo.ChargeColorMapping(maptype='jet', reverse=False)

Maps amino-acid charge at neutral pH to colors. Currently does not use the keyword arguments for maptype or reverse but accepts these arguments to be consistent with KyteDoolittleColorMapping and MWColorMapping for now.

weblogo.FunctionalGroupColorMapping(maptype='jet', reverse=False)

Maps amino-acid functional groups to colors. Currently does not use the keyword arguments for maptype or reverse but accepts these arguments to be consistent with the other mapping functions, which all get called with these arguments.

weblogo.KyteDoolittleColorMapping(maptype='jet', reverse=True)

Maps amino-acid hydrophobicities to colors.

Uses the Kyte-Doolittle hydrophobicity scale defined by:

J. Kyte & R. F. Doolittle: 
"A simple method for displaying the hydropathic character of a protein." 
J Mol Biol, 157, 105-132

More positive values indicate higher hydrophobicity, while more negative values indicate lower hydrophobicity.

The returned variable is the 3-tuple (cmap, mapping_d, mapper):

  • cmap is a pylab LinearSegmentedColorMap object.
  • mapping_d is a dictionary keyed by the one-letter amino-acid codes. The values are the colors in CSS2 format (e.g. #FF0000 for red) for that amino acid. The value for a stop codon (denoted by a * character) is black (#000000).
  • mapper is the actual pylab.cm.ScalarMappable object.

The optional calling argument maptype should specify a valid pylab color map.

The optional calling argument reverse specifies that we set up the color map so that the most hydrophobic residue comes first (in the Kyte-Doolittle scale the most hydrophobic comes last as it has the largest value). This option is True by default as it seems more intuitive to have charged residues red and hydrophobic ones blue.

weblogo.LogoOverlay(sites, overlayfile, overlay, nperline, sitewidth, rmargin, logoheight, barheight, barspacing, fix_limits={}, fixlongname=False, overlay_cmap=None)

Makes overlay for LogoPlot.

This function creates colored bars overlay bars showing up to two properties. The trick of this function is to create the bars the right size so they align when they overlay the logo plot.

CALLING VARIABLES:

  • sites : same as the variable of this name used by LogoPlot.
  • overlayfile is a string giving the name of created PDF file containing the overlay. It must end in the extension .pdf.
  • overlay : same as the variable of this name used by LogoPlot.
  • nperline : same as the variable of this name used by LogoPlot.
  • sitewidth is the width of each site in points.
  • rmargin is the right margin in points.
  • logoheight is the total height of each logo row in points.
  • barheight is the total height of each bar in points.
  • barspacing is the vertical spacing between bars in points.
  • fix_limits has the same meaning of the variable of this name used by LogoPlot.
  • fixlongname has the same meaning of the variable of this name used by LogoPlot.
  • overlay_cmap has the same meaning of the variable of this name used by LogoPlot.
weblogo.LogoPlot(sites, datatype, data, plotfile, nperline, numberevery=10, allowunsorted=False, ydatamax=1.01, overlay=None, fix_limits={}, fixlongname=False, overlay_cmap=None, ylimits=None, relativestackheight=1, custom_cmap='jet', map_metric='kd', noseparator=False)

Constructs a sequence logo showing amino-acid or nucleotide preferences.

The heights of each letter is equal to the preference of that site for that amino acid or nucleotide.

Note that stop codons may or may not be included in the logo depending on whether they are present in pi_d.

CALLING VARIABLES:

  • sites is a list of all of the sites that are being included in the logo, as strings. They must be in natural sort order (as is done by dms_tools.utils.NaturalSort) or an error will be raised unless allowunsorted is True. The sites in the plot are ordered in the same arrangement listed in sites. These should be strings, not integers.

  • datatype should be one of the following strings:

    • ‘prefs’ for preferences
    • ‘diffprefs’ for differential preferences
    • ‘diffsel’ for differential selection
  • data is a dictionary that has a key for every entry in sites. For every site r in sites, sites[r][x] is the value for character x. Preferences must sum to one; differential preferences to zero. All sites must have the same set of characters. The characters must be the set of nucleotides (dms_tools.nts) or the set of amino acids with or without stop codons (dms_tools.aminoacids_nostop or dms_tools.aminoacids_withstop).

  • plotfile is a string giving the name of the created PDF file of the logo plot. It must end in the extension .pdf.

  • nperline is the number of sites per line. Often 40 to 80 are good values.

  • numberevery is specifies how frequently we put labels for the sites on x-axis.

  • allowunsorted : if True then we allow the entries in sites to not be sorted. This means that the logo plot will not have sites in sorted order.

  • ydatamax : meaningful only if datatype is ‘diffprefs’. In this case, it gives the maximum that the logo stacks extend in the positive and negative directions. Cannot be smaller than the maximum extent of the differential preferences.

  • ylimits: is mandatory if datatype is ‘diffsel’, and meaningless otherwise. It is (ymin, ymax) where ymax > 0 > ymin, and gives extent of the data in the positive and negative directions. Must encompass the actual maximum and minimum of the data.

  • overlay : this argument allows you to make overlay bars that indicated other properties for the sites. By default, this option is None, meaning that no overlay is created. If you set it to something else, it must be a list giving either one or two properties. Each property is a tuple: (prop_d, shortname, longname) where:

    • prop_d is a dictionary keyed by site numbers that are in sites. For each r in sites, prop_d[r] gives the value of the property, or if there is no entry in prop_d for r, then the property is undefined and is colored white. Properties can either be:

      • continuous: in this case, all of the values should be numbers.
      • discrete : in this case, all of the values should be strings. While in practice, if you have more than a few discrete categories (different strings), the plot will be a mess.
    • shortname : short name for the property; will not format well if more than 4 or 5 characters.

    • longname : longer name for property used on axes label. Can be the same as shortname if you don’t need a different long name.

  • fix_limits is only meaningful if overlay is being used. In this case, for any shortname in overlay that also keys an entry in fix_limits, we use fix_limits[shortname] to set the limits for shortname. Specifically, fix_limits[shortname] should be the 2-tuple (ticks, ticknames). ticks should be a list of tick locations (numbers) and ticknames should be a list of the corresponding tick label for that tick.

  • If fixlongname is True, then we use the longname in overlay exactly as written; otherwise we add a parenthesis indicating the shortname for which this longname stands.

  • overlay_cmap can be the name of a valid matplotlib.colors.Colormap, such as the string jet or bwr. Otherwise, it can be None and a (hopefully) good choice will be made for you.

  • custom_cmap can be the name of a valid matplotlib.colors.Colormap which will be used to color amino-acid one-letter codes in the logoplot by the map_metric when either ‘kd’ or ‘mw’ is used as map_metric.

  • relativestackheight indicates how high the letter stack is relative to the default. The default is multiplied by this number, so make it > 1 for a higher letter stack.

  • map_metric specifies the amino-acid property metric used to map colors to amino-acid letters. Valid options are ‘kd’ (Kyte-Doolittle hydrophobicity scale, default), ‘mw’ (molecular weight), ‘functionalgroup’ (functional groups: small, nucleophilic, hydrophobic, aromatic, basic, acidic, and amide), and ‘charge’ (charge at neutral pH). If ‘charge’ is used, then the argument for custom_cmap will no longer be meaningful, since ‘charge’ uses its own blue/black/red colormapping. Similarly, ‘functionalgroup’ uses its own colormapping.

  • noseparator is only meaningful if datatype is ‘diffsel’ or ‘diffprefs’. If it set to True, then we do not print a black horizontal line to separate positive and negative values.

weblogo.MWColorMapping(maptype='jet', reverse=True)

Maps amino-acid molecular weights to colors. Otherwise, this function is identical to KyteDoolittleColorMapping

simulate module

Module for simulating test data for analysis with dms_tools.

Written by Jesse Bloom.

Functions in this module

  • SimulateLibraryCounts : simulates counts for testing inference of site-specific preferences.
  • SimulatedDiffPrefCounts : simulates counts for testing inference of differential preferences.

Function documentation

simulate.SimulateDiffPrefCounts(depths_outfiles, start_counts, diffprefs_file, avgepsilon, control_concentration=50.0, fracdiffer=0.25, diff_concentration=0.5, epsilon_concentration=300.0, seed=1)

Simulates counts data for testing differential preference inference.

Use this function if you want to simulate creation of a mutant library with known differential preferences and then test how accurately you can infer them.

This function assumes mutations made at codon level, and that the site-specific preferences are at the amino-acid level.

CALLING VARIABLES:

  • depths_outfiles : list of 2-tuples (depth, outfile).

    • depth is an integer giving the per-site sequencing depth.

    • outfile is a string giving the prefix for the created deep mutational scanning counts file. These files are deleted if they already exist, and then re-created. The suffixes are:

      • _start.txt starting library counts
      • _s1.txt control selection counts
      • _s2.txt alternate selection counts.
      • _err.txt error control counts, only created if avgepsilon is not zero.
  • start_counts : an existing deep mutational scanning counts file; the starting counts are proportional to these.

  • diffprefs_file : created file giving differential preferences used in simulation.

  • avgepsilon : average error rate in sequencing. This is the average rate at which each nucleotide is mis-read as some other nucleotide. Set to zero if there are no errors.

  • control_concentration : concentration for “control” preferences.

  • fracdiffer fraction of sites for which differential preferences are non-zero.

  • diff_concentration : concentration parameter for differential preferences.

  • epsilon_concentration : concentration parameter for errors.

  • seed : random number seed.

RESULT OF THIS FUNCTION:

The result of this function is creation of the files specified in depths_outfiles.

simulate.SimulateLibraryCounts(depths_outfiles, prefs_file, avgmu, avgepsilon, avgrho, mu_concentration=300.0, epsilon_concentration=300.0, rho_concentration=300.0, seed=1)

Simulates counts data for deep mutational scanning experiment.

This function can be used if you want to simulate the creation of a mutant library with a set of known site-specific preferences to then test how accurately you can infer these preferences.

This function assumes mutations made at codon level, and that the site-specific preferences are at the amino-acid level.

All random variables are drawn from Dirichlet distributions with the indicated concentration parameters.

CALLING VARIABLES:

  • depths_outfiles : list of 2-tuples (depth, outfile).

    • depth is an integer giving the per-site sequencing depth.

    • outfile is a string giving the prefix for the created deep mutational scanning counts file. These files are deleted if they already exist, and then re-created. The suffixes are:

      • _pre.txt for pre-selection counts
      • _post.txt for post-selection counts
      • _errpre.txt for pre-selection error-control counts, only created if avgepsilon is not zero.
      • _errpost.txt for post-selection error-control counts, only created if avgrho is not zero.
  • prefs_file : name of an existing file giving the actual site-specific amino-acid preferences, either with or without stop codons. Data are simulated using these preferences. If stop codons are not specified, the preferences for such codons are set to zero.

  • avgmu : average mutation rate at each site to any non-wildtype codon. The actual mutation rate to each possible codon is drawn from a Dirichlet.

  • avgepsilon : average error rate in pre-selection library. This is the average rate at which each nucleotide is mis-read as some other nucleotide. Set to zero if there are no errors.

  • avgrho : average error rate in post-selection library. This is the average rate at which each nucleotide is mis-read as some other nucleotide. Set to zero if there are no errors.

  • mu_concentration : mutation rates for any given codon drawn from Dirichlet with this concentration parameter.

  • epsilon_concentration : concentration parameter for pre-selection error rate.

  • rho_concentration : concentratiion parameter for post-selection error rate.

  • seed is seed for random number generators

RESULT OF THIS FUNCTION:

The result of this function is creation of the files specified in depths_outfiles.