Miscellaneous utilities for dms_tools2.

dms_tools2.utils.adjustErrorCounts(errcounts, counts, charlist, maxexcess)[source]

Adjust error counts to not greatly exceed counts of interest.

This function is useful when estimating preferences. Under the model, the error-control should not have a higher rate of error than the actual sample. However, this could happen if the experimental data don’t fully meet the assumptions. So this function scales down the error counts in that case.

errcounts (pandas.DataFrame)

Holds counts for error control.

counts (pandas.DataFrame)

Holds counts for which we are correcting errors.

charlist (list)

Characters for which we have counts.

maxexcess (int)

Only let error-control counts exceed actual by this much.


A copy of errcounts except for any non-wildtype character, the maximum frequency of that character is adjusted to be at most the number predicted by the frequency in counts plus maxexcess.

>>> counts = pandas.DataFrame({'site':[1], 'wildtype':['A'], 
...         'A':500, 'C':10, 'G':40, 'T':20})
>>> errcounts = pandas.DataFrame({'site':[1], 'wildtype':['A'],
...         'A':250, 'C':1, 'G':30, 'T':10})
>>> charlist = ['A', 'C', 'G', 'T']
>>> errcounts = errcounts[['site', 'wildtype'] + charlist]
>>> adj_errcounts = adjustErrorCounts(errcounts, counts, charlist, 1)
>>> set(adj_errcounts.columns) == set(errcounts.columns)
>>> all(adj_errcounts['site'] == errcounts['site'])
>>> all(adj_errcounts['wildtype'] == errcounts['wildtype'])
>>> (adj_errcounts[adj_errcounts['site'] == 1][charlist].values[0]
...         == numpy.array([250, 1, 21, 10])).all()
dms_tools2.utils.alignSubamplicon(refseq, r1, r2, refseqstart, refseqend, maxmuts, maxN, chartype, use_cutils=True)[source]

Try to align subamplicon to reference sequence at defined location.

Tries to align reads r1 and r2 to refseq at location specified by refseqstart and refseqend. Determines how many sites of type chartype have mutations, and if <= maxmuts conside the subamplicon to align if fraction of ambiguous nucleotides <= maxN. In r1 and r2, an N indicates a non-called ambiguous identity. If the reads disagree in a region of overlap that is set to N in the final subamplicon, but if one read has N and the other a called identity, then the called identity is used in the final subamplicon.

refseq (str)

Sequence to which we align. if chartype is ‘codon’, must be a valid coding (length multiple of 3).

r1 (str)

The forward sequence to align.

r2 (str)

The reverse sequence to align. When reverse complemented, should read backwards in refseq.

refseqstart (int)

The nucleotide in refseq (1, 2, … numbering) where the first nucleotide in r1 aligns.

refseqend (int)

The nucleotide in refseq (1, 2, … numbering) where the first nucleotide in r2 aligns (note that r2 then reads backwards towards the 5’ end of refseq).

maxmuts (int or float)

Maximum number of mutations of character chartype that are allowed in the aligned subamplicons from the two reads.

maxN (int or float)

Maximum number of nucleotides for which we allow ambiguous (N) identities in final subamplicon.

chartype (str)

Character type for which we count mutations. Currently, the only allowable value is ‘codon’.

use_cutils (bool)

Use the faster implementation in the _cutils module.


If reads align, return aligned subamplicon as string (of length refseqend - refseqstart + 1). Otherwise return None.

>>> refseq = 'ATGGGGAAA'
>>> s = alignSubamplicon(refseq, 'GGGGAA', 'TTTCCC', 3, 9, 1, 1, 'codon')
>>> s == 'GGGGAAA' 
>>> s = alignSubamplicon(refseq, 'GGGGAA', 'TTTCCC', 1, 9, 1, 1, 'codon')
>>> s == False
>>> s = alignSubamplicon(refseq, 'GGGGAT', 'TTTCCC', 3, 9, 1, 0, 'codon')
>>> s == False
>>> s = alignSubamplicon(refseq, 'GGGGAT', 'TTTCCC', 3, 9, 1, 1, 'codon')
>>> s == 'GGGGANA'
>>> s = alignSubamplicon(refseq, 'GGGGAT', 'TATCCC', 3, 9, 1, 0, 'codon')
>>> s == 'GGGGATA'
>>> s = alignSubamplicon(refseq, 'GGGGAT', 'TATCCC', 3, 9, 0, 0, 'codon')
>>> s == False
>>> s = alignSubamplicon(refseq, 'GGGNAA', 'TTTCCC', 3, 9, 0, 0, 'codon')
>>> s == 'GGGGAAA'
>>> s = alignSubamplicon(refseq, 'GGGNAA', 'TTNCCC', 3, 9, 0, 0, 'codon')
>>> s == 'GGGGAAA'
>>> s = alignSubamplicon(refseq, 'GTTTAA', 'TTTAAA', 3, 9, 1, 0, 'codon')
>>> s == 'GTTTAAA' 
>>> s = alignSubamplicon(refseq, 'GGGGTA', 'TTACCC', 3, 9, 1, 0, 'codon')
>>> s == 'GGGGTAA' 
>>> s = alignSubamplicon(refseq, 'GGGCTA', 'TTAGCC', 3, 9, 1, 0, 'codon')
>>> s == False 

Gets annotated pandas.DataFrame from codon counts.

Some of the programs (e.g., dms2_bcsubamplicons) create *_codoncounts.csv files when run with --chartype codon. These CSV files have columns indicating the site and wildtype codon, as well as a column for each codon giving the counts for that codon. This function reads that file (or a pandas.DataFrame read from it) to return a pandas.DataFrame where a variety of additional useful annotations have been added.

counts (str)

Name of existing codon counts CSV file, or pandas.DataFrame holding counts.

df (pandas.DataFrame)

The DataFrame with the information in counts plus the following added columns for each site:

ncounts : number of counts at site

mutfreq : mutation frequency at site

nstop : number of stop-codon mutations

nsyn : number of synonymous mutations

nnonsyn : number of nonsynonymous mutations

n1nt : number of 1-nucleotide codon mutations

n2nt : number of 2-nucleotide codon mutations

n3nt : number of 3-nucleotide codon mutations

AtoC, AtoG, etc : number of each nucleotide mutation type among codon mutations with one nucleotide change.

mutfreq1nt, mutfreq2nt, mutfreq3nt : frequency of 1-, 2-, and 3-nucleotide codon mutations at site.

>>> d = {'site':[1, 2], 'wildtype':['ATG', 'GGG'], 'ATG':[105, 1],
...         'GGG':[3, 117], 'GGA':[2, 20], 'TGA':[0, 1]}
>>> for codon in CODONS:
...     if codon not in d:
...         d[codon] = [0, 0]
>>> counts = pandas.DataFrame(d)
>>> with tempfile.NamedTemporaryFile(mode='w') as f:
...     counts.to_csv(f, index=False)
...     f.flush()
...     df = annotateCodonCounts(f.name)
>>> all([all(df[col] == counts[col]) for col in counts.columns])
>>> all(df['ncounts'] == [110, 139])
>>> all(df['mutfreq'] == [5 / 110., 22 / 139.])
>>> all(df['nstop'] == [0, 1])
>>> all(df['nsyn'] == [0, 20])
>>> all(df['nnonsyn'] == [5, 1])
>>> all(df['n1nt'] == [0, 20])
>>> all(df['n2nt'] == [3, 2])
>>> all(df['n3nt'] == [2, 0])
>>> all(df['GtoA'] == [0, 20])
>>> all(df['AtoC'] == [0, 0])
>>> all(df['mutfreq1nt'] == [0, 20 / 139.])
>>> all(df['mutfreq3nt'] == [2 / 110., 0])
dms_tools2.utils.barcodeInfoToCodonVariantTable(samples, geneseq, path=None)[source]

Convert barcode info files into a CodonVariantTable

Convert barcode info files output from dms2_bcsubamp into a CodonVariantTable. Barcode info files contain reads and barcodes from barcoded subamplicon sequencing, described here. This function takes consensus reads retained by dms2_bcsubamp, gives each unique sequence a numerical barcode (since the barcodes from dms2_bcsubamp could come from the same variant), and counts the number of retained consensus reads corresponding to each sequence. Then, a CodonVariantTable is made using the sequences and their numerical barcodes, and counts are added based on the number of retained consensus reads of those sequences. Therefore, the CodonVariantTable will only contain one ‘variant’ for each unique sequence with the total count for all the unbarcoded variants in the experiment which had the same sequence.

samples (dict):

Dictionary with libraries as keys and lists of info file prefixes (file names without the ‘_bcinfo.txt.gz’) for files corresponding to those libraries as values.

Example: {‘library-1’:[‘condition-1-library-1’],


geneseq (str):

The wildtype gene sequence

path (str)

Directory in which barcode info files are located


A dms_variants.codonvarianttable.CodonVariantTable with ‘counts’ generated from the barcode info files

dms_tools2.utils.buildReadConsensus(reads, minreads, minconcur, use_cutils=True)[source]

Builds consensus sequence of some reads.

You may want to pre-fill low-quality sites with N using lowQtoN. An N is considered a non-called identity.

reads (list)

List of reads as strings. If reads are not all same length, shorter ones are extended from 3’ end with N to match maximal length.

minreads (int)

Only call consensus at a site if at least this many reads have called identity.

minconcur (float)

Only call consensus at site if >= this fraction of called identities agree.

use_cutils (bool)

Use the faster implementation in the _cutils module.


A string giving the consensus sequence. Non-called sites are returned as N`.

>>> reads = ['ATGCAT',
...          'NTGNANA',
...          'ACGNNTAT',
...          'NTGNTA']
>>> buildReadConsensus(reads, 2, 0.75) == 'ATGNNNAN'
>>> reads.append('CTGCATAT')
>>> buildReadConsensus(reads, 2, 0.75) == 'NTGCATAT'

Accessibility of amino acids by nucleotide mutations.

seqs (str or list)

A single coding sequence or a list of such sequences.


A pandas DataFrame listing all sites in the sequence(s) numbered 1, 2, …, with columns giving the accessibility of each amino acid by single nucleotide mutations.

The accessibility of codon \(c\) to amino-acid \(a\) by single-nucleotide mutations is defined as the minimum number of nucleotide mutations needed to generate that amino-acid.

For a collection of sequences, we calculate the accessibility as the weighted average of the accessibilities of all codons observed at that site in the collection of sequences.

As an example, compute accessibility for one sequence:

>>> s = "ATGGGA"
>>> acc = codonEvolAccessibility(s)

The returned pandas DataFrame acc is has a column named site plus columns for all amino acids:

>>> all(acc.columns == ['site'] + AAS_WITHSTOP)

We look at entries for a few amino acids. At the first site, the wildtype entry in the sequence s is the codon for M (ATG). So at this site, the distance to M is 0. The distance to I (which has codon ATA as a codon) is 1, and the distance to W (which has only TGG as a codon) is 2.

>>> acc[['site', 'G', 'I', 'M', 'W']]
   site    G    I    M    W
0     1  2.0  1.0  0.0  2.0
1     2  0.0  2.0  3.0  2.0

If we pass the function a list of multiple sequences, then the accessibilities are averaged over the sequences:

>>> acc2 = codonEvolAccessibility(['ATGGGA', 'ATAGGA'])
>>> acc2[['site', 'G', 'I', 'M', 'W']]
   site    G    I    M    W
0     1  2.0  0.5  0.5  2.5
1     2  0.0  2.0  3.0  2.0

Makes amino-acid counts pandas.DataFrame from codon counts.

counts (pandas.DataFrame)

Columns are the string site wildtype and all codons in CODONS. Additional columns are allowed but ignored.

aacounts (pandas.DataFrame)

Columns are the string site and all amino acids in AAS_WITHSTOP with counts for each amino acid made by summing counts for encoding codons.

>>> d = {'site':[1, 2], 'othercol':[0, 0], 'ATG':[105, 1],
...         'GGG':[3, 117], 'GGA':[2, 20], 'TGA':[0, 1],
...         'wildtype':['ATG', 'GGG']}
>>> for codon in CODONS:
...     if codon not in d:
...         d[codon] = [0, 0]
>>> counts = pandas.DataFrame(d)
>>> aacounts = codonToAACounts(counts)
>>> 'othercol' in aacounts.columns
>>> all(aacounts['site'] == [1, 2])
>>> all(aacounts['wildtype'] == ['M', 'G'])
>>> all(aacounts['M'] == [105, 1])
>>> all(aacounts['G'] == [5, 137])
>>> all(aacounts['*'] == [0, 1])
>>> all(aacounts['V'] == [0, 0])

Convert codon counts file to nucleotide counts.

codoncounts (str or pandas.DataFrame)

Codon counts in format produced by dms2_bcsubamp, either as CSV file or data frame holding CSV.


pandas.DataFrame with nucleotide counts.


>>> with tempfile.NamedTemporaryFile('w') as f:
...     _ = f.write(textwrap.dedent('''
...             1,ATG,0,0,0,0,0,0,2,0,0,0,0,0,8,0,333985,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...             2,AAG,16,20,333132,41,13,12,27,14,8,6,67,8,9,13,29,9,10,11,12,8,10,15,15,11,6,9,3,7,8,10,17,4,3,7,49,7,9,14,9,4,10,7,7,7,9,11,11,5,14,14,11,6,13,16,15,14,9,9,15,8,9,11,8,15
...             3,GCA,2,3,8,3,34,11,7,6,7,6,9,8,4,3,5,0,6,14,10,12,6,8,7,10,5,11,7,6,6,1,3,12,19,6,11,9,333250,10,6,9,15,3,5,5,37,9,9,7,8,4,8,3,23,5,7,8,6,11,7,10,7,9,3,6 
...             '''.strip()))
...     f.flush()
...     nt_counts = codon_to_nt_counts(f.name)
>>> nt_counts
   site wildtype       A       C       G       T
0     1        A  334009       0       6       0
1     2        T       0       2       0  334013
2     3        G       8       0  333993      14
3     4        A  333424     156     169     187
4     5        A  333361     211     186     178
5     6        G     156     185  333427     168
6     7        G     116     124  333410     125
7     8        C     126  333407     121     121
8     9        A  333435     114     112     114
dms_tools2.utils.convertCountsFormat(oldfile, newfile, charlist)[source]

Convert counts file from dms_tools to dms_tools2 format.

oldfile (str)

Name of counts file in the old dms_tools format: http://jbloomlab.github.io/dms_tools/fileformats.html

newfile (str)

Name of created counts file in the dms_tools2 format: https://jbloomlab.github.io/dms_tools2/dms2_bcsubamp.html

charlist (list)

List of characters that we expect in the counts files. For instance, could be CODONS.

dms_tools2.utils.getSubstitutions(wildtype, mutant, amino_acid=False)[source]

Get space delimited string of substitutions

wildtype (str):

The wildtype sequence

mutant (str):

The mutant sequence

amino_acid (bool)

Specify whether the sequence is amino acid. Default is False


A space delimited string of substitutions present in the mutant sequence

>>> getSubstitutions('AGT', 'TGT')
>>> getSubstitutions('AAGTAACGA', 'ATCTAACGA')
'A2T G3C'
>>> getSubstitutions('TYARV', 'GYAGV', amino_acid=True)
'T1G R4G'
dms_tools2.utils.incrementCounts(refseqstart, subamplicon, chartype, counts)[source]

Increment counts dict based on an aligned subamplicon.

This is designed for keeping track of counts of different mutations / identities when aligning many subamplicons to a sequence.

Any positions where subamplicon has an N are ignored, and not added to counts.

refseqstart (int)

First nucleotide position in 1, 2, … numbering where subamplicon aligns.

subamplicon (str)

The subamplicon.

chartype (str)

Character type for which we are counting mutations. Currently, only allowable value is ‘codon’.

counts (dict)

Stores counts of identities, and is incremented by this function. Is a dict keyed by every possible character (e.g., codon), with values lists with element i holding the counts for position i in 0, 1, … numbering.


On completion, counts has been incremented.

>>> codonlen = 10
>>> counts = dict([(codon, [0] * codonlen) for codon 
...         in CODONS])
>>> subamplicon1 = 'ATGGACTTTC'
>>> incrementCounts(1, subamplicon1, 'codon', counts)
>>> subamplicon2 = 'GGTCTTTCCCGGN'
>>> incrementCounts(3, subamplicon2, 'codon', counts)
>>> counts['ATG'][0] == 1
>>> counts['GAC'][1] == 1
>>> counts['GTC'][1] == 1
>>> counts['TTT'][2] == 2
>>> counts['CCC'][3] == 1
>>> sum([sum(c) for c in counts.values()]) == 6
dms_tools2.utils.initLogger(logfile, prog, args)[source]

Initialize output logging for scripts.

logfile (str or sys.stdout)

Name of file to which log is written, or sys.stdout if you just want to write information to standard output.

prog (str)

Name of program for which we are logging.

args (dict)

Program arguments as arg / value pairs.


If logfile is a string giving a file name, returns an opened and initialized logging.Logger. If logfile is sys.stdout, then writes information to sys.stdout. In either case, basic information is written about the program and args.

dms_tools2.utils.iteratePairedFASTQ(r1files, r2files, r1trim=None, r2trim=None)[source]

Iterates over FASTQ files for single or paired-end sequencing.

r1files (list or str)

Name of R1 FASTQ file or list of such files. Can optionally be gzipped.

r2files (list or str or None)

Like r1files but for R2 files, or None if no R2.

r1trim (int or None)

If not None, trim r1 and q1 to be no longer than this.

r2trim (int or None)

Like r1trim but for R2.


Each iteration returns (name, r1, r2, q1, q2, fail) where:

  • name is a string giving the read name

  • r1 and r2 are strings giving the reads; r2 is None if no R2.

  • q1 and q2 are strings giving the PHRED Q scores; q2 is none if no R2.

  • fail is True if either read failed Illumina chastity filter, False if both passed, None if info not present.

We run a simple test by first writing an example FASTQ file and then testing on it.

>>> n1_1 = '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 1:N:0:CGATGT'
>>> r1_1 = 'ATGCAATTG'
>>> q1_1 = 'GGGGGIIII'
>>> n2_1 = '@DH1DQQN1:933:HMLH5BCXY:1:1101:2165:1984 2:N:0:CGATGT'
>>> r2_1 = 'CATGCATA'
>>> q2_1 = 'G<GGGIII'
>>> tf = tempfile.NamedTemporaryFile
>>> with tf(mode='w') as r1file, tf(mode='w') as r2file:
...     _ = r1file.write('\n'.join([
...             n1_1, r1_1, '+', q1_1,
...             n1_1.replace(':N:', ':Y:'), r1_1, '+', q1_1,
...             n1_1.split()[0], r1_1, '+', q1_1,
...             ]))
...     r1file.flush()
...     _ = r2file.write('\n'.join([
...             n2_1, r2_1, '+', q2_1,
...             n2_1, r2_1, '+', q2_1,
...             n2_1, r2_1, '+', q2_1,
...             ]))
...     r2file.flush()
...     itr = iteratePairedFASTQ(r1file.name, r2file.name, r1trim=4, r2trim=5)
...     next(itr) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             r2_1[ : 5], q1_1[ : 4], q2_1[ : 5], False)
...     next(itr) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             r2_1[ : 5], q1_1[ : 4], q2_1[ : 5], True)
...     next(itr) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             r2_1[ : 5], q1_1[ : 4], q2_1[ : 5], None)

Now do the same test but for just R1:

>>> with tf(mode='w') as r1file:
...     _ = r1file.write('\n'.join([
...             n1_1, r1_1, '+', q1_1,
...             n1_1.replace(':N:', ':Y:'), r1_1, '+', q1_1,
...             n1_1.split()[0], r1_1, '+', q1_1,
...             ]))
...     r1file.flush()
...     itr_R1 = iteratePairedFASTQ(r1file.name, None, r1trim=4)
...     next(itr_R1) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             None, q1_1[ : 4], None, False)
...     next(itr_R1) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             None, q1_1[ : 4], None, True)
...     next(itr_R1) == (n1_1.split()[0][1 : ], r1_1[ : 4],
...             None, q1_1[ : 4], None, None)
dms_tools2.utils.lowQtoN(r, q, minq, use_cutils=True)[source]

Replaces low quality nucleotides with N characters.

r (str)

A string representing a sequencing read.

q (str)

String of same length as r holding Q scores in Sanger ASCII encoding.

minq (length-one string)

Replace all positions in r where q is < this.

use_cutils (bool)

Use the faster implementation in the _cutils module.


A version of r where all positions i where q[i] < minq have been replaced with N.

>>> r = 'ATGCAT'
>>> q = 'GB<.0+'
>>> minq = '0'
>>> lowQtoN(r, q, minq) == 'ATGNAN'
dms_tools2.utils.rarefactionCurve(barcodes, *, maxpoints=100000.0, logspace=True)[source]

Rarefaction curve from list of barcodes.

Uses the analytical formula for the rarefaction curve defined on Wikipedia.

barcodes (list or pandas Series)

Holds the list of unique barcodes for which we calculate the rarefaction curve. It is expected that some of these barcodes will be repeated multiple times in the list if the sampling is approaching saturation.

maxpoints (int)

Only calculate values at this many points. The benefit of this is that it can become very costly to calculate the curve at every point when there are many points.

logspace (True)

Logarithmically space the maxpoints points for the calculation. This will give better results if we are subsampling and the curve saturates. Only done if we have to subsample.


The 2-tuple (nreads, nbarcodes), where both nreads and nbarcodes are lists of the same length, and nbarcodes[i] is the expected number of barcodes observed when there are nreads[i] reads.

Here we take a very small list and show that the results given by the function are equivalent to those obtained by random subsampling:

>>> barcodes = ['A', 'A', 'A', 'A', 'G', 'G', 'C', 'T']
>>> (nreads, nbarcodes) = rarefactionCurve(barcodes)
>>> random.seed(1)
>>> nrand = 100000
>>> sim_equal_calc = []
>>> for n in range(1, len(barcodes) + 1):
...     nbarcodes_sim = sum([len(set(random.sample(barcodes, n)))
...             for _ in range(nrand)]) / nrand
...     sim_equal_calc.append(numpy.allclose(nbarcodes_sim,
...             nbarcodes[nreads.index(n)], atol=1e-2))
>>> all(sim_equal_calc)
dms_tools2.utils.renumberSites(renumbfile, infiles, missing='error', outfiles=None, outprefix=None, outdir=None)[source]

Renumber sites in CSV files.

Switch numbering scheme in files with a column named site.

You must specify exactly one of outfiles, outprefix, and outdir as something other than None.

renumbfile (str)

Name of existing CSV file with the re-numbering scheme. Should have columns with name original and new. Each entry in original should refer to a site in the input files, and each entry in new should be the new number for this site. If an entry in new is None or nan then it is dropped from the newly numbered files regardless of missing.

infiles (list)

List of existing CSV files that we are re-numbering. Each file must have an entry of site.

missing (str)
How to handle sites in infiles but not renumbfile.
  • error: raise an error

  • skip: skip renumbering, leave with original number

  • drop: drop any sites not in renumbfile

outfiles (list)

List of output files of the same length as infiles. The numbered version of infiles is named as the corresponding entry in outfiles.

outdir (str)

A directory name. The renumbered files have the same names as in infile, but are now placed in outdir.

outprefix (str)

The renumbered files have the same names and locations as infiles, but have the pre-pended filename extension outprefix.

dms_tools2.utils.reverseComplement(s, use_cutils=True)[source]

Gets reverse complement of DNA sequence s.

s (str)

Sequence to reverse complement.

use_cutils (bool)

Use the faster implementation in the _cutils module.


Reverse complement of s as a str.

>>> s = 'ATGCAAN'
>>> reverseComplement(s) == 'NTTGCAT'

Returns string with information about session / packages.

dms_tools2.utils.sigFigStr(x, nsig)[source]

Get str of x with nsig significant figures.

>>> sigFigStr(11190, 2)
>>> sigFigStr(117, 2)
>>> sigFigStr(6, 2)
>>> sigFigStr(0.213, 2)
>>> sigFigStr(0.007517, 3)