phydmslib.file_io module

Module for input / output from files.

phydmslib.file_io.ReadCodonAlignment(fastafile, checknewickvalid)

Reads codon alignment from file.

fastafile is the name of an existing FASTA file.

checknewickvalid : if True, we require that names are unique and do not contain spaces, commas, colons, semicolons, parentheses, square brackets, or single or double quotation marks. If any of these disallowed characters are present, raises an Exception.

Reads the alignment from the fastafile and returns the aligned sequences as a list of 2-tuple of strings (header, sequence) where sequence is upper case.

If the terminal codon is a stop codon for all sequences, then this terminal codon is trimmed. Raises an exception if the sequences are not aligned codon sequences that are free of stop codons (with the exception of a shared terminal stop) and free of ambiguous nucleotides.

Read aligned sequences in this example:

>>> seqs = [('seq1', 'ATGGAA'), ('seq2', 'ATGAAA')]
>>> f = io.StringIO()
>>> n = f.write(u'\n'.join(['>{0}\n{1}'.format(*tup) for tup in seqs]))
>>> n = f.seek(0)
>>> a = ReadCodonAlignment(f, True)
>>> seqs == a
True

Trim stop codons from all sequences in this example:

>>> seqs = [('seq1', 'ATGTAA'), ('seq2', 'ATGTGA')]
>>> f = io.StringIO()
>>> n = f.write(u'\n'.join(['>{0}\n{1}'.format(*tup) for tup in seqs]))
>>> n = f.seek(0)
>>> a = ReadCodonAlignment(f, True)
>>> [(head, seq[ : -3]) for (head, seq) in seqs] == a
True

Read sequences with gap:

>>> seqs = [('seq1', 'ATG---'), ('seq2', 'ATGAGA')]
>>> f = io.StringIO()
>>> n = f.write(u'\n'.join(['>{0}\n{1}'.format(*tup) for tup in seqs]))
>>> n = f.seek(0)
>>> a = ReadCodonAlignment(f, True)
>>> [(head, seq) for (head, seq) in seqs] == a
True

Premature stop codon gives error:

>>> seqs = [('seq1', 'TGAATG'), ('seq2', 'ATGAGA')]
>>> f = io.StringIO()
>>> n = f.write(u'\n'.join(['>{0}\n{1}'.format(*tup) for tup in seqs]))
>>> n = f.seek(0)
>>> a = ReadCodonAlignment(f, True) 
Traceback (most recent call last):
ValueError:
phydmslib.file_io.Versions()

Returns a string with version information.

You would call this function if you want a string giving detailed informationon the version of phydms and the associated packages that it uses.

phydmslib.file_io.readDivPressure(fileName)

Reads in diversifying pressures from some file.

Scale diversifying pressure values so absolute value of the max value is 1, unless all values are zero.

Args:
fileName (string or readable file-like object)

File holding diversifying pressure values. Can be comma-, space-, or tab-separated file. The first column is the site (consecutively numbered, sites starting with one) and the second column is the diversifying pressure values.

Returns:
divPressure (dict keyed by ints)

divPressure[r][v] is the diversifying pressure value of site r.

phydmslib.file_io.readPrefs(prefsfile, minpref=0, avgprefs=False, randprefs=False, seed=1, sites_as_strings=False)

Read preferences from file with some error checking.

Args:
prefsfile (string or readable file-like object)

File holding amino-acid preferences. Can be comma-, space-, or tab-separated file with column headers of site and then all one-letter amino-acid codes, or can be in the more complex format written dms_tools v1. Must be prefs for consecutively numbered sites starting at 1. Stop codon prefs can be present (stop codons are indicated by *); if so they are removed and prefs re-normalized to sum to 1.

minpref (float >= 0)

Adjust all preferences to be >= this number.

avgprefs, randprefs (bool)

Mutually exclusive options specifying to average or randomize prefs across sites.

seed (int)

Seed used to sort random number generator for randprefs.

sites_as_strings (bool)

By default, the site numers are coerced to integers. If this option is True, then they are kept as strings.

Returns:
prefs (dict)

prefs[r][a] is the preference of site r for amino-acid a. r is an int unless sites_as_strings=True.

phydmslib.file_io.readPrefs_dms_tools_format(f)

Reads the amino-acid preferences written by dms_tools v1.

This is an exact copy of the same code from dms_tools.file_io.ReadPreferences. It is copied because dms_tools v1 is currently only compatible with python2, and we needed something that also works with python3.

f is the name of an existing file or a readable file-like object. It should be in the format written by dms_tools v1.

The return value is the tuple: (sites, wts, pi_means, pi_95credint, h) where sites, wts, pi_means, and pi_95credint will all have the same values used to write the file with WritePreferences, and h is a dictionary with h[r] giving the site entropy (log base 2) for each r in sites.