phydmslib.treelikelihood module

Tree likelihoods.

Nucleotides, amino acids, and codons are indexed by integers 0, 1, … using the indexing schemes defined in phydmslib.constants.

class phydmslib.treelikelihood.TreeLikelihood(tree, alignment, model, underflowfreq=3, dparamscurrent=True, dtcurrent=False, branchScale=None)

Bases: object

Uses alignment, model, and tree to calculate likelihoods.

See __init__ for how to initialize a TreeLikelihood.

Most commonly, you will initiate a tree likelihood object, and then maximize it using the maximizeLikelihood method.

A TreeLikelihood object is designed to only work for a fixed tree topology. In the internal workings, this fixed topology is transformed into the node indexing scheme defined by name_to_nodeindex.

After initialization, most attributes should not be changed directly. They should only be altered via the updateParams method or by altering the paramsarray property. Both of these methods of updating attributes correctly update all of the dependent properties; setting other attributes directly does not.

Attributes:
tree (instance of Bio.Phylo.BaseTree.Tree derived class)

Phylogenetic tree. Branch lengths are in units of codon substitutions per site for the current model.

model (instance of phydmslib.models.Model derived class)

Specifies the substitution model for nsites codon sites. This can either be a simple Model or a DistributionModel. Note that if a non-zero prior is defined by model, this is included in loglik and dloglik.

alignment (list of 2-tuples of strings, (head, seq))

Aligned protein-coding codon sequences. Headers match tip names in tree; sequences contain nsites codons.

paramsarray (numpy.ndarray of floats, 1-dimensional)

An array of the model free parameters. You can directly assign to paramsarray, and the correct parameters will be internally updated vi updateParams. paramsarray is actually a property rather than an attribute.

paramsarraybounds (tuple of 2-tuples)

paramsarraybounds[i] is the 2-tuple (minbound, maxbound) for parameter paramsarray[i]. Set minbound or maxbound are None if there is no lower or upper bound.

loglik (float)

Current log likelihood. Also includes model.logprior if this is non-zero.

siteloglik (numpy.ndarray of floats, length nsites)

siteloglik[r] is current log likelihood at site r.

dparamscurrent (bool)

Do we keep the derivatives of the likelihood with respect to model parameters current? Doing so involves computational costs, so only set to True if using these derivatives. Currently, only one of dparamscurrent and dtcurrent can be True simultaneously, as we optimize parameters and branch lengths separately.

dtcurrent (bool)

Do we keep the derivatives of the likelihood with respect to branch lengths current? Doing so involves computational costs, so only set to True if using these derivatives.

dloglik (dict)

For each param in model.freeparams, dloglik[param] is the derivative of loglik with respect to param. Only accessible if dparamscurrent is True. Also includes model.dlogprior if this is non-zero.

dloglikarray (numpy.ndarray of floats, 1-dimensional)

dloglikarray[i] is the derivative of loglik with respect to the parameter represented in paramsarray[i]. This is the same information as in dloglik, but in a different representation. Only accessible if dparamscurrent is True.

dloglik_dt (numpy.ndarray of floats, shape (nnodes - 1,).

dloglik_dt is the derivative of loglik with respect to the branch lengths t.

nsites (int)

Number of codon sites.

nseqs (int)

Number of sequences in alignment, and tip nodes in tree.

nnodes (int)

Total number of nodes in tree, counting tip and internal. Node are indexed 0, 1, …, nnodes - 1. This indexing is done such that the descendants of node n always have indices < n, and so that the tips are the first ntips indices and the internals are the last ninternal indices.

ninternal (int)

Total number of internal nodes in tree.

ntips (int)

Total number of tip nodes in tree.

tips (numpy.ndarray of int, shape (ntips, nsites))

tips[n][r] is the index of the codon at site r for tip node n (0 <= n < ntips). If tips[n][r] is a gap, then tips[n][r] is 0 and r is in gaps[n].

gaps (numpy.array, length ntips, entries numpy.array of int)

gaps[n] is an array containing all sites where the codon is a gap for tip node n (0 <= n < ntips).

name_to_nodeindex (dict)

For each sequence name (these are headers in alignment, tip names in tree), name_to_nodeindex[name] is the int index of that node in the internal indexing scheme.

rdescend and ldescend (lists of int of length ninternal)

For each internal node n, rdescend[n - ntips] is its right descendant and ldescend[n - ntips] is its left descendant.

descendants (list of list)

For each node n, descendants[n] is a list of all nodes that are direct or indirect descendants of n.

t (numpy.ndarray of float, shape (nnodes - 1,))

t[n] is the branch length leading node n. The branch leading to the root node (which has index nnodes - 1) is undefined and not used, which is why t is of length nnodes - 1 rather than nnodes.

L (dict)

L[n] is a numpy.ndarray of float of shape (nsites, N_CODON) for each node n. L[n][r][x] is the partial condition likelihood of codon `x at site r at node n. Note that these must be corrected by adding underflowlogscale. If model is a DistributionModel, then L[n] is instead of shape (model.ncats, nsites, N_CODON) with the first index ranging over the model categories. Entries in L are only defined and saved for nodes as needed in order to save memory.

dL (dict keyed by strings)

For each free model parameter param and node n, dL[param][n] is derivative of L[n] with respect to param. Only guaranteed to be valid if dparamscurrent is True, and entries are only defined and saved for nodes as needed in order to save memory.

dL_dt (dict of dict)

dL_dt[n][n2] is the derivative of L[n2] with respect to branch length t[n] where 0 <= n < nnodes - 1.

underflowfreq (int >= 1)

The frequency with which we rescale likelihoods to avoid numerical underflow

underflowlogscale (numpy.ndarray of float, shape (nsites,))

Corrections that must be added to L at the root node to get actual likelihoods when underflow correction is being performed.

__init__(tree, alignment, model, underflowfreq=3, dparamscurrent=True, dtcurrent=False, branchScale=None)

Initialize a TreeLikelihood object.

Args:
tree, alignment, model, underflowfreq

Attributes of same name described in class doc string. Note that we make copies of tree, model, and alignment’ so the calling objects are not modified during optimization.

dparamscurrent, dtcurrent

Current values of these flags indicating which derivatives to compute.

branchScale (None or float > 0)

The branch lengths in the input tree are assumed to be in units of substitutions per site with the scaling defined by model.branchScale. If the scaling should instead be defined by some other value of branchScale, indicate by setting to that value rather than None. This is useful if tree was inferred on models across many sites but you are now just analyzing an individual one. Note that this option applies only to the input tree, not the output one.

property dloglik

dloglik[param] is derivative of loglik by param.

property dloglik_dt

dloglik_dt is derivative of loglik with respect to t.

property dloglikarray

Derivative of loglik with respect to paramsarray.

property dparamscurrent

Are derivatives with respect to model parameters current?

property dtcurrent

Are derivatives with respect to branch lengths current?

maximizeLikelihood(optimize_brlen=False, approx_grad=False, logliktol=0.01, nparamsretry=1, printfunc=None)

Maximize the log likelihood.

Maximizes log likelihood with respect to model parameters and potentially branch lengths depending on optimize_brlen. If optimizing the branch lengths, iterates between optimizing the model parameters and branch lengths.

Uses the L-BFGS-B method implemented in scipy.optimize.

There is no return variable, but after call object attributes will correspond to maximimum likelihood values.

Args:
optimize_brlen (bool)

Do we optimize branch lengths?

approx_grad (bool)

If True, then we numerically approximate the gradient rather than using the analytical values.

logliktol (float)

When using optimize_brlen, keep iterating between optimization of parameters and branch lengths until change in log likelihood is less than logliktol.

nparamsretry (int >= 0)

Number of times to retry parameter optimization from different initial values if it fails the first time.

printfunc (None or a function)

If not None, then we print using printfunc the detailed results of the optimization at each step. For instance, printfunc might be sys.stderr.write or logger.info.

Returns:

A string giving a summary of the maximization.

property paramsarray

All free model parameters as 1-dimensional numpy.ndarray.

You are allowed to update model parameters by direct assignment of this property.

property paramsarraybounds

Bounds for parameters in paramsarray.

property t

Gets array of branch lengths.

property tree

Tree with branch lengths in codon substitutions per site.

The tree is a Bio.Phylo.BaseTree.Tree object.

This is the current tree after whatever optimizations have been performed so far.

updateParams(newvalues)

Update model parameters and re-compute likelihoods.

This method is the only acceptable way to update model parameters. The likelihood is re-computed as needed by this method.

Args:
newvalues (dict)

A dictionary keyed by param name and with value as new value to set. Each parameter name must either be a valid model parameter (in model.freeparams).