API Reference

EvoTen: Vectorized, differentiable computation of tree likelihoods supporting different computation backends.

This page provides a complete reference to all public classes, functions, and modules in EvoTen.

Core Module

The main evoten module exports the primary classes and functions.

Model

evoten.model.compute_ancestral_marginals(leaves, tree_handler, transition_probs, equilibrium_logits, leaf_names=None, leaves_are_probabilities=True, return_probabilities=False, return_upward_messages=False, return_downward_messages=False)[source]

Compute all marginal distributions at internal (ancestral) nodes u in the given leave data and the tree. Formally, the method computes sP(u | leaves, tree) for all u that are not leaves.

  • Broadcasting is supported for this dimension.

Parameters:
  • leaves – Logits of all symbols at all leaves of shape (num_leaves, models*, L, d).

  • tree_handler (TreeHandler) – TreeHandler object

  • transition_probs – Probabilistic transition matrices of shape (num_nodes-1, models, L*, d, d).

  • equilibrium_logits – Equilibrium distribution logits of shape (models, d).

  • leaf_names – Names of the leaves (list-like of length num_leaves). Used to reorder correctly.

  • leaves_are_probabilities – If True, leaves are assumed to be probabilities or one-hot encoded.

  • return_probabilities – If True, return probabilities instead of logliks.

  • return_upward_messages – If True, also returns upward messages of shape (num_nodes-1, models).

  • return_downward_messages – If True, also returns downward messages of shape (num_nodes-1, models).

Returns:

Ancestral marginals of shape (num_ancestral_nodes, models, L, d).

evoten.model.compute_ancestral_probabilities(leaves, tree_handler, transition_probs, leaf_names=None, return_only_root=False, leaves_are_probabilities=True, return_probabilities=False)[source]

Computes all (partial) log-likelihoods at all internal (ancestral) nodes u in the given tree, that is P(leaves below u | u, tree) for all u that are not leaves. Supports multiple, parallel models. Uses a vectorized implementation of Felsenstein’s pruning algorithm that treats models, sequence positions and all nodes within a tree layer in parallel.

  • Broadcasting is supported for this dimension.

Parameters:
  • leaves – Logits of all symbols at all leaves of shape (num_leaves, models*, L, d).

  • tree_handler (TreeHandler) – TreeHandler object

  • transition_probs – Probabilistic transition matrices of shape (num_nodes-1, models, L*, d, d) or (num_nodes-1, models, d, d).

  • leaf_names – Names of the leaves (list-like of length num_leaves). Used to reorder correctly.

  • return_only_root – If True, only the root node logliks are returned.

  • leaves_are_probabilities – If True, leaves are assumed to be probabilities or one-hot encoded.

  • return_probabilities – If True, return probabilities instead of logliks.

Returns:

Ancestral logliks of shape (models, L, d) if return_only_root else shape (num_ancestral_nodes, models, L, d)

evoten.model.compute_leaf_out_marginals(leaves, tree_handler, transition_probs, equilibrium_logits, leaf_names=None, leaves_are_probabilities=True, return_probabilities=False)[source]

Computes the marginal distributions of the leaves given all other leaves, the tree topology and the rate matrix. Formally, the method computes P(u | leaves_except_u, tree, rates) for all leaves u.

  • Broadcasting is supported for this dimension.

Parameters:
  • leaves – Logits of all symbols at all leaves of shape (num_leaves, models*, L, d).

  • tree_handler (TreeHandler) – TreeHandler object

  • transition_probs – Probabilistic transition matrices of shape (num_nodes-1, models, L*, d, d).

  • equilibrium_logits – Equilibrium distribution logits of shape (models, d).

  • leaf_names – Names of the leaves (list-like of length num_leaves). Used to reorder correctly.

  • leaves_are_probabilities – If True, leaves are assumed to be probabilities or one-hot encoded.

  • return_probabilities – If True, return probabilities instead of logliks.

Returns:

Leaf-out marginals of shape (num_leaves, models, L, d).

evoten.model.loglik(leaves, tree_handler, transition_probs, equilibrium_logits, leaf_names=None, leaves_are_probabilities=True)[source]

Computes log P(leaves | tree, rate_matrix).

  • Broadcasting is supported for this dimension.

Parameters:
  • leaves – Logits of all symbols at all leaves of shape (num_leaves, models*, L, d).

  • tree_handler (TreeHandler) – TreeHandler object

  • transition_probs – Probabilistic transition matrices of shape (num_nodes-1, models, L*, d, d) or (num_nodes-1, models, d, d).

  • equilibrium_logits – Equilibrium distribution logits of shape (models, L*, d).

  • leaf_names – Names of the leaves (list-like of length num_leaves). Used to reorder correctly.

  • leaves_are_probabilities – If True, leaves are assumed to be probabilities or one-hot encoded.

Returns:

Log-likelihoods of shape (models, L).

evoten.model.propagate(root, tree_handler, transition_probs)[source]

Propagates a root distribution along the tree topology. The method computes P(u | root, tree) for all nodes u in the tree.

  • Broadcasting is supported for this dimension.

Parameters:
  • root – Probabilities of all symbols at the root node of shape (1, models*, L, d).

  • tree_handler (TreeHandler) – TreeHandler object

  • transition_probs – Probabilistic transition matrices of shape (num_nodes-1, models, L*, d, d).

Tree Handler

class evoten.tree_handler.NodeData(node: Bio.Phylo.BaseTree.Clade, parent: Bio.Phylo.BaseTree.Clade, height: int = -1, index: int = -1, finished: bool = False)[source]

Bases: object

Parameters:
  • node (Clade)

  • parent (Clade)

  • height (int)

  • index (int)

  • finished (bool)

finished: bool = False
height: int = -1
index: int = -1
node: Clade
parent: Clade
class evoten.tree_handler.TreeHandler(tree=None, root_name=None)[source]

Bases: object

Wraps a rooted tree and provides utility functions useful for a height-wise processing of the tree.

Parameters:
  • tree (<module 'Bio.Phylo.BaseTree' from '/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/Bio/Phylo/BaseTree.py'>) – Bio.Phylo tree object that will we wrapped by this class. If None, a tree with only a root node will be created.

  • root_name – Name of the root node when creating a new tree.

change_root(new_root_name)[source]
Rotates the tree such that a different node becomes the root.

Calls update() automatically.

Parameters:
  • new_root_name – Name of the new root node, can be any internal node

  • tree. (in the)

collapse(node_name)[source]

Collapses a node in the tree. Call update() after all tree modifications are done.

classmethod copy(tree_handler)[source]

Copies another tree handler.

Parameters:

tree_handler – TreeHandler object to copy.

draw(no_labels=False, axes=None, do_show=True)[source]

Plots the tree.

classmethod from_newick(newick_str)[source]

Reads a tree from a newick string.

Parameters:
  • filename – handle or filepath

  • fmt – Format of the tree file. Supports all formats supported by Bio.Phylo.

get_branch_lengths_by_height(height)[source]

Retrieves a vector with the branch lengths for each node with the given height.

Parameters:

height – Height of the subtree rooted at a node.

get_index(node_name)[source]
get_indices(node_names)[source]

Get indices for a list of node names (strings).

get_internal_counts_by_height(height)[source]

Retrieves a vector with the number of child nodes that are internal for each node in the given layer.

Example: For the tree ROOT / | A B C |\ | /|D E F G H I | | | x y z and height=2, the function will return [1,2].

Parameters:

height – Height of the subtree rooted at a node.

get_leaf_counts_by_height(height)[source]

Retrieves a vector with the number of child nodes that are leafs for each node in the given layer.

Example: For the tree ROOT / | A B C |\ | /|D E F G H I | | | x y z and height=1, the function will return [1, 1, 1, 1].

Parameters:

height – Height of the subtree rooted at a node.

get_parent_indices_by_height(height)[source]

Retrieves a vector with the index of the parent for each node in a height layer.

get_values_by_height(kernel, height, leaves_included=True)[source]

Retrieves all values from the leftmost axis of a tensor corresponding to all nodes with a given height.

Parameters:
  • kernel – A tensor of shape (num_nodes-1, …) or (num_nodes, …) (root last; might be excluded),

  • and (representing all branch lengths ordered by tree height) – left-to-right (starting with leaves).

  • height – Height of the subtree rooted at a node.

  • leaves_included – If False, the method will assume a kernel if shape (num_nodes-num_leaves, …) and height=0 is invalid.

Returns:

Tensor of shape (layer_size, …) representing the branch lengths for the tree layer.

prune()[source]

Prunes the tree by removing all leaves, i.e. strips the lowest height layer.

Call update() after all tree modifications are done.

classmethod read(file, fmt='newick')[source]

Reads a tree from a file.

Parameters:
  • filename – handle or filepath

  • fmt – Format of the tree file. Supports all formats supported by Bio.Phylo.

reorder(tensor, node_names, axis=0)[source]
Reorders the tensor along the given axis to be sorted in a way

compatible with the tree. This method is meant to be statically compiled in the compute graph (leaf order is always the same).

Parameters:
  • tensor – A tensor of shape (…, k, …).

  • node_names – List-like of k node names in the order as they appear in the tensor.

  • axis – The axis along which the tensor should be reordered.

set_branch_lengths(branch_lengths, update_phylo_tree=True)[source]

Sets the branch lengths of the tree.

Parameters:

branch_lengths

A tensor of shape (num_nodes-1, k) representing the branch lengths of each node to its parent.

k is the number of models

setup_init_branch_lengths()[source]

Initializes the branch lengths of the tree.

split(node_name, n=2, branch_length=1.0, names=None)[source]

Generates n new descendants for a node. Call update() after all tree modifications are done.

to_newick(no_internal_names=True)[source]

Returns the newick string representation of the tree.

update(unnamed_node_keyword='evoten._node', force_reset_init_lengths=False)[source]

Initializes or updates utility datastructures for the tree.

Substitution Models

evoten.substitution_models.LG(alphabet='ARNDCQEGHILKMFPSTWYV', dtype=<class 'numpy.float32'>)[source]
Returns the exchangeabilities and equilibrium frequencies for the LG

model. Si Quang Le, Olivier Gascuel An Improved General Amino Acid Replacement Matrix, 2008 Use for amino acids.

Parameters:
  • alphabet (str) – A string with the amino acids in the desired order.

  • dtype (type[floating])

Returns:

symmetric d x d tensor of exchangeabilities and d matrix of equilibrium frequencies.

Return type:

tuple[ndarray, ndarray]

evoten.substitution_models.jukes_cantor(mue=1.3333333333333333, d=4, dtype=<class 'numpy.float32'>)[source]

Returns the exchangeabilities and equilibrium frequencies for the Jukes-Cantor model.

Parameters:
  • mue (float | Sequence[float]) – Scalar, list or 1D array.

  • d (int)

  • dtype (type[floating])

Returns:

symmetric k x d x d tensor of exchangeabilities and k x d matrix of

equilibrium frequencies.

k is the length of mue or 1 if mue is a scalar.

Return type:

tuple[ndarray, ndarray]

Utilities

evoten.util.data_path(filename)[source]

Yields a real filesystem path to a data file.

Parameters:

filename (str)

Return type:

Iterator[Path]

evoten.util.encode_one_hot(sequences, alphabet)[source]

One-hot encodes a list of strings over the given alphabet.

Parameters:
  • sequences (Sequence[str])

  • alphabet (str)

Return type:

ndarray

evoten.util.encode_tuple_alignment(ta, k=3, gap_symbols='-', gap_separate_state=0)[source]

One-hot encode a tuple alignment produced by tuple_alignment().

Each k-character entry is converted to a base-4 integer index (a=0, c=1, g=2, t=3), so ‘aaa’->0 and ‘cgt’->1*16+2*4+3=27. Gap entries and entries containing characters outside {a,c,g,t,A,C,G,T} are encoded as all-ones (unknown/missing) unless gap_separate_state >= 1, in which case gap entries are one-hot at index 4**k.

Parameters:
  • ta (List[str]) – Output of tuple_alignment(); R strings each of length L*k.

  • k (int) – Tuple length used to produce ta (default 3).

  • gap_symbols (str) – Characters treated as gaps (default ‘-‘).

  • gap_separate_state (int) – Number of extra gap states appended to the alphabet (default 0). If >= 1, gap entries are one-hot at index 4**k instead of all-ones.

Returns:

Shape [R, L, 4**k + gap_separate_state], dtype float32.

Valid k-mer entries are one-hot. Ambiguous entries are all-ones. Gap entries are all-ones when gap_separate_state=0, or one-hot at index 4**k when gap_separate_state >= 1.

Return type:

np.ndarray

evoten.util.parse_rate_model(path)[source]

Parses a rate model from a file. The first row is expected to contain the equilibrium frequencies, and the remaining rows are expected to contain the exchangeabilities (lower triangular matrix without diagonal). After the matrix, an optional scaling factor can be provided in a separate line.

The file can contain comments starting with ‘#’.

Parameters:

path (Path | str) – Path to the rate model file.

Returns:

A tuple (exchangeabilities, equilibrium frequencies, scaling factor) with shapes (n, n), (n,) and a scalar respectively. When the scaling factor is not provided, it defaults to 1.

Return type:

tuple[ndarray, ndarray, float]

evoten.util.permute_rate_model(exchangeabilities, equilibrium, alphabet, new_alphabet)[source]

Permutes a rate model to match a new alphabet. The new alphabet must be a permutation of the old alphabet.

Parameters:
  • exchangeabilities (ndarray) – Exchangeability matrix of shape (n, n).

  • equilibrium (ndarray) – Equilibrium frequencies of shape (n,).

  • alphabet (str) – Original alphabet.

  • new_alphabet (str) – New alphabet.

Returns:

A tuple (new_exchangeabilities, new_equilibrium) with the same shapes as the input exchangeabilities and equilibrium, but permuted to match the new alphabet.

Return type:

tuple[ndarray, ndarray]

evoten.util.print_tuple_rate_matrix(Q, k, label=None, file=None)[source]

Pretty-print a rate matrix with k-mer row/column labels.

Parameters:
  • Q (ndarray) – 2-D array-like of shape (4**k, 4**k).

  • k (int) – Tuple size used to produce Q.

  • label (str) – Optional header line printed before the matrix.

  • file (IO[str] | None) – File object for output (default: sys.stdout).

Return type:

None

evoten.util.print_tuple_stationary(pi, k, label=None, file=None)[source]

Pretty-print a stationary distribution with k-mer labels.

Parameters:
  • pi (ndarray) – 1-D array-like of shape (4**k,).

  • k (int) – Tuple size.

  • label (str) – Optional header line printed before the values.

  • file (IO[str] | None) – File object for output (default: sys.stdout).

Return type:

None

evoten.util.tuple_alignment(sequences, k=3, gap_symbols='-')[source]

Construct a tuple alignment from a multiple sequence alignment (MSA).

For each row, k-mers of k consecutive non-gap characters are identified by their column-index tuple (c_0, …, c_{k-1}). Output columns are the equivalence classes — unique column-index tuples — that appear in at least two rows, sorted lexicographically. Each output entry is either the k characters at those columns or k gap characters.

For a gap-less MSA with L columns the output has L-k+1 columns.

Parameters:
  • sequences (List[str]) – MSA rows, all the same length.

  • k (int) – Tuple length; k=3 gives codon-level alignment.

  • gap_symbols (str) – Characters treated as gaps. The first character is used to fill missing entries in the output.

Returns:

One string per row; length = (number of output columns) * k.

Return type:

List[str]

Example

S = [‘ACGT’, ‘A-GT’, ‘ACG-‘] tuple_alignment(S, k=2) # => [‘ACCGGT’, ‘—-GT’, ‘ACCG–‘] # Column-index tuples with count>=2: (0,1), (1,2), (2,3) # Row 1 lacks (0,1) and (1,2); row 2 lacks (2,3).

evoten.util.tuple_array(sequences, k=3, gap_symbols='-', gap_separate_state=0)[source]

Directly compute the one-hot encoded tuple alignment array.

Combines tuple_alignment() and encode_tuple_alignment() in a single pass, avoiding intermediate string allocation. Equivalent to:

encode_tuple_alignment(tuple_alignment(sequences, k, gap_symbols),

k, gap_symbols, gap_separate_state)

For each row, non-gap positions are found and k-tuples of consecutive positions are mapped to base-4 indices. Valid tuples (present in >=2 rows) are written as one-hot vectors into the output array using batch numpy indexing.

With gap_separate_state=0 (default): absent/gap entries are all-ones (neutral for Felsenstein’s pruning algorithm). With gap_separate_state>=1: absent entries are one-hot at index 4**k (explicit gap state); ambiguous entries remain all-ones.

Parameters:
  • sequences (List[str]) – MSA rows, all the same length.

  • k (int) – Tuple length; k=3 for codons.

  • gap_symbols (str) – Characters treated as gaps (default ‘-‘).

  • gap_separate_state (int) – Extra gap states appended to the alphabet (default 0). If >= 1, absent tuple positions are encoded as one-hot at index 4**k rather than all-ones.

Returns:

np.ndarray, shape [R, L, 4**k + gap_separate_state], float32. first_positions : np.ndarray, shape [L], dtype int64.

For each output column j, the 0-based alignment column index of the first character of tuple j.

Return type:

result

evoten.util.tuple_labels(k)[source]

Return the list of 4**k k-mer strings in base-4 index order.

Index i corresponds to the k-mer whose base-4 digits (most-significant first) give bases _INT_TO_BASE[digit]. E.g. for k=2: index 0 → ‘aa’, index 1 → ‘ac’, index 4 → ‘ca’, index 15 → ‘tt’.

Parameters:

k (int) – Tuple length.

Returns:

4**k strings of length k.

Return type:

List[str]

evoten.util.write_rate_model(path, exchangeabilities, equilibrium, scaling_factor=1.0)[source]

Writes a rate model to a file. The first row contains the equilibrium frequencies, and the remaining rows contain the exchangeabilities (lower triangular matrix without diagonal). An optional scaling factor can be provided in a separate line after the matrix.

Parameters:
  • exchangeabilities (ndarray) – Exchangeability matrix of shape (n, n).

  • equilibrium (ndarray) – Equilibrium frequencies of shape (n,).

  • scaling_factor (float) – Optional scaling factor to write to the file.

  • path (Path | str)

Return type:

None

Backend: PyTorch

Backend: TensorFlow