Python API¶
This page provides detailed documentation for the tskit
Python API.
Trees and tree sequences¶
The TreeSequence
class represents a sequence of correlated
evolutionary trees along a genome. The Tree
class represents a
single tree in this sequence. These classes are the interfaces used to
interact with the trees and mutational information stored in a tree sequence,
for example as returned from a simulation or inferred from a set of DNA
sequences. This library also provides methods for loading stored tree
sequences, for example using tskit.load()
.
The TreeSequence
class¶
-
class
tskit.
TreeSequence
[source]¶ A single tree sequence, as defined by the data model. A TreeSequence instance can be created from a set of tables using
TableCollection.tree_sequence()
, or loaded from a set of text files usingtskit.load_text()
, or loaded from a native binary file usingtskit.load()
.TreeSequences are immutable. To change the data held in a particular tree sequence, first get the table information as a
TableCollection
instance (usingdump_tables()
), edit those tables using the tables api, and create a new tree sequence usingTableCollection.tree_sequence()
.The
trees()
method iterates over all trees in a tree sequence, and thevariants()
method iterates over all sites and their genotypes.Methods
Fst
(sample_sets[, indexes, windows, mode, …])Computes “windowed” Fst between pairs of sets of nodes from
sample_sets
.Tajimas_D
([sample_sets, windows, mode])Computes Tajima’s D of sets of nodes from
sample_sets
in windows.Y1
(sample_sets[, windows, mode, span_normalise])Computes the ‘Y1’ statistic within each of the sets of nodes given by
sample_sets
.Y2
(sample_sets[, indexes, windows, mode, …])Computes the ‘Y2’ statistic between pairs of sets of nodes from
sample_sets
.Y3
(sample_sets[, indexes, windows, mode, …])Computes the ‘Y’ statistic between triples of sets of nodes from
sample_sets
.allele_frequency_spectrum
([sample_sets, …])Computes the allele frequency spectrum (AFS) in windows across the genome for with respect to the specified
sample_sets
.aslist
(**kwargs)Returns the trees in this tree sequence as a list.
at
(position, **kwargs)Returns the tree covering the specified genomic location.
at_index
(index, **kwargs)Returns the tree at the specified index.
breakpoints
([as_array])Returns the breakpoints along the chromosome, including the two extreme points 0 and L.
coiterate
(other, **kwargs)Returns an iterator over the pairs of trees for each distinct interval in the specified pair of tree sequences.
count_topologies
([sample_sets])Returns a generator that produces the same distribution of topologies as
Tree.count_topologies()
but sequentially for every tree in a tree sequence.delete_intervals
(intervals[, simplify, …])Returns a copy of this tree sequence for which information in the specified list of genomic intervals has been deleted.
delete_sites
(site_ids[, record_provenance])Returns a copy of this tree sequence with the specified sites (and their associated mutations) entirely removed.
divergence
(sample_sets[, indexes, windows, …])Computes mean genetic divergence between (and within) pairs of sets of nodes from
sample_sets
.diversity
([sample_sets, windows, mode, …])Computes mean genetic diversity (also knowns as “Tajima’s pi”) in each of the sets of nodes from
sample_sets
.draw_svg
([path, size, x_scale, …])Return an SVG representation of a tree sequence.
dump
(file_or_path)Writes the tree sequence to the specified path or file object.
A copy of the tables defining this tree sequence.
dump_text
([nodes, edges, sites, mutations, …])Writes a text representation of the tables underlying the tree sequence to the specified connections.
edge
(id_)Returns the edge in this tree sequence with the specified ID.
edge_diffs
([include_terminal])Returns an iterator over all the edges that are inserted and removed to build the trees as we move from left-to-right along the tree sequence.
edges
()Returns an iterable sequence of all the edges in this tree sequence.
equals
(other, *[, ignore_metadata, …])Returns True if self and other are equal.
f2
(sample_sets[, indexes, windows, mode, …])Computes Patterson’s f3 statistic between two groups of nodes from
sample_sets
.f3
(sample_sets[, indexes, windows, mode, …])Computes Patterson’s f3 statistic between three groups of nodes from
sample_sets
.f4
(sample_sets[, indexes, windows, mode, …])Computes Patterson’s f4 statistic between four groups of nodes from
sample_sets
.first
(**kwargs)Returns the first tree in this
TreeSequence
.genealogical_nearest_neighbours
(focal, …)Return the genealogical nearest neighbours (GNN) proportions for the given focal nodes, with reference to two or more sets of interest, averaged over all trees in the tree sequence.
general_stat
(W, f, output_dim[, windows, …])Compute a windowed statistic from weights and a summary function.
genetic_relatedness
(sample_sets[, indexes, …])Computes genetic relatedness between (and within) pairs of sets of nodes from
sample_sets
.genotype_matrix
(*[, isolated_as_missing, …])Returns an \(m \times n\) numpy array of the genotypes in this tree sequence, where \(m\) is the number of sites and \(n\) the number of samples.
haplotypes
(*[, isolated_as_missing, …])Returns an iterator over the strings of haplotypes that result from the trees and mutations in this tree sequence.
individual
(id_)Returns the individual in this tree sequence with the specified ID.
Returns an iterable sequence of all the individuals in this tree sequence.
kc_distance
(other[, lambda_])Returns the average
Tree.kc_distance()
between pairs of trees along the sequence whose intervals overlap.keep_intervals
(intervals[, simplify, …])Returns a copy of this tree sequence which includes only information in the specified list of genomic intervals.
last
(**kwargs)Returns the last tree in this
TreeSequence
.ltrim
([record_provenance])Returns a copy of this tree sequence with a potentially changed coordinate system, such that empty regions (i.e.
mean_descendants
(sample_sets)Computes for every node the mean number of samples in each of the sample_sets that descend from that node, averaged over the portions of the genome for which the node is ancestral to any sample.
migration
(id_)Returns the migration in this tree sequence with the specified ID.
Returns an iterable sequence of all the migrations in this tree sequence.
mutation
(id_)Returns the mutation in this tree sequence with the specified ID.
Returns an iterator over all the mutations in this tree sequence.
node
(id_)Returns the node in this tree sequence with the specified ID.
nodes
()Returns an iterable sequence of all the nodes in this tree sequence.
pairwise_diversity
([samples])Returns the pairwise nucleotide site diversity, the average number of sites that differ between a randomly chosen pair of samples.
population
(id_)Returns the population in this tree sequence with the specified ID.
Returns an iterable sequence of all the populations in this tree sequence.
Returns an iterable sequence of all the provenances in this tree sequence.
rtrim
([record_provenance])Returns a copy of this tree sequence with the
sequence_length
property reset so that the sequence ends at the end of the rightmost edge.sample_count_stat
(sample_sets, f, output_dim)Compute a windowed statistic from sample counts and a summary function.
samples
([population, population_id])Returns an array of the sample node IDs in this tree sequence.
segregating_sites
([sample_sets, windows, …])Computes the density of segregating sites for each of the sets of nodes from
sample_sets
, and related quantities.simplify
([samples, map_nodes, …])Returns a simplified tree sequence that retains only the history of the nodes given in the list
samples
.site
(id_)Returns the site in this tree sequence with the specified ID.
sites
()Returns an iterable sequence of all the sites in this tree sequence.
subset
(nodes[, record_provenance])Returns a tree sequence modified to contain only the entries referring to the provided list of nodes, with nodes reordered according to the order they appear in the
nodes
argument.to_macs
()Return a macs encoding of this tree sequence.
to_nexus
([precision])Returns a nexus encoding of this tree sequence.
trait_correlation
(W[, windows, mode, …])Computes the mean squared correlations between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence.trait_covariance
(W[, windows, mode, …])Computes the mean squared covariances between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence.trait_linear_model
(W[, Z, windows, mode, …])Finds the relationship between trait and genotype after accounting for covariates.
trait_regression
(*args, **kwargs)Deprecated synonym for
trait_linear_model
.trees
([tracked_samples, sample_lists, …])Returns an iterator over the trees in this tree sequence.
trim
([record_provenance])Returns a copy of this tree sequence with any empty regions (i.e.
union
(other, node_mapping[, …])Returns an expanded tree sequence which contains the node-wise union of
self
andother
, obtained by adding the non-shared portions ofother
ontoself
.variants
(*[, as_bytes, samples, …])Returns an iterator over the variants in this tree sequence.
write_vcf
(output[, ploidy, contig_id, …])Writes a VCF formatted file to the specified file-like object.
Attributes
Returns time of the oldest root in any of the trees in this tree sequence.
The decoded metadata for this TreeSequence.
The
tskit.MetadataSchema
for this TreeSequence.Returns the total number of bytes required to store the data in this tree sequence.
Returns the number of edges in this tree sequence.
Returns the number of individuals in this tree sequence.
Returns the number of migrations in this tree sequence.
Returns the number of mutations in this tree sequence.
Returns the number of nodes in this tree sequence.
Returns the number of populations in this tree sequence.
Returns the number of provenances in this tree sequence.
Returns the number of samples in this tree sequence.
Returns the number of sites in this tree sequence.
Returns the number of distinct trees in this tree sequence.
Returns the sequence length in this tree sequence.
The set of metadata schemas for the tables in this tree sequence.
A copy of the tables underlying this tree sequence.
Returns a dictionary mapping names to tables in the underlying
TableCollection
.-
Fst
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes “windowed” Fst between pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.For sample sets
X
andY
, ifd(X, Y)
is thedivergence
betweenX
andY
, andd(X)
is thediversity
ofX
, then what is computed isFst = 1 - 2 * (d(X) + d(Y)) / (d(X) + 2 * d(X, Y) + d(Y))
What is computed for diversity and divergence depends on
mode
; see those functions for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
Tajimas_D
(sample_sets=None, windows=None, mode='site')[source]¶ Computes Tajima’s D of sets of nodes from
sample_sets
in windows. Please see the one-way statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, and return value. Operates onk = 1
sample sets at a time. For a sample setX
ofn
nodes, if andT
is the mean number of pairwise differing sites inX
andS
is the number of sites segregating inX
(computed withdiversity
andsegregating sites
, respectively, both not span normalised), then Tajima’s D isD = (T - S / h) / sqrt(a * S + (b / c) * S * (S - 1)) h = 1 + 1 / 2 + ... + 1 / (n - 1) g = 1 + 1 / 2 ** 2 + ... + 1 / (n - 1) ** 2 a = (n + 1) / (3 * (n - 1) * h) - 1 / h ** 2 b = 2 * (n ** 2 + n + 3) / (9 * n * (n - 1)) - (n + 2) / (h * n) + g / h ** 2 c = h ** 2 + g
What is computed for diversity and divergence depends on
mode
; see those functions for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
Y1
(sample_sets, windows=None, mode='site', span_normalise=True)[source]¶ Computes the ‘Y1’ statistic within each of the sets of nodes given by
sample_sets
. Please see the one-way statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates onk = 1
sample set at a time.What is computed depends on
mode
. Each is computed exactly asY3
, except that the average is across a randomly chosen trio of samples(a1, a2, a3)
all chosen without replacement from the same sample set. SeeY3
for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
Y2
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes the ‘Y2’ statistic between pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is computed exactly asY3
, except that the average across randomly chosen trios of samples(a, b1, b2)
, wherea
is chosen from the first sample set, andb1, b2
are chosen (without replacement) from the second sample set. SeeY3
for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
Y3
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes the ‘Y’ statistic between triples of sets of nodes from
sample_sets
. Operates onk = 3
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is an average across randomly chosen trios of samples(a, b, c)
, one from each sample set:- “site”
The average density of sites at which
a
differs fromb
andc
, per unit of chromosome length.- “branch”
The average length of all branches that separate
a
fromb
andc
(in units of time).- “node”
For each node, the average proportion of the window on which
a
inherits from that node butb
andc
do not, or vice-versa.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 3-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
allele_frequency_spectrum
(sample_sets=None, windows=None, mode='site', span_normalise=True, polarised=False)[source]¶ Computes the allele frequency spectrum (AFS) in windows across the genome for with respect to the specified
sample_sets
. See the statistics interface section for details on sample sets, windows, mode, span normalise, polarised, and return value. and see Allele frequency spectra for examples of how to use this method.Similar to other windowed stats, the first dimension in the returned array corresponds to windows, such that
result[i]
is the AFS in the ith window. The AFS in each window is a k-dimensional numpy array, where k is the number of input sample sets, such thatresult[i, j0, j1, ...]
is the value associated with frequencyj0
insample_sets[0]
,j1
insample_sets[1]
, etc, in windowi
. From here, we will assume thatafs
corresponds to the result in a single window, i.e.,afs = result[i]
.If a single sample set is specified, the allele frequency spectrum within this set is returned, such that
afs[j]
is the value associated with frequencyj
. Thus, singletons are counted inafs[1]
, doubletons inafs[2]
, and so on. The zeroth entry counts alleles or branches not seen in the samples but that are polymorphic among the rest of the samples of the tree sequence; likewise, the last entry counts alleles fixed in the sample set but polymorphic in the entire set of samples. Please see the Zeroth and final entries in the AFS for an illustration.Warning
Please note that singletons are not counted in the initial entry in each AFS array (i.e.,
afs[0]
), but inafs[1]
.If
sample_sets
is None (the default), the allele frequency spectrum for all samples in the tree sequence is returned.If more than one sample set is specified, the joint allele frequency spectrum within windows is returned. For example, if we set
sample_sets = [S0, S1]
, then afs[1, 2] counts the number of sites that are singletons within S0 and doubletons in S1. The dimensions of the output array will be[num_windows] + [1 + len(S) for S in sample_sets]
.If
polarised
is False (the default) the AFS will be folded, so that the counts do not depend on knowing which allele is ancestral. If folded, the frequency spectrum for a single sample setS
hasafs[j] = 0
for allj > len(S) / 2
, so that alleles at frequencyj
andlen(S) - j
both add to the same entry. If there is more than one sample set, the returned array is “lower triangular” in a similar way. For more details, especially about handling of multiallelic sites, see Allele frequency spectrum.What is computed depends on
mode
:- “site”
The number of alleles at a given frequency within the specified sample sets for each window, per unit of sequence length. To obtain the total number of alleles, set
span_normalise
to False.- “branch”
The total length of branches in the trees subtended by subsets of the specified sample sets, per unit of sequence length. To obtain the total, set
span_normalise
to False.- “node”
Not supported for this method (raises a ValueError).
For example, suppose that S0 is a list of 5 sample IDs, and S1 is a list of 3 other sample IDs. Then afs = ts.allele_frequency_spectrum([S0, S1], mode=”site”, span_normalise=False) will be a 5x3 numpy array, and if there are six alleles that are present in only one sample of S0 but two samples of S1, then afs[1,2] will be equal to 6. Similarly, branch_afs = ts.allele_frequency_spectrum([S0, S1], mode=”branch”, span_normalise=False) will also be a 5x3 array, and branch_afs[1,2] will be the total area (i.e., length times span) of all branches that are above exactly one sample of S0 and two samples of S1.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of samples to compute the joint allele frequency
windows (list) – An increasing list of breakpoints between windows along the genome.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A (k + 1) dimensional numpy array, where k is the number of sample sets specified.
-
aslist
(**kwargs)[source]¶ Returns the trees in this tree sequence as a list. Each tree is represented by a different instance of
Tree
. As such, this method is inefficient and may use a large amount of memory, and should not be used when performance is a consideration. Thetrees()
method is the recommended way to efficiently iterate over the trees in a tree sequence.
-
at
(position, **kwargs)[source]¶ Returns the tree covering the specified genomic location. The returned tree will have
tree.interval.left
<=position
<tree.interval.right
. See alsoTree.seek()
.- Parameters
- Returns
A new instance of
Tree
positioned to cover the specified genomic location.- Return type
-
at_index
(index, **kwargs)[source]¶ Returns the tree at the specified index. See also
Tree.seek_index()
.- Parameters
- Returns
A new instance of
Tree
positioned at the specified index.- Return type
-
breakpoints
(as_array=False)[source]¶ Returns the breakpoints along the chromosome, including the two extreme points 0 and L. This is equivalent to
>>> iter([0] + [t.interval.right for t in self.trees()])
By default we return an iterator over the breakpoints as Python float objects; if
as_array
is True we return them as a numpy array.Note that the
as_array
form will be more efficient and convenient in most cases; the default iterator behaviour is mainly kept to ensure compatability with existing code.- Parameters
as_array (bool) – If True, return the breakpoints as a numpy array.
- Returns
The breakpoints defined by the tree intervals along the sequence.
- Return type
-
coiterate
(other, **kwargs)[source]¶ Returns an iterator over the pairs of trees for each distinct interval in the specified pair of tree sequences.
- Parameters
other (TreeSequence) – The other tree sequence from which to take trees. The sequence length must be the same as the current tree sequence.
**kwargs – Further named arguments that will be passed to the
trees()
method when constructing the returned trees.
- Returns
An iterator returning successive tuples of the form
(interval, tree_self, tree_other)
. For example, the first item returned will consist of an tuple of the initial interval, the first tree of the current tree sequence, and the first tree of theother
tree sequence; the.left
attribute of the initial interval will be 0 and the.right
attribute will be the smallest non-zero breakpoint of the 2 tree sequences.- Return type
-
count_topologies
(sample_sets=None)[source]¶ Returns a generator that produces the same distribution of topologies as
Tree.count_topologies()
but sequentially for every tree in a tree sequence. For use on a tree sequence this method is much faster than computing the result independently per tree.Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
- Return type
iter(
tskit.TopologyCounter
)- Raises
ValueError – If nodes in
sample_sets
are invalid or are internal samples.
-
delete_intervals
(intervals, simplify=True, record_provenance=True)[source]¶ Returns a copy of this tree sequence for which information in the specified list of genomic intervals has been deleted. Edges spanning these intervals are truncated or deleted, and sites and mutations falling within them are discarded. Note that it is the information in the intervals that is deleted, not the intervals themselves, so in particular, all samples will be isolated in the deleted intervals.
Note that node IDs may change as a result of this operation, as by default
simplify()
is called on the returned tree sequence to remove redundant nodes. If you wish to map node IDs onto the same nodes before and after this method has been called, specifysimplify=False
.See also
keep_intervals()
,ltrim()
,rtrim()
, and missing data.- Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to delete. Intervals must be non-overlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, return a simplified tree sequence where nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance information of the returned tree sequence. (Default:True
).
- Return type
-
delete_sites
(site_ids, record_provenance=True)[source]¶ Returns a copy of this tree sequence with the specified sites (and their associated mutations) entirely removed. The site IDs do not need to be in any particular order, and specifying the same ID multiple times does not have any effect (i.e., calling
tree_sequence.delete_sites([0, 1, 1])
has the same effect as callingtree_sequence.delete_sites([0, 1])
.
-
divergence
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes mean genetic divergence between (and within) pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.As a special case, an index
(j, j)
will compute thediversity
ofsample_set[j]
.What is computed depends on
mode
:- “site”
Mean pairwise genetic divergence: the average across distinct, randomly chosen pairs of chromosomes (one from each sample set), of the density of sites at which the two carry different alleles, per unit of chromosome length.
- “branch”
Mean distance in the tree: the average across distinct, randomly chosen pairs of chromsomes (one from each sample set) and locations in the window, of the mean distance in the tree between the two samples (in units of time).
- “node”
For each node, the proportion of genome on which the node is an ancestor to only one of a random pair (one from each sample set), averaged over choices of pair.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
diversity
(sample_sets=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes mean genetic diversity (also knowns as “Tajima’s pi”) in each of the sets of nodes from
sample_sets
. Please see the one-way statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.Note that this quantity can also be computed by the
divergence
method.What is computed depends on
mode
:- “site”
Mean pairwise genetic diversity: the average across distinct, randomly chosen pairs of chromosomes, of the density of sites at which the two carry different alleles, per unit of chromosome length.
- “branch”
Mean distance in the tree: the average across distinct, randomly chosen pairs of chromsomes and locations in the window, of the mean distance in the tree between the two samples (in units of time).
- “node”
For each node, the proportion of genome on which the node is an ancestor to only one of a random pair from the sample set, averaged over choices of pair.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A numpy array.
-
draw_svg
(path=None, *, size=None, x_scale=None, tree_height_scale=None, node_labels=None, mutation_labels=None, root_svg_attributes=None, style=None, order=None, force_root_branch=None, **kwargs)[source]¶ Return an SVG representation of a tree sequence.
When working in a Jupyter notebook, use the
IPython.display.SVG
function to display the SVG output from this function inline in the notebook:>>> SVG(tree.draw_svg())
The visual elements in the svg are grouped for easy styling and manipulation. The entire visualization with trees and X axis is contained within a group of class
tree-sequence
. Each tree in the displayed tree sequence is contained in a group of classtree
, as described inTree.draw_svg()
, so that visual elements pertaining to one or more trees targetted as documented in that method. For instance, the following style will change the colour of all the edges of the initial tree in the sequence and hide the non-sample node labels in all the trees.tree.t0 .edge {stroke: blue} .tree .node:not(.sample) > text {visibility: hidden}
See
Tree.draw_svg()
for further details.- Parameters
path (str) – The path to the file to write the output. If None, do not write to file.
size (tuple(int, int)) – A tuple of (width, height) giving the width and height of the produced SVG drawing in abstract user units (usually interpreted as pixels on display).
x_scale (str) – Control how the X axis is drawn. If “physical” (the default) the axis scales linearly with physical distance along the sequence, and background shading is used to indicate the position of the trees along the sequence. If “treewise”, each axis tick corresponds to a tree boundary, which are positioned evenly along the axis, so that the X axis is of variable scale and no background scaling is required.
tree_height_scale (str) – Control how height values for nodes are computed. If this is equal to
"time"
, node heights are proportional to their time values (this is the default). If this is equal to"log_time"
, node heights are proportional to their log(time) values. If it is equal to"rank"
, node heights are spaced equally according to their ranked times.node_labels (dict(int, str)) – If specified, show custom labels for the nodes (specified by ID) that are present in this map; any nodes not present will not have a label.
mutation_labels (dict(int, str)) – If specified, show custom labels for the mutations (specified by ID) that are present in the map; any mutations not present will not have a label.
root_svg_attributes (dict) – Additional attributes, such as an id, that will be embedded in the root
<svg>
tag of the generated drawing.style (str) – A css string that will be included in the
<style>
tag of the generated svg.order (str) – A string specifying the traversal type used to order the tips in each tree, as detailed in
Tree.nodes()
. If None (default), use the default order as described in that method.force_root_branch (bool) – If
True
plot a branch (edge) above every tree root in the tree sequence. IfNone
(default) then only plot such root branches if any root in the tree sequence has a mutation above it.
- Returns
An SVG representation of a tree.
- Return type
-
dump
(file_or_path)[source]¶ Writes the tree sequence to the specified path or file object.
- Parameters
file_or_path (str) – The file object or path to write the TreeSequence to.
-
dump_tables
()[source]¶ A copy of the tables defining this tree sequence.
- Returns
A
TableCollection
containing all tables underlying the tree sequence.- Return type
-
dump_text
(nodes=None, edges=None, sites=None, mutations=None, individuals=None, populations=None, provenances=None, precision=6, encoding='utf8', base64_metadata=True)[source]¶ Writes a text representation of the tables underlying the tree sequence to the specified connections.
If Base64 encoding is not used, then metadata will be saved directly, possibly resulting in errors reading the tables back in if metadata includes whitespace.
- Parameters
nodes (io.TextIOBase) – The file-like object (having a .write() method) to write the NodeTable to.
edges (io.TextIOBase) – The file-like object to write the EdgeTable to.
sites (io.TextIOBase) – The file-like object to write the SiteTable to.
mutations (io.TextIOBase) – The file-like object to write the MutationTable to.
individuals (io.TextIOBase) – The file-like object to write the IndividualTable to.
populations (io.TextIOBase) – The file-like object to write the PopulationTable to.
provenances (io.TextIOBase) – The file-like object to write the ProvenanceTable to.
precision (int) – The number of digits of precision.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
-
edge_diffs
(include_terminal=False)[source]¶ Returns an iterator over all the edges that are inserted and removed to build the trees as we move from left-to-right along the tree sequence. The iterator yields a sequence of 3-tuples,
(interval, edges_out, edges_in)
. Theinterval
is a pair(left, right)
representing the genomic interval (seeTree.interval
). Theedges_out
value is a list of the edges that were just-removed to create the tree covering the interval (henceedges_out
will always be empty for the first tree). Theedges_in
value is a list of edges that were just inserted to construct the tree covering the current interval.The edges returned within each
edges_in
list are ordered by ascending time of the parent node, then ascending parent id, then ascending child id. The edges within eachedges_out
list are the reverse order (e.g. descending parent time, parent id, then child_id). This means that within each list, edges with the same parent appear consecutively.- Parameters
include_terminal (bool) – If False (default), the iterator terminates after the final interval in the tree sequence (i.e. it does not report a final removal of all remaining edges), and the number of iterations will be equal to the number of trees in the tree sequence. If True, an additional iteration takes place, with the last
edges_out
value reporting all the edges contained in the final tree (with bothleft
andright
equal to the sequence length).- Returns
An iterator over the (interval, edges_out, edges_in) tuples.
- Return type
-
edges
()[source]¶ Returns an iterable sequence of all the edges in this tree sequence. Edges are returned in the order required for a valid tree sequence. So, edges are guaranteed to be ordered such that (a) all parents with a given ID are contiguous; (b) edges are returned in non-descreasing order of parent time ago; (c) within the edges for a given parent, edges are sorted first by child ID and then by left coordinate.
- Returns
An iterable sequence of all edges.
- Return type
Sequence(
Edge
)
-
equals
(other, *, ignore_metadata=False, ignore_ts_metadata=False, ignore_provenance=False, ignore_timestamps=False)[source]¶ Returns True if self and other are equal. Uses the underlying table equlity, see
TableCollection.equals()
for details and options.
-
f2
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes Patterson’s f3 statistic between two groups of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each works exactly asf4
, except the average is across randomly chosen set of four samples(a1, b1; a2, b2)
, with a1 and a2 both chosen (without replacement) from the first sample set andb1
andb2
chosen randomly without replacement from the second sample set. Seef4
for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
f3
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes Patterson’s f3 statistic between three groups of nodes from
sample_sets
. Operates onk = 3
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each works exactly asf4
, except the average is across randomly chosen set of four samples(a1, b; a2, c)
, with a1 and a2 both chosen (without replacement) from the first sample set. Seef4
for more details.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 3-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
f4
(sample_sets, indexes=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes Patterson’s f4 statistic between four groups of nodes from
sample_sets
. Operates onk = 4
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. Each is an average across randomly chosen set of four samples(a, b; c, d)
, one from each sample set:- “site”
The average density of sites at which
a
andc
agree but differs fromb
andd
, minus the average density of sites at whicha
andd
agree but differs fromb
andc
, per unit of chromosome length.- “branch”
The average length of all branches that separate
a
andc
fromb
andd
, minus the average length of all branches that separatea
andd
fromb
andc
(in units of time).- “node”
For each node, the average proportion of the window on which
a
andc
inherit from that node butb
andd
do not, or vice-versa, minus the average proportion of the window on whicha
ancd
inherit from that node butb
andc
do not, or vice-versa.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 4-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
first
(**kwargs)[source]¶ Returns the first tree in this
TreeSequence
. To iterate over all trees in the sequence, use thetrees()
method.
-
genealogical_nearest_neighbours
(focal, sample_sets, num_threads=0)[source]¶ Return the genealogical nearest neighbours (GNN) proportions for the given focal nodes, with reference to two or more sets of interest, averaged over all trees in the tree sequence.
The GNN proportions for a focal node in a single tree are given by first finding the most recent common ancestral node \(a\) between the focal node and any other node present in the reference sets. The GNN proportion for a specific reference set, \(S\) is the number of nodes in \(S\) that descend from \(a\), as a proportion of the total number of descendant nodes in any of the reference sets.
For example, consider a case with 2 sample sets, \(S_1\) and \(S_2\). For a given tree, \(a\) is the node that includes at least one descendant in \(S_1\) or \(S_2\) (not including the focal node). If the descendants of \(a\) include some nodes in \(S_1\) but no nodes in \(S_2\), then the GNN proportions for that tree will be 100% \(S_1\) and 0% \(S_2\), or \([1.0, 0.0]\).
For a given focal node, the GNN proportions returned by this function are an average of the GNNs for each tree, weighted by the genomic distance spanned by that tree.
For an precise mathematical definition of GNN, see https://doi.org/10.1101/458067
Note
The reference sets need not include all the samples, hence the most recent common ancestral node of the reference sets, \(a\), need not be the immediate ancestor of the focal node. If the reference sets only comprise sequences from relatively distant individuals, the GNN statistic may end up as a measure of comparatively distant ancestry, even for tree sequences that contain many closely related individuals.
Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future. The long-term stable API for this method will be consistent with other Statistics.
- Parameters
- Returns
An \(n\) by \(m\) array of focal nodes by GNN proportions. Every focal node corresponds to a row. The numbers in each row corresponding to the GNN proportion for each of the passed-in reference sets. Rows therefore sum to one.
- Return type
-
general_stat
(W, f, output_dim, windows=None, polarised=False, mode=None, span_normalise=True, strict=True)[source]¶ Compute a windowed statistic from weights and a summary function. See the statistics interface section for details on windows, mode, span normalise, and return value. On each tree, this propagates the weights
W
up the tree, so that the “weight” of each node is the sum of the weights of all samples at or below the node. Then the summary functionf
is applied to the weights, giving a summary for each node in each tree. How this is then aggregated depends onmode
:- “site”
Adds together the total summary value across all alleles in each window.
- “branch”
Adds together the summary value for each node, multiplied by the length of the branch above the node and the span of the tree.
- “node”
Returns each node’s summary value added across trees and multiplied by the span of the tree.
Both the weights and the summary can be multidimensional: if
W
hask
columns, andf
takes ak
-vector and returns anm
-vector, then the output will bem
-dimensional for each node or window (depending on “mode”).Note
The summary function
f
should return zero when given both 0 and the total weight (i.e.,f(0) = 0
andf(np.sum(W, axis=0)) = 0
), unlessstrict=False
. This is necessary for the statistic to be unaffected by parts of the tree sequence ancestral to none or all of the samples, respectively.- Parameters
W (numpy.ndarray) – An array of values with one row for each sample and one column for each weight.
f – A function that takes a one-dimensional array of length equal to the number of columns of
W
and returns a one-dimensional array.output_dim (int) – The length of
f
’s return value.windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
polarised (bool) – Whether to leave the ancestral state out of computations: see Statistics for more details.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
strict (bool) – Whether to check that f(0) and f(total weight) are zero.
- Returns
A ndarray with shape equal to (num windows, num statistics).
Computes genetic relatedness between (and within) pairs of sets of nodes from
sample_sets
. Operates onk = 2
sample sets at a time; please see the multi-way statistics section for details on how thesample_sets
andindexes
arguments are interpreted and how they interact with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, polarised, and return value.What is computed depends on
mode
:- “site”
Number of pairwise allelic matches in the window between two sample sets relative to the rest of the sample sets. To be precise, let m(u,v) denote the total number of alleles shared between nodes u and v, and let m(I,J) be the sum of m(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be independently chosen sample sets. Then, for sample sets I and J, this computes E[m(I,J) - m(I,S) - m(J,T) + m(S,T)]. This can also be seen as the covariance of a quantitative trait determined by additive contributions from the genomes in each sample set. Let each allele be associated with an effect drawn from a N(0,1/2) distribution, and let the trait value of a sample set be the sum of its allele effects. Then, this computes the covariance between the trait values of two sample sets. For example, to compute covariance between the traits of diploid individuals, each sample set would be the pair of genomes of each individual; if
proportion=True
, this then corresponds to \(K_{c0}\) in Speed & Balding (2014).- “branch”
Total area of branches in the window ancestral to pairs of samples in two sample sets relative to the rest of the sample sets. To be precise, let B(u,v) denote the total area of all branches ancestral to nodes u and v, and let B(I,J) be the sum of B(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[B(I,J) - B(I,S) - B(J,T) + B(S,T)].
- “node”
For each node, the proportion of the window over which pairs of samples in two sample sets are descendants, relative to the rest of the sample sets. To be precise, for each node n, let N(u,v) denote the proportion of the window over which samples u and v are descendants of n, and let and let N(I,J) be the sum of N(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[N(I,J) - N(I,S) - N(J,T) + N(S,T)].
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
indexes (list) – A list of 2-tuples, or None.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
proportion (bool) – Whether to divide the result by
segregating_sites()
, called with the samewindows
andmode
(defaults to True). Note that this counts sites that are segregating between any of the samples of any of the sample sets (rather than segregating between all of the samples of the tree sequence).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
genotype_matrix
(*, isolated_as_missing=None, alleles=None, impute_missing_data=None)[source]¶ Returns an \(m \times n\) numpy array of the genotypes in this tree sequence, where \(m\) is the number of sites and \(n\) the number of samples. The genotypes are the indexes into the array of
alleles
, as described for theVariant
class.If isolated samples are present at a given site without mutations above them, they will be interpreted as missing data the genotypes array will contain a special value
MISSING_DATA
(-1) to identify these missing samples.Such samples are treated as missing data by default, but if
isolated_as_missing
is set to to False, they will not be treated as missing, and so assigned the ancestral state. This was the default behaviour in versions prior to 0.2.0. Prior to 0.3.0 the impute_missing_data argument controlled this behaviour.Warning
This method can consume a very large amount of memory! If all genotypes are not needed at once, it is usually better to access them sequentially using the
variants()
iterator.- Parameters
isolated_as_missing (bool) – If True, the allele assigned to missing samples (i.e., isolated samples without mutations) is the
missing_data_character
. If False, missing samples will be assigned the ancestral state. Default: True.alleles (tuple) – A tuple of strings describing the encoding of alleles to genotype values. At least one allele must be provided. If duplicate alleles are provided, output genotypes will always be encoded as the first occurance of the allele. If None (the default), the alleles are encoded as they are encountered during genotype generation.
impute_missing_data (bool) – Deprecated in 0.3.0. Use ``isolated_as_missing``, but inverting value. Will be removed in a future version
- Returns
The full matrix of genotypes.
- Return type
numpy.ndarray (dtype=np.int8)
-
haplotypes
(*, isolated_as_missing=None, missing_data_character='-', impute_missing_data=None)[source]¶ Returns an iterator over the strings of haplotypes that result from the trees and mutations in this tree sequence. Each haplotype string is guaranteed to be of the same length. A tree sequence with \(n\) samples and \(s\) sites will return a total of \(n\) strings of \(s\) alleles concatenated together, where an allele consists of a single ascii character (tree sequences that include alleles which are not a single character in length, or where the character is non-ascii, will raise an error). The first string returned is the haplotype for sample
0
, and so on.The alleles at each site must be represented by single byte characters, (i.e. variants must be single nucleotide polymorphisms, or SNPs), hence the strings returned will all be of length \(s\), and for a haplotype
h
, the value ofh[j]
will be the observed allelic state at sitej
.If
isolated_as_missing
is True (the default), isolated samples without mutations directly above them will be treated as missing data and will be represented in the string by themissing_data_character
. If instead it is set to False, missing data will be assigned the ancestral state (unless they have mutations directly above them, in which case they will take the most recent derived mutational state for that node). This was the default behaviour in versions prior to 0.2.0. Prior to 0.3.0 the impute_missing_data argument controlled this behaviour.See also the
variants()
iterator for site-centric access to sample genotypes.Warning
For large datasets, this method can consume a very large amount of memory! To output all the sample data, it is more efficient to iterate over sites rather than over samples. If you have a large dataset but only want to output the haplotypes for a subset of samples, it may be worth calling
simplify()
to reduce tree sequence down to the required samples before outputting haplotypes.- Returns
An iterator over the haplotype strings for the samples in this tree sequence.
- Parameters
isolated_as_missing (bool) – If True, the allele assigned to missing samples (i.e., isolated samples without mutations) is the
missing_data_character
. If False, missing samples will be assigned the ancestral state. Default: True.missing_data_character (str) – A single ascii character that will be used to represent missing data. If any normal allele contains this character, an error is raised. Default: ‘-‘.
impute_missing_data (bool) – Deprecated in 0.3.0. Use ``isolated_as_missing``, but inverting value. Will be removed in a future version
- Return type
- Raises
TypeError if the
missing_data_character
or any of the alleles at a site or the are not a single ascii character.- Raises
ValueError if the
missing_data_character
exists in one of the alleles
-
individual
(id_)[source]¶ Returns the individual in this tree sequence with the specified ID.
- Return type
-
individuals
()[source]¶ Returns an iterable sequence of all the individuals in this tree sequence.
- Returns
An iterable sequence of all individuals.
- Return type
Sequence(
Individual
)
-
kc_distance
(other, lambda_=0.0)[source]¶ Returns the average
Tree.kc_distance()
between pairs of trees along the sequence whose intervals overlap. The average is weighted by the fraction of the sequence on which each pair of trees overlap.- Parameters
other (TreeSequence) – The other tree sequence to compare to.
lambda (float) – The KC metric lambda parameter determining the relative weight of topology and branch length.
- Returns
The computed KC distance between this tree sequence and other.
- Return type
-
keep_intervals
(intervals, simplify=True, record_provenance=True)[source]¶ Returns a copy of this tree sequence which includes only information in the specified list of genomic intervals. Edges are truncated to lie within these intervals, and sites and mutations falling outside these intervals are discarded. Note that it is the information outside the intervals that is deleted, not the intervals themselves, so in particular, all samples will be isolated outside of the retained intervals.
Note that node IDs may change as a result of this operation, as by default
simplify()
is called on the returned tree sequence to remove redundant nodes. If you wish to map node IDs onto the same nodes before and after this method has been called, specifysimplify=False
.See also
keep_intervals()
,ltrim()
,rtrim()
, and missing data.- Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to keep. Intervals must be non-overlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, return a simplified tree sequence where nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
- Return type
-
last
(**kwargs)[source]¶ Returns the last tree in this
TreeSequence
. To iterate over all trees in the sequence, use thetrees()
method.
-
ltrim
(record_provenance=True)[source]¶ Returns a copy of this tree sequence with a potentially changed coordinate system, such that empty regions (i.e. those not covered by any edge) at the start of the tree sequence are trimmed away, and the leftmost edge starts at position 0. This affects the reported position of sites and edges. Additionally, sites and their associated mutations to the left of the new zero point are thrown away.
- Parameters
record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
-
property
max_root_time
¶ Returns time of the oldest root in any of the trees in this tree sequence. This is usually equal to
np.max(ts.tables.nodes.time)
but may not be since there can be nodes that are not present in any tree. Consistent with the definition of tree roots, if there are no edges in the tree sequence we return the time of the oldest sample.- Returns
The maximum time of a root in this tree sequence.
- Return type
-
mean_descendants
(sample_sets)[source]¶ Computes for every node the mean number of samples in each of the sample_sets that descend from that node, averaged over the portions of the genome for which the node is ancestral to any sample. The output is an array, C[node, j], which reports the total span of all genomes in sample_sets[j] that inherit from node, divided by the total span of the genome on which node is an ancestor to any sample in the tree sequence.
Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future. The long-term stable API for this method will be consistent with other Statistics. In particular, the normalization by proportion of the genome that node is an ancestor to anyone may not be the default behaviour in the future.
- Parameters
sample_sets (list) – A list of lists of node IDs.
- Returns
An array with dimensions (number of nodes in the tree sequence, number of reference sets)
-
property
metadata
¶ The decoded metadata for this TreeSequence.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this TreeSequence.
-
migration
(id_)[source]¶ Returns the migration in this tree sequence with the specified ID.
- Return type
-
migrations
()[source]¶ Returns an iterable sequence of all the migrations in this tree sequence.
Migrations are returned in nondecreasing order of the
time
value.- Returns
An iterable sequence of all migrations.
- Return type
Sequence(
Migration
)
-
mutation
(id_)[source]¶ Returns the mutation in this tree sequence with the specified ID.
- Return type
-
mutations
()[source]¶ Returns an iterator over all the mutations in this tree sequence. Mutations are returned in order of nondecreasing site ID. See the
Mutation
class for details on the available fields for each mutation.The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:
>>> for site in tree_sequence.sites(): >>> for mutation in site.mutations: >>> yield mutation
- Returns
An iterator over all mutations in this tree sequence.
- Return type
iter(
Mutation
)
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this tree sequence. Note that this may not be equal to the actual memory footprint.
-
nodes
()[source]¶ Returns an iterable sequence of all the nodes in this tree sequence.
- Returns
An iterable sequence of all nodes.
- Return type
Sequence(
Node
)
-
property
num_edges
¶ Returns the number of edges in this tree sequence.
- Returns
The number of edges in this tree sequence.
- Return type
-
property
num_individuals
¶ Returns the number of individuals in this tree sequence.
- Returns
The number of individuals in this tree sequence.
- Return type
-
property
num_migrations
¶ Returns the number of migrations in this tree sequence.
- Returns
The number of migrations in this tree sequence.
- Return type
-
property
num_mutations
¶ Returns the number of mutations in this tree sequence.
- Returns
The number of mutations in this tree sequence.
- Return type
-
property
num_nodes
¶ Returns the number of nodes in this tree sequence.
- Returns
The number of nodes in this tree sequence.
- Return type
-
property
num_populations
¶ Returns the number of populations in this tree sequence.
- Returns
The number of populations in this tree sequence.
- Return type
-
property
num_provenances
¶ Returns the number of provenances in this tree sequence.
- Returns
The number of provenances in this tree sequence.
- Return type
-
property
num_samples
¶ Returns the number of samples in this tree sequence. This is the number of sample nodes in each tree.
- Returns
The number of sample nodes in this tree sequence.
- Return type
-
property
num_sites
¶ Returns the number of sites in this tree sequence.
- Returns
The number of sites in this tree sequence.
- Return type
-
property
num_trees
¶ Returns the number of distinct trees in this tree sequence. This is equal to the number of trees returned by the
trees()
method.- Returns
The number of trees in this tree sequence.
- Return type
-
pairwise_diversity
(samples=None)[source]¶ Returns the pairwise nucleotide site diversity, the average number of sites that differ between a randomly chosen pair of samples. If samples is specified, calculate the diversity within this set.
Deprecated since version 0.2.0: please use
diversity()
instead. Since version 0.2.0 the error semantics have also changed slightly. It is no longer an error when there is one sample and a tskit.LibraryError is raised when non-sample IDs are provided rather than a ValueError. It is also no longer an error to compute pairwise diversity at sites with multiple mutations.
-
population
(id_)[source]¶ Returns the population in this tree sequence with the specified ID.
- Return type
-
populations
()[source]¶ Returns an iterable sequence of all the populations in this tree sequence.
- Returns
An iterable sequence of all populations.
- Return type
Sequence(
Population
)
-
provenances
()[source]¶ Returns an iterable sequence of all the provenances in this tree sequence.
- Returns
An iterable sequence of all provenances.
- Return type
Sequence(
Provenance
)
-
rtrim
(record_provenance=True)[source]¶ Returns a copy of this tree sequence with the
sequence_length
property reset so that the sequence ends at the end of the rightmost edge. Additionally, sites and their associated mutations at positions greater than the newsequence_length
are thrown away.- Parameters
record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
-
sample_count_stat
(sample_sets, f, output_dim, windows=None, polarised=False, mode=None, span_normalise=True, strict=True)[source]¶ Compute a windowed statistic from sample counts and a summary function. This is a wrapper around
general_stat()
for the common case in which the weights are all either 1 or 0, i.e., functions of the joint allele frequency spectrum. See the statistics interface section for details on sample sets, windows, mode, span normalise, and return value. Ifsample_sets
is a list ofk
sets of samples, thenf
should be a function that takes an argument of lengthk
and returns a one-dimensional array. Thej
-th element of the argument tof
will be the number of samples insample_sets[j]
that lie below the node thatf
is being evaluated for. Seegeneral_stat()
for more details.Here is a contrived example: suppose that
A
andB
are two sets of samples withnA
andnB
elements, respectively. Passing these as sample sets will givef
an argument of length two, giving the number of samples inA
andB
below the node in question. So, if we definedef f(x): pA = x[0] / nA pB = x[1] / nB return np.array([pA * pB])
then if all sites are biallelic,
ts.sample_count_stat([A, B], f, windows="site", polarised=False, mode="site")
would compute, for each site, the product of the derived allele frequencies in the two sample sets, in a (num sites, 1) array. If instead
f
returnsnp.array([pA, pB, pA * pB])
, then the output would be a (num sites, 3) array, with the first two columns giving the allele frequencies inA
andB
, respectively.Note
The summary function
f
should return zero when given both 0 and the sample size (i.e.,f(0) = 0
andf(np.array([len(x) for x in sample_sets]) = 0
). This is necessary for the statistic to be unaffected by parts of the tree sequence ancestral to none or all of the samples, respectively.- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
f – A function that takes a one-dimensional array of length equal to the number of sample sets and returns a one-dimensional array.
output_dim (int) – The length of
f
’s return value.windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
polarised (bool) – Whether to leave the ancestral state out of computations: see Statistics for more details.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
strict (bool) – Whether to check that f(0) and f(total weight) are zero.
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
samples
(population=None, population_id=None)[source]¶ Returns an array of the sample node IDs in this tree sequence. If the
population
parameter is specified, only return sample IDs from this population.- Parameters
- Returns
A numpy array of the node IDs for the samples of interest.
- Return type
numpy.ndarray (dtype=np.int32)
-
segregating_sites
(sample_sets=None, windows=None, mode='site', span_normalise=True)[source]¶ Computes the density of segregating sites for each of the sets of nodes from
sample_sets
, and related quantities. Please see the one-way statistics section for details on how thesample_sets
argument is interpreted and how it interacts with the dimensions of the output array. See the statistics interface section for details on windows, mode, span normalise, and return value.What is computed depends on
mode
. For a sample setA
, computes:- “site”
The sum over sites of the number of alleles found in
A
at each site minus one, per unit of chromosome length. If all sites have at most two alleles, this is the density of sites that are polymorphic inA
. To get the number of segregating minor alleles per window, passspan_normalise=False
.- “branch”
The total length of all branches in the tree subtended by the samples in
A
, averaged across the window.- “node”
The proportion of the window on which the node is ancestral to some, but not all, of the samples in
A
.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
property
sequence_length
¶ Returns the sequence length in this tree sequence. This defines the genomic scale over which tree coordinates are defined. Given a tree sequence with a sequence length \(L\), the constituent trees will be defined over the half-closed interval \([0, L)\). Each tree then covers some subset of this interval — see
tskit.Tree.interval
for details.- Returns
The length of the sequence in this tree sequence in bases.
- Return type
-
simplify
(samples=None, *, map_nodes=False, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True, keep_unary=False, keep_input_roots=False, record_provenance=True, filter_zero_mutation_sites=None)[source]¶ Returns a simplified tree sequence that retains only the history of the nodes given in the list
samples
. Ifmap_nodes
is true, also return a numpy array whoseu``th element is the ID of the node in the simplified tree sequence that corresponds to node ``u
in the original tree sequence, ortskit.NULL
(-1) ifu
is no longer present in the simplified tree sequence.In the returned tree sequence, the node with ID
0
corresponds tosamples[0]
, node1
corresponds tosamples[1]
, and so on. Besides the samples, node IDs in the returned tree sequence are then allocated sequentially in time order.If you wish to simplify a set of tables that do not satisfy all requirements for building a TreeSequence, then use
TableCollection.simplify()
.If the
reduce_to_site_topology
parameter is True, the returned tree sequence will contain only topological information that is necessary to represent the trees that contain sites. If there are zero sites in this tree sequence, this will result in an output tree sequence with zero edges. When the number of sites is greater than zero, every tree in the output tree sequence will contain at least one site. For a given site, the topology of the tree containing that site will be identical (up to node ID remapping) to the topology of the corresponding tree in the input tree sequence.If
filter_populations
,filter_individuals
orfilter_sites
is True, any of the corresponding objects that are not referenced elsewhere are filtered out. As this is the default behaviour, it is important to realise IDs for these objects may change through simplification. By setting these parameters to False, however, the corresponding tables can be preserved without changes.- Parameters
samples (list) – The list of nodes for which to retain information. This may be a numpy array (or array-like) object (dtype=np.int32).
map_nodes (bool) – If True, return a tuple containing the resulting tree sequence and a numpy array mapping node IDs in the current tree sequence to their corresponding node IDs in the returned tree sequence. If False (the default), return only the tree sequence object itself.
reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (Default: False)
filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True)
filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True)
filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True)
keep_unary (bool) – If True, any unary nodes (i.e. nodes with exactly one child) that exist on the path from samples to root will be preserved in the output. (Default: False)
keep_input_roots (bool) – If True, insert edges from the MRCAs of the samples to the roots in the input trees. If False, no topology older than the MRCAs of the samples will be included. (Default: False)
record_provenance (bool) – If True, record details of this call to simplify in the returned tree sequence’s provenance information (Default: True).
filter_zero_mutation_sites (bool) – Deprecated alias for
filter_sites
.
- Returns
The simplified tree sequence, or (if
map_nodes
is True) a tuple consisting of the simplified tree sequence and a numpy array mapping source node IDs to their corresponding IDs in the new tree sequence.- Return type
-
sites
()[source]¶ Returns an iterable sequence of all the sites in this tree sequence. Sites are returned in order of increasing ID (and also position). See the
Site
class for details on the available fields for each site.- Returns
An iterable sequence of all sites.
- Return type
Sequence(
Site
)
-
subset
(nodes, record_provenance=True)[source]¶ Returns a tree sequence modified to contain only the entries referring to the provided list of nodes, with nodes reordered according to the order they appear in the
nodes
argument. Specifically, this subsets and reorders each of the tables as follows:Nodes: if in the list of nodes, and in the order provided.
Individuals and Populations: if referred to by a retained node, and in the order first seen when traversing the list of retained nodes.
Edges: if both parent and child are retained nodes.
Mutations: if the mutation’s node is a retained node.
Sites: if any mutations remain at the site after removing mutations.
Retained edges, mutations, and sites appear in the same order as in the original tables.
If
nodes
is the entire list of nodes in the tables, then the resulting tables will be identical to the original tables, but with nodes (and individuals and populations) reordered.To instead subset the tables to a given portion of the genome, see
keep_intervals()
.Note: This is quite different from
simplify()
: the resulting tables contain only the nodes given, not ancestral ones as well, and does not simplify the relationships in any way.- Parameters
- Return type
-
property
table_metadata_schemas
¶ The set of metadata schemas for the tables in this tree sequence.
-
property
tables
¶ A copy of the tables underlying this tree sequence. See also
dump_tables()
.Warning
This propery currently returns a copy of the tables underlying a tree sequence but it may return a read-only view in the future. Thus, if the tables will subsequently be updated, please use the
dump_tables()
method instead as this will always return a new copy of the TableCollection.- Returns
A
TableCollection
containing all a copy of the tables underlying this tree sequence.- Return type
-
property
tables_dict
¶ Returns a dictionary mapping names to tables in the underlying
TableCollection
. Equivalent to callingts.tables.name_map
.
-
to_macs
()[source]¶ Return a macs encoding of this tree sequence.
- Returns
The macs representation of this TreeSequence as a string.
- Return type
-
to_nexus
(precision=14)[source]¶ Returns a nexus encoding of this tree sequence. Trees along the sequence are listed sequentially in the TREES block. The tree spanning the interval \([x, y)\) is given the name “tree_x_y”. Spatial positions are written at the specified precision.
Nodes in the tree sequence are identified by the taxon labels of the form
f"tsk_{node.id}_{node.flags}"
, such that a node withid=5
andflags=1
will have the label"tsk_5_1"
(please see the data model section for details on the interpretation of node ID and flags values). These labels are listed for all nodes in the tree sequence in theTAXLABELS
block.
-
trait_correlation
(W, windows=None, mode='site', span_normalise=True)[source]¶ Computes the mean squared correlations between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.This is computed as squared covariance in
trait_covariance
, but divided by \(p (1-p)\), where p is the proportion of samples inheriting from the allele, branch, or node in question.What is computed depends on
mode
:- “site”
The sum of squared correlations between presence/absence of each allele and phenotypes, divided by length of the window (if
span_normalise=True
). This is computed as thetrait_covariance
divided by the variance of the relevant column of W and by ;math:p * (1 - p), where \(p\) is the allele frequency.- “branch”
The sum of squared correlations between the split induced by each branch and phenotypes, multiplied by branch length, averaged across trees in the window. This is computed as the
trait_covariance
, divided by the variance of the column of w and by \(p * (1 - p)\), where \(p\) is the proportion of the samples lying below the branch.- “node”
For each node, the squared correlation between the property of inheriting from this node and phenotypes, computed as in “branch”.
Note that above we divide by the sample variance, which for a vector x of length n is
np.var(x) * n / (n-1)
.- Parameters
W (numpy.ndarray) – An array of values with one row for each sample and one column for each “phenotype”. Each column must have positive standard deviation.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
trait_covariance
(W, windows=None, mode='site', span_normalise=True)[source]¶ Computes the mean squared covariances between each of the columns of
W
(the “phenotypes”) and inheritance along the tree sequence. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.Concretely, if g is a binary vector that indicates inheritance from an allele, branch, or node and w is a column of W, normalised to have mean zero, then the covariance of g and w is \(\sum_i g_i w_i\), the sum of the weights corresponding to entries of g that are 1. Since weights sum to zero, this is also equal to the sum of weights whose entries of g are 0. So, \(cov(g,w)^2 = ((\sum_i g_i w_i)^2 + (\sum_i (1-g_i) w_i)^2)/2\).
What is computed depends on
mode
:- “site”
The sum of squared covariances between presence/absence of each allele and phenotypes, divided by length of the window (if
span_normalise=True
). This is computed as sum_a (sum(w[a])^2 / 2), where w is a column of W with the average subtracted off, and w[a] is the sum of all entries of w corresponding to samples carrying allele “a”, and the first sum is over all alleles.- “branch”
The sum of squared covariances between the split induced by each branch and phenotypes, multiplied by branch length, averaged across trees in the window. This is computed as above: a branch with total weight w[b] below b contributes (branch length) * w[b]^2 to the total value for a tree. (Since the sum of w is zero, the total weight below b and not below b are equal, canceling the factor of 2 above.)
- “node”
For each node, the squared covariance between the property of inheriting from this node and phenotypes, computed as in “branch”.
- Parameters
W (numpy.ndarray) – An array of values with one row for each sample and one column for each “phenotype”.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
trait_linear_model
(W, Z=None, windows=None, mode='site', span_normalise=True)[source]¶ Finds the relationship between trait and genotype after accounting for covariates. Concretely, for each trait w (i.e., each column of W), this does a least-squares fit of the linear model \(w \sim g + Z\), where \(g\) is inheritance in the tree sequence (e.g., genotype) and the columns of \(Z\) are covariates, and returns the squared coefficient of \(g\) in this linear model. See the statistics interface section for details on windows, mode, span normalise, and return value. Operates on all samples in the tree sequence.
To do this, if g is a binary vector that indicates inheritance from an allele, branch, or node and w is a column of W, there are \(k\) columns of \(Z\), and the \(k+2\)-vector \(b\) minimises \(\sum_i (w_i - b_0 - b_1 g_i - b_2 z_{2,i} - ... b_{k+2} z_{k+2,i})^2\) then this returns the number \(b_1^2\). If \(g\) lies in the linear span of the columns of \(Z\), then \(b_1\) is set to 0. To fit the linear model without covariates (only the intercept), set Z = None.
What is computed depends on
mode
:- “site”
Computes the sum of \(b_1^2/2\) for each allele in the window, as above with \(g\) indicating presence/absence of the allele, then divided by the length of the window if
span_normalise=True
. (For biallelic loci, this number is the same for both alleles, and so summing over each cancels the factor of two.)- “branch”
The squared coefficient b_1^2, computed for the split induced by each branch (i.e., with \(g\) indicating inheritance from that branch), multiplied by branch length and tree span, summed over all trees in the window, and divided by the length of the window if
span_normalise=True
.- “node”
For each node, the squared coefficient b_1^2, computed for the property of inheriting from this node, as in “branch”.
- Parameters
W (numpy.ndarray) – An array of values with one row for each sample and one column for each “phenotype”.
Z (numpy.ndarray) – An array of values with one row for each sample and one column for each “covariate”, or None. Columns of Z must be linearly independent.
windows (list) – An increasing list of breakpoints between the windows to compute the statistic in.
mode (str) – A string giving the “type” of the statistic to be computed (defaults to “site”).
span_normalise (bool) – Whether to divide the result by the span of the window (defaults to True).
- Returns
A ndarray with shape equal to (num windows, num statistics).
-
trait_regression
(*args, **kwargs)[source]¶ Deprecated synonym for
trait_linear_model
.
-
trees
(tracked_samples=None, *, sample_lists=False, root_threshold=1, sample_counts=None, tracked_leaves=None, leaf_counts=None, leaf_lists=None)[source]¶ Returns an iterator over the trees in this tree sequence. Each value returned in this iterator is an instance of
Tree
. Upon successful termination of the iterator, the tree will be in the “cleared” null state.The
sample_lists
andtracked_samples
parameters are passed to theTree
constructor, and control the options that are set in the returned tree instance.- Warning
Do not store the results of this iterator in a list! For performance reasons, the same underlying object is used for every tree returned which will most likely lead to unexpected behaviour. If you wish to obtain a list of trees in a tree sequence please use
ts.aslist()
instead.- Parameters
tracked_samples (list) – The list of samples to be tracked and counted using the
Tree.num_tracked_samples()
method.sample_lists (bool) – If True, provide more efficient access to the samples beneath a give node using the
Tree.samples()
method.root_threshold (int) – The minimum number of samples that a node must be ancestral to for it to be in the list of roots. By default this is 1, so that isolated samples (representing missing data) are roots. To efficiently restrict the roots of the tree to those subtending meaningful topology, set this to 2. This value is only relevant when trees have multiple roots.
sample_counts (bool) – Deprecated since 0.2.4.
- Returns
An iterator over the Trees in this tree sequence.
- Return type
collections.abc.Iterable,
Tree
-
trim
(record_provenance=True)[source]¶ Returns a copy of this tree sequence with any empty regions (i.e. those not covered by any edge) on the right and left trimmed away. This may reset both the coordinate system and the
sequence_length
property. It is functionally equivalent tortrim()
followed byltrim()
. Sites and their associated mutations in the empty regions are thrown away.- Parameters
record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
-
union
(other, node_mapping, check_shared_equality=True, add_populations=True, record_provenance=True)[source]¶ Returns an expanded tree sequence which contains the node-wise union of
self
andother
, obtained by adding the non-shared portions ofother
ontoself
. The “shared” portions are specified using a map that specifies which nodes inother
are equivalent to those inself
: thenode_mapping
argument should be an array of length equal to the number of nodes inother
and whose entries are the ID of the matching node inself
, ortskit.NULL
if there is no matching node. Those nodes inother
that map totskit.NULL
will be added toself
, along with:Individuals whose nodes are new to
self
.Edges whose parent or child are new to
self
.Mutations whose nodes are new to
self
.Sites which were not present in
self
, if the site contains a newly added mutation.
By default, populations of newly added nodes are assumed to be new populations, and added to the population table as well.
Note that this operation also sorts the resulting tables, so the resulting tree sequence may not be equal to
self
even if nothing new was added (although it would differ only in ordering of the tables).- Parameters
other (TableCollection) – Another table collection.
node_mapping (list) – An array of node IDs that relate nodes in
other
to nodes inself
.check_shared_equality (bool) – If True, the shared portions of the tree sequences will be checked for equality. It does so by subsetting both
self
andother
on the equivalent nodes specified innode_mapping
, and then checking for equality of the subsets.add_populations (bool) – If True, nodes new to
self
will be assigned new population IDs.record_provenance (bool) – Whether to record a provenance entry in the provenance table for this operation.
-
variants
(*, as_bytes=False, samples=None, isolated_as_missing=None, alleles=None, impute_missing_data=None)[source]¶ Returns an iterator over the variants in this tree sequence. See the
Variant
class for details on the fields of each returned object. Thegenotypes
for the variants are numpy arrays, corresponding to indexes into thealleles
attribute in theVariant
object. By default, thealleles
for each site are generated automatically, such that the ancestral state is at the zeroth index and subsequent alleles are listed in no particular order. This means that the encoding of alleles in terms of genotype values can vary from site-to-site, which is sometimes inconvenient. It is possible to specify a fixed mapping from allele strings to genotype values using thealleles
parameter. For example, if we setalleles=("A", "C", "G", "T")
, this will map allele “A” to 0, “C” to 1 and so on (theALLELES_ACGT
constant provides a shortcut for this common mapping).By default, genotypes are generated for all samples. The
samples
parameter allows us to specify the nodes for which genotypes are generated; output order of genotypes in the returned variants corresponds to the order of the samples in this list. It is also possible to provide non-sample nodes as an argument here, if you wish to generate genotypes for (e.g.) internal nodes. However,isolated_as_missing
must be False in this case, as it is not possible to detect missing data for non-sample nodes.If isolated samples are present at a given site without mutations above them, they will be interpreted as missing data the genotypes array will contain a special value
MISSING_DATA
(-1) to identify these missing samples, and thealleles
tuple will end with the valueNone
(note that this is true whether we specify a fixed mapping using thealleles
parameter or not). See theVariant
class for more details on how missing data is reported.Such samples are treated as missing data by default, but if
isolated_as_missing
is set to to False, they will not be treated as missing, and so assigned the ancestral state. This was the default behaviour in versions prior to 0.2.0. Prior to 0.3.0 the impute_missing_data argument controlled this behaviour.Note
The
as_bytes
parameter is kept as a compatibility option for older code. It is not the recommended way of accessing variant data, and will be deprecated in a later release.- Parameters
as_bytes (bool) – If True, the genotype values will be returned as a Python bytes object. Legacy use only.
samples (array_like) – An array of sample IDs for which to generate genotypes, or None for all samples. Default: None.
isolated_as_missing (bool) – If True, the allele assigned to missing samples (i.e., isolated samples without mutations) is the
missing_data_character
. If False, missing samples will be assigned the ancestral state. Default: True.alleles (tuple) – A tuple of strings defining the encoding of alleles as integer genotype values. At least one allele must be provided. If duplicate alleles are provided, output genotypes will always be encoded as the first occurance of the allele. If None (the default), the alleles are encoded as they are encountered during genotype generation.
impute_missing_data (bool) – Deprecated in 0.3.0. Use ``isolated_as_missing``, but inverting value. Will be removed in a future version
- Returns
An iterator of all variants this tree sequence.
- Return type
iter(
Variant
)
-
write_vcf
(output, ploidy=None, contig_id='1', individuals=None, individual_names=None, position_transform=None)[source]¶ Writes a VCF formatted file to the specified file-like object. If there is individual information present in the tree sequence (see Individual Table), the values for sample nodes associated with these individuals are combined into phased multiploid individuals and output.
If there is no individual data present in the tree sequence, synthetic individuals are created by combining adjacent samples, and the number of samples combined is equal to the specified ploidy value (1 by default). For example, if we have a ploidy of 2 and a sample of size 6, then we will have 3 diploid samples in the output, consisting of the combined genotypes for samples [0, 1], [2, 3] and [4, 5]. If we had genotypes 011110 at a particular variant, then we would output the diploid genotypes 0|1, 1|1 and 1|0 in VCF.
Each individual in the output is identified by a string; these are the VCF “sample” names. By default, these are of the form
tsk_0
,tsk_1
etc, up to the number of individuals, but can be manually specified using theindividual_names
argument. We do not check for duplicates in this array, or perform any checks to ensure that the output VCF is well-formed.Note
Warning to
plink
users:As the default first individual name is
tsk_0
,plink
will throw this error when loading the VCF:Error: Sample ID ends with "_0", which induces an invalid IID of '0'.
This can be fixed by using the
individual_names
argument to set the names to anything where the first name doesn’t end with_0
. An example implementation for diploid individuals is:n_dip_indv = int(ts.num_samples / 2) indv_names = [f"tsk_{str(i)}indv" for i in range(n_dip_indv)] with open("output.vcf", "w") as vcf_file: ts.write_vcf(vcf_file, ploidy=2, individual_names=indv_names)
Adding a second
_
(eg:tsk_0_indv
) is not recommended asplink
uses_
as the default separator for separating family id and individual id, and two_
will throw an error.The REF value in the output VCF is the ancestral allele for a site and ALT values are the remaining alleles. It is important to note, therefore, that for real data this means that the REF value for a given site may not be equal to the reference allele. We also do not check that the alleles result in a valid VCF—for example, it is possible to use the tab character as an allele, leading to a broken VCF.
The
position_transform
argument provides a way to flexibly translate the genomic location of sites in tskit to the appropriate value in VCF. There are two fundamental differences in the way that tskit and VCF define genomic coordinates. The first is that tskit uses floating point values to encode positions, whereas VCF uses integers. Thus, if the tree sequence contains positions at non-integral locations there is an information loss incurred by translating to VCF. By default, we round the site positions to the nearest integer, such that there may be several sites with the same integer position in the output. The second difference between VCF and tskit is that VCF is defined to be a 1-based coordinate system, whereas tskit uses 0-based. However, how coordinates are transformed depends on the VCF parser, and so we do not account for this change in coordinate system by default.Example usage:
with open("output.vcf", "w") as vcf_file: tree_sequence.write_vcf(vcf_file, ploidy=2)
The VCF output can also be compressed using the
gzip
module, if you wish:import gzip with gzip.open("output.vcf.gz", "wt") as f: ts.write_vcf(f)
However, this gzipped VCF may not be fully compatible with downstream tools such as tabix, which may require the VCF use the specialised bgzip format. A general way to convert VCF data to various formats is to pipe the text produced by
tskit
intobcftools
, as done here:import os import subprocess read_fd, write_fd = os.pipe() write_pipe = os.fdopen(write_fd, "w") with open("output.bcf", "w") as bcf_file: proc = subprocess.Popen( ["bcftools", "view", "-O", "b"], stdin=read_fd, stdout=bcf_file ) ts.write_vcf(write_pipe) write_pipe.close() os.close(read_fd) proc.wait() if proc.returncode != 0: raise RuntimeError("bcftools failed with status:", proc.returncode)
This can also be achieved on the command line use the
tskit vcf
command, e.g.:$ tskit vcf example.trees | bcftools view -O b > example.bcf
- Parameters
output (io.IOBase) – The file-like object to write the VCF output.
ploidy (int) – The ploidy of the individuals to be written to VCF. This sample size must be evenly divisible by ploidy.
contig_id (str) – The value of the CHROM column in the output VCF.
individuals (list(int)) – A list containing the individual IDs to write out to VCF. Defaults to all individuals in the tree sequence.
individual_names (list(str)) – A list of string names to identify individual columns in the VCF. In VCF nomenclature, these are the sample IDs. If specified, this must be a list of strings of length equal to the number of individuals to be output. Note that we do not check the form of these strings in any way, so that is is possible to output malformed VCF (for example, by embedding a tab character within on of the names). The default is to output
tsk_j
for the jth individual.position_transform – A callable that transforms the site position values into integer valued coordinates suitable for VCF. The function takes a single positional parameter x and must return an integer numpy array the same dimension as x. By default, this is set to
numpy.round()
which will round values to the nearest integer. If the string “legacy” is provided here, the pre 0.2.0 legacy behaviour of rounding values to the nearest integer (starting from 1) and avoiding the output of identical positions by incrementing is used.
-
The Tree
class¶
Trees in a tree sequence can be obtained by iterating over
TreeSequence.trees()
and specific trees can be accessed using methods
such as TreeSequence.first()
, TreeSequence.at()
and
TreeSequence.at_index()
. Each tree is an instance of the following
class which provides methods, for example, to access information
about particular nodes in the tree.
-
class
tskit.
Tree
[source]¶ A single tree in a
TreeSequence
. Please see the Moving along a tree sequence section for information on how efficiently access trees sequentially or obtain a list of individual trees in a tree sequence.The
sample_lists
parameter controls the features that are enabled for this tree. Ifsample_lists
is True a more efficient algorithm is used in theTree.samples()
method.The
tracked_samples
parameter can be used to efficiently count the number of samples in a given set that exist in a particular subtree using theTree.num_tracked_samples()
method.The
Tree
class is a state-machine which has a state corresponding to each of the trees in the parent tree sequence. We transition between these states by using the seek functions likeTree.first()
,Tree.last()
,Tree.seek()
andTree.seek_index()
. There is one more state, the so-called “null” or “cleared” state. This is the state that aTree
is in immediately after initialisation; it has an index of -1, and no edges. We can also enter the null state by callingTree.next()
on the last tree in a sequence, callingTree.prev()
on the first tree in a sequence or calling calling theTree.clear()
method at any time.The high-level TreeSequence seeking and iterations methods (e.g,
TreeSequence.trees()
) are built on these low-level state-machine seek operations. We recommend these higher level operations for most users.- Parameters
tree_sequence (TreeSequence) – The parent tree sequence.
tracked_samples (list) – The list of samples to be tracked and counted using the
Tree.num_tracked_samples()
method.sample_lists (bool) – If True, provide more efficient access to the samples beneath a give node using the
Tree.samples()
method.root_threshold (int) – The minimum number of samples that a node must be ancestral to for it to be in the list of roots. By default this is 1, so that isolated samples (representing missing data) are roots. To efficiently restrict the roots of the tree to those subtending meaningful topology, set this to 2. This value is only relevant when trees have multiple roots.
sample_counts (bool) – Deprecated since 0.2.4.
Methods
Convert tree to dict of dicts for conversion to a networkx graph.
Returns the length of the branch (in generations) joining the specified node to its parent.
children
(u)Returns the children of the specified node
u
as a tuple of integer node IDs.clear
()Resets this tree back to the initial null state.
copy
()Returns a deep copy of this tree.
count_topologies
([sample_sets])Calculates the distribution of embedded topologies for every combination of the sample sets in
sample_sets
.depth
(u)Returns the number of nodes on the path from
u
to a root, not includingu
.draw
([path, width, height, node_labels, …])Returns a drawing of this tree.
draw_svg
([path, size, tree_height_scale, …])Return an SVG representation of a single tree.
first
()Seeks to the first tree in the sequence.
generate_balanced
(num_leaves, *[, arity, …])Generate a :class:<Tree> with the specified number of leaves that is maximally balanced.
generate_comb
(num_leaves, *[, span, …])Generate a :class:<Tree> in which all internal nodes have two children and the left child is a leaf.
generate_random
(num_leaves, *[, arity, …])Generate a random :class:<Tree> with \(n\) =
num_leaves
leaves with an equal probability of returning any valid topology.generate_star
(num_leaves, *[, span, …])Generate a :class:<Tree> whose leaf nodes all have the same parent (i.e.
is_descendant
(u, v)Returns True if the specified node u is a descendant of node v and False otherwise.
is_internal
(u)Returns True if the specified node is not a leaf.
is_isolated
(u)Returns True if the specified node is isolated in this tree: that is it has no parents and no children.
is_leaf
(u)Returns True if the specified node is a leaf.
is_sample
(u)Returns True if the specified node is a sample.
kc_distance
(other[, lambda_])Returns the Kendall-Colijn distance between the specified pair of trees.
last
()Seeks to the last tree in the sequence.
leaves
([u])Returns an iterator over all the leaves in this tree that are underneath the specified node.
left_child
(u)Returns the leftmost child of the specified node.
left_sib
(u)Returns the sibling node to the left of u, or
tskit.NULL
if u does not have a left sibling.map_mutations
(genotypes, alleles)Given observations for the samples in this tree described by the specified set of genotypes and alleles, return a parsimonious set of state transitions explaining these observations.
mrca
(u, v)Returns the most recent common ancestor of the specified nodes.
Returns an iterator over all the mutations in this tree.
newick
([precision, root, node_labels, …])Returns a newick encoding of this tree.
next
()Seeks to the next tree in the sequence.
nodes
([root, order])Returns an iterator over the node IDs in this tree.
num_children
(u)Returns the number of children of the specified node (i.e.
num_samples
([u])Returns the number of samples in this tree underneath the specified node (including the node itself).
num_tracked_samples
([u])Returns the number of samples in the set specified in the
tracked_samples
parameter of theTreeSequence.trees()
method underneath the specified node.parent
(u)Returns the parent of the specified node.
population
(u)Returns the population associated with the specified node.
prev
()Seeks to the previous tree in the sequence.
rank
()Produce the rank of this tree in the enumeration of all leaf-labelled trees of n leaves.
right_child
(u)Returns the rightmost child of the specified node.
right_sib
(u)Returns the sibling node to the right of u, or
tskit.NULL
if u does not have a right sibling.samples
([u])Returns an iterator over all the samples in this tree that are underneath the specified node.
seek
(position)Sets the state to represent the tree that covers the specified position in the parent tree sequence.
seek_index
(index)Sets the state to represent the tree at the specified index in the parent tree sequence.
sites
()Returns an iterator over all the sites in this tree.
split_polytomies
(*[, epsilon, method, …])Return a new
Tree
where extra nodes and edges have been inserted so that any any nodeu
with greater than 2 children — a multifurcation or “polytomy” — is resolved into successive bifurcations.time
(u)Returns the time of the specified node in generations.
tmrca
(u, v)Returns the time of the most recent common ancestor of the specified nodes.
unrank
(num_leaves, rank, *[, span, …])Reconstruct the tree of the given
rank
(seetskit.Tree.rank()
) withnum_leaves
leaves.Attributes
Returns the index this tree occupies in the parent tree sequence.
Returns the coordinates of the genomic interval that this tree represents the history of.
The leftmost root in this tree.
Returns the total number of mutations across all sites on this tree.
Returns the number of nodes in the
TreeSequence
this tree is in.The number of roots in this tree, as defined in the
roots
attribute.Returns the number of sites on this tree.
The root of this tree.
Returns the minimum number of samples that a node must be an ancestor of to be considered a potential root.
The list of roots in this tree.
Returns the sample size for this tree.
Returns the genomic distance that this tree spans.
Returns the sum of all the branch lengths in this tree (in units of generations).
Returns the tree sequence that this tree is from.
-
as_dict_of_dicts
()[source]¶ Convert tree to dict of dicts for conversion to a networkx graph.
For example:
>>> import networkx as nx >>> nx.DiGraph(tree.as_dict_of_dicts()) >>> # undirected graphs work as well >>> nx.Graph(tree.as_dict_of_dicts())
- Returns
Dictionary of dictionaries of dictionaries where the first key is the source, the second key is the target of an edge, and the third key is an edge annotation. At this point the only annotation is “branch_length”, the length of the branch (in generations).
-
branch_length
(u)[source]¶ Returns the length of the branch (in generations) joining the specified node to its parent. This is equivalent to
>>> tree.time(tree.parent(u)) - tree.time(u)
The branch length for a node that has no parent (e.g., a root) is defined as zero.
Note that this is not related to the property .length which is a deprecated alias for the genomic
span
covered by a tree.
-
children
(u)[source]¶ Returns the children of the specified node
u
as a tuple of integer node IDs. Ifu
is a leaf, return the empty tuple. The ordering of children is arbitrary and should not be depended on; see the data model section for details.
-
clear
()[source]¶ Resets this tree back to the initial null state. Calling this method on a tree already in the null state has no effect.
-
copy
()[source]¶ Returns a deep copy of this tree. The returned tree will have identical state to this tree.
- Returns
A copy of this tree.
- Return type
-
count_topologies
(sample_sets=None)[source]¶ Calculates the distribution of embedded topologies for every combination of the sample sets in
sample_sets
.sample_sets
defaults to all samples in the tree grouped by population.sample_sets
need not include all samples but must be pairwise disjoint.The returned object is a
tskit.TopologyCounter
that contains counts of topologies per combination of sample sets. For example,>>> topology_counter = tree.count_topologies() >>> rank, count = topology_counter[0, 1, 2].most_common(1)[0]
produces the most common tree topology, with populations 0, 1 and 2 as its tips, according to the genealogies of those populations’ samples in this tree.
The counts for each topology in the
tskit.TopologyCounter
are absolute counts that we would get if we were to select all combinations of samples from the relevant sample sets. For sample sets \([s_0, s_1, ..., s_n]\), the total number of topologies for those sample sets is equal to \(|s_0| * |s_1| * ... * |s_n|\), so the counts in the countertopology_counter[0, 1, ..., n]
should sum to \(|s_0| * |s_1| * ... * |s_n|\).To convert the topology counts to probabilities, divide by the total possible number of sample combinations from the sample sets in question:
>>> set_sizes = [len(sample_set) for sample_set in sample_sets] >>> p = count / (set_sizes[0] * set_sizes[1] * set_sizes[2])
Warning
The interface for this method is preliminary and may be subject to backwards incompatible changes in the near future.
- Parameters
sample_sets (list) – A list of lists of Node IDs, specifying the groups of nodes to compute the statistic with. Defaults to all samples grouped by population.
- Return type
- Raises
ValueError – If nodes in
sample_sets
are invalid or are internal samples.
-
depth
(u)[source]¶ Returns the number of nodes on the path from
u
to a root, not includingu
. Thus, the depth of a root is zero.
-
draw
(path=None, width=None, height=None, node_labels=None, node_colours=None, mutation_labels=None, mutation_colours=None, format=None, edge_colours=None, tree_height_scale=None, max_tree_height=None, order=None)[source]¶ Returns a drawing of this tree.
When working in a Jupyter notebook, use the
IPython.display.SVG
function to display the SVG output from this function inline in the notebook:>>> SVG(tree.draw())
The unicode format uses unicode box drawing characters to render the tree. This allows rendered trees to be printed out to the terminal:
>>> print(tree.draw(format="unicode")) 6 ┏━┻━┓ ┃ 5 ┃ ┏━┻┓ ┃ ┃ 4 ┃ ┃ ┏┻┓ 3 0 1 2
The
node_labels
argument allows the user to specify custom labels for nodes, or no labels at all:>>> print(tree.draw(format="unicode", node_labels={})) ┃ ┏━┻━┓ ┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
Note: in some environments such as Jupyter notebooks with Windows or Mac, users have observed that the Unicode box drawings can be misaligned. In these cases, we recommend using the SVG or ASCII display formats instead. If you have a strong preference for aligned Unicode, you can try out the solution documented here.
- Parameters
path (str) – The path to the file to write the output. If None, do not write to file.
width (int) – The width of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels.
height (int) – The height of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels.
node_labels (dict) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label.
node_colours (dict) – If specified, show custom colours for the nodes given in the map. Any nodes not specified in the map will take the default colour; a value of
None
is treated as transparent and hence the node symbol is not plotted. (Only supported in the SVG format.)mutation_labels (dict) – If specified, show custom labels for the mutations (specified by ID) that are present in the map. Any mutations not in the map will not have a label. (Showing mutations is currently only supported in the SVG format)
mutation_colours (dict) – If specified, show custom colours for the mutations given in the map (specified by ID). As for
node_colours
, mutations not present in the map take the default colour, and those mapping toNone
are not drawn. (Only supported in the SVG format.)format (str) – The format of the returned image. Currently supported are ‘svg’, ‘ascii’ and ‘unicode’. Note that the
Tree.draw_svg()
method provides more comprehensive functionality for creating SVGs.edge_colours (dict) – If specified, show custom colours for the edge joining each node in the map to its parent. As for
node_colours
, unspecified edges take the default colour, andNone
values result in the edge being omitted. (Only supported in the SVG format.)tree_height_scale (str) – Control how height values for nodes are computed. If this is equal to
"time"
, node heights are proportional to their time values. If this is equal to"log_time"
, node heights are proportional to their log(time) values. If it is equal to"rank"
, node heights are spaced equally according to their ranked times. For SVG output the default is ‘time’-scale whereas for text output the default is ‘rank’-scale. Time scaling is not currently supported for text output.max_tree_height (str,float) – The maximum tree height value in the current scaling system (see
tree_height_scale
). Can be either a string or a numeric value. If equal to"tree"
, the maximum tree height is set to be that of the oldest root in the tree. If equal to"ts"
the maximum height is set to be the height of the oldest root in the tree sequence; this is useful when drawing trees from the same tree sequence as it ensures that node heights are consistent. If a numeric value, this is used as the maximum tree height by which to scale other nodes. This parameter is not currently supported for text output.order (str) – The left-to-right ordering of child nodes in the drawn tree. This can be either:
"minlex"
, which minimises the differences between adjacent trees (see also the"minlex_postorder"
traversal order for thenodes()
method); or"tree"
which draws trees in the left-to-right order defined by the quintuply linked tree structure. If not specified or None, this defaults to"minlex"
.
- Returns
A representation of this tree in the requested format.
- Return type
-
draw_svg
(path=None, *, size=None, tree_height_scale=None, max_tree_height=None, node_labels=None, mutation_labels=None, root_svg_attributes=None, style=None, order=None, force_root_branch=None, **kwargs)[source]¶ Return an SVG representation of a single tree.
When working in a Jupyter notebook, use the
IPython.display.SVG
function to display the SVG output from this function inline in the notebook:>>> SVG(tree.draw_svg())
The elements in the tree are grouped according to the structure of the tree, using SVG groups. This allows easy styling and manipulation of elements and subtrees. Elements in the SVG file are marked with SVG classes so that they can be targetted, allowing different components of the drawing to be hidden, styled, or otherwise manipulated. For example, when drawing (say) the first tree from a tree sequence, all the SVG components will be placed in a group of class
tree
. The group will have the additional classt0
, indicating that this tree has index 0 in the tree sequence. The general SVG structure is as follows:The tree is contained in a group of class
tree
. Additionally, this group has a classtN
where N is the tree index.Within the
tree
group there is a nested hierarchy of groups corresponding to the tree structure. Any particular node in the tree will have a corresponding group containing child groups (if any) followed by the edge above that node, a node symbol, and (potentially) text containing the node label. For example, a simple two tip tree, with tip node ids 0 and 1, and a root node id of 2 will have a structure similar to the following:<g class="tree t0"> <g class="node n2 root"> <g class="node n1 a2 i1 p1 sample leaf"> <path class="edge" ... /> <circle class="sym" ... /> <text class="lab" ...>Node 1</text> </g> <g class="node n0 a2 i2 p1 sample leaf"> <path class="edge" ... /> <circle class="sym" .../> <text class="lab" ...>Node 0</text> </g> <path class="edge" ... /> <circle /> <text class="lab">Root (Node 2)</text> </g> </g>
The classes can be used to manipulate the element, e.g. by using stylesheets. Style strings can be embedded in the svg by using the
style
parameter, or added to html pages which contain the raw SVG (e.g. within a Jupyter notebook by using the IPythonHTML()
function). As a simple example, passing the following string as thestyle
parameter will hide all labels:.tree .lab {visibility: hidden}
You can also change the format of various items: in SVG2-compatible viewers, the following styles will rotate the leaf nodes labels by 90 degrees, colour the leaf node symbols blue, and hide the non-sample node labels. Note that SVG1.1 does not recognize the
transform
style, so in some SVG viewers, the labels will not appear rotated: a workaround is to convert the SVG to PDF first, using e.g. the programmable chromium engine:chromium --headless --print-to-pdf=out.pdf in.svg
).tree .node.leaf > .lab { transform: translateY(0.5em) rotate(90deg); text-anchor: start} .tree .node.leaf > .sym {fill: blue} .tree .node:not(.sample) > .lab {visibility: hidden}
Nodes contain classes that allow them to be targetted by node id (
nX
), ancestor (parent) id (aX
orroot
if this node has no parent), and (if defined) the id of the individual (iX
) and population (pX
) to which this node belongs. Hence the following style will display a large symbol for node 10, coloured red with a black border, and will also use thick red lines for all the edges that have it as a direct or indirect parent (note that, as with thetransform
style, changing the geometrical size of symbols is only possible in SVG2 and above and therefore not all SVG viewers will render such symbol size changes correctly)..tree .node.n10 > .sym {fill: red; stroke: black; r: 8px} .tree .node.a10 .edge {stroke: red; stroke-width: 2px}
Note
A feature of SVG style commands is that they apply not just to the contents within the <svg> container, but to the entire file. Thus if an SVG file is embedded in a larger document, such as an HTML file (e.g. when an SVG is displayed inline in a Jupyter notebook), the style will apply to all SVG drawings in the notebook. To avoid this, you can tag the SVG with a unique SVG using
root_svg_attributes={'id':'MY_UID'}
, and prepend this to the style string, as in#MY_UID .tree .edges {stroke: gray}
.- Parameters
path (str) – The path to the file to write the output. If None, do not write to file.
size (tuple(int, int)) – A tuple of (width, height) giving the width and height of the produced SVG drawing in abstract user units (usually interpreted as pixels on initial display).
tree_height_scale (str) – Control how height values for nodes are computed. If this is equal to
"time"
(the default), node heights are proportional to their time values. If this is equal to"log_time"
, node heights are proportional to their log(time) values. If it is equal to"rank"
, node heights are spaced equally according to their ranked times.max_tree_height (str,float) – The maximum tree height value in the current scaling system (see
tree_height_scale
). Can be either a string or a numeric value. If equal to"tree"
(the default), the maximum tree height is set to be that of the oldest root in the tree. If equal to"ts"
the maximum height is set to be the height of the oldest root in the tree sequence; this is useful when drawing trees from the same tree sequence as it ensures that node heights are consistent. If a numeric value, this is used as the maximum tree height by which to scale other nodes.node_labels (dict(int, str)) – If specified, show custom labels for the nodes (specified by ID) that are present in this map; any nodes not present will not have a label.
mutation_labels (dict(int, str)) – If specified, show custom labels for the mutations (specified by ID) that are present in the map; any mutations not present will not have a label.
root_svg_attributes (dict) – Additional attributes, such as an id, that will be embedded in the root
<svg>
tag of the generated drawing.style (str) – A css style string that will be included in the
<style>
tag of the generated svg. Note that certain styles, in particular transformations and changes in geometrical properties of objects, will only be recognised by SVG2-compatible viewers.order (str) – A string specifying the traversal type used to order the tips in the tree, as detailed in
Tree.nodes()
. If None (default), use the default order as described in that method.force_root_branch (bool) – If
True
always plot a branch (edge) above the root(s) in the tree. IfNone
(default) then only plot such root branches if there is a mutation above a root of the tree.
- Returns
An SVG representation of a tree.
- Return type
-
first
()[source]¶ Seeks to the first tree in the sequence. This can be called whether the tree is in the null state or not.
-
static
generate_balanced
(num_leaves, *, arity=2, span=1, branch_length=1, record_provenance=True)[source]¶ Generate a :class:<Tree> with the specified number of leaves that is maximally balanced. By default, the tree returned is binary, such that for each node that subtends \(n\) leaves, the left child will subtend \(\floor{n / 2}\) leaves and the right child the remainder. Balanced trees with higher arity can also generated using the
arity
parameter, where the leaves subtending a node are distributed among its children analogously.In the returned tree, the leaf nodes are all at time 0, marked as samples, and labelled 0 to n from left-to-right. Internal node IDs are assigned sequentially from n in a postorder traversal, and the time of an internal node is the maximum time of its children plus the specified
branch_length
.- Parameters
num_leaves (int) – The number of leaf nodes in the returned tree (must be be 2 or greater).
arity (int) – The maximum number of children a node can have in the returned tree.
span (float) – The span of the tree, and therefore the
sequence_length
of thetree_sequence
property of the returned :class:<Tree>.branch_length (float) – The minimum length of a branch in the tree (see above for details on how internal node times are assigned).
- Returns
A balanced tree. Its corresponding
TreeSequence
is available via thetree_sequence
attribute.- Return type
-
static
generate_comb
(num_leaves, *, span=1, branch_length=1, record_provenance=True)[source]¶ Generate a :class:<Tree> in which all internal nodes have two children and the left child is a leaf. This is a “comb”, “ladder” or “pectinate” phylogeny, and also known as a caterpiller tree.
The leaf nodes are all at time 0, marked as samples, and labelled 0 to n from left-to-right. Internal node IDs are assigned sequentially from n as we ascend the tree, and the time of an internal node is the maximum time of its children plus the specified
branch_length
.- Parameters
num_leaves (int) – The number of leaf nodes in the returned tree (must be be 2 or greater).
span (float) – The span of the tree, and therefore the
sequence_length
of thetree_sequence
property of the returned :class:<Tree>.branch_length (float) – The length of every branch in the tree (equivalent to the time of the root node).
- Returns
A star-shaped tree. Its corresponding
TreeSequence
is available via thetree_sequence
attribute.- Return type
-
static
generate_random
(num_leaves, *, arity=2, span=1, branch_length=1, random_seed=None, record_provenance=True)[source]¶ Generate a random :class:<Tree> with \(n\) =
num_leaves
leaves with an equal probability of returning any valid topology.The leaf nodes are marked as samples, labelled 0 to n, and placed at time 0. The root node is placed at time
(num_leaves - 1) * branch_length
, and the other non-leaf nodes placed at a time ofbranch_length
from each other and from the root.Note
The returned tree has not been created under any explicit model of evolution. In order to simulate such trees, additional software such as msprime <https://github.com/tskit-dev/msprime>` is required.
- Parameters
num_leaves (int) – The number of leaf nodes in the returned tree (must be 2 or greater).
arity (int) – The number of children of each internal node. If this is 2 (the default) then a strictly bifurcating (binary) tree is generated chosen at random from the \((2n - 3)! / (2^(n - 2) (n - 2!))\) possible topologies.
span (float) – The span of the tree, and therefore the
sequence_length
of thetree_sequence
property of the returned :class:<Tree>.branch_length (float) – The time separating successive non-leaf nodes from each other.
- Returns
A random tree. Its corresponding
TreeSequence
is available via thetree_sequence
attribute.- Return type
-
static
generate_star
(num_leaves, *, span=1, branch_length=1, record_provenance=True)[source]¶ Generate a :class:<Tree> whose leaf nodes all have the same parent (i.e. a “star” tree). The leaf nodes are all at time 0 and are marked as sample nodes.
The tree produced by this method is identical to
tskit.Tree.unrank(n, (0, 0))
, but generated more efficiently for largen
.- Parameters
num_leaves (int) – The number of leaf nodes in the returned tree (must be be 2 or greater).
span (float) – The span of the tree, and therefore the
sequence_length
of thetree_sequence
property of the returned :class:<Tree>.branch_length (float) – The length of every branch in the tree (equivalent to the time of the root node).
- Returns
A star-shaped tree. Its corresponding
TreeSequence
is available via thetree_sequence
attribute.- Return type
-
property
index
¶ Returns the index this tree occupies in the parent tree sequence. This index is zero based, so the first tree in the sequence has index 0.
- Returns
The index of this tree.
- Return type
-
property
interval
¶ Returns the coordinates of the genomic interval that this tree represents the history of. The interval is returned as a tuple \((l, r)\) and is a half-open interval such that the left coordinate is inclusive and the right coordinate is exclusive. This tree therefore applies to all genomic locations \(x\) such that \(l \leq x < r\).
- Returns
A named tuple (l, r) representing the left-most (inclusive) and right-most (exclusive) coordinates of the genomic region covered by this tree. The coordinates can be accessed by index (
0
or1
) or equivalently by name (.left
or.right
)- Return type
-
is_descendant
(u, v)[source]¶ Returns True if the specified node u is a descendant of node v and False otherwise. A node \(u\) is a descendant of another node \(v\) if \(v\) is on the path from \(u\) to root. A node is considered to be a descendant of itself, so
tree.is_descendant(u, u)
will be True for any valid node.- Parameters
- Returns
True if u is a descendant of v.
- Return type
- Raises
ValueError – If u or v are not valid node IDs.
-
is_internal
(u)[source]¶ Returns True if the specified node is not a leaf. A node is internal if it has one or more children in the current tree.
-
is_isolated
(u)[source]¶ Returns True if the specified node is isolated in this tree: that is it has no parents and no children. Sample nodes that are isolated and have no mutations above them are used to represent missing data.
-
is_leaf
(u)[source]¶ Returns True if the specified node is a leaf. A node \(u\) is a leaf if it has zero children.
-
is_sample
(u)[source]¶ Returns True if the specified node is a sample. A node \(u\) is a sample if it has been marked as a sample in the parent tree sequence.
-
kc_distance
(other, lambda_=0.0)[source]¶ Returns the Kendall-Colijn distance between the specified pair of trees. The
lambda_
parameter determines the relative weight of topology vs branch lengths in calculating the distance. Iflambda_
is 0 (the default) we only consider topology, and if it is 1 we only consider branch lengths. See Kendall & Colijn (2016) for details.The trees we are comparing to must have identical lists of sample nodes (i.e., the same IDs in the same order). The metric operates on samples, not leaves, so internal samples are treated identically to sample tips. Subtrees with no samples do not contribute to the metric.
-
last
()[source]¶ Seeks to the last tree in the sequence. This can be called whether the tree is in the null state or not.
-
leaves
(u=None)[source]¶ Returns an iterator over all the leaves in this tree that are underneath the specified node. If u is not specified, return all leaves in the tree.
- Parameters
u (int) – The node of interest.
- Returns
An iterator over all leaves in the subtree rooted at u.
- Return type
-
left_child
(u)[source]¶ Returns the leftmost child of the specified node. Returns
tskit.NULL
if u is a leaf or is not a node in the current tree. The left-to-right ordering of children is arbitrary and should not be depended on; see the data model section for details.This is a low-level method giving access to the quintuply linked tree structure in memory; the
children()
method is a more convenient way to obtain the children of a given node.
-
property
left_root
¶ The leftmost root in this tree. If there are multiple roots in this tree, they are siblings of this node, and so we can use
right_sib()
to iterate over all roots:u = tree.left_root while u != tskit.NULL: print("Root:", u) u = tree.right_sib(u)
The left-to-right ordering of roots is arbitrary and should not be depended on; see the data model section for details.
This is a low-level method giving access to the quintuply linked tree structure in memory; the
roots
attribute is a more convenient way to obtain the roots of a tree. If you are assuming that there is a single root in the tree you should use theroot
property.Warning
Do not use this property if you are assuming that there is a single root in trees that are being processed. The
root
property should be used in this case, as it will raise an error when multiple roots exists.- Return type
-
left_sib
(u)[source]¶ Returns the sibling node to the left of u, or
tskit.NULL
if u does not have a left sibling. The left-to-right ordering of children is arbitrary and should not be depended on; see the data model section for details.
-
map_mutations
(genotypes, alleles)[source]¶ Given observations for the samples in this tree described by the specified set of genotypes and alleles, return a parsimonious set of state transitions explaining these observations. The genotypes array is interpreted as indexes into the alleles list in the same manner as described in the
TreeSequence.variants()
method. Thus, if samplej
carries the allele at indexk
, then we havegenotypes[j] = k
. Missing observations can be specified for a sample using the valuetskit.MISSING_DATA
(-1), in which case the state at this sample does not influence the ancestral state or the position of mutations returned. At least one non-missing observation must be provided. A maximum of 64 alleles are supported.The current implementation uses the Fitch parsimony algorithm to determine the minimum number of state transitions required to explain the data. In this model, transitions between any of the non-missing states is equally likely.
The returned values correspond directly to the data model for describing variation at sites using mutations. See the Site Table and Mutation Table definitions for details and background.
The state reconstruction is returned as two-tuple,
(ancestral_state, mutations)
, whereancestral_state
is the allele assigned to the tree root(s) andmutations
is a list ofMutation
objects, ordered as required in a mutation table. For each mutation,node
is the tree node at the bottom of the branch on which the transition occurred, andderived_state
is the new state after this mutation. Theparent
property contains the index in the returned list of the previous mutation on the path to root, ortskit.NULL
if there are no previous mutations (see the Mutation Table for more information on the concept of mutation parents). All other attributes of theMutation
object are undefined and should not be used.Note
Sample states observed as missing in the input
genotypes
need not correspond to samples whose nodes are actually “missing” (i.e. isolated) in the tree. In this case, mapping the mutations returned by this method onto the tree will result in these missing observations being imputed to the most parsimonious state.See the Parsimony section in the tutorial for examples of how to use this method.
- Parameters
- Returns
The inferred ancestral state and list of mutations on this tree that encode the specified observations.
- Return type
(str, list(tskit.Mutation))
-
mutations
()[source]¶ Returns an iterator over all the mutations in this tree. Mutations are returned in their order in the mutations table, that is, by nondecreasing site ID, and within a site, by decreasing mutation time with parent mutations before their children. See the
Mutation
class for details on the available fields for each mutation.The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:
>>> for site in tree.sites(): >>> for mutation in site.mutations: >>> yield mutation
-
newick
(precision=14, *, root=None, node_labels=None, include_branch_lengths=True)[source]¶ Returns a newick encoding of this tree. If the
root
argument is specified, return a representation of the specified subtree, otherwise the full tree is returned. If the tree has multiple roots then seperate newick strings for each rooted subtree must be found (i.e., we do not attempt to concatenate the different trees).By default, leaf nodes are labelled with their numerical ID + 1, and internal nodes are not labelled. Arbitrary node labels can be specified using the
node_labels
argument, which maps node IDs to the desired labels.Warning
Node labels are not Newick escaped, so care must be taken to provide labels that will not break the encoding.
- Parameters
precision (int) – The numerical precision with which branch lengths are printed.
root (int) – If specified, return the tree rooted at this node.
node_labels (dict) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label.
include_branch_lengths – If True (default), output branch lengths in the Newick string. If False, only output the topology, without branch lengths.
- Returns
A newick representation of this tree.
- Return type
-
next
()[source]¶ Seeks to the next tree in the sequence. If the tree is in the initial null state we seek to the first tree (equivalent to calling
first()
). Callingnext
on the last tree in the sequence results in the tree being cleared back into the null initial state (equivalent to callingclear()
). The return value of the function indicates whether the tree is in a non-null state, and can be used to loop over the trees:# Iterate over the trees from left-to-right tree = tskit.Tree(tree_sequence) while tree.next() # Do something with the tree. print(tree.index) # tree is now back in the null state.
- Returns
True if the tree has been transformed into one of the trees in the sequence; False if the tree has been transformed into the null state.
- Return type
-
nodes
(root=None, order='preorder')[source]¶ Returns an iterator over the node IDs in this tree. If the root parameter is provided, iterate over the node IDs in the subtree rooted at this node. If this is None, iterate over all node IDs. If the order parameter is provided, iterate over the nodes in required tree traversal order.
Note
Unlike the
TreeSequence.nodes()
method, this iterator produces integer node IDs, notNode
objects.The currently implemented traversal orders are:
‘preorder’: starting at root, yield the current node, then recurse and do a preorder on each child of the current node. See also Wikipedia.
‘inorder’: starting at root, assuming binary trees, recurse and do an inorder on the first child, then yield the current node, then recurse and do an inorder on the second child. In the case of
n
child nodes (not necessarily 2), the firstn // 2
children are visited in the first stage, and the remainingn - n // 2
children are visited in the second stage. See also Wikipedia.‘postorder’: starting at root, recurse and do a postorder on each child of the current node, then yield the current node. See also Wikipedia.
‘levelorder’ (‘breadthfirst’): visit the nodes under root (including the root) in increasing order of their depth from root. See also Wikipedia.
‘timeasc’: visits the nodes in order of increasing time, falling back to increasing ID if times are equal.
‘timedesc’: visits the nodes in order of decreasing time, falling back to decreasing ID if times are equal.
‘minlex_postorder’: a usual postorder has ambiguity in the order in which children of a node are visited. We constrain this by outputting a postorder such that the leaves visited, when their IDs are listed out, have minimum lexicographic order out of all valid traversals. This traversal is useful for drawing multiple trees of a
TreeSequence
, as it leads to more consistency between adjacent trees. Note that internal non-leaf nodes are not counted in assessing the lexicographic order.
- Parameters
- Returns
An iterator over the node IDs in the tree in some traversal order.
- Return type
-
num_children
(u)[source]¶ Returns the number of children of the specified node (i.e.
len(tree.children(u))
)
-
property
num_mutations
¶ Returns the total number of mutations across all sites on this tree.
- Returns
The total number of mutations over all sites on this tree.
- Return type
-
property
num_nodes
¶ Returns the number of nodes in the
TreeSequence
this tree is in. Equivalent totree.tree_sequence.num_nodes
. To find the number of nodes that are reachable from all roots uselen(list(tree.nodes()))
.- Return type
-
property
num_roots
¶ The number of roots in this tree, as defined in the
roots
attribute.Requires O(number of roots) time.
- Return type
-
num_samples
(u=None)[source]¶ Returns the number of samples in this tree underneath the specified node (including the node itself). If u is not specified return the total number of samples in the tree.
This is a constant time operation.
-
property
num_sites
¶ Returns the number of sites on this tree.
- Returns
The number of sites on this tree.
- Return type
-
num_tracked_samples
(u=None)[source]¶ Returns the number of samples in the set specified in the
tracked_samples
parameter of theTreeSequence.trees()
method underneath the specified node. If the input node is not specified, return the total number of tracked samples in the tree.This is a constant time operation.
-
parent
(u)[source]¶ Returns the parent of the specified node. Returns
tskit.NULL
if u is a root or is not a node in the current tree.
-
population
(u)[source]¶ Returns the population associated with the specified node. Equivalent to
tree.tree_sequence.node(u).population
.
-
prev
()[source]¶ Seeks to the previous tree in the sequence. If the tree is in the initial null state we seek to the last tree (equivalent to calling
last()
). Callingprev
on the first tree in the sequence results in the tree being cleared back into the null initial state (equivalent to callingclear()
). The return value of the function indicates whether the tree is in a non-null state, and can be used to loop over the trees:# Iterate over the trees from right-to-left tree = tskit.Tree(tree_sequence) while tree.prev() # Do something with the tree. print(tree.index) # tree is now back in the null state.
- Returns
True if the tree has been transformed into one of the trees in the sequence; False if the tree has been transformed into the null state.
- Return type
-
rank
()[source]¶ Produce the rank of this tree in the enumeration of all leaf-labelled trees of n leaves. See the Interpreting Tree Ranks section for details on ranking and unranking trees.
- Return type
- Raises
ValueError – If the tree has multiple roots.
-
right_child
(u)[source]¶ Returns the rightmost child of the specified node. Returns
tskit.NULL
if u is a leaf or is not a node in the current tree. The left-to-right ordering of children is arbitrary and should not be depended on; see the data model section for details.This is a low-level method giving access to the quintuply linked tree structure in memory; the
children()
method is a more convenient way to obtain the children of a given node.
-
right_sib
(u)[source]¶ Returns the sibling node to the right of u, or
tskit.NULL
if u does not have a right sibling. The left-to-right ordering of children is arbitrary and should not be depended on; see the data model section for details.
-
property
root
¶ The root of this tree. If the tree contains multiple roots, a ValueError is raised indicating that the
roots
attribute should be used instead.- Returns
The root node.
- Return type
- Raises
ValueError
if this tree contains more than one root.
-
property
root_threshold
¶ Returns the minimum number of samples that a node must be an ancestor of to be considered a potential root.
- Returns
The root threshold.
- Return type
-
property
roots
¶ The list of roots in this tree. A root is defined as a unique endpoint of the paths starting at samples. We can define the set of roots as follows:
roots = set() for u in tree_sequence.samples(): while tree.parent(u) != tskit.NULL: u = tree.parent(u) roots.add(u) # roots is now the set of all roots in this tree. assert sorted(roots) == sorted(tree.roots)
The roots of the tree are returned in a list, in no particular order.
Requires O(number of roots) time.
- Returns
The list of roots in this tree.
- Return type
-
property
sample_size
¶ Returns the sample size for this tree. This is the number of sample nodes in the tree.
- Returns
The number of sample nodes in the tree.
- Return type
-
samples
(u=None)[source]¶ Returns an iterator over all the samples in this tree that are underneath the specified node. If u is a sample, it is included in the returned iterator. If u is not specified, return all samples in the tree.
If the
TreeSequence.trees()
method is called withsample_lists=True
, this method uses an efficient algorithm to find the samples. If not, a simple traversal based method is used.- Parameters
u (int) – The node of interest.
- Returns
An iterator over all samples in the subtree rooted at u.
- Return type
-
seek
(position)[source]¶ Sets the state to represent the tree that covers the specified position in the parent tree sequence. After a successful return of this method we have
tree.interval.left
<=position
<tree.interval.right
.- Parameters
position (float) – The position along the sequence length to seek to.
- Raises
ValueError – If 0 < position or position >=
TreeSequence.sequence_length
.
-
seek_index
(index)[source]¶ Sets the state to represent the tree at the specified index in the parent tree sequence. Negative indexes following the standard Python conventions are allowed, i.e.,
index=-1
will seek to the last tree in the sequence.- Parameters
index (int) – The tree index to seek to.
- Raises
IndexError – If an index outside the acceptable range is provided.
-
sites
()[source]¶ Returns an iterator over all the sites in this tree. Sites are returned in order of increasing ID (and also position). See the
Site
class for details on the available fields for each site.- Returns
An iterator over all sites in this tree.
-
property
span
¶ Returns the genomic distance that this tree spans. This is defined as \(r - l\), where \((l, r)\) is the genomic interval returned by
interval
.- Returns
The genomic distance covered by this tree.
- Return type
-
split_polytomies
(*, epsilon=None, method=None, record_provenance=True, random_seed=None)[source]¶ Return a new
Tree
where extra nodes and edges have been inserted so that any any nodeu
with greater than 2 children — a multifurcation or “polytomy” — is resolved into successive bifurcations. New nodes are inserted at times fractionally less than than the time of nodeu
(controlled by theepsilon
parameter).If the
method
is"random"
(currently the only option, and the default when no method is specified), then for a node with \(n\) children, the \((2n - 3)! / (2^(n - 2) (n - 2!))\) possible binary trees with equal probability.The returned :class`.Tree` will have the same genomic span as this tree, and node IDs will be conserved (that is, node
u
in this tree will be the same node in the returned tree). The returned tree is derived from a tree sequence that contains only one non-degenerate tree, that is, where edges cover only the interval spanned by this tree.Note
A tree sequence requires that parents be older than children and that mutations are younger than the parent of the edge on which they lie. If
epsilon
is not small enough, compared to the distance between a polytomy and its oldest child (or oldest child mutation) these requirements may not be met. In this case an error is raised, recommending a smaller epsilon value be used.- Parameters
epsilon – A small time period used to separate each newly inserted node. For a given polytomy of degree \(n\), the \(n-2\) extra nodes are inserted with the oldest at time
epsilon
less than the original parent,u
, and successive nodes at timeepsilon
from each other. Times are allocated to different levels of the tree, such that any newly inserted sibling nodes will have the same time. (Default \(1e-10\)).method (str) – The method used to break polytomies. Currently only “random” is supported, which can also be specified by
method=None
(Default:None
).record_provenance (bool) – If True, add details of this operation to the provenance information of the returned tree sequence. (Default: True).
random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^32 − 1\).
- Returns
A new tree with polytomies split into random bifurcations.
- Return type
-
time
(u)[source]¶ Returns the time of the specified node in generations. Equivalent to
tree.tree_sequence.node(u).time
.
-
tmrca
(u, v)[source]¶ Returns the time of the most recent common ancestor of the specified nodes. This is equivalent to:
>>> tree.time(tree.mrca(u, v))
-
property
total_branch_length
¶ Returns the sum of all the branch lengths in this tree (in units of generations). This is equivalent to
>>> sum(tree.branch_length(u) for u in tree.nodes())
Note that the branch lengths for root nodes are defined as zero.
As this is defined by a traversal of the tree, technically we return the sum of all branch lengths that are reachable from roots. Thus, this is the sum of all branches that are ancestral to at least one sample. This distinction is only important in tree sequences that contain ‘dead branches’, i.e., those that define topology not ancestral to any samples.
- Returns
The sum of lengths of branches in this tree.
- Return type
-
property
tree_sequence
¶ Returns the tree sequence that this tree is from.
- Returns
The parent tree sequence for this tree.
- Return type
-
static
unrank
(num_leaves, rank, *, span=1, branch_length=1)[source]¶ Reconstruct the tree of the given
rank
(seetskit.Tree.rank()
) withnum_leaves
leaves. The labels and times of internal nodes are assigned by a postorder traversal of the nodes, such that the time of each internal node is the maximum time of its children plus the specifiedbranch_length
. The time of each leaf is 0.See the Interpreting Tree Ranks section for details on ranking and unranking trees and what constitutes valid ranks.
- Parameters
num_leaves (int) – The number of leaves of the tree to generate.
span (float) – The genomic span of the returned tree. The tree will cover the interval \([0, \text{span})\) and the
tree_sequence
from which the tree is taken will have itssequence_length
equal tospan
.
- Param
float branch_length: The minimum length of a branch in this tree.
- Return type
- Raises
ValueError: If the given rank is out of bounds for trees with
num_leaves
leaves.
Constants¶
-
tskit.
NULL
(*args, **kwargs) = -1¶ Special reserved value representing a null ID.
-
tskit.
NODE_IS_SAMPLE
(*args, **kwargs) = 1¶ Node flag value indicating that it is a sample.
-
tskit.
MISSING_DATA
(*args, **kwargs) = -1¶ Special value representing missing data in a genotype array
-
tskit.
FORWARD
(*args, **kwargs) = 1¶ Constant representing the forward direction of travel (i.e., increasing genomic coordinate values).
-
tskit.
REVERSE
(*args, **kwargs) = -1¶ Constant representing the reverse direction of travel (i.e., decreasing genomic coordinate values).
-
tskit.
ALLELES_ACGT
= ('A', 'C', 'G', 'T')¶ Say what
Simple container classes¶
These classes are simple shallow containers representing the entities defined
in the Definitions. These classes are not intended to be instantiated
directly, but are the return types for the various iterators provided by the
TreeSequence
and Tree
classes.
-
class
tskit.
Individual
[source]¶ An individual in a tree sequence. Since nodes correspond to genomes, individuals are associated with a collection of nodes (e.g., two nodes per diploid). See Nodes, Genomes, or Individuals? for more discussion of this distinction.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
id (int) – The integer ID of this individual. Varies from 0 to
TreeSequence.num_individuals
- 1.flags (int) – The bitwise flags for this individual.
location (numpy.ndarray) – The spatial location of this individual as a numpy array. The location is an empty array if no spatial location is defined.
nodes – The IDs of the nodes that are associated with this individual as a numpy array (dtype=np.int32). If no nodes are associated with the individual this array will be empty.
metadata (object) – The decoded metadata for this individual.
-
class
tskit.
Node
[source]¶ A node in a tree sequence, corresponding to a single genome. The
time
andpopulation
are attributes of theNode
, rather than theIndividual
, as discussed in Nodes, Genomes, or Individuals?.Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
id (int) – The integer ID of this node. Varies from 0 to
TreeSequence.num_nodes
- 1.flags (int) – The bitwise flags for this node.
time (float) – The birth time of this node.
population (int) – The integer ID of the population that this node was born in.
individual (int) – The integer ID of the individual that this node was a part of.
-
class
tskit.
Edge
[source]¶ An edge in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
left (float) – The left coordinate of this edge.
right (float) – The right coordinate of this edge.
parent (int) – The integer ID of the parent node for this edge. To obtain further information about a node with a given ID, use
TreeSequence.node()
.child (int) – The integer ID of the child node for this edge. To obtain further information about a node with a given ID, use
TreeSequence.node()
.id (int) – The integer ID of this edge. Varies from 0 to
TreeSequence.num_edges
- 1.
-
class
tskit.
Interval
[source]¶ A tuple of 2 numbers,
[left, right)
, defining an interval over the genome.- Variables
left (float) – The left hand end of the interval. By convention this value is included in the interval.
right (float) – The right hand end of the iterval. By convention this value is not included in the interval, i.e. the interval is half-open.
span (float) – The span of the genome covered by this interval, simply
right-left
.
-
class
tskit.
Site
[source]¶ A site in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
id (int) – The integer ID of this site. Varies from 0 to
TreeSequence.num_sites
- 1.position (float) – The floating point location of this site in genome coordinates. Ranges from 0 (inclusive) to
TreeSequence.sequence_length
(exclusive).ancestral_state (str) – The ancestral state at this site (i.e., the state inherited by nodes, unless mutations occur).
mutations (list[
Mutation
]) – The list of mutations at this site. Mutations within a site are returned in the order they are specified in the underlyingMutationTable
.
-
class
tskit.
Mutation
[source]¶ A mutation in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
id (int) – The integer ID of this mutation. Varies from 0 to
TreeSequence.num_mutations
- 1.site (int) – The integer ID of the site that this mutation occurs at. To obtain further information about a site with a given ID use
TreeSequence.site()
.node (int) – The integer ID of the first node that inherits this mutation. To obtain further information about a node with a given ID, use
TreeSequence.node()
.time (float) – The occurrence time of this mutation.
derived_state (str) – The derived state for this mutation. This is the state inherited by nodes in the subtree rooted at this mutation’s node, unless another mutation occurs.
parent (int) – The integer ID of this mutation’s parent mutation. When multiple mutations occur at a site along a path in the tree, mutations must record the mutation that is immediately above them. If the mutation does not have a parent, this is equal to the
NULL
(-1). To obtain further information about a mutation with a given ID, useTreeSequence.mutation()
.
-
class
tskit.
Variant
[source]¶ A variant represents the observed variation among samples for a given site. A variant consists (a) of a reference to the
Site
instance in question; (b) the alleles that may be observed at the samples for this site; and (c) the genotypes mapping sample IDs to the observed alleles.Each element in the
alleles
tuple is a string, representing the actual observed state for a given sample. Thealleles
tuple is generated in one of two ways. The first (and default) way is fortskit
to generate the encoding on the fly as alleles are encountered while generating genotypes. In this case, the first element of this tuple is guaranteed to be the same as the site’sancestral_state
value and the list of alleles is also guaranteed not to contain any duplicates. Note that allelic values may be listed that are not referred to by any samples. For example, if we have a site that is fixed for the derived state (i.e., we have a mutation over the tree root), all genotypes will be 1, but the alleles list will be equal to('0', '1')
. Other than the ancestral state being the first allele, the alleles are listed in no particular order, and the ordering should not be relied upon (but see the notes on missing data below).The second way is for the user to define the mapping between genotype values and allelic state strings using the
alleles
parameter to theTreeSequence.variants()
method. In this case, there is no indication of which allele is the ancestral state, as the ordering is determined by the user.The
genotypes
represent the observed allelic states for each sample, such thatvar.alleles[var.genotypes[j]]
gives the string allele for sample IDj
. Thus, the elements of the genotypes array are indexes into thealleles
list. The genotypes are provided in this way via a numpy array to enable efficient calculations.When missing data is present at a given site boolean flag
has_missing_data
will be True, at least one element of thegenotypes
array will be equal totskit.MISSING_DATA
, and the last element of thealleles
array will beNone
. Note that in this casevariant.num_alleles
will not be equal tolen(variant.alleles)
. The rationale for addingNone
to the end of thealleles
list is to help code that does not handle missing data correctly fail early rather than introducing subtle and hard-to-find bugs. Astskit.MISSING_DATA
is equal to -1, code that decodes genotypes into allelic values without taking missing data into account would otherwise output the last allele in the list rather missing data.Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
site (
Site
) – The site object for this variant.alleles (tuple(str)) – A tuple of the allelic values that may be observed at the samples at the current site. The first element of this tuple is always the site’s ancestral state.
genotypes (numpy.ndarray) – An array of indexes into the list
alleles
, giving the state of each sample at the current site.has_missing_data (bool) – True if there is missing data for any of the samples at the current site.
num_alleles (int) – The number of distinct alleles at this site. Note that this may be greater than the number of distinct values in the genotypes array.
-
class
tskit.
Migration
[source]¶ A migration in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
left (float) – The left end of the genomic interval covered by this migration (inclusive).
right (float) – The right end of the genomic interval covered by this migration (exclusive).
node (int) – The integer ID of the node involved in this migration event. To obtain further information about a node with a given ID, use
TreeSequence.node()
.source (int) – The source population ID.
dest (int) – The destination population ID.
time (float) – The time at which this migration occured at.
metadata (object) – The decoded metadata for this migration.
-
class
tskit.
Population
[source]¶ A population in a tree sequence.
Modifying the attributes in this class will have no effect on the underlying tree sequence data.
- Variables
id (int) – The integer ID of this population. Varies from 0 to
TreeSequence.num_populations
- 1.metadata (object) – The decoded metadata for this population.
Loading data¶
There are several methods for loading data into a TreeSequence
instance. The simplest and most convenient is the use the tskit.load()
function to load a tree sequence file. For small
scale data and debugging, it is often convenient to use the
tskit.load_text()
to read data in the text file format. The TableCollection.tree_sequence()
function
efficiently creates a TreeSequence
object from a set of tables
using the Tables API.
-
tskit.
load
(file)[source]¶ Loads a tree sequence from the specified file object or path. The file must be in the tree sequence file format produced by the
TreeSequence.dump()
method.- Parameters
file (str) – The file object or path of the
.trees
file containing the tree sequence we wish to load.- Returns
The tree sequence object containing the information stored in the specified file path.
- Return type
-
tskit.
load_text
(nodes, edges, sites=None, mutations=None, individuals=None, populations=None, sequence_length=0, strict=True, encoding='utf8', base64_metadata=True)[source]¶ Parses the tree sequence data from the specified file-like objects, and returns the resulting
TreeSequence
object. The format for these files is documented in the Text file formats section, and is produced by theTreeSequence.dump_text()
method. Further properties required for an input tree sequence are described in the Valid tree sequence requirements section. This method is intended as a convenient interface for importing external data into tskit; the binary file format using bytskit.load()
is many times more efficient than this text format.The
nodes
andedges
parameters are mandatory and must be file-like objects containing text with whitespace delimited columns, parsable byparse_nodes()
andparse_edges()
, respectively.sites
,mutations
,individuals
andpopulations
are optional, and must be parsable byparse_sites()
,parse_individuals()
,parse_populations()
, andparse_mutations()
, respectively.The
sequence_length
parameter determines theTreeSequence.sequence_length
of the returned tree sequence. If it is 0 or not specified, the value is taken to be the maximum right coordinate of the input edges. This parameter is useful in degenerate situations (such as when there are zero edges), but can usually be ignored.The
strict
parameter controls the field delimiting algorithm that is used. Ifstrict
is True (the default), we require exactly one tab character separating each field. Ifstrict
is False, a more relaxed whitespace delimiting algorithm is used, such that any run of whitespace is regarded as a field separator. In most situations,strict=False
is more convenient, but it can lead to error in certain situations. For example, if a deletion is encoded in the mutation table this will not be parseable whenstrict=False
.After parsing the tables,
TableCollection.sort()
is called to ensure that the loaded tables satisfy the tree sequence ordering requirements. Note that this may result in the IDs of various entities changing from their positions in the input file.- Parameters
nodes (io.TextIOBase) – The file-like object containing text describing a
NodeTable
.edges (io.TextIOBase) – The file-like object containing text describing an
EdgeTable
.sites (io.TextIOBase) – The file-like object containing text describing a
SiteTable
.mutations (io.TextIOBase) – The file-like object containing text describing a
MutationTable
.individuals (io.TextIOBase) – The file-like object containing text describing a
IndividualTable
.populations (io.TextIOBase) – The file-like object containing text describing a
PopulationTable
.sequence_length (float) – The sequence length of the returned tree sequence. If not supplied or zero this will be inferred from the set of edges.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
- Returns
The tree sequence object containing the information stored in the specified file paths.
- Return type
Tables and Table Collections¶
The information required to construct a tree sequence is stored in a collection
of tables, each defining a different aspect of the structure of a tree
sequence. These tables are described individually in the next section. However, these are interrelated, and so many operations work
on the entire collection of tables, known as a TableCollection
.
The TableCollection
and TreeSequence
classes are
deeply related. A TreeSequence
instance is based on the information
encoded in a TableCollection
. Tree sequences are immutable, and
provide methods for obtaining trees from the sequence. A TableCollection
is mutable, and does not have any methods for obtaining trees.
The TableCollection
class thus allows dynamic creation and modification of
tree sequences.
The TableCollection
class¶
Many of the TreeSequence
methods that return a modified tree sequence
are in fact wrappers around a corresponding TableCollection
method
that modifies a copy of the origin tree sequence’s table collection.
-
class
tskit.
TableCollection
(sequence_length=0)[source]¶ A collection of mutable tables defining a tree sequence. See the Data model section for definition on the various tables and how they together define a
TreeSequence
. Arbitrary data can be stored in a TableCollection, but there are certain requirements that must be satisfied for these tables to be interpreted as a tree sequence.To obtain an immutable
TreeSequence
instance corresponding to the current state of aTableCollection
, please use thetree_sequence()
method.- Variables
individuals (IndividualTable) – The individual table.
nodes (NodeTable) – The node table.
edges (EdgeTable) – The edge table.
migrations (MigrationTable) – The migration table.
sites (SiteTable) – The site table.
mutations (MutationTable) – The mutation table.
populations (PopulationTable) – The population table.
provenances (ProvenanceTable) – The provenance table.
index – The edge insertion and removal index.
sequence_length (float) – The sequence length defining the coordinate space.
file_uuid (str) – The UUID for the file this TableCollection is derived from, or None if not derived from a file.
Methods
asdict
()Returns a dictionary representation of this TableCollection.
Builds an index on this TableCollection.
clear
([clear_provenance, …])Remove all rows of the data tables, optionally remove provenance, metadata schemas and ts-level metadata.
Modifies the tables in place, computing the
parent
column of the mutation table.Modifies the tables in place, computing valid values for the
time
column of the mutation table.copy
()Returns a deep copy of this TableCollection.
Modifies the tables in place, removing entries in the site table with duplicate
position
(and keeping only the first entry for each site), and renumbering thesite
column of the mutation table appropriately.delete_intervals
(intervals[, simplify, …])Delete all information from this set of tables which lies within the specified list of genomic intervals.
delete_sites
(site_ids[, record_provenance])Remove the specified sites entirely from the sites and mutations tables in this collection.
Drops any indexes present on this table collection.
dump
(file_or_path)Writes the table collection to the specified path or file object.
equals
(other, *[, ignore_metadata, …])Returns True if self and other are equal.
Returns True if this TableCollection is indexed.
keep_intervals
(intervals[, simplify, …])Delete all information from this set of tables which lies outside the specified list of genomic intervals.
link_ancestors
(samples, ancestors)Returns an
EdgeTable
instance describing a subset of the genealogical relationships between the nodes insamples
andancestors
.ltrim
([record_provenance])Reset the coordinate system used in these tables, changing the left and right genomic positions in the edge table such that the leftmost edge now starts at position 0.
rtrim
([record_provenance])Reset the
sequence_length
property so that the sequence ends at the end of the last edge.simplify
([samples, reduce_to_site_topology, …])Simplifies the tables in place to retain only the information necessary to reconstruct the tree sequence describing the given
samples
.sort
([edge_start])Sorts the tables in place.
subset
(nodes[, record_provenance])Modifies the tables in place to contain only the entries referring to the provided list of nodes, with nodes reordered according to the order they appear in the list.
Returns a
TreeSequence
instance with the structure defined by the tables in thisTableCollection
.trim
([record_provenance])Trim away any empty regions on the right and left of the tree sequence encoded by these tables.
union
(other, node_mapping[, …])Modifies the table collection in place by adding the non-shared portions of
other
to itself.Attributes
The decoded metadata for this TableCollection.
The raw bytes of metadata for this TableCollection
The
tskit.MetadataSchema
for this TableCollection.Returns a dictionary mapping table names to the corresponding table instances.
Returns the total number of bytes required to store the data in this table collection.
-
asdict
()[source]¶ Returns a dictionary representation of this TableCollection.
Note: the semantics of this method changed at tskit 0.1.0. Previously a map of table names to the tables themselves was returned.
-
build_index
()[source]¶ Builds an index on this TableCollection. Any existing indexes are automatically dropped.
-
clear
(clear_provenance=False, clear_metadata_schemas=False, clear_ts_metadata_and_schema=False)[source]¶ Remove all rows of the data tables, optionally remove provenance, metadata schemas and ts-level metadata.
- Parameters
clear_provenance (bool) – If
True
, remove all rows of the provenance table. (Default:False
).clear_metadata_schemas (bool) – If
True
, clear the table metadata schemas. (Default:False
).clear_ts_metadata_and_schema (bool) – If
True
, clear the tree-sequence level metadata and schema (Default:False
).
-
compute_mutation_parents
()[source]¶ Modifies the tables in place, computing the
parent
column of the mutation table. For this to work, the node and edge tables must be valid, and the site and mutation tables must be sorted (seeTableCollection.sort()
). This will produce an error if mutations are not sorted (i.e., if a mutation appears before its mutation parent) unless the two mutations occur on the same branch, in which case there is no way to detect the error.The
parent
of a given mutation is the ID of the next mutation encountered traversing the tree upwards from that mutation, orNULL
if there is no such mutation.Note
note: This method does not check that all mutations result in a change of state, as required; see Mutation requirements.
-
compute_mutation_times
()[source]¶ Modifies the tables in place, computing valid values for the
time
column of the mutation table. For this to work, the node and edge tables must be valid, and the site and mutation tables must be sorted and indexed(seeTableCollection.sort()
andTableCollection.build_index()
).For a single mutation on an edge at a site, the
time
assigned to a mutation by this method is the mid-point between the times of the nodes above and below the mutation. In the case where there is more than one mutation on an edge for a site, the times are evenly spread along the edge. For mutations that are above a root node, the time of the root node is assigned.The mutation table will be sorted if the new times mean that the original order is no longer valid.
-
copy
()[source]¶ Returns a deep copy of this TableCollection.
- Returns
A deep copy of this TableCollection.
- Return type
-
deduplicate_sites
()[source]¶ Modifies the tables in place, removing entries in the site table with duplicate
position
(and keeping only the first entry for each site), and renumbering thesite
column of the mutation table appropriately. This requires the site table to be sorted by position.
-
delete_intervals
(intervals, simplify=True, record_provenance=True)[source]¶ Delete all information from this set of tables which lies within the specified list of genomic intervals. This is identical to
TreeSequence.delete_intervals()
but acts in place to alter the data in thisTableCollection
.- Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to delete. Intervals must be non-overlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, run simplify on the tables so that nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).
-
delete_sites
(site_ids, record_provenance=True)[source]¶ Remove the specified sites entirely from the sites and mutations tables in this collection. This is identical to
TreeSequence.delete_sites()
but acts in place to alter the data in thisTableCollection
.
-
drop_index
()[source]¶ Drops any indexes present on this table collection. If the tables are not currently indexed this method has no effect.
-
dump
(file_or_path)[source]¶ Writes the table collection to the specified path or file object.
- Parameters
file_or_path (str) – The file object or path to write the TreeSequence to.
-
equals
(other, *, ignore_metadata=False, ignore_ts_metadata=False, ignore_provenance=False, ignore_timestamps=False)[source]¶ Returns True if self and other are equal. By default, two table collections are considered equal if their
sequence_length
properties are identical;top-level tree sequence metadata and metadata schemas are byte-wise identical;
constituent tables are byte-wise identical.
Some of the requirements in this definition can be relaxed using the parameters, which can be used to remove certain parts of the data model from the comparison.
Table indexes are not considered in the equality comparison.
- Parameters
other (TableCollection) – Another table collection.
ignore_metadata (bool) – If True all metadata and metadata schemas will be excluded from the comparison. This includes the top-level tree sequence and constituent table metadata (default=False).
ignore_ts_metadata (bool) – If True the top-level tree sequence metadata and metadata schemas will be excluded from the comparison. If
ignore_metadata
is True, this parameter has no effect.ignore_provenance (bool) – If True the provenance tables are not included in the comparison.
ignore_timestamps (bool) – If True the provenance timestamp column is ignored in the comparison. If
ignore_provenance
is True, this parameter has no effect.
- Returns
True if other is equal to this table collection; False otherwise.
- Return type
-
keep_intervals
(intervals, simplify=True, record_provenance=True)[source]¶ Delete all information from this set of tables which lies outside the specified list of genomic intervals. This is identical to
TreeSequence.keep_intervals()
but acts in place to alter the data in thisTableCollection
.- Parameters
intervals (array_like) – A list (start, end) pairs describing the genomic intervals to keep. Intervals must be non-overlapping and in increasing order. The list of intervals must be interpretable as a 2D numpy array with shape (N, 2), where N is the number of intervals.
simplify (bool) – If True, run simplify on the tables so that nodes no longer used are discarded. (Default: True).
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).
-
link_ancestors
(samples, ancestors)[source]¶ Returns an
EdgeTable
instance describing a subset of the genealogical relationships between the nodes insamples
andancestors
.Each row
parent, child, left, right
in the output table indicates thatchild
has inherited the segment[left, right)
fromparent
more recently than from any other node in these lists.In particular, suppose
samples
is a list of nodes such thattime
is 0 for each node, andancestors
is a list of nodes such thattime
is greater than 0.0 for each node. Then each row of the output table will show an interval[left, right)
over which a node insamples
has inherited most recently from a node inancestors
, or an interval over which one of theseancestors
has inherited most recently from another node inancestors
.The following table shows which
parent->child
pairs will be shown in the output oflink_ancestors
. A node is a relevant descendant on a given interval if it also appears somewhere in theparent
column of the outputted table.Type of relationship
Shown in output of
link_ancestors
ancestor->sample
Always
ancestor1->ancestor2
Only if
ancestor2
has a relevant descendantsample1->sample2
Always
sample->ancestor
Only if
ancestor
has a relevant descendantThe difference between
samples
andancestors
is that information about the ancestors of a node inancestors
will only be retained if it also has a relevant descendant, while information about the ancestors of a node insamples
will always be retained. The node IDs inparent
andchild
refer to the IDs in the node table of the inputted tree sequence.The supplied nodes must be non-empty lists of the node IDs in the tree sequence: in particular, they do not have to be samples of the tree sequence. The lists of
samples
andancestors
may overlap, although adding a node fromsamples
toancestors
will not change the output. So, settingsamples
andancestors
to the same list of nodes will find all genealogical relationships within this list.If none of the nodes in
ancestors
orsamples
are ancestral tosamples
anywhere in the tree sequence, an empty table will be returned.
-
ltrim
(record_provenance=True)[source]¶ Reset the coordinate system used in these tables, changing the left and right genomic positions in the edge table such that the leftmost edge now starts at position 0. This is identical to
TreeSequence.ltrim()
but acts in place to alter the data in thisTableCollection
.- Parameters
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).
-
property
metadata
¶ The decoded metadata for this TableCollection.
-
property
metadata_bytes
¶ The raw bytes of metadata for this TableCollection
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this TableCollection.
-
property
name_map
¶ Returns a dictionary mapping table names to the corresponding table instances. For example, the returned dictionary will contain the key “edges” that maps to an
EdgeTable
instance.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table collection. Note that this may not be equal to the actual memory footprint.
-
rtrim
(record_provenance=True)[source]¶ Reset the
sequence_length
property so that the sequence ends at the end of the last edge. This is identical toTreeSequence.rtrim()
but acts in place to alter the data in thisTableCollection
.- Parameters
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).
-
simplify
(samples=None, *, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True, keep_unary=False, keep_input_roots=False, record_provenance=True, filter_zero_mutation_sites=None)[source]¶ Simplifies the tables in place to retain only the information necessary to reconstruct the tree sequence describing the given
samples
. This will change the ID of the nodes, so that the nodesamples[k]
will have IDk
in the result. The resulting NodeTable will have only the firstlen(samples)
individuals marked as samples. The mapping from node IDs in the current set of tables to their equivalent values in the simplified tables is also returned as a numpy array. If an arraya
is returned by this function andu
is the ID of a node in the input table, thena[u]
is the ID of this node in the output table. For any nodeu
that is not mapped into the output tables, this mapping will equal-1
.Tables operated on by this function must: be sorted (see
TableCollection.sort()
), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint. Other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).This is identical to
TreeSequence.simplify()
but acts in place to alter the data in thisTableCollection
. Please see theTreeSequence.simplify()
method for a description of the remaining parameters.- Parameters
samples (list[int]) – A list of node IDs to retain as samples. If not specified or None, use all nodes marked with the IS_SAMPLE flag.
reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (default: False).
filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True)
filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True)
filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True)
keep_unary (bool) – If True, any unary nodes (i.e. nodes with exactly one child) that exist on the path from samples to root will be preserved in the output. (Default: False)
keep_input_roots (bool) – If True, insert edges from the MRCAs of the samples to the roots in the input trees. If False, no topology older than the MRCAs of the samples will be included. (Default: False)
record_provenance (bool) – If True, record details of this call to simplify in the returned tree sequence’s provenance information (Default: True).
filter_zero_mutation_sites (bool) – Deprecated alias for
filter_sites
.
- Returns
A numpy array mapping node IDs in the input tables to their corresponding node IDs in the output tables.
- Return type
numpy.ndarray (dtype=np.int32)
-
sort
(edge_start=0)[source]¶ Sorts the tables in place. This ensures that all tree sequence ordering requirements listed in the Valid tree sequence requirements section are met, as long as each site has at most one mutation (see below).
If the
edge_start
parameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal toedge_start
are sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.The individual, node, population and provenance tables are not affected by this method.
Edges are sorted as follows:
time of parent, then
parent node ID, then
child node ID, then
left endpoint.
Note that this sorting order exceeds the edge sorting requirements for a valid tree sequence. For a valid tree sequence, we require that all edges for a given parent ID are adjacent, but we do not require that they be listed in sorted order.
Sites are sorted by position, and sites with the same position retain their relative ordering.
Mutations are sorted by site ID, and within the same site are sorted by time. Those with equal or unknown time retain their relative ordering. This does not currently rearrange tables so that mutations occur after their mutation parents, which is a requirement for valid tree sequences.
- Parameters
edge_start (int) – The index in the edge table where sorting starts (default=0; must be <= len(edges)).
-
subset
(nodes, record_provenance=True)[source]¶ Modifies the tables in place to contain only the entries referring to the provided list of nodes, with nodes reordered according to the order they appear in the list. See
TreeSequence.subset()
for a more detailed description.
-
tree_sequence
()[source]¶ Returns a
TreeSequence
instance with the structure defined by the tables in thisTableCollection
. If the table collection is not in canonical form (i.e., does not meet sorting requirements) or cannot be interpreted as a tree sequence an exception is raised. Thesort()
method may be used to ensure that input sorting requirements are met. If the table collection does not have indexes they will be built.- Returns
A
TreeSequence
instance reflecting the structures defined in this set of tables.- Return type
-
trim
(record_provenance=True)[source]¶ Trim away any empty regions on the right and left of the tree sequence encoded by these tables. This is identical to
TreeSequence.trim()
but acts in place to alter the data in thisTableCollection
.- Parameters
record_provenance (bool) – If
True
, add details of this operation to the provenance table in this TableCollection. (Default:True
).
-
union
(other, node_mapping, check_shared_equality=True, add_populations=True, record_provenance=True)[source]¶ Modifies the table collection in place by adding the non-shared portions of
other
to itself. To perform the node-wise union, the method relies on anode_mapping
array, that maps nodes inother
to its equivalent node inself
ortskit.NULL
if the node is exclusive toother
. SeeTreeSequence.union()
for a more detailed description.- Parameters
other (TableCollection) – Another table collection.
node_mapping (list) – An array of node IDs that relate nodes in
other
to nodes inself
: the k-th element ofnode_mapping
should be the index of the equivalent node inself
, ortskit.NULL
if the node is not present inself
(in which case it will be added to self).check_shared_equality (bool) – If True, the shared portions of the table collections will be checked for equality.
add_populations (bool) – If True, nodes new to
self
will be assigned new population IDs.record_provenance (bool) – Whether to record a provenance entry in the provenance table for this operation.
Tables¶
The tables API provides an efficient way of working
with and interchanging tree sequence data. Each table
class (e.g, NodeTable
, EdgeTable
) has a specific set of
columns with fixed types, and a set of methods for setting and getting the data
in these columns. The number of rows in the table t
is given by len(t)
.
Each table supports accessing the data either by row or column. To access the
row j
in table t
simply use t[j]
. The value returned by such an
access is an instance of collections.namedtuple()
, and therefore supports
either positional or named attribute access. To access the data in
a column, we can use standard attribute access which will return a numpy array
of the data. For example:
>>> import tskit
>>> t = tskit.EdgeTable()
>>> t.add_row(left=0, right=1, parent=10, child=11)
0
>>> t.add_row(left=1, right=2, parent=9, child=11)
1
>>> print(t)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
>>> t[0]
EdgeTableRow(left=0.0, right=1.0, parent=10, child=11)
>>> t[-1]
EdgeTableRow(left=1.0, right=2.0, parent=9, child=11)
>>> t.left
array([ 0., 1.])
>>> t.parent
array([10, 9], dtype=int32)
>>> len(t)
2
>>>
Tables also support the pickle
protocol, and so can be easily
serialised and deserialised (for example, when performing parallel
computations using the multiprocessing
module).
>>> serialised = pickle.dumps(t)
>>> t2 = pickle.loads(serialised)
>>> print(t2)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
However, pickling will not be as efficient as storing tables in the native format.
Tables support the equality operator ==
based on the data
held in the columns:
>>> t == t2
True
>>> t is t2
False
>>> t2.add_row(0, 1, 2, 3)
2
>>> print(t2)
id left right parent child
0 0.00000000 1.00000000 10 11
1 1.00000000 2.00000000 9 11
2 0.00000000 1.00000000 2 3
>>> t == t2
False
Text columns¶
As described in the Encoding ragged columns, working with variable length columns is somewhat more involved. Columns encoding text data store the encoded bytes of the flattened strings, and the offsets into this column in two separate arrays.
Consider the following example:
>>> t = tskit.SiteTable()
>>> t.add_row(0, "A")
>>> t.add_row(1, "BB")
>>> t.add_row(2, "")
>>> t.add_row(3, "CCC")
>>> print(t)
id position ancestral_state metadata
0 0.00000000 A
1 1.00000000 BB
2 2.00000000
3 3.00000000 CCC
>>> t[0]
SiteTableRow(position=0.0, ancestral_state='A', metadata=b'')
>>> t[1]
SiteTableRow(position=1.0, ancestral_state='BB', metadata=b'')
>>> t[2]
SiteTableRow(position=2.0, ancestral_state='', metadata=b'')
>>> t[3]
SiteTableRow(position=3.0, ancestral_state='CCC', metadata=b'')
Here we create a SiteTable
and add four rows, each with a different
ancestral_state
. We can then access this information from each
row in a straightforward manner. Working with the data in the columns
is a little trickier, however:
>>> t.ancestral_state
array([65, 66, 66, 67, 67, 67], dtype=int8)
>>> t.ancestral_state_offset
array([0, 1, 3, 3, 6], dtype=uint32)
>>> tskit.unpack_strings(t.ancestral_state, t.ancestral_state_offset)
['A', 'BB', '', 'CCC']
Here, the ancestral_state
array is the UTF8 encoded bytes of the flattened
strings, and the ancestral_state_offset
is the offset into this array
for each row. The tskit.unpack_strings()
function, however, is a convient
way to recover the original strings from this encoding. We can also use the
tskit.pack_strings()
to insert data using this approach:
>>> a, off = tskit.pack_strings(["0", "12", ""])
>>> t.set_columns(position=[0, 1, 2], ancestral_state=a, ancestral_state_offset=off)
>>> print(t)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 12
2 2.00000000
When inserting many rows with standard infinite sites mutations (i.e.,
ancestral state is “0”), it is more efficient to construct the
numpy arrays directly than to create a list of strings and use
pack_strings()
. When doing this, it is important to note that
it is the encoded byte values that are stored; by default, we
use UTF8 (which corresponds to ASCII for simple printable characters).:
>>> t_s = tskit.SiteTable()
>>> m = 10
>>> a = ord("0") + np.zeros(m, dtype=np.int8)
>>> off = np.arange(m + 1, dtype=np.uint32)
>>> t_s.set_columns(position=np.arange(m), ancestral_state=a, ancestral_state_offset=off)
>>> print(t_s)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 0
2 2.00000000 0
3 3.00000000 0
4 4.00000000 0
5 5.00000000 0
6 6.00000000 0
7 7.00000000 0
8 8.00000000 0
9 9.00000000 0
>>> t_s.ancestral_state
array([48, 48, 48, 48, 48, 48, 48, 48, 48, 48], dtype=int8)
>>> t_s.ancestral_state_offset
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint32)
Here we create 10 sites at regular positions, each with ancestral state equal to
“0”. Note that we use ord("0")
to get the ASCII code for “0” (48), and create
10 copies of this by adding it to an array of zeros. We have done this for
illustration purposes: it is equivalent (though slower for large examples) to do
a, off = tskit.pack_strings(["0"] * m)
.
Mutations can be handled similarly:
>>> t_m = tskit.MutationTable()
>>> site = np.arange(m, dtype=np.int32)
>>> d, off = tskit.pack_strings(["1"] * m)
>>> node = np.zeros(m, dtype=np.int32)
>>> t_m.set_columns(site=site, node=node, derived_state=d, derived_state_offset=off)
>>> print(t_m)
id site node derived_state parent metadata
0 0 0 1 -1
1 1 0 1 -1
2 2 0 1 -1
3 3 0 1 -1
4 4 0 1 -1
5 5 0 1 -1
6 6 0 1 -1
7 7 0 1 -1
8 8 0 1 -1
9 9 0 1 -1
>>>
Binary columns¶
Columns storing binary data take the same approach as
Text columns to encoding
variable length data.
The difference between the two is only raw bytes
values are accepted: no
character encoding or decoding is done on the data. Consider the following example
where a table has no metadata_schema
such that arbitrary bytes can be stored and
no automatic encoding or decoding of objects is performed by the Python API and we can
store and retrive raw bytes
. (See Metadata for details):
>>> t = tskit.NodeTable()
>>> t.add_row(metadata=b"raw bytes")
>>> t.add_row(metadata=pickle.dumps({"x": 1.1}))
>>> t[0].metadata
b'raw bytes'
>>> t[1].metadata
b'\x80\x03}q\x00X\x01\x00\x00\x00xq\x01G?\xf1\x99\x99\x99\x99\x99\x9as.'
>>> pickle.loads(t[1].metadata)
{'x': 1.1}
>>> print(t)
id flags population time metadata
0 0 -1 0.00000000000000 cmF3IGJ5dGVz
1 0 -1 0.00000000000000 gAN9cQBYAQAAAHhxAUc/8ZmZmZmZmnMu
>>> t.metadata
array([ 114, 97, 119, 32, 98, 121, 116, 101, 115, -128, 3,
125, 113, 0, 88, 1, 0, 0, 0, 120, 113, 1,
71, 63, -15, -103, -103, -103, -103, -103, -102, 115, 46], dtype=int8)
>>> t.metadata_offset
array([ 0, 9, 33], dtype=uint32)
Here we add two rows to a NodeTable
, with different
metadata. The first row contains a simple
byte string, and the second contains a Python dictionary serialised using
pickle
. We then show several different (and seemingly incompatible!)
different views on the same data.
When we access the data in a row (e.g., t[0].metadata
) we are returned
a Python bytes object containing precisely the bytes that were inserted.
The pickled dictionary is encoded in 24 bytes containing unprintable
characters, and when we unpickle it using pickle.loads()
, we obtain
the original dictionary.
When we print the table, however, we see some data which is seemingly unrelated to the original contents. This is because the binary data is base64 encoded to ensure that it is print-safe (and doesn’t break your terminal). (See the Metadata section for more information on the use of base64 encoding.).
Finally, when we print the metadata
column, we see the raw byte values
encoded as signed integers. As for Text columns,
the metadata_offset
column encodes the offsets into this array. So, we
see that the first metadata value is 9 bytes long and the second is 24.
The tskit.pack_bytes()
and tskit.unpack_bytes()
functions are
also useful for encoding data in these columns.
Table classes¶
This section describes the methods and variables available for each table class. For description and definition of each table’s meaning and use, see the table definitions.
-
class
tskit.
IndividualTable
[source]¶ A table defining the individuals in a tree sequence. Note that although each Individual has associated nodes, reference to these is not stored in the individual table, but rather reference to the individual is stored for each node in the
NodeTable
. This is similar to the way in which the relationship between sites and mutations is modelled.- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
flags (numpy.ndarray, dtype=np.uint32) – The array of flags values.
location (numpy.ndarray, dtype=np.float64) – The flattened array of floating point location values. See Encoding ragged columns for more details.
location_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the location column. See Encoding ragged columns for more details.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(flags=0, location=None, metadata=None)[source]¶ Adds a new row to this
IndividualTable
and returns the ID of the corresponding individual. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.- Parameters
flags (int) – The bitwise flags for the new node.
location (array-like) – A list of numeric values or one-dimensional numpy array describing the location of this individual. If not specified or None, a zero-dimensional location is stored.
metadata (object) – Any object that is valid metadata for the table’s schema.
- Returns
The ID of the newly added node.
- Return type
-
append_columns
(flags=None, location=None, location_offset=None, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns in this
IndividualTable
. This allows many new rows to be added at once.The
flags
array is mandatory and defines the number of extra individuals to add to the table. Thelocation
andlocation_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required.
location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with
location_offset
. If not specified or None, an empty location value is stored for each individual.location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
location
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each individual.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_location
(locations)[source]¶ Packs the specified list of location values and updates the
location
andlocation_offset
columns. The length of the locations array must be equal to the number of rows in the table.- Parameters
locations (list) – A list of locations interpreted as numpy float64 arrays.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(flags=None, location=None, location_offset=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
IndividualTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
flags
array is mandatory and defines the number of individuals the table will contain. Thelocation
andlocation_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required.
location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with
location_offset
. If not specified or None, an empty location value is stored for each individual.location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
location
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each individual.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
NodeTable
[source]¶ A table defining the nodes in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a node table to be a part of a valid tree sequence.
- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
time (numpy.ndarray, dtype=np.float64) – The array of time values.
flags (numpy.ndarray, dtype=np.uint32) – The array of flags values.
population (numpy.ndarray, dtype=np.int32) – The array of population IDs.
individual (numpy.ndarray, dtype=np.int32) – The array of individual IDs that each node belongs to.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(flags=0, time=0, population=- 1, individual=- 1, metadata=None)[source]¶ Adds a new row to this
NodeTable
and returns the ID of the corresponding node. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.- Parameters
flags (int) – The bitwise flags for the new node.
time (float) – The birth time for the new node.
population (int) – The ID of the population in which the new node was born. Defaults to
tskit.NULL
.individual (int) – The ID of the individual in which the new node was born. Defaults to
tskit.NULL
.metadata (object) – Any object that is valid metadata for the table’s schema.
- Returns
The ID of the newly added node.
- Return type
-
append_columns
(flags=None, time=None, population=None, individual=None, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns in this
NodeTable
. This allows many new rows to be added at once.The
flags
,time
andpopulation
arrays must all be of the same length, which is equal to the number of nodes that will be added to the table. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required.
time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required.
population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the
tskit.NULL
value is stored for each node.individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the
tskit.NULL
value is stored for each node.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(flags=None, time=None, population=None, individual=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
NodeTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
flags
,time
andpopulation
arrays must all be of the same length, which is equal to the number of nodes the table will contain. Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required.
time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required.
population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the
tskit.NULL
value is stored for each node.individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the
tskit.NULL
value is stored for each node.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
EdgeTable
[source]¶ A table defining the edges in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for an edge table to be a part of a valid tree sequence.
- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
left (numpy.ndarray, dtype=np.float64) – The array of left coordinates.
right (numpy.ndarray, dtype=np.float64) – The array of right coordinates.
parent (numpy.ndarray, dtype=np.int32) – The array of parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The array of child node IDs.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(left, right, parent, child, metadata=None)[source]¶ Adds a new row to this
EdgeTable
and returns the ID of the corresponding edge. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.- Parameters
- Returns
The ID of the newly added edge.
- Return type
-
append_columns
(left, right, parent, child, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
EdgeTable
. This allows many new rows to be added at once.The
left
,right
,parent
andchild
parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional edges to add to the table). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
parent (numpy.ndarray, dtype=np.int32) – The parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The child node IDs.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(left=None, right=None, parent=None, child=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
EdgeTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
left
,right
,parent
andchild
parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of edges the table will contain). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
parent (numpy.ndarray, dtype=np.int32) – The parent node IDs.
child (numpy.ndarray, dtype=np.int32) – The child node IDs.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
squash
()[source]¶ Sorts, then condenses the table into the smallest possible number of rows by combining any adjacent edges. A pair of edges is said to be adjacent if they have the same parent and child nodes, and if the left coordinate of one of the edges is equal to the right coordinate of the other edge. The
squash
method modifies anEdgeTable
in place so that any set of adjacent edges is replaced by a single edge. The new edge will have the same parent and child node, a left coordinate equal to the smallest left coordinate in the set, and a right coordinate equal to the largest right coordinate in the set. The new edge table will be sorted in the canonical order (P, C, L, R).
-
class
tskit.
MigrationTable
[source]¶ A table defining the migrations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a migration table to be a part of a valid tree sequence.
- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
left (numpy.ndarray, dtype=np.float64) – The array of left coordinates.
right (numpy.ndarray, dtype=np.float64) – The array of right coordinates.
node (numpy.ndarray, dtype=np.int32) – The array of node IDs.
source (numpy.ndarray, dtype=np.int32) – The array of source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The array of destination population IDs.
time (numpy.ndarray, dtype=np.float64) – The array of time values.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(left, right, node, source, dest, time, metadata=None)[source]¶ Adds a new row to this
MigrationTable
and returns the ID of the corresponding migration. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.- Parameters
left (float) – The left coordinate (inclusive).
right (float) – The right coordinate (exclusive).
node (int) – The node ID.
source (int) – The ID of the source population.
dest (int) – The ID of the destination population.
time (float) – The time of the migration event.
metadata (object) – Any object that is valid metadata for the table’s schema.
- Returns
The ID of the newly added migration.
- Return type
-
append_columns
(left, right, node, source, dest, time, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
MigrationTable
. This allows many new rows to be added at once.All parameters except
metadata
andmetadata_offset
and are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional migrations to add to the table). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
node (numpy.ndarray, dtype=np.int32) – The node IDs.
source (numpy.ndarray, dtype=np.int32) – The source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The destination population IDs.
time (numpy.ndarray, dtype=np.int64) – The time of each migration.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each migration.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(left=None, right=None, node=None, source=None, dest=None, time=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
MigrationTable
using the values in the specified arrays. Overwrites any data currently stored in the table.All parameters except
metadata
andmetadata_offset
and are mandatory, and must be numpy arrays of the same length (which is equal to the number of migrations the table will contain). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive).
right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive).
node (numpy.ndarray, dtype=np.int32) – The node IDs.
source (numpy.ndarray, dtype=np.int32) – The source population IDs.
dest (numpy.ndarray, dtype=np.int32) – The destination population IDs.
time (numpy.ndarray, dtype=np.int64) – The time of each migration.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each migration.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
SiteTable
[source]¶ A table defining the sites in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a site table to be a part of a valid tree sequence.
- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
position (numpy.ndarray, dtype=np.float64) – The array of site position coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened array of ancestral state strings. See Text columns for more details.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the ancestral_state array. See Text columns for more details.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(position, ancestral_state, metadata=None)[source]¶ Adds a new row to this
SiteTable
and returns the ID of the corresponding site. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.
-
append_columns
(position, ancestral_state, ancestral_state_offset, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
SiteTable
. This allows many new rows to be added at once.The
position
,ancestral_state
andancestral_state_offset
parameters are mandatory, and must be 1D numpy arrays. The length of theposition
array determines the number of additional rows to add the table. Theancestral_state
andancestral_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
ancestral_state
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_ancestral_state
(ancestral_states)[source]¶ Packs the specified list of ancestral_state values and updates the
ancestral_state
andancestral_state_offset
columns. The length of the ancestral_states array must be equal to the number of rows in the table.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(position=None, ancestral_state=None, ancestral_state_offset=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
SiteTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
position
,ancestral_state
andancestral_state_offset
parameters are mandatory, and must be 1D numpy arrays. The length of theposition
array determines the number of rows in table. Theancestral_state
andancestral_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates.
ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required.
ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
ancestral_state
array.metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
MutationTable
[source]¶ A table defining the mutations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a mutation table to be a part of a valid tree sequence.
- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
site (numpy.ndarray, dtype=np.int32) – The array of site IDs.
node (numpy.ndarray, dtype=np.int32) – The array of node IDs.
time (numpy.ndarray, dtype=np.float64) – The array of time values.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened array of derived state strings. See Text columns for more details.
derived_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the derived_state array. See Text columns for more details.
parent (numpy.ndarray, dtype=np.int32) – The array of parent mutation IDs.
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(site, node, derived_state, parent=- 1, metadata=None, time=None)[source]¶ Adds a new row to this
MutationTable
and returns the ID of the corresponding mutation. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.- Parameters
site (int) – The ID of the site that this mutation occurs at.
node (int) – The ID of the first node inheriting this mutation.
derived_state (str) – The state of the site at this mutation’s node.
parent (int) – The ID of the parent mutation. If not specified, defaults to
NULL
.metadata (object) – Any object that is valid metadata for the table’s schema.
time (float) – The occurrence time for the new mutation. If not specified, defaults to
UNKNOWN_TIME
, indicating the time is unknown.
- Returns
The ID of the newly added mutation.
- Return type
-
append_columns
(site, node, derived_state, derived_state_offset, parent=None, time=None, metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
MutationTable
. This allows many new rows to be added at once.The
site
,node
,derived_state
andderived_state_offset
parameters are mandatory, and must be 1D numpy arrays. Thesite
andnode
(alsotime
andparent
, if supplied) arrays must be of equal length, and determine the number of additional rows to add to the table. Thederived_state
andderived_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at.
node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with.
time (numpy.ndarray, dtype=np.float64) – The time values for each mutation.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required.
derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
derived_state
array.parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_derived_state
(derived_states)[source]¶ Packs the specified list of derived_state values and updates the
derived_state
andderived_state_offset
columns. The length of the derived_states array must be equal to the number of rows in the table.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(site=None, node=None, time=None, derived_state=None, derived_state_offset=None, parent=None, metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
MutationTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
site
,node
,derived_state
andderived_state_offset
parameters are mandatory, and must be 1D numpy arrays. Thesite
andnode
(alsoparent
andtime
, if supplied) arrays must be of equal length, and determine the number of rows in the table. Thederived_state
andderived_state_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). Themetadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at.
node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with.
time (numpy.ndarray, dtype=np.float64) – The time values for each mutation.
derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required.
derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
derived_state
array.parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation.
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
PopulationTable
[source]¶ A table defining the populations referred to in a tree sequence. The PopulationTable stores metadata for populations that may be referred to in the NodeTable and MigrationTable”. Note that although nodes may be associated with populations, this association is stored in the
NodeTable
: only metadata on each population is stored in the population table.- Warning
The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays.
NOTE: this behaviour may change in future.
- Variables
metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details.
metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
metadata_schema (tskit.MetadataSchema) – The metadata schema for this table’s metadata column
-
__getitem__
(index)¶ Return the specifed row of this table, decoding metadata if it is present. Supports negative indexing, e.g.
table[-5]
.- Parameters
index (int) – the zero-index of the desired row
-
add_row
(metadata=None)[source]¶ Adds a new row to this
PopulationTable
and returns the ID of the corresponding population. Metadata, if specified, will be validated and encoded according to the table’smetadata_schema
.
-
append_columns
(metadata=None, metadata_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
PopulationTable
. This allows many new rows to be added at once.The
metadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_metadata=False)¶ Returns True if self and other are equal. By default, two tables are considered equal if their columns and metadata schemas are byte-for-byte identical.
-
property
metadata_schema
¶ The
tskit.MetadataSchema
for this table.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_metadata
(metadatas)¶ Packs the specified list of metadata values and updates the
metadata
andmetadata_offset
columns. The length of the metadatas array must be equal to the number of rows in the table.- Parameters
metadatas (list) – A list of metadata bytes values.
-
set_columns
(metadata=None, metadata_offset=None, metadata_schema=None)[source]¶ Sets the values for each column in this
PopulationTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
metadata
andmetadata_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information) and Metadata for bulk table methods for an example of how to prepare metadata.- Parameters
metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with
metadata_offset
. If not specified or None, an empty metadata value is stored for each node.metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
metadata
array.metadata_schema – The encoded metadata schema.
-
class
tskit.
ProvenanceTable
[source]¶ A table recording the provenance (i.e., history) of this table, so that the origin of the underlying data and sequence of subsequent operations can be traced. Each row contains a “record” string (recommended format: JSON) and a timestamp.
Todo
The format of the record field will be more precisely specified in the future.
- Variables
record (numpy.ndarray, dtype=np.int8) – The flattened array containing the record strings. Text columns for more details.
record_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the record column. See Text columns for more details.
timestamp (numpy.ndarray, dtype=np.int8) – The flattened array containing the timestamp strings. Text columns for more details.
timestamp_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the timestamp column. See Text columns for more details.
-
add_row
(record, timestamp=None)[source]¶ Adds a new row to this ProvenanceTable consisting of the specified record and timestamp. If timestamp is not specified, it is automatically generated from the current time.
-
append_columns
(timestamp=None, timestamp_offset=None, record=None, record_offset=None)[source]¶ Appends the specified arrays to the end of the columns of this
ProvenanceTable
. This allows many new rows to be added at once.The
timestamp
andtimestamp_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Likewise for therecord
andrecord_offset
columns- Parameters
timestamp (numpy.ndarray, dtype=np.int8) – The flattened timestamp array. Must be specified along with
timestamp_offset
. If not specified or None, an empty timestamp value is stored for each node.timestamp_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
timestamp
array.record (numpy.ndarray, dtype=np.int8) – The flattened record array. Must be specified along with
record_offset
. If not specified or None, an empty record value is stored for each node.record_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
record
array.
-
asdict
()¶ Returns a dictionary mapping the names of the columns in this table to the corresponding numpy arrays.
-
clear
()¶ Deletes all rows in this table.
-
copy
()¶ Returns a deep copy of this table
-
equals
(other, ignore_timestamps=False)[source]¶ Returns True if self and other are equal. By default, two provenance tables are considered equal if their columns are byte-for-byte identical.
-
property
nbytes
¶ Returns the total number of bytes required to store the data in this table. Note that this may not be equal to the actual memory footprint.
-
packset_record
(records)[source]¶ Packs the specified list of record values and updates the
record
andrecord_offset
columns. The length of the records array must be equal to the number of rows in the table.
-
packset_timestamp
(timestamps)[source]¶ Packs the specified list of timestamp values and updates the
timestamp
andtimestamp_offset
columns. The length of the timestamps array must be equal to the number of rows in the table.
-
set_columns
(timestamp=None, timestamp_offset=None, record=None, record_offset=None)[source]¶ Sets the values for each column in this
ProvenanceTable
using the values in the specified arrays. Overwrites any data currently stored in the table.The
timestamp
andtimestamp_offset
parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information). Likewise for therecord
andrecord_offset
columns- Parameters
timestamp (numpy.ndarray, dtype=np.int8) – The flattened timestamp array. Must be specified along with
timestamp_offset
. If not specified or None, an empty timestamp value is stored for each node.timestamp_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
timestamp
array.record (numpy.ndarray, dtype=np.int8) – The flattened record array. Must be specified along with
record_offset
. If not specified or None, an empty record value is stored for each node.record_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the
record
array.
Table functions¶
-
tskit.
parse_nodes
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of a node table and returns the corresponding
NodeTable
instance. See the node text format section for the details of the required format and the node table definition section for the required properties of the contents.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (NodeTable) – If specified write into this table. If not, create a new
NodeTable
instance.
-
tskit.
parse_edges
(source, strict=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of a edge table and returns the corresponding
EdgeTable
instance. See the edge text format section for the details of the required format and the edge table definition section for the required properties of the contents.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
table (EdgeTable) – If specified, write the edges into this table. If not, create a new
EdgeTable
instance and return.
-
tskit.
parse_sites
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of a site table and returns the corresponding
SiteTable
instance. See the site text format section for the details of the required format and the site table definition section for the required properties of the contents.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (SiteTable) – If specified write site into this table. If not, create a new
SiteTable
instance.
-
tskit.
parse_mutations
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of a mutation table and returns the corresponding
MutationTable
instance. See the mutation text format section for the details of the required format and the mutation table definition section for the required properties of the contents. Note that if thetime
column is missing its entries are filled withUNKNOWN_TIME
.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (MutationTable) – If specified, write mutations into this table. If not, create a new
MutationTable
instance.
-
tskit.
parse_individuals
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of an individual table and returns the corresponding
IndividualTable
instance. See the individual text format section for the details of the required format and the individual table definition section for the required properties of the contents.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (IndividualTable) – If specified write into this table. If not, create a new
IndividualTable
instance.
-
tskit.
parse_populations
(source, strict=True, encoding='utf8', base64_metadata=True, table=None)[source]¶ Parse the specified file-like object containing a whitespace delimited description of a population table and returns the corresponding
PopulationTable
instance. See the population text format section for the details of the required format and the population table definition section for the required properties of the contents.See
tskit.load_text()
for a detailed explanation of thestrict
parameter.- Parameters
source (io.TextIOBase) – The file-like object containing the text.
strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used.
encoding (str) – Encoding used for text representation.
base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
table (PopulationTable) – If specified write into this table. If not, create a new
PopulationTable
instance.
-
tskit.
pack_strings
(strings, encoding='utf8')[source]¶ Packs the specified list of strings into a flattened numpy array of 8 bit integers and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.
- Parameters
- Returns
The tuple (packed, offset) of numpy arrays representing the flattened input data and offsets.
- Return type
numpy.ndarray (dtype=np.int8), numpy.ndarray (dtype=np.uint32)
-
tskit.
unpack_strings
(packed, offset, encoding='utf8')[source]¶ Unpacks a list of strings from the specified numpy arrays of packed byte data and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.
- Parameters
packed (numpy.ndarray) – The flattened array of byte values.
offset (numpy.ndarray) – The array of offsets into the
packed
array.encoding (str) – The text encoding to use when converting string data to bytes. See the
codecs
module for information on available string encodings.
- Returns
The list of strings unpacked from the parameter arrays.
- Return type
-
tskit.
pack_bytes
(data)[source]¶ Packs the specified list of bytes into a flattened numpy array of 8 bit integers and corresponding offsets. See Encoding ragged columns for details of this encoding.
- Parameters
- Returns
The tuple (packed, offset) of numpy arrays representing the flattened input data and offsets.
- Return type
numpy.ndarray (dtype=np.int8), numpy.ndarray (dtype=np.uint32)
-
tskit.
unpack_bytes
(packed, offset)[source]¶ Unpacks a list of bytes from the specified numpy arrays of packed byte data and corresponding offsets. See Encoding ragged columns for details of this encoding.
- Parameters
packed (numpy.ndarray) – The flattened array of byte values.
offset (numpy.ndarray) – The array of offsets into the
packed
array.
- Returns
The list of bytes values unpacked from the parameter arrays.
- Return type
Metadata API¶
The metadata
module provides validation, encoding and decoding of metadata
using a schema. See Metadata, Python Metadata API Overview and
Working with Metadata.
-
class
tskit.
MetadataSchema
(schema: Optional[Mapping[str, Any]])[source]¶ Class for validating, encoding and decoding metadata.
- Parameters
schema (dict) – A dict containing a valid JSONSchema object.
-
decode_row
(row: bytes) → Any[source]¶ Decode an encoded row (bytes) of metadata, using the codec specifed in the schema and return a python dict. Note that no validation of the metadata against the schema is performed.
-
encode_row
(row: Any) → bytes[source]¶ Encode a row (dict) of metadata to its binary representation (bytes) using the codec specified in the schema. Note that unlike
validate_and_encode_row()
no validation against the schema is performed. This should only be used for performance if a validation check is not needed.
-
tskit.
register_metadata_codec
(codec_cls: Type[tskit.metadata.AbstractMetadataCodec], codec_id: str) → None[source]¶ Register a metadata codec class. This function maintains a mapping from metadata codec identifiers used in schemas to codec classes. When a codec class is registered, it will replace any class previously registered under the same codec identifier, if present.
- Parameters
codec_id (str) – String to use to refer to the codec in the schema.
Combinatorics API¶
The combinatorics API deals with tree topologies, allowing them to be counted,
listed and generated: see Combinatorics for a detailed description. Briefly,
the position of a tree in the enumeration all_trees
can be obtained using the tree’s
rank()
method. Inversely, a Tree
can be constructed from a position
in the enumeration with Tree.unrank()
. Generated trees are associated with a new
tree sequence containing only that tree for the entire genome (i.e. with
num_trees
= 1 and a sequence_length
equal to
the span
of the tree).
-
tskit.
all_trees
(num_leaves, span=1)[source]¶ Generates all unique leaf-labelled trees with
num_leaves
leaves. See Combinatorics on the details of this enumeration. The leaf labels are selected from the set[0, num_leaves)
. The times and labels on internal nodes are chosen arbitrarily.- Parameters
- Return type
-
tskit.
all_tree_shapes
(num_leaves, span=1)[source]¶ Generates all unique shapes of trees with
num_leaves
leaves.- Parameters
- Return type
-
tskit.
all_tree_labellings
(tree, span=1)[source]¶ Generates all unique labellings of the leaves of a
tskit.Tree
. Leaves are labelled from the set[0, n)
wheren
is the number of leaves oftree
.- Parameters
tree (tskit.Tree) – The tree used to generate labelled trees of the same shape.
span (float) – The genomic span of each returned tree.
- Return type
-
class
tskit.
TopologyCounter
[source]¶ Contains the distributions of embedded topologies for every combination of the sample sets used to generate the
TopologyCounter
. It is indexable by a combination of sample set indexes and returns acollections.Counter
whose keys are topology ranks (see Interpreting Tree Ranks). SeeTree.count_topologies()
for more detail on how this structure is used.
Linkage disequilibrium¶
Note
This API will soon be deprecated in favour of multi-site extensions to the Statistics API.
-
class
tskit.
LdCalculator
(tree_sequence)[source]¶ Class for calculating linkage disequilibrium coefficients between pairs of mutations in a
TreeSequence
. This class requires the numpy library.This class supports multithreaded access using the Python
threading
module. Separate instances ofLdCalculator
referencing the same tree sequence can operate in parallel in multiple threads.Note
This class does not currently support sites that have more than one mutation. Using it on such a tree sequence will raise a LibraryError with an “Unsupported operation” message.
- Parameters
tree_sequence (TreeSequence) – The tree sequence containing the mutations we are interested in.
-
r2
(a, b)[source]¶ Returns the value of the \(r^2\) statistic between the pair of mutations at the specified indexes. This method is not an efficient method for computing large numbers of pairwise; please use either
r2_array()
orr2_matrix()
for this purpose.
-
r2_array
(a, direction=1, max_mutations=None, max_distance=None)[source]¶ Returns the value of the \(r^2\) statistic between the focal mutation at index \(a\) and a set of other mutations. The method operates by starting at the focal mutation and iterating over adjacent mutations (in either the forward or backwards direction) until either a maximum number of other mutations have been considered (using the
max_mutations
parameter), a maximum distance in sequence coordinates has been reached (using themax_distance
parameter) or the start/end of the sequence has been reached. For every mutation \(b\) considered, we then insert the value of \(r^2\) between \(a\) and \(b\) at the corresponding index in an array, and return the entire array. If the returned array is \(x\) anddirection
istskit.FORWARD
then \(x[0]\) is the value of the statistic for \(a\) and \(a + 1\), \(x[1]\) the value for \(a\) and \(a + 2\), etc. Similarly, ifdirection
istskit.REVERSE
then \(x[0]\) is the value of the statistic for \(a\) and \(a - 1\), \(x[1]\) the value for \(a\) and \(a - 2\), etc.- Parameters
a (int) – The index of the focal mutation.
direction (int) – The direction in which to travel when examining other mutations. Must be either
tskit.FORWARD
ortskit.REVERSE
. Defaults totskit.FORWARD
.max_mutations (int) – The maximum number of mutations to return \(r^2\) values for. Defaults to as many mutations as possible.
max_distance (float) – The maximum absolute distance between the focal mutation and those for which \(r^2\) values are returned.
- Returns
An array of double precision floating point values representing the \(r^2\) values for mutations in the specified direction.
- Return type
- Warning
For efficiency reasons, the underlying memory used to store the returned array is shared between calls. Therefore, if you wish to store the results of a single call to
get_r2_array()
for later processing you must take a copy of the array!
Provenance¶
We provide some preliminary support for validating JSON documents against the provenance schema. Programmatic access to provenance information is planned for future versions.
-
tskit.
validate_provenance
(provenance)[source]¶ Validates the specified dict-like object against the tskit provenance schema. If the input does not represent a valid instance of the schema an exception is raised.
- Parameters
provenance (dict) – The dictionary representing a JSON document to be validated against the schema.
- Raises
-
tskit.
ProvenanceValidationError
¶