Data model¶
The correlated genealogical trees that describe the shared ancestry of a set of
samples are stored concisely in tskit
as a collection of
easy-to-understand tables. These are output by coalescent simulation in
msprime
or can be read in from another source. This page documents
the structure of the tables, and the different methods of interchanging
genealogical data to and from the tskit API. We begin by defining
the basic concepts that we need and the structure of the tables in the
Data model section. We then describe the tabular text formats that can
be used as simple interchange mechanism for small amounts of data in the
Text file formats section. The Binary interchange section then describes
the efficient Python API for table interchange using numpy arrays. Finally,
we describe the binary format used by tskit to efficiently
store tree sequences on disk in the Tree sequence file format section.
Definitions¶
To begin, here are definitions of some key ideas encountered later.
- tree
A “gene tree”, i.e., the genealogical tree describing how a collection of genomes (usually at the tips of the tree) are related to each other at some chromosomal location. See Nodes, Genomes, or Individuals? for discussion of what a “genome” is.
- tree sequence
A “succinct tree sequence” (or tree sequence, for brevity) is an efficient encoding of a sequence of correlated trees, such as one encounters looking at the gene trees along a genome. A tree sequence efficiently captures the structure shared by adjacent trees, (essentially) storing only what differs between them.
- node
Each branching point in each tree is associated with a particular genome in a particular ancestor, called “nodes”. Since each node represents a specific genome it has a unique
time
, thought of as its birth time, which determines the height of any branching points it is associated with. See Nodes, Genomes, or Individuals? for discussion of what a “node” is.- individual
In certain situations we are interested in how nodes (representing individual homologous genomes) are grouped together into individuals (e.g. two nodes per diploid individual). For example, when we are working with polyploid samples it is useful to associate metadata with a specific individual rather than duplicate this information on the constituent nodes. See Nodes, Genomes, or Individuals? for more discussion on this point.
- sample
The focal nodes of a tree sequence, usually thought of as those that we have obtained data from. The specification of these affects various methods: (1)
TreeSequence.variants()
andTreeSequence.haplotypes()
will output the genotypes of the samples, andTree.roots
only return roots ancestral to at least one sample. (See the node table definitions for information on how the sample status a node is encoded in theflags
column.)- edge
The topology of a tree sequence is defined by a set of edges. Each edge is a tuple
(left, right, parent, child)
, which records a parent-child relationship among a pair of nodes on the on the half-open interval of chromosome[left, right)
.- site
Tree sequences can define the mutational state of nodes as well as their topological relationships. A site is thought of as some position along the genome at which variation occurs. Each site is associated with a unique position and ancestral state.
- mutation
A mutation records the change of state at a particular site ‘above’ a particular node (more precisely, along the branch between the node in question and its parent). Each mutation is associated with a specific site (which defines the position along the genome), a node (which defines where it occurs within the tree at this position), and a derived state (which defines the mutational state inherited by all nodes in the subtree rooted at the focal node). In more complex situations in which we have back or recurrent mutations, a mutation must also specify its ‘parent’ mutation.
- migration
An event at which a parent and child node were born in different populations.
- population
A grouping of nodes, e.g., by sampling location.
- provenance
An entry recording the origin and history of the data encoded in a tree sequence.
- ID
In the set of interconnected tables that we define here, we refer throughout to the IDs of particular entities. The ID of an entity (e.g., a node) is defined by the position of the corresponding row in the table. These positions are zero indexed. For example, if we refer to node with ID zero, this corresponds to the node defined by the first row in the node table.
- Sequence length
This value defines the coordinate space in which the edges and site positions are defined. This is most often assumed to be equal to the largest
right
coordinate in the edge table, but there are situations in which we might wish to specify the sequence length explicitly.
A tree sequence can be stored in a collection of eight tables: Node, Edge, Individual, Site, Mutation, Migration, Population, and Provenance. The Node and Edge tables store the genealogical relationships that define the trees, and the Individual table describes how multiple genomes are grouped within individuals; the Site and Mutation tables describe where mutations fall on the trees; the Migration table describes how lineages move across space; and the Provenance table contains information on where the data came from. Only Node and Edge tables are necessary to encode the genealogical trees; Sites and Mutations are optional but necessary to encode polymorphism (sequence) data; the remainder are optional. In the following sections we define these components of a tree sequence in more detail.
Nodes, Genomes, or Individuals?¶
The natural unit of biological analysis is (usually) the individual. However, many organisms we study are diploid, and so each individual contains two homologous copies of the entire genome, separately inherited from the two parental individuals. Since each monoploid copy of the genome is inherited separately, each diploid individual lies at the end of two distinct lineages, and so will be represented by two places in any given genealogical tree. This makes it difficult to precisely discuss tree sequences for diploids, as we have no simple way to refer to the bundle of chromosomes that make up the “copy of the genome inherited from one particular parent”. For this reason, in this documentation we use the non-descriptive term “node” to refer to this concept – and so, a diploid individual is composed of two nodes – although we use the term “genome” at times, for concreteness.
Several properties naturally associated with individuals are in fact assigned
to nodes in what follows: birth time and population. This is for two reasons:
First, since coalescent simulations naturally lack a notion of polyploidy, earlier
versions of tskit
lacked the notion of an individual. Second, ancestral
nodes are not naturally grouped together into individuals – we know they must have
existed, but have no way of inferring this grouping, so in fact many nodes in
an empirically-derived tree sequence will not be associated with individuals,
even though their birth times might be inferred.
Table definitions¶
Table types¶
Node Table¶
A node defines a monoploid set of chromosomes (a “genome”) of a specific
individual that was born at some time in the past: the set of
chromosomes inherited from a particular one of the individual’s parents.
(See Nodes, Genomes, or Individuals? for more discussion.)
Every vertex in the marginal trees of a tree sequence corresponds
to exactly one node, and a node may be present in many trees. The
node table contains five columns, of which flags
and time
are
mandatory:
Column |
Type |
Description |
---|---|---|
flags |
uint32 |
Bitwise flags. |
time |
double |
Birth time of node |
population |
int32 |
Birth population of node. |
individual |
int32 |
The individual the node belongs to. |
metadata |
binary |
Node Metadata |
The time
column records the birth time of the individual in question,
and is a floating point value. Similarly,
the population
column records the ID of the population where this
individual was born. If not provided, population
defaults to the
null ID (-1). Otherwise, the population ID must refer to a row in the
Population Table.
The individual
column records the ID of the
Individual
individual that this node belongs to. If specified, the ID must refer
to a valid individual. If not provided, individual
defaults to the null ID (-1).
The flags
column stores information about a particular node, and
is composed of 32 bitwise boolean values. Currently, the only flag defined
is NODE_IS_SAMPLE = 1
, which defines the sample status of nodes. Marking
a particular node as a “sample” means, for example, that the mutational state
of the node will be included in the genotypes produced by
TreeSequence.variants()
.
Bits 0-15 (inclusive) of the flags
column are reserved for internal use by
tskit
and should not be used by applications for anything other
than the purposes documented here. Bits 16-31 (inclusive) are free for applications
to use for any purpose and will not be altered or interpreteted by
tskit
.
See the Node requirements section for details on the properties required for a valid set of nodes.
For convenience, the text format for nodes
decomposes the flags
value into its separate values. Thus, in the
text format we have a column for is_sample
, which corresponds to the
flags
column in the underlying table. As more flags values are
defined, these will be added to the text file format.
The metadata
column provides a location for client code to store
information about each node. See the Metadata section for
more details on how metadata columns should be used.
Note
The distinction between flags
and metadata
is that flags
holds information about a node that the library understands, whereas
metadata holds information about a node that the library does not
understand. Metadata is for storing auxiliarly information that is
not necessary for the core tree sequence algorithms.
Individual Table¶
An individual defines how nodes (which can be seen
as representing single chromosomes) group together in a polyploid individual.
The individual table contains three columns, of which only flags
is mandatory.
Column |
Type |
Description |
---|---|---|
flags |
uint32 |
Bitwise flags. |
location |
double |
Location in arbitrary dimensions |
metadata |
binary |
Individual Metadata |
See the Individual requirements section for details on the properties required for a valid set of individuals.
The flags
column stores information about a particular individual, and
is composed of 32 bitwise boolean values. Currently, no flags are
defined.
Bits 0-15 (inclusive) of the flags
column are reserved for internal use by
tskit
and should not be used by applications for anything other
than the purposes documented here. Bits 16-31 (inclusive) are free for applications
to use for any purpose and will not be altered or interpreteted by
tskit
.
The location
column stores the location of an individual in arbitrary
dimensions. This column is ragged, and
so different individuals can have locations with different dimensions (i.e.,
one individual may have location []
and another [0, 1, 0]
. This could
therefore be used to store other quantities (e.g., phenotype).
The metadata
column provides a location for client code to store
information about each individual. See the Metadata section for
more details on how metadata columns should be used.
Note
The distinction between flags
and metadata
is that flags
holds information about a individual that the library understands, whereas
metadata holds information about a individual that the library does not
understand. Metadata is for storing auxiliarly information that is
not necessary for the core tree sequence algorithms.
Edge Table¶
An edge defines a parent-child relationship between a pair of nodes
over a specific sequence interval. The edge table contains five columns,
all of which are mandatory except metadata
:
Column |
Type |
Description |
---|---|---|
left |
double |
Left coordinate of the edge (inclusive). |
right |
double |
Right coordinate of the edge (exclusive). |
parent |
int32 |
Parent node ID. |
child |
int32 |
Child node ID. |
metadata |
binary |
Node Metadata |
Each row in an edge table describes a half-open genomic interval [left, right)
over which the child
inherited from the given parent
.
The left
and right
columns are defined using double precision
floating point values. The parent
and child
columns specify integer IDs in the associated Node Table.
The metadata
column provides a location for client code to store
information about each edge. See the Metadata section for
more details on how metadata columns should be used.
See the Edge requirements section for details on the properties required for a valid set of edges.
Site Table¶
A site defines a particular location along the genome in which
we are interested in observing the allelic state. The site table
contains three columns, of which position
and ancestral_state
are mandatory.
Column |
Type |
Description |
---|---|---|
position |
double |
Position of site in genome coordinates. |
ancestral_state |
text |
The state at the root of the tree. |
metadata |
binary |
Site Metadata. |
The position
column is a floating point value defining the location
of the site in question along the genome.
The ancestral_state
column specifies the allelic state at the root
of the tree, thus defining the state that nodes inherit if no mutations
intervene. The column stores text character data of arbitrary length.
The metadata
column provides a location for client code to store
information about each site. See the Metadata section for
more details on how metadata columns should be used.
See the Site requirements section for details on the properties required for a valid set of sites.
Mutation Table¶
A mutation defines a change of allelic state on a tree at a particular site.
The mutation table contains five columns, of which site
, node
and
derived_state
are mandatory.
Column |
Type |
Description |
---|---|---|
site |
int32 |
The ID of the site the mutation occurs at. |
node |
int32 |
The node this mutation occurs at. |
parent |
int32 |
The ID of the parent mutation. |
time |
double |
Time at which the mutation occurred. |
derived_state |
char |
The allelic state resulting from the mutation. |
metadata |
binary |
Mutation Metadata. |
The site
column is an integer value defining the ID of the
site at which this mutation occurred.
The node
column is an integer value defining the ID of the
first node in the tree below this mutation.
The time
column is a double precision floating point value recording how long ago
the mutation happened.
The derived_state
column specifies the allelic state resulting from the mutation,
thus defining the state that the node
and any descendant nodes in the
subtree inherit unless further mutations occur. The column stores text
character data of arbitrary length.
The parent
column is an integer value defining the ID of the mutation whose
allelic state this mutation replaced. If there is no mutation at the
site in question on the path back to root, then this field is set to the
null ID (-1). (The parent
column is only required in situations
where there are multiple mutations at a given site. For
“infinite sites” mutations, it can be ignored.)
The metadata
column provides a location for client code to store
information about each site. See the Metadata section for
more details on how metadata columns should be used.
See the Mutation requirements section for details on the properties required for a valid set of mutations.
Migration Table¶
In simulations, trees can be thought of as spread across space, and it is
helpful for inferring demographic history to record this history.
Migrations are performed by individual ancestors, but most likely not by an
individual whose genome is tracked as a node
(as in a discrete-deme model they are
unlikely to be both a migrant and a most recent common ancestor). So,
tskit
records when a segment of ancestry has moved between
populations. This table is not required, even if different nodes come from
different populations.
Column |
Type |
Description |
---|---|---|
left |
double |
Left coordinate of the migrating segment (inclusive). |
right |
double |
Right coordinate of the migrating segment (exclusive). |
node |
int32 |
Node ID. |
source |
int32 |
Source population ID. |
dest |
int32 |
Destination population ID. |
time |
double |
Time of migration event. |
metadata |
binary |
Migration Metadata |
The left
and right
columns are floating point values defining the
half-open segment of genome affected. The source
and dest
columns
record the IDs of the respective populations. The node
column records the
ID of the node that was associated with the ancestry segment in question
at the time of the migration event. The time
column is holds floating
point values recording the time of the event.
The metadata
column provides a location for client code to store
information about each migration. See the Metadata section for
more details on how metadata columns should be used.
See the Migration requirements section for details on the properties required for a valid set of mutations.
Population Table¶
A population defines a grouping of individuals that a node can be said to belong to.
The population table contains one column, metadata
.
Column |
Type |
Description |
---|---|---|
metadata |
binary |
Population Metadata. |
The metadata
column provides a location for client code to store
information about each population. See the Metadata section for
more details on how metadata columns should be used.
See the Population requirements section for details on the properties required for a valid set of populations.
Metadata¶
Each table (excluding provenance) has a metadata column for storing and passing along information that tskit does not use or interpret. See Metadata for details. The metadata columns are binary columns.
When using the Text file formats, to ensure that metadata can be safely interchanged, each row is base 64 encoded. Thus, binary information can be safely printed and exchanged, but may not be human readable.
The tree sequence itself also has metadata stored as a byte array.
Valid tree sequence requirements¶
Arbitrary data can be stored in tables using the classes in the
Tables and Table Collections. However, only a TableCollection
that fulfils a set of requirements represents
a valid TreeSequence
object which can be obtained
using the TableCollection.tree_sequence()
method. In this
section we list these requirements, and explain their rationale.
Violations of most of these requirements are detected when the
user attempts to load a tree sequence via tskit.load()
or
TableCollection.tree_sequence()
, raising an informative
error message. Some more complex requirements may not be detectable at load-time,
and errors may not occur until certain operations are attempted.
These are documented below.
We also provide tools that can transform a collection of tables into a valid
collection of tables, so long as they are logically consistent,
as described in Table transformation methods.
Individual requirements¶
Individuals are a basic type in a tree sequence and are not defined with respect to any other tables. Therefore, there are no requirements on individuals.
There are no requirements regarding the ordering of individuals.
Sorting a set of tables using TableCollection.sort()
has
no effect on the individuals.
Node requirements¶
Given a valid set of individuals and populations, the requirements for each node are:
population
must either be null (-1) or refer to a valid population ID;individual
must either be null (-1) or refer to a valid individual ID.
An ID refers to a zero-indexed row number in the relevant table, and so is “valid” if is between 0 and one less than the number of rows in the relevant table.
There are no requirements regarding the ordering of nodes with respect to time.
Sorting a set of tables using TableCollection.sort()
has no effect on nodes.
Edge requirements¶
Given a valid set of nodes and a sequence length \(L\), the simple requirements for each edge are:
We must have \(0 \leq\)
left
\(<\)right
\(\leq L\);parent
andchild
must be valid node IDs;time[parent]
>time[child]
;edges must be unique (i.e., no duplicate edges are allowed).
The first requirement simply ensures that the interval makes sense. The third requirement ensures that we cannot have loops, since time is always increasing as we ascend the tree.
To ensure a valid tree sequence there is one further requirement:
The set of intervals on which each node is a child must be disjoint.
This guarantees that we cannot have contradictory edges (i.e.,
where a node a
is a child of both b
and c
), and ensures that
at each point on the sequence we have a well-formed forest of trees.
Because this is a more complex semantic requirement, it is not detected
at load time. This error is detected during tree traversal, via, e.g.,
the TreeSequence.trees()
iterator.
In the interest of algorithmic efficiency, edges must have the following sortedness properties:
All edges for a given parent must be contiguous;
Edges must be listed in nondecreasing order of
parent
time;Within the edges for a given
parent
, edges must be sorted first bychild
ID and then byleft
coordinate.
Violations of these requirements are detected at load time.
The TableCollection.sort()
method will ensure that these sortedness
properties are fulfilled.
Site requirements¶
Given a valid set of nodes and a sequence length \(L\), the simple requirements for a valid set of sites are:
We must have \(0 \leq\)
position
\(< L\);position
values must be unique.
For simplicity and algorithmic efficiency, sites must also:
Be sorted in increasing order of
position
.
Violations of these requirements are detected at load time.
The TableCollection.sort()
method ensures that sites are sorted
according to these criteria.
Mutation requirements¶
Given a valid set of nodes, edges and sites, the requirements for a valid set of mutations are:
site
must refer to a valid site ID;node
must refer to a valid node ID;time
must either beUNKNOWN_TIME
(a NAN value which indicates the time is unknown) or be a finite value which is greater or equal to the mutationnode
’stime
, less than thenode
above the mutation’stime
and equal to or less than thetime
of theparent
mutation if this mutation has one. If one mutation on a site hasUNKNOWN_TIME
then all mutations at that site must, as a mixture of known and unknown is not valid.parent
must either be the null ID (-1) or a valid mutation ID within the current table
Furthermore,
If another mutation occurs on the tree above the mutation in question, its ID must be listed as the
parent
.
For simplicity and algorithmic efficiency, mutations must also:
be sorted by site ID;
when there are multiple mutations per site, mutations should be ordered by decreasing time, if known, and parent mutations must occur before their children (i.e. if a mutation with ID \(x\) has
parent
with ID \(y\), then we must have \(y < x\)).
Violations of these sorting requirements are detected at load time.
The TableCollection.sort()
method ensures that mutations are sorted
according site ID, but does not at present enforce that mutations occur
after their parent mutations.
Mutations also have the requirement that they must result in a
change of state. For example, if we have a site with ancestral state
of “A” and a single mutation with derived state “A”, then this
mutation does not result in any change of state. This error is
raised at run-time when we reconstruct sample genotypes, for example
in the TreeSequence.variants()
iterator.
Note
As tskit.UNKNOWN_TIME
is implemented as a NaN
value, tests for
equality will always fail. Use tskit.is_unknown_time
to detect unknown
values.
Migration requirements¶
Given a valid set of nodes and edges, the requirements for a value set of migrations are:
left
andright
must lie within the tree sequence coordinate space (i.e., from 0 tosequence_length
).time
must be strictly between the time of itsnode
and the time of any ancestral node from which that node inherits on the segment[left, right)
.The
population
of any such ancestor matchingsource
, if anothermigration
does not intervene.
To enable efficient processing, migrations must also be:
Sorted by nondecreasing
time
value.
Note in particular that there is no requirement that adjacent migration records
should be “squashed”. That is, we can have two records m1
and m2
such that m1.right
= m2.left
and with the node
, source
,
dest
and time
fields equal. This is because such records will usually
represent two independent ancestral segments migrating at the same time, and
as such squashing them into a single record would result in a loss of information.
Population requirements¶
There are no requirements on a population table.
Table transformation methods¶
In several cases it may be necessary to transform the data stored in a
TableCollection
. For example, an application may produce tables
which, while logically consistent, do not meet all the
requirements for a valid tree
sequence, which exist for algorithmic and efficiency reasons; table
transformation methods can make such a set of tables valid, and thus ready
to be loaded into a tree sequence.
In general, table methods operate in place on a TableCollection
,
directly altering the data stored within its constituent tables.
Some of the methods described in this section also have an equivalant
TreeSequence
version: unlike the methods described below,
TreeSequence
methods do not operate in place, but rather act in
a functional way, returning a new tree sequence while leaving the original
unchanged.
This section is best skipped unless you are writing a program that records tables directly.
Simplification¶
Simplifying a tree sequence is an operation commonly used to remove
redundant information and only retain the minimal tree sequence necessary
to describe the genealogical history of the samples
provided. In fact all
that the TreeSequence.simplify()
method does is to call the equivalent
table transformation method, TableCollection.simplify()
, on the
underlying tables and load them in a new tree sequence.
Removing information via TableCollection.simplify()
is done by
discarding rows from the underlying tables. Nevertheless, simplification is
guaranteed to preserve relative ordering of any retained rows in the Site
and Mutation tables.
The TableCollection.simplify()
method can be applied to a collection of
tables that does not have the mutations.parent
entries filled in, as long
as all other validity requirements are satisfied.
Sorting¶
The validity requirements for a set of tables to be loaded into a tree sequence
listed in Table definitions are of two sorts: logical consistency,
and sortedness. The TableCollection.sort()
method can be used to make
completely valid a set of tables that satisfies all requirements other than
sortedness.
This method can also be used on slightly more general collections of tables:
it is not required that site
positions be unique in the table collection to
be sorted. The method has two additional properties:
it preserves relative ordering between sites at the same position, and
it preserves relative ordering between mutations at the same site.
TableCollection.sort()
does not check the validity of the parent
property of the mutation table. However, because the method preserves mutation
order among mutations at the same site, if mutations are already sorted so that
each mutation comes after its parent (e.g., if they are ordered by time of
appearance), then this property is preserved, even if the parent properties
are not specified.
Table indexes¶
To efficiently iterate over the trees in a tree sequence, tskit
uses
indexes built on the edges. To create a tree sequence from a table collection
the tables must be indexed; the TableCollection.build_index()
method
can be used to create an index on a table collection if necessary.
Todo
Add more details on what the indexes actually are.
Removing duplicate sites¶
The TableCollection.deduplicate_sites()
method can be used to save a tree
sequence recording method the bother of checking to see if a given site already
exists in the site table. If there is more than one site with the same
position, all but the first is removed, and all mutations referring to the
removed sites are edited to refer to the first (and remaining) site. Order is
preserved.
Computing mutation parents¶
If each edge had at most only a single mutation, then the parent
property
of the mutation table would be easily inferred from the tree at that mutation’s
site. If mutations are entered into the mutation table ordered by time of
appearance, then this sortedness allows us to infer the parent of each mutation
even for mutations occurring on the same branch. The
TableCollection.compute_mutation_parents()
method will take advantage
of this fact to compute the parent
column of a mutation table, if all
other information is valid.
Computing mutation times¶
In the case where the method generating a tree sequence does not generate mutation
times, valid times can be provided by TableCollection.compute_mutation_parents()
.
If all other information is valid this method will assign times to the mutations by
placing them at evenly spaced intervals along their edge (for instance, a single
mutation on an edge between a node at time 1.0 and a node at time 4.0 would be given
time 2.5; while two mutations on that edge would be given times 2.0 and 3.0).
Recording tables in forwards time¶
The above methods enable the following scheme for recording site and mutation tables during a forwards-time simulation. Whenever a new mutation is encountered:
Add a new
site
to the site table at this position.Add a new
mutation
to the mutation table at the newly created site.
This is lazy and wrong, because:
There might have already been sites in the site table with the same position,
and/or a mutation (at the same position) that this mutation should record as its
parent
.
But, it’s all OK because here’s what we do:
Add rows to the mutation and site tables as described above.
Periodically,
sort
,deduplicate_sites
, andsimplify
, then return to (1.), except thatSometimes, to output the tables,
sort
,compute_mutation_parents
, (optionallysimplify
), and dump these out to a file.
Note: as things are going along we do not have to
compute_mutation_parents
, which is nice, because this is a nontrivial step
that requires construction all the trees along the sequence. Computing mutation
parents only has to happen before the final (output) step.
This is OK as long as the forwards-time simulation outputs things in order by when they occur, because these operations have the following properties:
Mutations appear in the mutation table ordered by time of appearance, so that a mutation will always appear after the one that it replaced (i.e., its parent).
Furthermore, if mutation B appears after mutation A, but at the same site, then mutation B’s site will appear after mutation A’s site in the site table.
sort
sorts sites by position, and then by ID, so that the relative ordering of sites at the same position is maintained, thus preserving property (2).sort
sorts mutations by site, and then by ID, thus preserving property (1); if the mutations are at separate sites (but the same position), this fact is thanks to property (2).simplify
also preserves ordering of any rows in the site and mutation tables that do not get discarded.deduplicate_sites
goes through and collapses all sites at the same position to only one site, maintaining order otherwise.compute_mutation_parents
fills in theparent
information by using property (1).
Tree structure¶
Quintuply linked trees¶
Tree structure in tskit
is encoded internally as a “quintuply
linked tree”, a generalisation of the triply linked tree encoding
used by Knuth and others. Nodes are represented by their integer
IDs, and their relationships to other nodes are recorded in the
parent
, left_child
, right_child
, left_sib
and
right_sib
arrays. For example, consider the following tree
and associated arrays:
node |
parent |
left_child |
right_child |
left_sib |
right_sib |
---|---|---|---|---|---|
0 |
5 |
-1 |
-1 |
-1 |
1 |
1 |
5 |
-1 |
-1 |
0 |
2 |
2 |
5 |
-1 |
-1 |
1 |
-1 |
3 |
6 |
-1 |
-1 |
-1 |
4 |
4 |
6 |
-1 |
-1 |
3 |
-1 |
5 |
7 |
0 |
2 |
-1 |
6 |
6 |
7 |
3 |
4 |
5 |
-1 |
7 |
-1 |
5 |
6 |
-1 |
-1 |
Each node in the tree sequence corresponds to a row in this table, and
the columns are the individual arrays recording the quintuply linked
structure. Thus, we can see that the parent of nodes 0
, 1
and 2
is 5
. Similarly, the left child of 5
is 0
and the
right child of 5
is 2
. The left_sib
and right_sib
arrays
then record each nodes sibling on its left or right, respectively;
hence the right sib of 0
is 1
, and the right sib of 1
is 2
.
Thus, sibling information allows us to efficiently support trees
with arbitrary numbers of children. In each of the five pointer arrays,
the null node (-1) is used to indicate the end of a path; thus,
for example, the parent of 7
and left sib of 0
are null.
Please see this example for details of how to use the quintuply linked structure in the C API.
Note
For many applications we do not need the quintuply linked trees,
and (for example) the left_sib
and right_child
arrays can be
ignored. The reason for using a quintuply instead of triply linked
encoding is that it is not possible to efficiently update the trees
as we move along the sequence without the quintuply linked structure.
Warning
The left-to-right ordering of nodes is determined by the order in which edges are inserted into the tree during iteration along the sequence. Thus, if we arrive at the same tree by iterating from different directions, the left-to-right ordering of nodes may be different! The specific ordering of the children of a node should therefore not be depended on.
Accessing roots¶
The roots of a tree are defined as the unique endpoints of upward paths
starting from sample nodes (if no path leads upward from a sample node,
that node is also a root). Thus, trees can have multiple roots in tskit
.
For example, if we delete the edge joining 6
and 7
in the previous
example, we get a tree with two roots:
node |
parent |
left_child |
right_child |
left_sib |
right_sib |
---|---|---|---|---|---|
0 |
5 |
-1 |
-1 |
-1 |
1 |
1 |
5 |
-1 |
-1 |
0 |
2 |
2 |
5 |
-1 |
-1 |
1 |
-1 |
3 |
6 |
-1 |
-1 |
-1 |
4 |
4 |
6 |
-1 |
-1 |
3 |
-1 |
5 |
7 |
0 |
2 |
-1 |
-1 |
6 |
-1 |
3 |
4 |
7 |
-1 |
7 |
-1 |
5 |
5 |
-1 |
6 |
To gain efficient access to the roots in the quintuply linked encoding we keep
one extra piece of information: the left_root
. In this example
the leftmost root is 7
. Roots are considered siblings, and so
once we have one root we can find all the other roots efficiently using
the left_sib
and right_sib
arrays. For example, we can see here
that the right sibling of 7
is 6
, and the left sibling of 6
is 7
.
Missing data¶
Missing data is encoded in tskit using the idea of isolated samples. A sample’s genotype is missing at a position if it is isolated and if it has no mutations directly above it at that position. An isolated sample is a sample node (see Definitions) that has no children and no parent, in a particular tree. This encodes the idea that we don’t know anything about that sample’s relationships over a specific interval. This definition covers the standard idea of missing data in genomics (where we do not know the sequence of a given contemporary sample at some site, for whatever reason), but also more generally the idea that we may not know anything about large sections of the genomes of ancestral samples. However, a mutation above an isolated node can be thought of as saying directly what the genotype is, and so renders the genotype at that position not missing.
Consider the following example:
In this tree, node 4 is isolated, and therefore for any sites that are
on this tree, the state that it is assigned is a special value
tskit.MISSING_DATA
, or -1
, as long as there are no mutations above
the node at that site. Note that, although isolated, because node 4
is a sample node it is still considered as being present in the
tree, meaning it will still returned by the Tree.nodes()
and
Tree.samples()
methods. The Tree.is_isolated()
method can be used to
identify nodes which are isolated samples:
>>> [u for u in tree.samples() if tree.is_isolated(u)] # isolated samples in this tree
[4]
>>> [u for u in tree.nodes() if not tree.is_isolated(u)] # topologically connected nodes
[0, 1, 2, 3, 5, 6, 7]
See the TreeSequence.variants()
method and Variant
class for
more information on how missing data is represented in variant data.
Text file formats¶
The tree sequence text file format is based on a simple whitespace
delimited approach. Each table corresponds to a single file, and is
composed of a number of whitespace delimited columns. The first
line of each file must be a header giving the names of each column.
Subsequent rows must contain data for each of these columns, following
the usual conventions. Each table has a set of mandatory and optional columns which are
described below. The columns can be provided in any order, and extra columns
can be included in the file. Note, in particular, that this means that
an id
column may be present in any of these files, but it will be
ignored (IDs are always determined by the position of the row in a table).
We present the text format below using the following very simple tree sequence, with four nodes, two trees, and three mutations at two sites, both on the first tree:
time ago
--------
3 3
┏━━┻━━┓
╋ ╋ 2
┃ ╋ ┏━━┻━━┓
0 0 1 0 1
position 0 7 10
A deletion from AT to A has occurred at position 2 on the branch leading to node 0, and two mutations have occurred at position 4 on the branch leading to node 1, first from A to T, then a back mutation to A. The genotypes of our two samples, nodes 0 and 1, are therefore AA and ATA.
Individual text format¶
The individual text format must contain a flags
column.
Optionally, there may also be a location
and
metadata
columns. See the individual table definitions for details on these columns.
Note that there are currently no globally defined flags
, but the column
is still required; a value of 0
means that there are no flags set.
The location
column should be a sequence of comma-separated numeric
values. They do not all have to be the same length.
An example individual table:
flags location
0 0.5,1.2
0 1.0,3.4
0
0 1.2
0 3.5,6.3
0 0.5,0.5
0 0.5
0 0.7,0.6,0.0
0 0.5,0.0
Node text format¶
The node text format must contain the columns is_sample
and
time
. Optionally, there may also be population
, individual
, and
metadata
columns. See the node table definitions for details on these columns.
Note that we do not have a flags
column in the text file format, but
instead use is_sample
(which may be 0 or 1). Currently, NODE_IS_SAMPLE
is
the only flag value defined for nodes, and as more flags are defined we will
allow for extra columns in the text format.
An example node table:
is_sample individual time
1 0 0.0
1 0 0.0
0 -1 1.0
0 -1 3.0
Edge text format¶
The edge text format must contain the columns left
,
right
, parent
and child
. Optionally, there may also be
a metadata
column.
See the edge table definitions
for details on these columns.
An example edge table:
left right parent child
0.0 7.0 2 0
0.0 7.0 2 1
7.0 10.0 3 0
7.0 10.0 3 1
Site text format¶
The site text format must contain the columns position
and
ancestral_state
. The metadata
column may also be optionally
present. See the
site table definitions
for details on these columns.
sites:
position ancestral_state
2.0 AT
4.0 A
Mutation text format¶
The mutation text format must contain the columns site
,
node
and derived_state
. The time
, parent
and metadata
columns
may also be optionally present (but parent
must be specified if
more than one mutation occurs at the same site). If time
is absent
UNKNOWN_TIME
will be used to fill the column. See the
mutation table definitions
for details on these columns.
mutations:
site node derived_state time parent
0 0 A 0 -1
1 0 T 0.5 -1
1 1 A 1 1
Migration text format¶
The migration text format must contain the columns left
,
right
, node
, source
, dest
and time
. The metadata
column
may also be optionally present. See the
migration table definitions
for details on these columns.
migrations:
left right node source dest time
0.0 0.7 5 2 3 1.0
0.8 0.9 8 3 4 3.0
Population text format¶
Population tables only have a metadata
column, so the text format for
a population table requires there to be a metadata
column. See the
population table definitions for
details.
An example population table:
id metadata
0 cG9wMQ==
1 cG9wMg==
The metadata
contains base64-encoded data (in this case, the strings
pop1
and pop1
).
Binary interchange¶
In this section we describe the high-level details of the API for interchanging table data via numpy arrays. Please see the Tables and Table Collections for detailed description of the functions and methods.
The tables API is based on columnar storage of the data. In memory, each
table is organised as a number of blocks of contiguous storage, one for
each column. There are many advantages to this approach, but the key
property for us is that allows for very efficient transfer of data
in and out of tables. Rather than inserting data into tables row-by-row
(which can be done using the add_row
methods), it is much more
efficient to add many rows at the same time by providing pointers to blocks of
contiguous memory. By taking
this approach, we can work with tables containing gigabytes of data very
efficiently.
We use the numpy Array API
to allow us to define and work with numeric arrays of the required types.
Node IDs, for example, are defined using 32 bit integers. Thus, the
parent
column of an Edge Table’s with n
rows
is a block 4n
bytes.
This approach is very straightforward for columns in which each row contains a fixed number of values. However, dealing with columns containing a variable number of values is more problematic.
Encoding ragged columns¶
A ragged column is a column in which the rows are not of a fixed length. For example, Metadata columns contain binary of data of arbitrary length. To encode such columns in the tables API, we store two columns: one contains the flattened array of data and another stores the offsets of each row into this flattened array. Consider an example:
>>> s = tskit.SiteTable()
>>> s.add_row(0, "A")
>>> s.add_row(0, "")
>>> s.add_row(0, "TTT")
>>> s.add_row(0, "G")
>>> print(s)
id position ancestral_state metadata
0 0.00000000 A
1 0.00000000
2 0.00000000 TTT
3 0.00000000 G
>>> s.ancestral_state
array([65, 84, 84, 84, 71], dtype=int8)
>>> s.ancestral_state.tobytes()
b'ATTTG'
>>> s.ancestral_state_offset
array([0, 1, 1, 4, 5], dtype=uint32)
>>> s.ancestral_state[s.ancestral_state_offset[2]: s.ancestral_state_offset[3]].tobytes()
b'TTT'
In this example we create a Site Table with four rows,
and then print out this table. We can see that the second row has the
empty string as its ancestral_state
, and the third row’s
ancestral_state
is TTT
. When we print out the tables ancestral_state
column, we see that its a numpy array of length 5: this is the
flattened array of ASCII encoded
values for these rows. When we decode these bytes using the
numpy tobytes
method, we get the string ‘ATTTG’. This flattened array
can now be transferred efficiently in memory like any other column. We
then use the ancestral_state_offset
column to allow us find the
individual rows. For a row j
:
ancestral_state[ancestral_state_offset[j]: ancestral_state_offset[j + 1]]
gives us the array of bytes for the ancestral state in that row.
For a table with n
rows, any offset column must have n + 1
values, the first of which is always 0
. The values in this column must be
nondecreasing, and cannot exceed the length of the ragged column in question.
Tree sequence file format¶
To make tree sequence data as efficient and easy as possible to use, we store the data on file in a columnar, binary format. The format is based on the kastore package, which is a simple key-value store for numerical data. There is a one-to-one correspondence between the tables described above and the arrays stored in these files.
By convention, these files are given the .trees
suffix (although this
is not enforced in any way), and we will sometimes refer to them as “.trees”
files. We also refer to them as “tree sequence files”.
Todo
Link to the documentation for kastore, and describe the arrays that are stored as well as the top-level metadata.
Legacy Versions¶
Tree sequence files written by older versions of tskit are not readable by
newer versions of tskit. For major releases of tskit, tskit upgrade
will convert older tree sequence files to the latest version.
File formats from version 11 onwards are based on kastore; previous to this, the file format was based on HDF5.
However many changes to the tree sequence format are not part of major releases. The table below gives these versions.
Version number |
Commit Short Hash |
---|---|
11.0 |
5646cd3 |
10.0 |
e4396a7 |
9.0 |
e504abd |
8.0 |
299ddc9 |
7.0 |
ca9c0c5 |
6.0 |
6310725 |
5.0 |
62659fb |
4.0 |
a586646 |
3.2 |
8f44bed |
3.1 |
d69c059 |
3.0 |
7befdcf |
2.1 |
a26a227 |
2.0 |
7c507f3 |
1.1 |
c143dd9 |
1.0 |
04722d8 |
0.3 |
f42215e |
0.1 |
34ac742 |