Analysing tree sequences#


This tutorial is a work in progress.

Computing statistics#

Tskit provides an extensive and flexible interface for computing population genetic statistics, which is documented in detail in the general statistics section of the offical documentation. This tutorial aims to give a quick overview of how the APIs work and how to use them effectively.

First, let’s load a tree sequence to work with which has roughly human parameters for 10 thousand samples and 10Mb chromosomes:

ts = tskit.load("data/computing_statistics.trees")
Tree Sequence
Sequence Length10000000.0
Time Unitsunknown
Sample Nodes10000
Total Size9.5 MiB
MetadataNo Metadata
Table Rows Size Has Metadata
Edges 157894 4.8 MiB
Individuals 0 24 Bytes
Migrations 0 8 Bytes
Mutations 38778 1.4 MiB
Nodes 44864 1.2 MiB
Populations 1 16 Bytes
Provenances 1 1.0 KiB
Sites 38778 946.7 KiB

This tree sequence has ~36.6 thousand trees & ~39 thousand segregating sites. We’d now like to compute some statistics on this dataset.

One-way statistics#

We refer to statistics that are defined with respect to a single set of samples as “one-way”. An example of such a statistic is diversity, which is computed using the TreeSequence.diversity() method:

d = ts.diversity()
print("Average diversity per unit sequence length = {d:.3G}")
Average diversity per unit sequence length = {d:.3G}

This tells the average diversity across the whole sequence and returns a single number. We’ll usually want to compute statistics in windows along the genome and we use the windows argument to do this:

windows = np.linspace(0, ts.sequence_length, num=5)
d = ts.diversity(windows=windows)
print(windows, d, sep="\n")
[       0.  2500000.  5000000.  7500000. 10000000.]
[0.00040717 0.00040891 0.00041644 0.00039899]

The windows argument takes a numpy array specifying the breakpoints along the genome. Here, we use numpy to create four equally spaced windows of size 2.5 megabases (the windows array contains k + 1 elements to define k windows). Because we have asked for values in windows, tskit now returns a numpy array rather than a single value. (See Output dimensions for a full description of how the output dimensions of statistics are determined by the windows argument.)

Suppose we wanted to compute diversity within a specific subset of samples. We can do this using the sample_sets argument:

A = ts.samples()[:100]
d = ts.diversity(sample_sets=A)

Here, we’ve computed the average diversity within the first hundred samples across the whole genome. As we’ve not specified any windows, this is again a single value.

We can also compute diversity in multiple sample sets at the same time by providing a list of sample sets as an argument:

A = ts.samples()[:100]
B = ts.samples()[100:200]
C = ts.samples()[200:300]
d = ts.diversity(sample_sets=[A, B, C])
[0.00041008 0.00040893 0.00040853]

Because we’ve computed multiple statistics concurrently, tskit returns a numpy array of these statistics. We have asked for diversity within three different sample sets, and tskit therefore returns an array with three values. (In general, the dimensions of the input determines the dimensions of the output: see Output dimensions for a detailed description of the rules.)

We can also compute multiple statistics in multiple windows:

d = ts.diversity(sample_sets=[A, B, C], windows=windows)
print("shape = ", d.shape, "\n", d)
shape =  (4, 3) 
 [[0.00041199 0.00040542 0.00040588]
 [0.00040889 0.00041216 0.00041419]
 [0.00041636 0.00041463 0.0004128 ]
 [0.00040307 0.0004035  0.00040126]]

We have computed diversity within three different sample sets across four genomic windows, and our output is therefore a 2D numpy array with four rows and three columns: each row contains the diversity values within A, B and C for a particular window.

Multi-way statistics#

Many population genetic statistics compare multiple sets of samples to each other. For example, the TreeSequence.divergence() method computes the divergence between two subsets of samples:

A = ts.samples()[:100]
B = ts.samples()[:100]
d = ts.divergence([A, B])

The divergence between two sets of samples A and B is a single number, and we we again return a single floating point value as the result. We can also compute this in windows along the genome, as before:

d = ts.divergence([A, B], windows=windows)
[0.00040787 0.00040481 0.00041219 0.00039904]

Again, as we have defined four genomic windows along the sequence, the result is numpy array with four values.

A powerful feature of tskit’s stats API is that we can compute the divergences between multiple sets of samples simultaneously using the indexes argument:

d = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
[0.00040598 0.00040943]

Here, we’ve specified three sample sets A, B and C and we’ve computed the divergences between A and B, and between A and C. The indexes argument is used to specify which pairs of sets we are interested in. In this example we’ve computed two different divergence values and the output is therefore a numpy array of length 2.

As before, we can combine computing multiple statistics in multiple windows to return a 2D numpy array:

windows = np.linspace(0, ts.sequence_length, num=5)
d = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
[[0.00040787 0.00040834]
 [0.00040481 0.00041207]
 [0.00041219 0.00041464]
 [0.00039904 0.00040267]]

Each row again corresponds to a window, which contains the average divergence values between the chosen sets.

If the indexes parameter is 1D array, we interpret this as specifying a single statistic and remove the empty outer dimension:

d = ts.divergence([A, B, C], indexes=(0, 1))

It’s important to note that we don’t have to remove empty dimensions: tskit will only do this if you explicitly ask it to. Here, for example, we can keep the output as an array with one value if we wish:

d = ts.divergence([A, B, C], indexes=[(0, 1)])

Please see Sample sets and indexes for a full description of the sample_sets and indexes arguments.

Allele frequency spectra#

The allele frequency spectrum is a fundamental tool in population genetics, and tskit provides a flexible and powerful approach to computing such spectra. Suppose we have simulated the following tree sequence:

from IPython.display import display
ts = tskit.load("data/afs.trees")
tree = ts.first()

Computing the allele frequency spectrum is then easy:

afs = ts.allele_frequency_spectrum(polarised=True, span_normalise=False)
[0. 2. 6. 1. 1. 0. 0.]

This tells us that we have two singletons, six doubletons and one 3-ton and one 4-ton. Note that the first element of the returned AFS array does not correspond to the singletons (see below for why). Because we have simulated these mutations, we know the ancestral and derived states we have set polarised to True. We can get the “folded” AFS by setting polarised to False. Because we want simple counts here and not averaged values, we set span_normalise=False: by default, windowed statistics are divided by the sequence length, so they are comparable between windows.

The returned value here is actually a 2D array, and this is because we can also perform these computations in windows along the genome:

afs = ts.allele_frequency_spectrum(windows=[0, 0.5, 1], span_normalise=False, polarised=True)
[[0. 1. 1. 1. 0. 0. 0.]
 [0. 1. 5. 0. 1. 0. 0.]]

This time, we’ve asked for the number of sites at each frequency in two equal windows. Now we can see that in the first half of the sequence we have three sites (compare with the site table above): one singleton, one doubleton and one tripleton.

Joint spectra#

We can also compute allele frequencies within multiple sets of samples, the joint allele frequency spectra.

node_colours = {0: "blue", 2: "blue", 3: "blue", 1: "green", 4: "green", 5: "green"}
styles = [f".n{k} > .sym {{fill: {v}}}" for k, v in node_colours.items()]
tree.draw_svg(style = "".join(styles))

Here we’ve marked the samples as either blue or green (we can imagine these belonging to different populations, for example). We can then compute the joint AFS based on these two sets:

afs = ts.allele_frequency_spectrum([[0, 2, 3], [1, 4, 5]], polarised=True)
[[0. 2. 0. 0.]
 [0. 6. 0. 0.]
 [0. 1. 1. 0.]
 [0. 0. 0. 0.]]

Now, each window in our AFS is a 2D numpy array, where each dimension corresponds to frequencies within the different sets. So, we see for example that there are six sites that are singletons in both sets, 1 site that is a doubleton in both sets, and 2 sites that singletons in \([1, 4, 5]\) and not present in the other sample set.

Branch length spectra#

Up to now we’ve used the allele_frequency_spectrum() method to summarise the number of sites that occur at different frequencies. We can also use this approach to compute the total branch lengths subtending a given number of samples by setting mode="branch":

afs = ts.allele_frequency_spectrum(mode="branch", polarised=True, span_normalise=False)
[0.         4.86089166 5.39638988 2.55239269 2.07444286 0.
 0.        ]

Thus, the total branch length over example one sample is 4.86, over two is 5.39, and so on.

Zeroth and final entries in the AFS#

The zeroth element of the AFS is significant when we are working with sample sets that are a subset of all samples in the tree sequence. For example, in the following we compute the AFS within the sample set [0, 1, 2]:

afs = ts.allele_frequency_spectrum([[0, 1, 2]], mode="branch", polarised=True)
[4.33184862 5.30022646 5.252042   0.        ]

Thus, the total branch length over 0, 1 and 2 is 5.3, and over pairs from this set is 5.25. What does the zeroth value of 4.33 signify? This is the total branch length over all samples that are not in this sample set. By including this value, we maintain the property that for each tree, the sum of the AFS for any sample set is always equal to the total branch length. For example, here we compute:

print("sum afs          = ", np.sum(afs))
print("total branch len = ", tree.total_branch_length)
sum afs          =  14.884117086717392
total branch len =  14.884117086717396

The final entry of the AFS is similar: it counts alleles (for mode=”site”) or branches (for mode=”branch”) that are ancestral to all of the given sample set, but are still polymorphic in the entire set of samples of the tree sequence. Note, however, that alleles fixed among all the samples, e.g., ones above the root of the tree, will not be included.