Tskit and R#
To interface with tskit
in R, we can use the reticulate R package, which lets you call Python functions within an R session. In this tutorial, we’ll go through a couple of examples to show you how to get started. If you haven’t done so already, you’ll need to install reticulate
in your R session via install.packages("reticulate")
.
We’ll begin by simulating a small tree sequence using msprime
.
msprime <- reticulate::import("msprime")
ts <- msprime$sim_ancestry(80, sequence_length=1e4, recombination_rate=1e-4, random_seed=42)
ts # See "Jupyter notebook tips", below for how to render this nicely
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 26║
╟───────────────┼───────────╢
║Sequence Length│ 10000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 160║
╟───────────────┼───────────╢
║Total Size │ 29.1 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤════╤══════════╤════════════╗
║Table │Rows│Size │Has Metadata║
╠═══════════╪════╪══════════╪════════════╣
║Edges │ 414│ 12.9 KiB│ No║
╟───────────┼────┼──────────┼────────────╢
║Individuals│ 80│ 2.2 KiB│ No║
╟───────────┼────┼──────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼────┼──────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼────┼──────────┼────────────╢
║Nodes │ 344│ 9.4 KiB│ No║
╟───────────┼────┼──────────┼────────────╢
║Populations│ 1│ 224 Bytes│ Yes║
╟───────────┼────┼──────────┼────────────╢
║Provenances│ 1│1015 Bytes│ No║
╟───────────┼────┼──────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧════╧══════════╧════════════╝
Attributes and methods#
reticulate
allows us to access a Python object’s attributes via
the $
operator. For example, we can access (and assign to a variable) the number of
samples in the tree sequence:
n <- ts$num_samples
n
The $
operator can also be used to call methods, for example, the
simplify()
method associated with the tree sequence.
The method parameters are given as native R objects
(but note that object IDs still use tskit’s 0-based indexing system).
reduced_ts <- ts$simplify(0:7) # only keep samples with ids 0, 1, 2, 3, 4, 5, 6, 7
reduced_ts <- reduced_ts$delete_intervals(list(c(6000, 10000))) # delete data after 6kb
reduced_ts <- reduced_ts$trim() # remove the deleted region
paste(
"Reduced from", ts$num_trees, "trees over", ts$sequence_length/1e3, "kb to",
reduced_ts$num_trees, "trees over", reduced_ts$sequence_length/1e3, "kb.")
IDs and indexes#
Note that if a bare digit is provided to one of these methods, it will be treated as a
floating point number. This is useful to know when calling tskit
methods that
require integers (e.g. object IDs). For example, the following will not work:
ts$node(0) # Will raise an error
In this case, to force the 0
to be passed as an integer, you can either coerce it
using as.integer
or simply prepend the letter L
:
ts$node(as.integer(0))
# or
ts$node(0L)
Node(id=0, flags=1, time=0.0, population=0, individual=0, metadata=b'')
Node(id=0, flags=1, time=0.0, population=0, individual=0, metadata=b'')
Coercing in this way is only necessary when passing parameters to those underlying
tskit
methods that expect integers. It is not needed e.g. to index into numeric arrays.
However, when using arrays, very careful attention must be paid to the fact that
tskit
IDs start at zero, whereas R indexes start at one:
root_id <- ts$first()$root
paste("Root time via tskit method:", ts$node(root_id)$time)
# When indexing into tskit arrays in R, add 1 to the ID
paste("Root time via array access:", ts$nodes_time[root_id + 1])
Analysis#
From within R we can use tskit
’s powerful
Statistics framework to efficiently
compute many different summary statistics from a tree sequence. To illustrate this,
we’ll first add some mutations to our tree sequence with the
msprime.sim_mutations()
function, and then compute the genetic diversity
for each of the tree sequence’s sample nodes:
ts_mut = msprime$sim_mutations(reduced_ts, rate=1e-4, random_seed=321)
paste(ts_mut$num_mutations, "mutations, genetic diversity is", ts_mut$diversity())
Numerical arrays and matrices work as expected. For instance, we can use the tree
sequence genotype_matrix()
method to return the genotypes of
the tree sequence as a matrix object in R.
G = ts_mut$genotype_matrix()
G
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
We can then use R functions directly on the genotype matrix:
allele_frequency = rowMeans(G)
allele_frequency
- 0.125
- 0.125
- 0.125
- 0.125
- 0.5
- 0.625
- 0.875
- 0.875
- 0.125
- 0.25
- 0.125
- 0.625
- 0.125
Jupyter notebook tips#
When running R within a Jupyter notebook, a few magic functions can be defined that allow tskit objects to be rendered within the notebook:
# Define some magic functions to allow objects to be displayed in R Jupyter notebooks
repr_html.tskit.trees.TreeSequence <- function(obj, ...){obj$`_repr_html_`()}
repr_html.tskit.trees.Tree <- function(obj, ...){obj$`_repr_html_`()}
repr_svg.tskit.drawing.SVGString <- function(obj, ...){obj$`__str__`()}
This leads to much nicer tabular summaries:
ts_mut
Tree Sequence | |
---|---|
Trees | 6 |
Sequence Length | 6000.0 |
Time Units | generations |
Sample Nodes | 8 |
Total Size | 6.1 KiB |
Metadata | No Metadata |
Table | Rows | Size | Has Metadata |
---|---|---|---|
Edges | 32 | 1.0 KiB | |
Individuals | 4 | 136 Bytes | |
Migrations | 0 | 8 Bytes | |
Mutations | 13 | 497 Bytes | |
Nodes | 20 | 568 Bytes | |
Populations | 1 | 224 Bytes | ✅ |
Provenances | 5 | 3.1 KiB | |
Sites | 13 | 341 Bytes |
It also allows trees and tree sequences to be plotted inline:
ts_mut$draw_svg(y_axis=TRUE, y_ticks=0:10)
Interaction with R libraries#
R has a number of libraries to deal with genomic data and trees. Below we focus on the
phylogenetic tree representation defined in the the popular
ape package, taking all the trees
exported in Nexus format
, or
individual trees exported in Newick format
:
file = tempfile()
ts_mut$write_nexus(file)
# Warning - ape trees are stored independently, so this will use much more memory than tskit
trees <- ape::read.nexus(file, force.multi = TRUE) # return a set of trees
# Or simply read in a single tree
tree <- ape::read.tree(text=ts_mut$first()$as_newick())
# Now we can plot the tree in tskit style, but using the ape library
plot(tree, direction="downward", srt=90, adj=0.5) # or equivalently use trees[[1]]
Note that nodes are labelled with the prefix n
, so that nodes 0
, 1
, 2
, …
become n0
, n1
, n2
… etc. This helps to avoid
confusion between the the zero-based counting system used natively
by tskit
, and the one-based counting system used in R
.
Further information#
Be sure to check out the reticulate documentation, in particular on Calling Python from R, which includes important information on how R data types are converted to their equivalent Python types.