Advanced forward simulations

Advanced forward simulations#

Todo

Add further details on building a forward simulator (see issue #14)

In the previous tutorial, we developed a basic basic forward-time Wright-Fisher (WF) simulator (refer back to that tutorial for a detailed run through of the hidden code):

Show code cell content

Hide code cell content

import tskit
import numpy as np

random_seed = 6
random = np.random.default_rng(random_seed)  # A random number generator for general use

L = 50_000  # The sequence length: 50 Kb

def add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate):
    "Add paths from parent genomes to the child genome, with crossover recombination"
    L = tables.sequence_length
    num_recombinations = random.poisson(recombination_rate * L)
    breakpoints = random.integers(0, L - 1, size=num_recombinations)
    break_pos, counts = np.unique(breakpoints, return_counts=True)
    crossovers = break_pos[counts % 2 == 1]  # no crossover if e.g. 2 breaks at same pos
    left_positions = np.insert(crossovers, 0, 0)
    right_positions = np.append(crossovers, L)

    inherit_from = random.integers(2)
    for left, right in zip(left_positions, right_positions):
        tables.edges.add_row(
            left, right, parent_genomes[inherit_from], child_genome)
        inherit_from = 1 - inherit_from  # switch to other parent genome

def make_diploid(tables, time, parent_individuals=None):
    individual_id = tables.individuals.add_row(parents=parent_individuals)
    return individual_id, (
        tables.nodes.add_row(time=time, individual=individual_id),
        tables.nodes.add_row(time=time, individual=individual_id),
    )

def new_population(tables, time, prev_pop, recombination_rate):
    pop = {}
    prev_individuals = np.array([i for i in prev_pop.keys()], dtype=np.int32)
    for _ in range(len(prev_pop)):
        mother_and_father = random.choice(prev_individuals, 2, replace=True)
        child_id, child_genomes = make_diploid(tables, time, mother_and_father)
        pop[child_id] = child_genomes  # store the genome IDs
        for child_genome, parent_individual in zip(child_genomes, mother_and_father):
            parent_genomes = prev_pop[parent_individual]
            add_inheritance_paths(tables, parent_genomes, child_genome, recombination_rate)
    return pop

def initialise_population(tables, time, size) -> dict:
    return dict(make_diploid(tables, time) for _ in range(size))

The main simulation function, as below, returned an unsimplified tree sequence, which we subsequently simplified:

def forward_WF(num_diploids, seq_len, generations, recombination_rate=0, random_seed=7):
    global random
    random = np.random.default_rng(random_seed) 
    tables = tskit.TableCollection(seq_len)
    tables.time_units = "generations"

    pop = initialise_population(tables, generations, num_diploids)
    while generations > 0:
        generations = generations - 1
        pop = new_population(tables, generations, pop, recombination_rate)

    tables.sort()
    return tables.tree_sequence()

Repeated simplification#

We can perform simplification directly on the tables within the forward_WF() function, using TableCollection.simplify(). More importantly, we can carry this out at repeated intervals. It is helpful to think of this as regular “garbage collection”, as what we’re really doing is getting rid of extinct lineages while also “trimming” extant lineages down to a minimal representation.

Caution

Regular garbage collection forces us to reckon with the fact that simplification changes the node IDs. We therefore need to remap any node (and individual) IDs that are used outside of tskit. In the implementation described here, those IDs are stored in the pop variable.

def simplify_tables(tables, samples, pop) -> dict[int, tuple[int, int]]:
    """
    Simplify the tables with respect to the given samples, returning a
    population dict in which individual and nodes have been remapped to their
    new ID numbers
    """
    tables.sort()
    node_map = tables.simplify(samples, keep_input_roots=True)
    
    nodes_individual = tables.nodes.individual
    remapped_pop = {}
    for node1, node2 in pop.values():
        node1, node2 = node_map[[node1, node2]]  # remap
        assert nodes_individual[node1] == nodes_individual[node2]  # sanity check
        remapped_pop[nodes_individual[node1]] = (node1, node2)
    return remapped_pop


def forward_WF(
    num_diploids,
    seq_len,
    generations,
    recombination_rate=0,
    simplification_interval=None,  # default to simplifying only at end
    show=None,
    random_seed=7,
):
    global random
    random = np.random.default_rng(random_seed) 
    tables = tskit.TableCollection(seq_len)
    tables.time_units = "generations"  # optional, but helpful when plotting
    if simplification_interval is None:
        simplification_interval = generations
    simplify_mod = generations % simplification_interval

    pop = initialise_population(tables, generations, num_diploids)
    while generations > 0:
        generations = generations - 1
        pop = new_population(tables, generations, pop, recombination_rate)
        if generations > 0 and generations % simplification_interval == simplify_mod:
            current_nodes = [u for nodes in pop.values() for u in nodes]
            pop = simplify_tables(tables, current_nodes, pop)
            if show:
                print("Simplified", generations, "generations before end")

    pop = simplify_tables(tables, [u for nodes in pop.values() for u in nodes], pop)
    if show:
        print("Final simplification")
    return tables.tree_sequence()

ts = forward_WF(6, L, generations=100, simplification_interval=25, show=True)
ts.draw_svg(size=(800, 200))

Simplified 75 generations before end
Simplified 50 generations before end
Simplified 25 generations before end
Final simplification

_images/34fc0c7d5c28354171c8ed22915d646e645f0729cb0db6b9594ed6f3017e8bb7.svg

Invariance to simplification interval#

A critical concept to keep in mind is that the simulation itself is the only random component. The simplification algorithm is deterministic given a set of (nodes, edges) satisfying tskit’s sorting requirements. Therefore, the results of our new forward_WF() function must be the same for all simplification intervals

Note

This invariance property only holds in some cases. We discuss this in more detail below when we add in mutation.

ts = forward_WF(10, L, 500, simplification_interval=1, random_seed=42)

# Iterate over a range of odd and even simplification intervals.
print("Testing invariance to simplification interval")
test_intervals = list(range(2, 500, 33))
for i in test_intervals:
    # Make sure each new sim starts with same random seed!
    ts_test = forward_WF(10, L, 500, simplification_interval=i, show=False, random_seed=42)
    assert ts.equals(ts_test, ignore_provenance=True)
print(f"Intervals {test_intervals} passed")

Testing invariance to simplification interval

Intervals [2, 35, 68, 101, 134, 167, 200, 233, 266, 299, 332, 365, 398, 431, 464, 497] passed

Tip

Testing your own code using loops like the one above is a very good way to identify subtle bugs in book-keeping.

Summary#

Simplifying during a simulation changes IDs in the tree sequence tables, so we need to remap entities that store any of these IDs between generations.
Our code to carry out simplification gets called both during the simulation and at the end. It’s therefore worth encapsulating it into a class or function for easier code re-use and testing.

Technical notes#

We have found that it is possible to write a simulation where the results differ by simplification interval, but appear correct in distribution. By this we mean that looking at distributions of numbers of mutations, their frequencies, etc., match predictions from analytical theory. However, our experience is that such simulations contain bugs and that the summaries being used for testing are too crude to catch them. For example, they may affect the variance in a subtle way that would require millions of simulations to catch. Often what is going on is that parent/offspring relationships are not being properly recorded, resulting in lineages that either persist too long or not long enough. (In other words, the variance in offspring number per diploid is no longer what it should be, meaning you’ve changed the effective population size.) Thus, please make sure you get the same tskit tables out of a simulation for any simplification interval.

Mutations#

In this section, we will add mutation to our simulation. Mutations will occur according to the infinitely-many sites model, which means that a new mutation cannot arise at a currently-mutated position. \(\theta = 4N\mu\) is the scaled mutation rate, and is equal to twice the expected number of new mutations per generation. The parameter \(\mu\) is the expected number of new mutations per gamete, per generation. Mutation positions will be uniformly distributed along the genome.

Adding mutations changes the complexity of the simulation quite a bit, because now we must add to and simplify site tables and mutation tables instances. We might also want to add metadata to the sites or mutations, recording details such as the selection coefficient of a mutation, or the type of mutation (e.g., synonymous vs. non-synonymous).

We will write a mutation function here which we will re-use in future examples.

Note

We will be treating mutations as neutral. Doing so is odd, as one big selling point of tskit is the ability to skip the tracking of neutral mutations in forward simulations. However, tracking neutral mutations plus metadata is the same as tracking selected mutations and their metadata, and being able to do neat things like put your selected mutations onto a figure of the genealogy is one of several possible use cases.

Todo

The rest of this tutorial is still under construction, and needs porting from this workbook. This will primarily deal with sites and mutations (and mutational metadata). We could also include details on selection, if that seems sensible.

The section in that workbook on “Starting with a prior history” should be put in the Recapitating a forward simulation tutorial.