Changelogs¶

Python¶

[0.X.X] - 2020-XX-XX¶

Features

Expose TreeSequence.coiterate() method to allow iteration over 2 sequences simultaneously, aiding comparison of trees from two sequences. (@jeromekelleher, @hyanwong, #1021, #1022)
tskit is now supported on, and has wheels for, python3.9. (@benjeffery, #982, #907)
Tree.newick() now has extra option include_branch_lengths to allow branch lengths to be omitted (@hyanwong, #931).
Added Tree.generate_star static method to create star-topologies (@hyanwong, #934).
Added Tree.generate_comb and Tree.generate_balanced methods to create example trees. (@jeromekelleher, #1026).
Added equals method to TreeSequence, TableCollection and each of the tables which provides more flexible equality comparisons, for example, allowing users to ignore metadata or provenance in the comparison. (@mufernando, @jeromekelleher, #896, #897, #913, #917).
Added __eq__ to TreeSequence. (@benjeffery, #1011, #1020)
ts.dump and tskit.load now support reading and writing file objects such as FIFOs and sockets. (@benjeffery, #657, #909)
Added tskit.write_ms for writing to MS format. (@saurabhbelsare, #727, #854)
Added TableCollection.indexes for access to the edge insertion/removal order indexes. (@benjeffery, #4, #916)
The dictionary representation of a TableCollection now contains its index. (@benjeffery, #870, #921)
Added TreeSequence._repr_html_ for use in jupyter notebooks. (@benjeffery, #872, #923)
Added TreeSequence.__repr__ to display a summary for terminal usage. (@benjeffery, #938, #985)
Added TableCollection.dump and TableCollection.load. This allows table collections that are not valid tree sequences to be manipulated. (@benjeffery, #14, #986)
Added nbytes method to tables, TableCollection and TreeSequence which reports the size in bytes of those objects. (@jeromekelleher, @benjeffery, #54, #871)
Added TableCollection.clear to clear data table rows and optionally provenances, table schemas and tree-sequence level metadata and schema. (@benjeffery, #929, #1001)

Bugfixes

LightWeightTableCollection.asdict and TableCollection.asdict now return copies of arrays. (@benjeffery, #1025, #1029)

Breaking changes

The argument to ts.dump and tskit.load has been renamed file from path.
All arguments to Tree.newick() except precision are now keyword-only.
Renamed ts.trait_regression to ts.trait_linear_model.

[0.3.2] - 2020-09-29¶

Breaking changes

The argument order of Tree.unrank and combinatorics.num_labellings now positions the number of leaves before the tree rank (@daniel-goldstein, #950, #978)
Change several methods (simplify(), trees(), Tree()) so most parameters are keyword only, not positional. This allows reordering of parameters, so that deprecated parameters can be moved, and the parameter order in similar functions, e.g. TableCollection.simplify and TreeSequence.simplify() can be made consistent (@hyanwong, #374, #846, #851)

Features

Add split_polytomies method to the Tree class (@hyanwong, @jeromekelleher, #809, #815)
Tree accessor functions (e.g. ts.first(), ts.at() pass extra parameters such as sample_indexes to the underlying Tree constructor; also root_threshold can be specified when calling ts.trees() (@hyanwong, #847, #848)
Genomic intervals returned by python functions are now namedtuples, allowing .left .right and .span usage (@hyanwong, #784, #786, #811)
Added include_terminal parameter to edge diffs iterator, to output the last edges at the end of a tree sequence (@hyanwong, #783, #787)
#832 - Add metadata_bytes method to allow access to raw TableCollection metadata (@benjeffery, #842)
New tree.is_isolated(u) method (@hyanwong, #443).
tskit.is_unknown_time can now check arrays. (@benjeffery, #857).

[0.3.1] - 2020-09-04¶

Bugfixes

#823 - Fix mutation time error when using simplify(keep_input_roots=True) (@petrelharp, #823).
#821 - Fix mutation rows with unknown time never being equal (@petrelharp, #822).

[0.3.0] - 2020-08-27¶

Major feature release for metadata schemas, set-like operations, mutation times, SVG drawing improvements and many others.

Breaking changes

The default display order for tree visualisations has been changed to minlex (see below) to stabilise the node ordering and to make trees more readily comparable. The old behaviour is still available with order="tree".
File system operations such as dump/load now raise an appropriate OSError instead of tskit.FileFormatError. Loading from an empty file now raises and EOFError.
Bad tree topologies are detected earlier, so that it is no longer possible to create a TreeSequence object which contains a parent with contradictory children on an interval. Previously an error was thrown when some operation building the trees was attempted (@jeromekelleher, #709).
The TableCollection object no longer implements the iterator protocol. Previously list(tables) returned a sequence of (table_name, table_instance) tuples. This has been replaced with the more intuitive and future-proof TableCollection.name_map and TreeSequence.tables_dict attributes, which perform the same function (@jeromekelleher, #500, #694).
The arguments to TreeSequence.genotype_matrix, TreeSequence.haplotypes and TreeSequence.variants must now be keyword arguments, not positional. This is to support the change from impute_missing_data to isolated_as_missing in the arguments to these methods. (@benjeffery, #716, #794)

New features

New methods to perform set operations on TableCollections and TreeSequences. TableCollection.subset subsets and reorders table collections by nodes (@mufernando, @petrelharp, #663, #690). TableCollection.union forms the node-wise union of two table collections (@mufernando, @petrelharp, #381 #623).
Mutations now have an optional double-precision floating-point time column. If not specified, this defaults to a particular NaN value (tskit.UNKNOWN_TIME) indicating that the time is unknown. For a tree sequence to be considered valid it must meet new criteria for mutation times, see Mutation requirements. Also added function TableCollection.compute_mutation_times. Table sorting orders mutations by non-increasing time per-site, which is also a requirement for a valid tree sequence (@benjeffery, #672).
Add support for trees with internal samples for the Kendall-Colijn tree distance metric. (@daniel-goldstein, #610)
Add background shading to SVG tree sequences to reflect tree position along the sequence (@hyanwong, #563).
Tables with a metadata column now have a metadata_schema that is used to validate and encode metadata that is passed to add_row and decode metadata on calls to table[j] and e.g. tree_sequence.node(j) See Metadata (@benjeffery, #491, #542, #543, #601).
The tree-sequence now has top-level metadata with a schema (@benjeffery, #666, #644, #642).
Add classes to SVG drawings to allow easy adjustment and styling, and document the new tskit.Tree.draw_svg() and tskit.TreeSequence.draw_svg() methods. This also fixes #467 for duplicate SVG entity id s in Jupyter notebooks (@hyanwong, #555).
Add a to_nexus function that outputs a tree sequence in Nexus format (@saunack, #550).
Add extension of Kendall-Colijn tree distance metric for tree sequences computed by TreeSequence.kc_distance (@daniel-goldstein, #548).
Add an optional node traversal order in tskit.Tree that uses the minimum lexicographic order of leaf nodes visited. This ordering ("minlex_postorder") adds more determinism because it constraints the order in which children of a node are visited (@brianzhang01, #411).
Add an order argument to the tree visualisation functions which supports two node orderings: "tree" (the previous default) and "minlex" which stabilises the node ordering (making it easier to compare trees). The default node ordering is changed to "minlex" (@brianzhang01, @jeromekelleher, #389, #566).
Add _repr_html_ to tables, so that jupyter notebooks render them as html tables (@benjeffery, #514).
Remove support for kc_distance on trees with unary nodes (@daniel-goldstein, #508).
Improve Kendall-Colijn tree distance algorithm to operate in O(n^2) time instead of O(n^2 * log(n)) where n is the number of samples (@daniel-goldstein, #490).
Add a metadata column to the migrations table. Works similarly to existing metadata columns on other tables (@benjeffery, #505).
Add a metadata column to the edges table. Works similarly to existing metadata columns on other tables (@benjeffery, #496).
Allow sites with missing data to be output by the haplotypes method, by default replacing with -. Errors are no longer raised for missing data with isolated_as_missing=True; the error types returned for bad alleles (e.g. multiletter or non-ascii) have also changed from _tskit.LibraryError to TypeError, or ValueError if the missing data character clashes (@hyanwong, #426).
Access the number of children of a node in a tree directly using tree.num_children(u) (@hyanwong, #436).
User specified allele mapping for genotypes in variants and genotype_matrix (@jeromekelleher, #430).
New root_threshold option for the Tree class, which allows us to efficiently iterate over ‘real’ roots when we have missing data (@jeromekelleher, #462).
Add pickle support for TreeSequence (@terhorst, #473).
Add tree.as_dict_of_dicts() function to enable use with networkx. See Traversals with networkx (@winni2k, #457).
Add tree_sequence.to_macs() function to convert tree sequence to MACS format (@winni2k, #727)
Add a keep_input_roots option to simplify which, if enabled, adds edges from the MRCAs of samples in the simplified tree sequence back to the roots in the input tree sequence (@jeromekelleher, #775, #782).

Bugfixes

#453 - Fix LibraryError when tree.newick() is called with large node time values (@jeromekelleher, #637).
#777 - Mutations over isolated samples were incorrectly decoded as missing data. (@jeromekelleher, #778)
#776 - Fix a segfault when a partial list of samples was provided to the variants iterator. (@jeromekelleher, #778)

Deprecated

The sample_counts feature has been deprecated and is now ignored. Sample counts are now always computed.
For TreeSequence.genotype_matrix, TreeSequence.haplotypes and TreeSequence.variants the impute_missing_data argument is deprecated and replaced with isolated_as_missing. Note that to get the same behaviour impute_missing_data=True should be replaced with isolated_as_missing=False. (@benjeffery, #716, #794)

[0.2.3] - 2019-11-22¶

Minor feature release, providing a tree distance metric and various method to manipulate tree sequence data.

New features

Kendall-Colijn tree distance metric computed by Tree.kc_distance (@awohns, #172).
New “timeasc” and “timedesc” orders for tree traversals (@benjeffery, #246, #399).
Up to 2X performance improvements to tree traversals (@benjeffery, #400).
Add trim, delete_sites, keep_intervals and delete_intervals methods to edit tree sequence data. (@hyanwong, #364, #372, #377, #390).
Initial online documentation for CLI (@hyanwong, #414).
Various documentation improvements (@hyanwong, @jeromekelleher, @petrelharp).
Rename the map_ancestors function to link_ancestors (@hyanwong, @gtsambos; #406, #262). The original function is retained as an deprecated alias.

Bugfixes

Fix height scaling issues with SVG tree drawing (@jeromekelleher, #407, #383, #378).
Do not reuse buffers in LdCalculator (@jeromekelleher). See #397 and #396.

[0.2.2] - 2019-09-01¶

Minor bugfix release.

Relaxes overly-strict input requirements on individual location data that caused some SLiM tree sequences to fail loading in version 0.2.1 (see #351).

New features

Add log_time height scaling option for drawing SVG trees (@marianne-aspbury). See #324 and #303.

Bugfixes

Allow 4G metadata columns (@jeromekelleher). See #342 and #341.

[0.2.1] - 2019-08-23¶

Major feature release, adding support for population genetic statistics, improved VCF output and many other features.

Note: Version 0.2.0 was skipped because of an error uploading to PyPI which could not be undone.

Breaking changes

Genotype arrays returned by TreeSequence.variants and TreeSequence.genotype_matrix have changed from unsigned 8 bit values to signed 8 bit values to accomodate missing data (see #144 for discussion). Specifically, the dtype of the genotypes arrays have changed from numpy “u8” to “i8”. This should not affect client code in any way unless it specifically depends on the type of the returned numpy array.
The VCF written by the write_vcf is no longer compatible with previous versions, which had significant shortcomings. Position values are now rounded to the nearest integer by default, REF and ALT values are derived from the actual allelic states (rather than always being A and T). Sample names are now of the form tsk_j for sample ID j. Most of the legacy behaviour can be recovered with new options, however.
The positional parameter reference_sets in genealogical_nearest_neighbours and mean_descendants TreeSequence methods has been renamed to sample_sets.

New features

Support for general windowed statistics. Implementations of diversity, divergence, segregating sites, Tajima’s D, Fst, Patterson’s F statistics, Y statistics, trait correlations and covariance, and k-dimensional allele frequency specra (@petrelharp, @jeromekelleher, @molpopgen).
Add the keep_unary option to simplify (@gtsambos). See #1 and #143.
Add the map_ancestors method to TableCollection (user:gtsambos). See #175.
Add the squash method to EdgeTable (@gtsambos). See #59 and #285.
Add support for individuals to VCF output, and fix major issues with output format (@jeromekelleher). Position values are transformed in a much more straightforward manner and output has been generalised substantially. Adds individual_names and position_transform arguments. See #286, and issues #2, #30 and #73.
Control height scale in SVG trees using ‘tree_height_scale’ and ‘max_tree_height’ (@hyanwong, @jeromekelleher). See #167, #168. Various other improvements to tree drawing (#235, #241, #242, #252, #259).
Add Tree.max_root_time property (@hyanwong, @jeromekelleher). See #170.
Improved input checking on various methods taking numpy arrays as parameters (@hyanwong). See #8 and #185.
Define the branch length over roots in trees to be zero (previously raise an error; @jeromekelleher). See #188 and #191.
Implementation of the genealogical nearest neighbours statistic (@hyanwong, @jeromekelleher).
New delete_intervals and keep_intervals method for the TableCollection to allow slicing out of topology from specific intervals (@hyanwong, @andrewkern, @petrelharp, @jeromekelleher). See #225 and #261.
Support for missing data via a topological definition (@jeromekelleher). See #270 and #272.
Add ability to set columns directly in the Tables API (@jeromekelleher). See #12 and #307.
Various documentation improvements from @brianzhang01, @hyanwong, @petrelharp and @jeromekelleher.

Deprecated

Deprecate Tree.length in favour of Tree.span (@hyanwong). See #169.
Deprecate TreeSequence.pairwise_diversity in favour of the new diversity method. See #215, #312.

Bugfixes

Catch NaN and infinity values within tables (@hyanwong). See #293 and #294.

[0.1.5] - 2019-03-27¶

This release removes support for Python 2, adds more flexible tree access and a new tskit command line interface.

New features

Remove support for Python 2 (@hugovk). See #137 and #140.
More flexible tree API (#121). Adds TreeSequence.at and TreeSequence.at_index methods to find specific trees, and efficient support for backwards traversal using reversed(ts.trees()).
Add initial tskit CLI (#80)
Add tskit info CLI command (#66)
Enable drawing SVG trees with coloured edges (@hyanwong; #149).
Add Tree.is_descendant method (#120)
Add Tree.copy method (#122)

Bugfixes

Fixes to the low-level C API (#132 and #157)

[0.1.4] - 2019-02-01¶

Minor feature update. Using the C API 0.99.1.

New features

Add interface for setting TableCollection.sequence_length: https://github.com/tskit-dev/tskit/issues/107
Add support for building and dropping TableCollection indexes: https://github.com/tskit-dev/tskit/issues/108

[0.1.3] - 2019-01-14¶

Bugfix release.

Bugfixes

Fix missing provenance schema: https://github.com/tskit-dev/tskit/issues/81

[0.1.2] - 2019-01-14¶

Bugfix release.

Bugfixes

Fix memory leak in table collection. https://github.com/tskit-dev/tskit/issues/76

[0.1.1] - 2019-01-11¶

Fixes broken distribution tarball for 0.1.0.

[0.1.0] - 2019-01-11¶

Initial release after separation from msprime 0.6.2. Code that reads tree sequence files and processes them should be able to work without changes.

Breaking changes

Removal of the previously deprecated sort_tables, simplify_tables and load_tables functions. All code should change to using corresponding TableCollection methods.
Rename SparseTree class to Tree.

[1.1.0a1] - 2019-01-10¶

Initial alpha version posted to PyPI for bootstrapping.

[0.0.0] - 2019-01-10¶

Initial extraction of tskit code from msprime. Relicense to MIT.

Code copied at hash 29921408661d5fe0b1a82b1ca302a8b87510fd23

C API¶

[0.99.8] - 2020-XX-XX¶

Breaking changes

Added an options argument to tsk_table_collection_equals and table equality methods to allow for more flexible equality criteria (e.g., ignore top-level metadata and schema or provenance tables). Existing code should add an extra final parameter 0 to retain the current behaviour. (@mufernando, @jeromekelleher, #896, #897, #913, #917).
Changed default behaviour of tsk_table_collection_clear to not clear provenances and added options argument to optionally clear provenances and schemas. (@benjeffery, #929, #1001)
Exposed tsk_table_collection_set_indexes to the API. (@benjeffery, #870, #921)
Renamed ts.trait_regression to ts.trait_linear_model.

[0.99.7] - 2020-09-29¶

Added TSK_INCLUDE_TERMINAL option to tsk_diff_iter_init to output the last edges at the end of a tree sequence (@hyanwong, #783, #787)
Added tsk_bug_assert for assertions that should be compiled into release binaries (@benjeffery, #860)

[0.99.6] - 2020-09-04¶

Bugfixes

#823 - Fix mutation time error when using tsk_table_collection_simplify with TSK_KEEP_INPUT_ROOTS (@petrelharp, #823).

[0.99.5] - 2020-08-27¶

Breaking changes

The macro TSK_IMPUTE_MISSING_DATA is renamed to TSK_ISOLATED_NOT_MISSING (@benjeffery, #716, #794)

New features

Add a TSK_KEEP_INPUT_ROOTS option to simplify which, if enabled, adds edges from the MRCAs of samples in the simplified tree sequence back to the roots in the input tree sequence (@jeromekelleher, #775, #782).

Bugfixes

#777 - Mutations over isolated samples were incorrectly decoded as missing data. (@jeromekelleher, #778)
#776 - Fix a segfault when a partial list of samples was provided to the variants iterator. (@jeromekelleher, #778)

[0.99.4] - 2020-08-12¶

Note

The TSK_VERSION_PATCH macro was incorrectly set to 4 for 0.99.3, so both 0.99.4 and 0.99.3 have the same value.

Changes

Mutation times can be a mixture of known and unknown as long as for each individual site they are either all known or all unknown (@benjeffery, #761).

Bugfixes

Fix for including core.h under C++ (@petrelharp, #755).

[0.99.3] - 2020-07-27¶

Breaking changes

tsk_mutation_table_add_row has an extra time argument. If the time is unknown TSK_UNKNOWN_TIME should be passed. (@benjeffery, #672)
Change genotypes from unsigned to signed to accommodate missing data (see #144 for discussion). This only affects users of the tsk_vargen_t class. Genotypes are now stored as int8_t and int16_t types rather than the former unsigned types. The field names in the genotypes union of the tsk_variant_t struct returned by tsk_vargen_next have been renamed to i8 and i16 accordingly; care should be taken when updating client code to ensure that types are correct. The number of distinct alleles supported by 8 bit genotypes has therefore dropped from 255 to 127, with a similar reduction for 16 bit genotypes.
Change the tsk_vargen_init method to take an extra parameter alleles. To keep the current behaviour, set this parameter to NULL.
Edges can now have metadata. Hence edge methods now take two extra arguments: metadata and metadata length. The file format has also changed to accommodate this, but is backwards compatible. Edge metadata can be disabled for a table collection with the TSK_NO_EDGE_METADATA flag. (@benjeffery, #496, #712)
Migrations can now have metadata. Hence migration methods now take two extra arguments: metadata and metadata length. The file format has also changed to accommodate this, but is backwards compatible. (@benjeffery, #505)
The text dump of tables with metadata now includes the metadata schema as a header. (@benjeffery, #493)
Bad tree topologies are detected earlier, so that it is no longer possible to create a tsk_treeseq_t object which contains a parent with contradictory children on an interval. Previously an error occured when some operation building the trees was attempted (@jeromekelleher, #709).

New features

New methods to perform set operations on table collections. tsk_table_collection_subset subsets and reorders table collections by nodes (@mufernando, @petrelharp, #663, #690). tsk_table_collection_union forms the node-wise union of two table collections (@mufernando, @petrelharp, #381, #623).
Mutations now have an optional double-precision floating-point time column. If not specified, this defaults to a particular NaN value (TSK_UNKNOWN_TIME) indicating that the time is unknown. For a tree sequence to be considered valid it must meet new criteria for mutation times, see Mutation requirements. Add tsk_table_collection_compute_mutation_times and new flag to tsk_table_collection_check_integrity:TSK_CHECK_MUTATION_TIME. Table sorting orders mutations by non-increasing time per-site, which is also a requirement for a valid tree sequence. (@benjeffery, #672)
Add metadata and metadata_schema fields to table collection, with accessors on tree sequence. These store arbitrary bytes and are optional in the file format. (:user: benjeffery, #641)
Add the TSK_KEEP_UNARY option to simplify (@gtsambos). See #1 and #143.
Add a set_root_threshold option to tsk_tree_t which allows us to set the number of samples a node must be an ancestor of to be considered a root (#462).
Change the semantics of tsk_tree_t so that sample counts are always computed, and add a new TSK_NO_SAMPLE_COUNTS option to turn this off (#462).
Tables with metadata now have an optional metadata_schema field that can contain arbitrary bytes. (@benjeffery, #493)
Tables loaded from a file can now be edited in the same way as any other table collection (@jeromekelleher, #536, #530.
Support for reading/writing to arbitrary file streams with the loadf/dumpf variants for tree sequence and table collection load/dump (@jeromekelleher, @grahamgower, #565, #599).
Add low-level sorting API and TSK_NO_CHECK_INTEGRITY flag (@jeromekelleher, #627, #626).
Add extension of Kendall-Colijn tree distance metric for tree sequences computed by tsk_treeseq_kc_distance (@daniel-goldstein, #548)

Deprecated

The TSK_SAMPLE_COUNTS options is now ignored and will print out a warning if used (#462).

[0.99.2] - 2019-03-27¶

Bugfix release. Changes:

Fix incorrect errors on tbl_collection_dump (#132)
Catch table overflows (#157)

[0.99.1] - 2019-01-24¶

Refinements to the C API as we move towards 1.0.0. Changes:

Change the _tbl_ abbreviation to _table_ to improve readability. Hence, we now have, e.g., tsk_node_table_t etc.
Change tsk_tbl_size_t to tsk_size_t.
Standardise public API to use tsk_size_t and tsk_id_t as appropriate.
Add tsk_flags_t typedef and consistently use this as the type used to encode bitwise flags. To avoid confusion, functions now have an options parameter.
Rename tsk_table_collection_position_t to tsk_bookmark_t.
Rename tsk_table_collection_reset_position to tsk_table_collection_truncate and tsk_table_collection_record_position to tsk_table_collection_record_num_rows.
Generalise tsk_table_collection_sort to take a bookmark as start argument.
Relax restriction that nodes in the samples argument to simplify must currently be marked as samples. (https://github.com/tskit-dev/tskit/issues/72)
Allow tsk_table_collection_simplify to take a NULL samples argument to specify “all samples in the current tables”.
Add support for building as a meson subproject.

[0.99.0] - 2019-01-14¶

Initial alpha version of the tskit C API tagged. Version 0.99.x represents the series of releases leading to version 1.0.0 which will be the first stable release. After 1.0.0, semver rules regarding API/ABI breakage will apply; however, in the 0.99.x series arbitrary changes may happen.

[0.0.0] - 2019-01-10¶

Initial extraction of tskit code from msprime. Relicense to MIT. Code copied at hash 29921408661d5fe0b1a82b1ca302a8b87510fd23