trout: finding tandem repeat outliers in a cohort

Wouter De Coster — Fri, 26 Jun 2026 09:00:00 +0200

Shortly ago I wrote about aSTRonaut, a small tool to visualise the motif composition of short tandem repeats. Visualisation is great when you already know which locus to look at — but if you have a whole cohort of genomes and want to know which samples stand out, and where, you need something that screens every locus for you. That is what trout does.

trout (Tandem Repeat OUTliers) takes a cohort of per-sample STRdust VCFs (possibly also from other STR genotypers, but so far untested. Let me know if you run into issues with VCFs from another genotyper) and, at every repeat locus, flags the samples whose alleles are unusual — either unusually long, unusually composed, or both. It is a single dependency-free Rust binary, and it runs over a cohort of more than a thousand genomes in a few seconds.

Why both length and composition

For tandem repeats, length is the obvious thing to look at: most pathogenic expansions are simply long. But length is not the whole story. At RFC1 a benign AAAAG allele and a disease-causing AAGGG allele can be the same size — what differs is the motif composition. A screen that only looks at allele length would miss exactly that kind of signal.

So trout describes each allele by two things at once: its length, and its k-mer composition (the frequencies of the canonical repeat motifs in the sequence). Every allele becomes a point in that combined feature space, and trout uses DBSCAN to find the dense cluster of “normal” alleles at each locus. Anything that falls outside the cluster is reported as an outlier, together with which axis it deviated on — length, or a specific motif.

Some examples

Here is the FXN locus (the GAA repeat behind Friedreich ataxia) across a cohort of long-read genomes. The bulk of the cohort sits as a dense cloud of short alleles; two samples (in red) sit far out to the right at ~210 bp — clear length outliers, exactly the kind of expansion you are screening for.

And here is a composition example at DMPK. On the length axis these samples are unremarkable, but their alleles are unusually rich in the CGG motif (vertical axis), so they sit well above the cohort cloud. A length-only screen would never have flagged them.

Each plot focuses on one axis, and the colours encode how a sample relates to it. Red points are the outliers on the axis shown — the length axis in the first plot, the CGG axis in the second — and they are the ones labelled with the sample name. Orange points are samples that trout flagged as outliers on a different axis; they are drawn for context so you can see, for example, that a length outlier is not also a composition outlier (and vice versa). Everything else is the density-shaded normal cloud, where darker means more samples stack on that spot, so a single marker never hides hundreds of samples. The plots are written straight to SVG and are built to stay openable even for large cohorts.

Using it

The basic invocation is just a glob of VCFs:

trout cohort/*.vcf.gz > outliers.tsv

That writes one row per outlier sample per locus. The rest of the options, including filtering and tuning the sensitivity, the output columns, and the algorithm details are documented in the repository. The plots are rendered with kuva, and trout reads the same STRdust VCFs as aSTRonaut, so the two tools sit naturally side by side: trout finds the loci and samples worth a closer look, aSTRonaut shows you what is going on at the sequence level.

trout can pick the k-mer length per locus on its own (-k auto), detecting the motif period from the sequence — so a trinucleotide locus is scored on trinucleotides and a hexanucleotide locus on hexamers, without you specifying anything. It also writes a per-sample QC summary with a robust z-score, which is handy for spotting samples that come up as outliers everywhere (usually a coverage or contamination problem rather than biology).

How this was built

In the interest of transparency: trout was not written by hand. The code was written and tested by Claude, Anthropic’s coding agent, working under my supervision — I directed the design, decided what the tool should do and which features to add, and reviewed the result.

trout is open source and available on GitHub. Feedback and feature requests are welcome.

aSTRonaut, now in Rust

Wouter De Coster — Thu, 25 Jun 2026 09:00:00 +0200

For tandem repeat expansion, the length of the repeat is only part of the story and the motif composition can be just as important, as for some loci a change in motif composition is what makes a repeat pathogenic. RFC1 is the classic example: a benign AAAAG allele and a disease-causing AAGGG expansion can be the same size, with dramatically different consequences.

To make that visible, pathSTR draws repeats with sequence-motif plots with one row per sample, each nucleotide coloured by the motif it belongs to. It is a simple idea, visually appealing, and remarkably informative. Of course I am not the first one to come up with such a visualization, sometimes also called a waterfall plot. The standalone version of that PathSTR visualization is a Python script called aSTRonaut, which is now also available as a Rust program for speed and simplicity: a single small binary with no dependencies that produces self-contained HTML files.

A note on how it was made: I did not write the Rust by hand. The port was carried out by Claude, Anthropic’s coding agent, working under my supervision. I directed the design, made the calls on what the tool should do and which features to implement, and reviewed the result, while Claude wrote and tested the code. It was a genuinely productive way to work, and I think it is worth being transparent about it.

What it looks like

Here is RFC1 across VCFs from pathSTR, with the rows clustered by motif composition. The benign AAAAG alleles (light blue) sit together at the bottom, the rare AAGGG expansions (dark green) stand out at the top, and a handful of other motif variants fall in between.

Similarly, here is the HTT CAG repeat behind Huntington’s disease: the polyglutamine CAG tract (blue) grows in length from top to bottom, followed by the CCG-rich tail (pink), with the occasional interruption breaking up the pattern.

A couple of new tricks

The Rust version also comes with a few new features. It can guess the repeat motif length for each locus on its own (-k auto), so you do not have to specify it. And it can collapse identical alleles into a haplotype-frequency view, where each unique sequence becomes a single row next to a bar showing how many people carry it.

The plots are rendered with kuva, a lovely new Rust scientific-plotting library. The example plots above use data from the pathSTR database.

aSTRonaut is open source and available on GitHub, with documentation. Feedback and feature requests are welcome.

Tandem Repeats on Gigabase or gigabyte