<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Tandem Repeats on Gigabase or gigabyte</title><link>https://wdecoster.github.io/gigabaseorgigabyte/tags/tandem-repeats/</link><description>Recent content in Tandem Repeats on Gigabase or gigabyte</description><generator>Hugo -- 0.163.3</generator><language>en-us</language><copyright>© Wouter De Coster</copyright><lastBuildDate>Fri, 26 Jun 2026 09:00:00 +0200</lastBuildDate><atom:link href="https://wdecoster.github.io/gigabaseorgigabyte/tags/tandem-repeats/index.xml" rel="self" type="application/rss+xml"/><item><title>trout: finding tandem repeat outliers in a cohort</title><link>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-26-trout-str-outliers/</link><pubDate>Fri, 26 Jun 2026 09:00:00 +0200</pubDate><author>Wouter De Coster</author><guid>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-26-trout-str-outliers/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Shortly ago I wrote about <a href="/gigabaseorgigabyte/posts/astronaut-rust-port/">aSTRonaut</a>, a small tool to <em>visualise</em> the motif composition of short tandem repeats. Visualisation is great when you already know which locus to look at — but if you have a whole cohort of genomes and want to know <em>which samples stand out, and where</em>, you need something that screens every locus for you. That is what <strong>trout</strong> does.</p>
<p>trout (Tandem Repeat OUTliers) takes a cohort of per-sample <a href="https://github.com/wdecoster/STRdust">STRdust</a> VCFs (possibly also from other STR genotypers, but so far untested. Let me know if you run into issues with VCFs from another genotyper) and, at every repeat locus, flags the samples whose alleles are unusual — either unusually long, unusually composed, or both. It is a single dependency-free Rust binary, and it runs over a cohort of more than a thousand genomes in a few seconds.</p>
<h2 id="why-both-length-and-composition">Why both length and composition<a href="#why-both-length-and-composition" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>For tandem repeats, length is the obvious thing to look at: most pathogenic expansions are simply <em>long</em>. But length is not the whole story. At RFC1 a benign <code>AAAAG</code> allele and a disease-causing <code>AAGGG</code> allele can be the same size — what differs is the motif composition. A screen that only looks at allele length would miss exactly that kind of signal.</p>
<p>So trout describes each allele by two things at once: its length, and its k-mer composition (the frequencies of the canonical repeat motifs in the sequence). Every allele becomes a point in that combined feature space, and trout uses <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> to find the dense cluster of &ldquo;normal&rdquo; alleles at each locus. Anything that falls outside the cluster is reported as an outlier, together with <em>which axis</em> it deviated on — length, or a specific motif.</p>
<h2 id="some-examples">Some examples<a href="#some-examples" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Here is the <em>FXN</em> locus (the GAA repeat behind Friedreich ataxia) across a cohort of long-read genomes. The bulk of the cohort sits as a dense cloud of short alleles; two samples (in red) sit far out to the right at ~210 bp — clear length outliers, exactly the kind of expansion you are screening for.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_trout_fxn.png" alt="FXN GAA length outliers"></p>
<p>And here is a composition example at <em>DMPK</em>. On the length axis these samples are unremarkable, but their alleles are unusually rich in the <code>CGG</code> motif (vertical axis), so they sit well above the cohort cloud. A length-only screen would never have flagged them.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_trout_dmpk.png" alt="DMPK CGG composition outliers"></p>
<p>Each plot focuses on one axis, and the colours encode how a sample relates to it. Red points are the outliers <em>on the axis shown</em> — the length axis in the first plot, the <code>CGG</code> axis in the second — and they are the ones labelled with the sample name. Orange points are samples that trout flagged as outliers on a <em>different</em> axis; they are drawn for context so you can see, for example, that a length outlier is not also a composition outlier (and vice versa). Everything else is the density-shaded normal cloud, where darker means more samples stack on that spot, so a single marker never hides hundreds of samples. The plots are written straight to SVG and are built to stay openable even for large cohorts.</p>
<h2 id="using-it">Using it<a href="#using-it" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The basic invocation is just a glob of VCFs:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">trout cohort/*.vcf.gz &gt; outliers.tsv
</span></span></code></pre></div><p>That writes one row per outlier sample per locus. The rest of the options, including filtering and tuning the sensitivity, the output columns, and the algorithm details are documented in the <a href="https://github.com/wdecoster/trout">repository</a>. The plots are rendered with <a href="https://psy-fer.github.io/kuva/">kuva</a>, and trout reads the same STRdust VCFs as aSTRonaut, so the two tools sit naturally side by side: trout finds the loci and samples worth a closer look, aSTRonaut shows you what is going on at the sequence level.</p>
<p>trout can pick the k-mer length per locus on its own (<code>-k auto</code>), detecting the motif period from the sequence — so a trinucleotide locus is scored on trinucleotides and a hexanucleotide locus on hexamers, without you specifying anything. It also writes a per-sample QC summary with a robust z-score, which is handy for spotting samples that come up as outliers everywhere (usually a coverage or contamination problem rather than biology).</p>
<h2 id="how-this-was-built">How this was built<a href="#how-this-was-built" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>In the interest of transparency: trout was not written by hand. The code was written and tested by <a href="https://claude.com/claude-code">Claude</a>, Anthropic&rsquo;s coding agent, working under my supervision — I directed the design, decided what the tool should do and which features to add, and reviewed the result.</p>
<p>trout is open source and available on <a href="https://github.com/wdecoster/trout">GitHub</a>. Feedback and feature requests are welcome.</p>
]]></content></item><item><title>aSTRonaut, now in Rust</title><link>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-25-astronaut-rust-port/</link><pubDate>Thu, 25 Jun 2026 09:00:00 +0200</pubDate><author>Wouter De Coster</author><guid>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-25-astronaut-rust-port/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>For tandem repeat expansion, the length of the repeat is only part of the story and the motif composition can be just as important, as for some loci a change in motif composition is what makes a repeat pathogenic. RFC1 is the classic example: a benign <code>AAAAG</code> allele
and a disease-causing <code>AAGGG</code> expansion can be the same size, with dramatically different consequences.</p>
<p>To make that visible, <a href="https://pathstr.bioinf.be/">pathSTR</a> draws repeats with sequence-motif plots with one row per sample, each nucleotide coloured by the motif it belongs to. It is a simple idea, visually appealing, and remarkably informative. Of course I am not the first one to come up with such a visualization, sometimes also called a <em>waterfall</em> plot. The standalone version of that PathSTR visualization is a Python script called aSTRonaut, which is now also available as a Rust program for speed and simplicity: a single small binary with no dependencies that produces self-contained HTML files.</p>
<p>A note on how it was made: I did not write the Rust by hand. The port was carried out by <a href="https://claude.com/claude-code">Claude</a>, Anthropic&rsquo;s coding agent, working under my supervision. I directed the design, made the calls on what the tool should do and which features to implement, and reviewed the result, while Claude wrote and tested the code. It was a genuinely productive way to work, and I think it is worth being transparent about it.</p>
<h2 id="what-it-looks-like">What it looks like<a href="#what-it-looks-like" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Here is RFC1 across VCFs from <a href="https://pathstr.bioinf.be/">pathSTR</a>, with the rows clustered by motif composition. The benign <code>AAAAG</code> alleles (light blue) sit together at the bottom, the rare <code>AAGGG</code> expansions (dark green) stand out at the top, and a handful of other motif variants fall in between.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_astronaut_rfc1.png" alt="RFC1 motif composition across a cohort"></p>
<p>Similarly, here is the <em>HTT</em> CAG repeat behind Huntington&rsquo;s disease: the polyglutamine <code>CAG</code> tract (blue) grows in length from
top to bottom, followed by the <code>CCG</code>-rich tail (pink), with the occasional interruption breaking up the pattern.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_astronaut_htt.png" alt="HTT CAG repeat"></p>
<h2 id="a-couple-of-new-tricks">A couple of new tricks<a href="#a-couple-of-new-tricks" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The Rust version also comes with a few new features. It can guess the repeat motif length for each locus on its own (<code>-k auto</code>), so you do not have to specify it. And it can <strong>collapse</strong> identical alleles into a haplotype-frequency view, where each unique sequence becomes a single row next to a bar showing how many people carry it.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_astronaut_rfc1_collapsed.png" alt="Collapsed haplotype-frequency view of RFC1"></p>
<p>The plots are rendered with <a href="https://psy-fer.github.io/kuva/">kuva</a>, a lovely new Rust scientific-plotting library. The example
plots above use data from the <a href="https://pathstr.bioinf.be/">pathSTR</a> database.</p>
<p>aSTRonaut is open source and available on <a href="https://github.com/wdecoster/aSTRonaut">GitHub</a>, with documentation. Feedback and feature requests are welcome.</p>
]]></content></item></channel></rss>