<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>Outliers on Gigabase or gigabyte</title><link>https://wdecoster.github.io/gigabaseorgigabyte/tags/outliers/</link><description>Recent content in Outliers on Gigabase or gigabyte</description><generator>Hugo -- 0.163.3</generator><language>en-us</language><copyright>© Wouter De Coster</copyright><lastBuildDate>Fri, 26 Jun 2026 09:00:00 +0200</lastBuildDate><atom:link href="https://wdecoster.github.io/gigabaseorgigabyte/tags/outliers/index.xml" rel="self" type="application/rss+xml"/><item><title>trout: finding tandem repeat outliers in a cohort</title><link>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-26-trout-str-outliers/</link><pubDate>Fri, 26 Jun 2026 09:00:00 +0200</pubDate><author>Wouter De Coster</author><guid>https://wdecoster.github.io/gigabaseorgigabyte/posts/2026-06-26-trout-str-outliers/</guid><description>&amp;lt;no value&amp;gt;</description><content type="text/html" mode="escaped"><![CDATA[<p>Shortly ago I wrote about <a href="/gigabaseorgigabyte/posts/astronaut-rust-port/">aSTRonaut</a>, a small tool to <em>visualise</em> the motif composition of short tandem repeats. Visualisation is great when you already know which locus to look at — but if you have a whole cohort of genomes and want to know <em>which samples stand out, and where</em>, you need something that screens every locus for you. That is what <strong>trout</strong> does.</p>
<p>trout (Tandem Repeat OUTliers) takes a cohort of per-sample <a href="https://github.com/wdecoster/STRdust">STRdust</a> VCFs (possibly also from other STR genotypers, but so far untested. Let me know if you run into issues with VCFs from another genotyper) and, at every repeat locus, flags the samples whose alleles are unusual — either unusually long, unusually composed, or both. It is a single dependency-free Rust binary, and it runs over a cohort of more than a thousand genomes in a few seconds.</p>
<h2 id="why-both-length-and-composition">Why both length and composition<a href="#why-both-length-and-composition" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>For tandem repeats, length is the obvious thing to look at: most pathogenic expansions are simply <em>long</em>. But length is not the whole story. At RFC1 a benign <code>AAAAG</code> allele and a disease-causing <code>AAGGG</code> allele can be the same size — what differs is the motif composition. A screen that only looks at allele length would miss exactly that kind of signal.</p>
<p>So trout describes each allele by two things at once: its length, and its k-mer composition (the frequencies of the canonical repeat motifs in the sequence). Every allele becomes a point in that combined feature space, and trout uses <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> to find the dense cluster of &ldquo;normal&rdquo; alleles at each locus. Anything that falls outside the cluster is reported as an outlier, together with <em>which axis</em> it deviated on — length, or a specific motif.</p>
<h2 id="some-examples">Some examples<a href="#some-examples" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>Here is the <em>FXN</em> locus (the GAA repeat behind Friedreich ataxia) across a cohort of long-read genomes. The bulk of the cohort sits as a dense cloud of short alleles; two samples (in red) sit far out to the right at ~210 bp — clear length outliers, exactly the kind of expansion you are screening for.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_trout_fxn.png" alt="FXN GAA length outliers"></p>
<p>And here is a composition example at <em>DMPK</em>. On the length axis these samples are unremarkable, but their alleles are unusually rich in the <code>CGG</code> motif (vertical axis), so they sit well above the cohort cloud. A length-only screen would never have flagged them.</p>
<p><img src="/gigabaseorgigabyte/images/2026-06_trout_dmpk.png" alt="DMPK CGG composition outliers"></p>
<p>Each plot focuses on one axis, and the colours encode how a sample relates to it. Red points are the outliers <em>on the axis shown</em> — the length axis in the first plot, the <code>CGG</code> axis in the second — and they are the ones labelled with the sample name. Orange points are samples that trout flagged as outliers on a <em>different</em> axis; they are drawn for context so you can see, for example, that a length outlier is not also a composition outlier (and vice versa). Everything else is the density-shaded normal cloud, where darker means more samples stack on that spot, so a single marker never hides hundreds of samples. The plots are written straight to SVG and are built to stay openable even for large cohorts.</p>
<h2 id="using-it">Using it<a href="#using-it" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>The basic invocation is just a glob of VCFs:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">trout cohort/*.vcf.gz &gt; outliers.tsv
</span></span></code></pre></div><p>That writes one row per outlier sample per locus. The rest of the options, including filtering and tuning the sensitivity, the output columns, and the algorithm details are documented in the <a href="https://github.com/wdecoster/trout">repository</a>. The plots are rendered with <a href="https://psy-fer.github.io/kuva/">kuva</a>, and trout reads the same STRdust VCFs as aSTRonaut, so the two tools sit naturally side by side: trout finds the loci and samples worth a closer look, aSTRonaut shows you what is going on at the sequence level.</p>
<p>trout can pick the k-mer length per locus on its own (<code>-k auto</code>), detecting the motif period from the sequence — so a trinucleotide locus is scored on trinucleotides and a hexanucleotide locus on hexamers, without you specifying anything. It also writes a per-sample QC summary with a robust z-score, which is handy for spotting samples that come up as outliers everywhere (usually a coverage or contamination problem rather than biology).</p>
<h2 id="how-this-was-built">How this was built<a href="#how-this-was-built" class="anchor" aria-hidden="true"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
      stroke-linecap="round" stroke-linejoin="round">
      <path d="M15 7h3a5 5 0 0 1 5 5 5 5 0 0 1-5 5h-3m-6 0H6a5 5 0 0 1-5-5 5 5 0 0 1 5-5h3"></path>
      <line x1="8" y1="12" x2="16" y2="12"></line>
   </svg></a></h2>
<p>In the interest of transparency: trout was not written by hand. The code was written and tested by <a href="https://claude.com/claude-code">Claude</a>, Anthropic&rsquo;s coding agent, working under my supervision — I directed the design, decided what the tool should do and which features to add, and reviewed the result.</p>
<p>trout is open source and available on <a href="https://github.com/wdecoster/trout">GitHub</a>. Feedback and feature requests are welcome.</p>
]]></content></item></channel></rss>