A very interesting false positive variant
Wouter De Coster
291 Words · 1 Minute, 19 Seconds
2019-05-09 14:20
Today I was, just like any other day, looking at structural variants (SVs) from Oxford Nanopore PromethION data, aligned using minimap2 and called with Sniffles. I came across a 100 Mb pericentrometric inversion of chr12, called in multiple individuals and visible in the alignment in all my samples. The region is not crazily repetitive (just a bit), the breakpoints are not in a segmental duplication. Interesting, but altogether very unlikely. I asked my colleagues on the Biostars moderator slack:

Now, what’s going on here… Below you see an IGV screenshot of one of the ‘breakpoints’. Notice the large insertions of 3kb, because that’s going to be the punchline.

I used a Python script to get the sequence of those large insertions, by parsing the CIGAR string:
https://gist.github.com/wdecoster/5427c7a87f1fb6d25608f8f90ea79303 get_large_insertions.py
After I got the insertions (see an example sequence at the bottom of the post) I could blat them, and lo and behold: a match is found on the location of the other ‘breakpoint’. So, it turns out there is a copy of this 3 kbp sequence on both ‘breakpoints’, but it is only part of the reference genome on one of these.
Now the blatting also reveals us where this sequence is coming from, as there is also a second hit:

So… the inserted sequence matches nearly perfectly to the mRNA sequence of the TDG gene! This gene has in its Entrez description the following sentence:
This gene may have a pseudogene in the p arm of chromosome 12.
I can confirm it does! So this inserted sequence is a processed pseudogene, which is integrated in two locations, of which only one made it to the reference genome. So no Nature paper, but a very interesting false positive variant.
example insertionTAGTGATCTGGACAATGTGCTTTTGGGAGAAGGCTCAAACTAGAGAAATCACAAGGGTATGCACCTCAACTTGTTTCCATGAGGTTCATTTGGTTATTAACAACAACAATGGTGATAGAGATCTTTAAAATTGTTGGCTTACTGGATTTACCGCTAGAAGAAATTCTGTGTAACATTTTGGTATCATGTACACTTAGCACAGTTTACGATGAGTTATCAGGTATCTACAAAGCACACTTGCTGTAAGTTATACCTTTTTAAATGAAGTCAAAGCCTAGTGGTGCAGTGTGCAATGAATTAAGTTACCTTTATCTGAAGTAACCTGGACAGATGTAAATCTTGCATTTGGTTTAGTCGAGTTCAACAAGCAAGCGGGAGGACTTACTCCACAAGTAAGCAAGCGGATAAGTAAAATACTAAGCAAGCATGGTTTTACACAACCTCCATTCCTCACTGGTGAAGCAAGAGGTCTGAGTGTGCATAAACAGAGACCCTAGTTCATAAACTAGTACCACTAAAAACTAACAAATGAGACTCCACTGAAACGCTTCCTCTAAAATCCAGCTTCTTTTTCTTGTAATTGGTAATTATTAGCATACCCTTCTTAGGTATAAGGGATATGGTATATCTAGGTTTATGCAATTGAATCAACTATAAATATTCAGACTTTATTTTAAAATTACAATTAAAGTGACTCCCAAAGAAGATAATTTCTTCATAAATGAGGAAAATATAGTATATAAAAGCTGGAATGATAATTAATTCAAATACACAAAGCTAGTATACAAAGCCCTACTTTCTTCTGCTTCTTAGGTTCATACGGTTCACTACCACTGGCTGTCTCATTACAATTACCACTGGGAAGTGGAAATACTAATAAAACACTTCAGTTAGCAAACTGAGGTTCTGTTGTTGACAACTATTCAAAATGCAGCATTTAAGCAGGCTGAGAAGCATTCTTAAGCATGAACTTCTTCTTCATTCTTGTGTTCCAGTGAGATTACTAAAGGAAGGAATTTGGTCTGCATATATAAACAGCTGGGTCATCCACTGCCATTAGGAATACCTGAAAGCCTGATTCTCCTCTTATCCACACGCTCTCAATTAGCCCATTTGAAGAAGCCACAAGGTTCACTGCTGCATGGATTTCTCCGTAAGCACCATATATACTGCCTCTGCTTCTGGATCATATTTTTGCTAATAACAGCCATCTTCTTTTGCATCCTCTTACTGGGCAAGCTGTAGGTCAAATGTATATTGCACCTCTTGAACGTCCATATCCGTTCAATGCCTTTCAACCGCTGATCTCTCTAGCCAATAGCTGAACTTTGTCTTGTTGGCTCAGGAAACTGGGCACATCTTGCACTGGATGATGGCATAACATAGCAGAGAGTTTCTGTGTCTGGAATCTTATGGGGGCGACCCAAATTCCAAGTTCTAGCAACCTTTTACTCCAAAAACTTCTTTACTAAAAATTTCATAAATACATTTCCATTAAAATGCTTGCTATTCGTGGCTGATAACTGTAATTTCTGTACTGGAATACGTCCTCCTTCACCACGAAATTCTTTACTGGAGAGATCACGCTGCGGGCGTGGTCACCATGTTGGTAAATCAATACCATACTTCCCTGGTAAGTGTGATCATCATATATGGTTCAACTTCGGACTTCACTGAGCCTGACGCAAACAAACACTTCCAAAATGAACTTGGTCAGGGGTAATGATGCCTTTGTAAGCAGCCATTGATCCCGGGTTTATACCAATAATGACAATGTCAGATTAGGTCAAAACCTTGGGGAGAGTGTACGAGTCGAAGTTCGGCTTCTGAAACACCATTAAAACAGTCTACTTTTCTTTTACTTTAAATGTGTCTGTAATTTTTTTCTTGTTTTCTTGATTTTTGAGACTTGCACTTAGTTTTTTTTGACTCAACAGGTTTTGGGTTCACTGGTTGTTTGGTTCTGTTGTTCTGGGATTTTCTTTTTCTTCCTTTTTGAACTCTTGCACTGGTTCCTGGAAGCAGGAGCTGGGGCTGGAACTTCTTCTGGCATTTGCTGTTCATTCACCTTTGCCATATTGGGAGCTTCAGCCATCAGTTGTTGAAATGGAAACGTATAAAAGCTTGAGCTTGCTGAAGGGAATGGCGCGCGTTCTCATGCATTCCCGAGGCGCTGGTGCGGGCGGCTCCTGTACCAGGCACACAGGCAGCCAGTCTCTGTATTACGCGTGACTCGCGGTAAAACCTTCAGACCTGGGAAGCCCGATGGCTGGCAGTACTGGGCTGATGGCGGACTCAGCTCCTCCTCCAGGAAACTGCGGGTCCCCCGCAGGACCCCTAGTGTAGATGATTTTTTGGCTATGAGAGGTCTGGGTAGTGGGAAGGGACTGACTATCATAGGAGCCTGGGGCGGCTGTTCCGGAAGAGAGAGGCGGGAAATTGCAGAGGCAATGTCCTGGCAGAGCAGTGCCAGATGTCCACTAGACACAAGGTAGGCACGGTGATGTAAGCGAGAGGATCAGAGCAGTGAGAAGGACCACAAACTGTACTACAGGCCACCATGAGAACTTGGACTGCTCTGTAGATGGGTAGGAGCCATGGGAGGTTTGGGAGCAGGAGGCGGAGGCCTGATGTGGCTCGCCTAGGTCCCCGATAAATAAGTACATCTGCCTAACCACAGCGCCTTTGGCCAGCCTCCTCCTTACGCCTGCCACGACCACCTACACTCTTGTTCCTGGCAGGGAGTGGTGCACGAGAGGGGCTGGAAGCAGGAGTAGAGCTCGATCAAAGAGATACTTGCGCCTTCAGGACAGGGCAACATTGCATCGATCCTTATCATCATCAGCCTGGCAGAGACTCCGCCTGTGACTGCGGGCAAG TGAAACACCTCTCTTGTGCCTCAGTTCTCCTCAGATGGAGCTGATGGTAGTAC