January 11, 2019

Briefly evaluating SVIM

Wouter De Coster

345 Words · 1 Minute, 34 Seconds

2019-01-11 14:18

Although it was available for a longer time already, in December the long read structural variant (SV) caller SVIM was published as a preprint. As I have published about tools for long read sequencing before I decided to also take a look at SVIM.

I started from reads of our NA19240 PromethION genome aligned using minimap2 to GRCh38, and compared with variants shared by Chaisson. I didn’t time it, but SV calling using svim was reasonably quick. Of note, SVIM currently does not output genotypes and only informs you about the presence of a variant, without providing the zygosity of the position. The length pattern of the SVs is not entirely as expected, as it misses the characteristic peaks at 300 and 6000 bp, corresponding to SVs involving respectively Alu and L1 elements.

The tool identified 3.7 million variants, of which 3.2 > 50 bp (the commonly used definition of a structural variant). This is of course far more than what we would expect for a human genome (about 25000-30000 variants). However, svim also adds quality scores to their calls, so I used these to filter, with increasing stringency, and create a precision-recall graph. The code for this procedure is available here and uses mainly SURVIVOR, pandas, cyvcf2, and matplotlib. Without filtering the tool reaches 80% recall but almost zero precision. With filtering on quality scores this precision increases with a modest loss in recall. At 60% recall, about 70% of the variants are accurate, a performance roughly similar to other structural variant callers in my published comparison. The decision of the author to output all variants and have the user sort those out makes sense, however, it would probably be more realistic to not write those variants which have really a terrible quality score. However, setting cutoffs is difficult and invariably a tradeoff between sensitivity and specificity.

So, another structural variant caller for long reads. It is clear we haven’t seen the last of these, and the field will take a while until the most optimal and mature tool has been found and refined.