When 50 is better than 50,000….
Last week Equinome – now badged as a division of PlusVital – announced the latest iteration of their "Elite Performance Test", version 3.0. One of their most prominent claims is that they are now "using" on a platform of 670,000 genetic markers, which considers over 48,000 genetic markers related to elite racing and breeding performance. V 3.0, which uses the Equine 670KSNP chip, was of course preceded by V2.0 that used the 75KEquine SNP chip and V1.0 that used the initial 50KEquine SNP chip that was launched back in 2011.
When compared to previous iterations, 670,000 genetic markers sounds impressive right? Well the truth, of course, is a little further away from reality.
Earlier this year we used the same 670,000 SNPs that are on the 670K Equine SNP chip that Plusvital/Equinome are now using to conduct a study on aptitudinal extremes of the population of thoroughbreds – Sprinters and Long Distance horses – to take a closer look at what genes, and what variations within these genes, explain the difference between elite and non-elite horses. There were some interesting finds, including some new SNPs in a gene called EPAS1, which is a gene that is involved in the response to variations in oxygen level (ie Hypoxia). We now use some of theses new SNPs that we have found in our own Elite Performance Test.
In the course of the genetic study we did, however, find some information pertaining to the 670K SNP chip, and why Plusvital/Equinome ‘s claim of "using" 670,000 SNPs is in fact just genotyping 670,000 SNPs, and it is far less impressive than it seems.
Firstly, over half of the SNPs on the 670K SNP chip are monomorphic in Thoroughbreds, that is they are all the same, and have zero use in genetic studies of the Thoroughbred. The original 50K and 70K SNP chips had a lot of Thoroughbred SNPs on them that were useful in explaining the difference between groups of Thoroughbreds, but with the 670K SNP chip, the consortium behind it’s development added a significant number of SNPs from other breeds like the Quarter Horse, Arabian, Warmblood, Paint, etc. on the chip so that it could be used for those breeds for their own genetic studies. As a result, well over half of the 670,000 SNPs that are on the SNP chip have no relevance to the Thoroughbred at all.
Secondly, because the throughbred has considerable inbreeding, the SNPs on the chip are often in high Linkage Disequilibrium with one another. Linkage Disequilibrium (LD) is the non-random association of variations in adjacent areas of the genome. That is, on one part of the genome a horse will be an A:A and on another part it will be a C:C. When this occurs and the statistical link is high, using both genetic markers isn’t necessary, and indeed confuses many prediction models such as a Random Forest, so eliminating one of these and using the more relevant SNP is desirable. When we take this in to account, while studying racehorse elite performance as a trait, approximately another 200,000 SNPs become irrelevant because they are effectively copies of other SNPs, or more of the same.
Thirdly, when you look at building a prediction model, the models that are built will be specifically interested in using the SNPs that explain the greatest difference between fast and slow horses, and pruning, or eliminating, those that don’t explain much. If you look at the 50,000 or so SNPs that have relevance to racing performance, once you get past the top 1,000 SNPs, the difference each SNP makes is really, really small: each SNP explains such a small amount of the difference between fast and slow horses that they have little additive use in a prediction model.
So, after really analyzing the data, these three factors bring us down to a little over 1,000 genetic SNPs that actually have significant relevance to the measurement of the trait at hand….a long way away from the “over 670,000 “ that Plusvital/Equinome are touting, but the actual reality of using the 670K SNP chip to examine racehorse performance as a trait.
This claim of using 670,000 markers also speaks to something that is being fundamentally misunderstood by the Thoroughbred industry at large.
In prediction of outcomes your goal is clear – to gather the variables that explain the greatest difference for the trait at hand and create the best model possible to predict future outcomes using these variables. Keep that in mind....variables, or numbers, that explain the largest difference....
With this in mind, if we are looking at racetrack performance as a trait, then we can get a population of elite horses and a population of non-elite horses, genotype them on the 670K SNP chip and then rank all 50,000 SNPs in terms of their explanation of the difference between elite and non-elite horses from 1 explaining the largest difference relative to the trait and other variables, and 2 being the next, and so on, and then arrive at our list of 1,000 or so SNPs that can be used in a model.
But if we are modeling performance then any variable, not just genetic SNPs, could be used explain the difference in ability. If we at Performance Genetics rank all the data we collect on a yearling; run a permutation importance test, then and rank the factors in terms of their explanation of the difference between fast and slow horses, this is what we find....
A genetic marker does indeed appear as the most important factor, and genetic markers are also the second and third most important, but the fourth most important is "Left Ventricular Mass by Body Weight,” and the fifth is the variable “Z-Score of Bodyweight by Sex”. These phenotypic measurements far outweigh most of the genetic markers that are being used by PlusVital/Equinome and help significantly to explain the largest difference between elite and non-elite runners. Additionally, by measuring the trait itself rather than the DNA that potential codes for it, they eliminate the need to consider many of the SNPs.
The reality is that if you are trying to predict whether or not a horse is going to be an elite performer, you only need to find the most significant 50 or so variables, be it SNPs, cardiovascular and splenic measurements, or biomechanical measurements. Its where the best 50 variables are far more important than 670,000....