Selecting Champion Racehorses
There is one simple key to being able to predict future outcomes - find the variables that explain the greatest difference in what you are trying to predict.
This concept is easily misunderstood in predicting outcomes, especially as it pertains to racehorses. Having a lot of data, weather that be through having a lot of samples, or gathering a lot of variables (or both), is useless if it doesn't help describe the difference between what makes a good racehorse a fast racehorse and what makes a poor racehorse a slow one. Ideally you should only measure what matters.
Equally, once you have gathered variables that individually tell you differences between fast and slow horses, many of these variables can be highly correlated with one another and therefore having both variables isn't helpful. In data science this is known as "feature selection", that is, selecting the variables that best explain what you are trying to predict while discarding those that don't.
It is with this in mind that we can explain how we went about developing our selection technique, and why a combination of variables - data, genetic, cardiovascular and biomechanical - has provided us with the most accurate prediction model.
What Data do we collect?
When it comes to evaluating a yearling or two-year-old. We gather data from three sources:
We have completed two separate Genome Wide Association Studies comparing elite and non-elite racehorses. The first of these used 54,000 SNPs (DNA Markers) and found many significant genetic markers that separate out elite and non-elite racehorses as well as the optimal racing distance of racehorses. A second Genome Wide Association Studies used 670,000 SNPs to again examine elite and non-elite racehorses. This second study was more specifically structured to look at the very best horses when compared to high priced yearlings and two year olds who failed to win a race or were winners of mediocre races. After eliminating SNPs that were not informative, these two studies provided us with around 200 SNPs that explained the difference between fast and slow horses. These 200 SNPs were then included in our larger study that formulated our sales selection process.
The Thoroughbred racehorse has evolved to allow it consume more oxygen per kilogram than most other large mammals. The superiority of the Thoroughbred cardiovascular system rests in a proportionately larger heart and spleen per unit body mass than other large mammals.
Correlations have been found between measures of racing performance and echocardiographic measures by means of a Pearson correlation coefficient. It has also been suggested that maximal oxygen consumption and heart size are more important predictors of performance for horses that run longer distances because their energy consumption is mainly aerobic.
The trend for echocardiographic dimensions to increase with body weight was found by Slater and Herrtage (1995). Earlier studies emphasized a tight correlation between body weight (BW), body surface area (BSA) and Left Ventricular Mass measured by a means of guided M-Mode echocardiography. Other investigations have also demonstrated a genetically determined breed relationship between body weight and heart weight.
A higher coefficient with performance was also found when horses were grouped on the basis of age at time of examination, sex and estimated body weight. Using this grouping, LV Mass and LVIDd were significantly correlated with measures of racing performance on first examination, and those that were currently racing had correlations with mean wall thickeness (MWT) as well as those two measurements.
We complete a 10 second m-mode video of the left ventricle of a yearling or two-year-old. With over 4,000 such videos with known racing outcomes, we are then able to train a neural network using Google's AutoML Video Intelligence module to 'learn' what an elite (1) and non-elite (0) cardio looks like and provide a probability, a score between 0 and 1, of any new video. This probability forms a variable in our final Sales Selection Algorithm.
As mentioned above, the size of the cardio is very dependent on the size of the horse. Equally, other studies have shown that variation in biomechanical parameters are important discriminators of elite and non-elite racehorses. Size and biomechanics are, of course, dependent on what the horse is required to do, and it does vary depending on the country and surface that the horse is to race on.
Using DeepLabCut, we are able to extract biomechanical measurements from the video of a horse as it walks from left to right. These biomechanical features are used to create a video of the horse. Like the video of the cardio above, we are then able to train a neural network to learn the difference between elite and non-elite racehorses based on the video of their walk and use the probability provided by the neural network as a variable in our final Sales Selection algorithm. (see our contribution to DeepLabCut Here)
The Sales Selection Algorithm
Our final Sales Selection algorithm that we have developed uses Google's AutoML Tables, to create the best possible algorithm to predict future outcomes. AutoML Tables has the advantage of completing feature engineering of the data (normalizing, etc) and creating new features in the data for use. AutoML Tables then searches through Google’s model zoo for structured data to find the best model for the data, ranging from linear/logistic regression models for simpler datasets to advanced deep, ensemble, and architecture-search methods for larger, more complex ones.
The data that we provide to create our final predictive algorithm will consider not only the probability variables from the cardio and biomechanics videos but also the genetic markers associated with sprinting or distance racing and determine if the size, biomechanics and cardiovascular parameters fit. Often we see horses that are mismatches, or 'all dressed up with nowhere to go" where the cardiovascular type is totally different to the genotype (i.e a sprinter genotype with a staying cardio) and the algorithm will score these types of horses poorly.
The Sales Selection algorithm is a robust algorithm that is able to consider many different types of horses. That said, we are still heavily reliant on 'trainer effect' which is a large component of the outcome of every horse.
The Sales Selection algorithm can be used on a horse once it reaches 12 months of age, but the later the horse is measured, the more accurate it is.