Selecting Champion Racehorses
There is one simple key to being able to predict future outcomes - find the variables that explain the greatest difference in what you are trying to predict.
This concept is easily misunderstood in predicting outcomes, especially as it pertains to racehorses. Having a lot of data, weather that be through having a lot of samples, or gathering a lot of variables (or both), is useless if it doesn't help describe the difference between what makes a good racehorse a fast racehorse and what makes a poor racehorse a slow one. Ideally you should only measure what matters.
Equally, once you have gathered variables that individually tell you differences between fast and slow horses, many of these variables can be highly correlated with one another and therefore having both variables isn't helpful. In data science this is known as "feature selection", that is, selecting the variables that best explain what you are trying to predict while discarding those that don't.
It is with this in mind that we can explain how we went about developing our selection technique, and why a combination of variables - data, genetic, cardiovascular and biomechanical - has provided us with the most accurate prediction model.
What Data do we collect?
When it comes to evaluating a yearling or two-year-old. We gather data from three sources
We have completed two separate Genome Wide Association Studies comparing elite and non-elite racehorses. The first of these used 54,000 SNPs (DNA Markers) and found many significant genetic markers that separate out elite and non-elite racehorses as well as the optimal racing distance of racehorses. A second Genome Wide Association Studies used 670,000 SNPs to again examine elite and non-elite racehorses. This second study was more specifically structured to look at the very best horses when compared to high priced yearlings and two year olds who failed to win a race or were winners of mediocre races. After eliminating SNPs that were not informative, these two studies provided us with around 200 SNPs that explained the difference between fast and slow horses. These 200 SNPs were then included in our larger study that formulated our sales selection process.
The Thoroughbred racehorse has evolved to allow it consume more oxygen per kilogram than most other large mammals. The superiority of the Thoroughbred cardiovascular system rests in a proportionately larger heart and spleen per unit body mass than other large mammals.
Correlations have been found between measures of racing performance and echocardiographic measures by means of a Pearson correlation coefficient. It has also been suggested that maximal oxygen consumption and heart size are more important predictors of performance for horses that run longer distances because their energy consumption is mainly aerobic.
The trend for echocardiographic dimensions to increase with body weight was found by Slater and Herrtage (1995). Earlier studies emphasized a tight correlation between body weight (BW), body surface area (BSA) and LV Mass measured by a means of guided M-Mode echocardiography. Other investigations have also demonstrated a genetically determined breed relationship between body weight and heart weight.
A higher coefficient with performance was also found when horses were grouped on the basis of age at time of examination, sex and estimated body weight. Using this grouping, LV Mass and LVIDd were significantly correlated with measures of racing performance on first examination, and those that were currently racing had correlations with mean wall thickeness (MWT) as well as those two measurements.
We complete a m-mode measurement of the left ventricle of a yearling or two-year-old to gather relevant data on the size and shape of each cardio. These data points formed part of the larger study that our sales selection process was developed from.
As mentioned above, the size of the cardio is very dependent on the size of the horse. Equally, other studies have shown that variation in biomechanical parameters are important discriminators of elite and non-elite racehorses. Size and biomechanics are, of course, dependent on what the horse is required to do, and it does vary depending on the country and surface that the horse is to race on.
Using advanced photogrammetry and image recognition technologies to refine the age-old art of hand-measuring horses, we are able to gather over 100 body measurements and indices on each horse.
The Sales Selection Algorithm
Our final Sales Selection algorithm that we have developed uses the latest in machine learning techniques to weigh all of the variables that we have collected in each of the datasets above. The final algorithm, which is a random forest algorithm, uses the data in various ways to ensure that each type of racehorse is optimized for prediction as much as possible.
As an example, the algorithm will consider the genetic markers associated with sprinting or distance racing and determine if the size, biomechanics and cardiovascular parameters fit. Often we see horses that are mismatches, or 'all dressed up with nowhere to go" where the cardiovascular type is totally different to the genotype (I.e a sprinter genotype with a staying cardio) and the algorithm will score these types of horses poorly.
The Sales Selection algorithm has been trained on the outcomes of nearly 10,000 racehorses so it is a robust algorithm that is able to consider many different types of horses. That said, we are still heavily reliant on 'trainer effect' which is a large component of the outcome of every horse.
The Sales Selection algorithm can be used on a horse once it reaches 12 months of age, but the later the horse is measured, the more accurate it is.