Byron Rogers
Using Vertex AI machine learning models to predict racehorse outcomes
Updated: Aug 14
This is the third blog post in a four-part series on how we used machine learning and artificial intelligence tools to find yearlings that subsequently become elite racehorses. The four blog posts are:
Building the predictive tabular models in Vertex AI
The selection process and future improvements

In the first two posts of this series we covered how we develop datasets and also how we build the base video and image models that form the basis of our predictive models. As we discussed in the last post, we use a process called model stacking to build the final models that we use at sales. Here is a recap on how that works.
Training Base Models: Several different models, known as base models, are trained on the training data. These can be different algorithms or different configurations of the same algorithm. If you recall in our previous post, for our biomechanics meta-model we trained 5 base models which were a variation of video models and image models.
Making Predictions: The base models make predictions on the data set. These predictions are then used as input features for a new meta-model. In our process, we are using the probabilities from the base models.
Building a Meta-Model: A new model, called a meta-model, is trained using the predictions from the base models as features. So, the meta-model is not trained on any of the raw data we input, rather it is trained on the probabilities from the base models.
Combining Predictions: For new, unseen data, the base models firstly make predictions, which are then fed into the meta-model. The meta-model's prediction is the final prediction of the model and the probability from that model is the one we use for selecting racehorses.
Stacking often helps in improving the predictive performance by combining the strengths of different models. As you can see in the last post, we have some of the base models that have higher accuracy than the others but importantly they have different ways of looking at the data we input. The stacking can capture complex patterns that may not be captured by individual models.
So let's have a look at the 5 meta-models that we use in racehorse prediction, namely:
The Biomechanics Model
The Conformation Model
The Cardio+Biomechanics Model
The DNA+Cardio+Biomechanics Model
The Optimal Distance Model
The Biomechanics Model
As we showed in the last post, the dataset that the Biomechanics model uses is the output of the base models. It looks like this:

The target variable for the Biomechanics model is the column "elite" which is either "Yes" or "No". We include Race Rating in the dataset as a weight column. The model doesn't use this as a variable to learn from, rather it is just a column to tell the model when it is training to put more importance on the higher Race Ratings for the elite horses (so it takes more notice of the data from a G1 winner over a Listed winner) and the inverse with the non-elite. The use of a weighting does have a 'tuning' effect on the final model.
In terms of training a model, we could do this all ourselves and go through the data science process completely, but to be honest, with the automated machine learning within the Vertex AI platform, and the simplicity of our dataset, it makes more sense to use the plug-and-play AutoML process. No need to try to be a hero given that it is much easier to implement AutoML within the pipeline and get it into production. AutoML does the following:
Data Input and Preprocessing: For the model build, users feed datasets into AutoML and it will handle preprocessing like feature normalization and data splitting (do an 80/10/10 split where it trains on 80% of the data, validates on 10% and holds out 10% of the data to test on). Normalizing the features is important as we include both continuous and categorical features to use in some of the models.
Model Selection and Training: AutoML leverages a search algorithm over a number of different machine learning models, exploring various algorithms, architectures, and hyperparameters. It might include deep learning models, ensemble models, or more traditional ML models like decision trees. I have found that it tends to use the highly successful XGBoost most of the time.
Hyperparameter Tuning: It employs intelligent optimization techniques, like Bayesian Optimization, to find the optimal hyperparameters for the selected models. This process is done iteratively to enhance the model's performance.
Model Evaluation: The system evaluates models using a separate validation set, based on the penalization metrics that we set for to the task.
In terms of the penalization metrics we use to select and train the AutoML model, as we have a balanced dataset, we have two choices either to maximize-au-roc or minimize-log-loss. Maximizing Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and minimizing Log Loss (also known as Logarithmic Loss) are both common objectives in binary classification but they have different optimizing processes.
Maximize AUC-ROC: If we are primarily interested in the model's ability to rank instances and discriminate between elite and non-elite, and we are not concerned with the absolute probability values (so we don't care if the best horse in the dataset ranks in the middle of all horses predicted to be elite) we use AUC-ROC.
When to Minimize Log Loss: If the predicted probability scores themselves are significant, and we want the model's confidence in predictions to align with actual outcomes, minimizing Log Loss is preferable.
For us, it is more important that we have the probabilities align with the actual outcomes so we are using minimize-log-loss. Thus, when AutoML is iterating through selecting models and tuning the hyperparameters for each model it is trying out, it is trying to find the model/hyperparameter combination with the lowest log-loss score on the holdout test 10% of the dataset. Here is the metric data on our current Biomechanics AutoML model (as of July 30 2023).

At a threshold of 0.5 (remember all records get a score between 0 and 1 and if it is at or beyond 0.5 it is considered to be predicted as elite), the biomechanics model has very solid metrics. With a binary classification problem and a balanced dataset, a Log Loss value of 0.69 can be obtained by any model that always predicts a 50% probability for both classes, so to have a Log loss of 0.4 is good. The Log Loss will improve over time as more horses have their measurements age into the dataset, we create further datasets and train new AutoML models in the months to come. Here is the confusion matrix for the Biomechanics model:

From above you can see that it is predicting Elite='Yes' 89% of the time, and only misclassifying the elite horse as non-elite 11% of the time. This is important as at the end of the day, people don't like missing elite horses because we all know how rare they actually are. It doesn't identify every good horse, no model can do that, but it finds a high proportion of them and has a low proportion of false positives (the 16% where True Label = No but Predicted Label = Yes) Conversely, the Biomechanics model is able to identify the non-elite horse correctly 84% of the time and only misclassify it as an elite horse 11% of the time.
The Vertex AI platform also provides the importance of each of the features that we used in the final model it creates. Here is the weighting of importance for that model:

As we can see, the probability that the raw biomechanics video model provides is the most predictive (strong learner) and has the highest feature attribution, but each of the base models have some effect on the performance of the biomechanics meta-model. That is what the strength of stacking is, the ability to use some weak-learners in different situations to augment a strong learner.
So what happens when we score all the records we have, and see how their scores relates to actual performance?
Well, at this point we had to use a bit of "marketing" in order to make things a little bit more palatable for the end user. If I take a look at the distributions of all the probabilities that the final Biomechanics model has, it looks like this:

So what you can see is that there are a lot of horses with predicted probabilities between 0.2 and 0.4, and less at 0.8 to 1. We felt that it might be a little unpalatable for anyone that we were looking at horses for if we just provided a probability score out of the model and they saw a value between 0.0 and 0.4 (my yearling is hopeless!). Equally, the probabilities produced by the Biomechanics model can get very high (if of interest, the highest rating on this model is 0.9809992909 for what is now the Coolmore Australia stallion Best of Bordeaux). A really high probability value could create an incentive to the user that overstates the probability of the horse being an elite horse. With that in mind, we decided to bin the probabilities in somewhat even bins and create a score of A+, A, B+, B, C and D.
If we look at the binned biomechanics probabilities for yearling colts, here is the outcome.

If we take a look at the above, we can see how the binned probabilities for the biomechanics model fit. There are some noteworthy things to discuss:
The yearling colt dataset has 7.93% Elite horses to all horses. This is slightly higher than the larger dataset which sits at ~6%, but it is because this yearling dataset is from major commercial yearling sales and also horses that have been pre-selected by us or others that have used Velox so it has a naturally higher rate of elite horses within it.
The model isn't perfect - All the categories have elite horses in them, even the "D" rated measurements. This model isn't a magic bullet or 'the answer', it is just a co-pilot to help you gravitate towards the elite horse as it's able to learn things we can't.
Horses that are rated "A+" are significantly better in terms of opportunity to become an elite runner. The A+ rated horses are only 13% of all horses but 60% of all elite horses and 35% of the horses that are rated A+ are elite runners. That strike rate of 35% is far better than any bloodstock agent could produce at a sale.
To get into the A+ rated band, the bottom bound of the probability is 0.7196485996. This is well above the 0.5 cutoff so while the goal would be to find the A+ rated horse, an A rated horse would also work. Your overall strike rate/precision would drop, as there is only 10% elite in the A rated group, but it would still be a significant improvement if you considered both groups.
The Cardio+Biomechanics Model
So now you understand how the biomechanics model works, the next two models work exactly the same, just with more variables. With the Cardio+Biomechanics model we add three cardio variables to the data we have generated from the cardio base model to the dataset. Importantly, the data is only included when we have an m-mode cardiovascular measurement and a biomechanics measurement on the same date. Here is the metrics on the current model:

You can see that the log loss value is slightly better than the biomechanics model (closer to zero is better), but some of the other metrics aren't quite as good as the biomechanics model. This is because of sample size. The biomechanics model has a lot more data and so it is able to capture a lot more of the variation within the dataset. The cardio+biomechanics dataset is a bit smaller in size, so some of the variation isn't there. yet but it will come as we have more horses age with their measurements into the dataset.
Like the biomechanics model we can see what variables are the most important in the trained model.

So you can see that the two measures with the highest feature attribution are Cardio values. How is that possible, when in the previous post we said that the Cardio isn't that important?
Well this is where you have to know your data and look at it's strengths and weaknesses to see why a model is behaving in a certain way. Let's look at the probability data from just the cardio scan itself. Here is the yearling colts data.

Over all yearling colts, the dataset has 12.57% elite to all horses. Again, this is higher than the general population, but these colts have been scanned from someone's shortlist, so they don't represent a general population which would be closer to 6%.
Now you know this number you can see that horses that are rated A+,A, B+ and B are all bins that have a percentage of elite horses within that group that is above that 12.57%. That sort of tells you what the problem is - after training on the cardio video of elite and non-elite horses, a cutting edge video recognition model can't produce a distribution of probabilities that sufficiently separates out the elite from the non-elite. The top 66% of all probabilities (the groups A+,A, B+ and B) have similar percentages of elite horses in their bin. As we have said in a previous post....there are more 'good hearts than there are good horses.'
So when is the data from a cardio scan useful? Generally speaking they are most predictable and most useful when you are looking at horses that want to run over distance and really not that useful when you are talking about sprinters. A generalization, yes, but borne out when we look closer at the data. When is the data generated by cardio scans really useful? The answer - it's really good with colts who want to run further than 6 furlongs. Take a look at the Cardio+Biomechanics data when we just look at yearling colts.

So, as we look at the data you can quickly see:
The A+ rated yearling colts are a really small percentage of the population at just 6% of all measurements, but 93% of those end up being elite runners !!!
A+ rated horses have 41% of all elite horses in that group so the cardio data for colts clearly creates a big effect. The A+ rated colts in this model are rare, but really good. Here is the top 10 (as of the date of this blog)

Legends of War (underbidder on him as a yearling) and Ubettabelieveit (passed on him as he was a weaver/box walker and we were buying for a breeze up client) are the two turf sprinters in the group that had really good scores. The rest are colts that generally want a mile.
Between the A+ and A rated horses, they are just 20% of the population, but they are 93% of all elite horses. If you really want to find an elite colt, the A and A+ rated horses on the cardio+biomechanics is a really strong but specific group.
We cover the best use cases for the technology that we have in the next post, but if you were looking at buying a colt that wanted to get a mile or more, especially on dirt, the cardio+biomechanics model is particularly compelling.
The DNA+Cardio+Biomechanics Model
The DNA+Cardio+Biomechanics model is an extension of the Cardio+Biomechanics model but we have added some selected DNA markers to the dataset. Firstly let's take a look at the confusion matrix:

As you can see this model is deadly when it comes to predicting if a horse will be an elite runner or not, with 91% of the holdout set that were elite runners being predicted as such. It also correctly predicts the non-elite runner 74% of the time in the balanced dataset. This model is very good at confirming the elite horse.
It is interesting however when you take a look at the feature importance from the model and see how it ranks the features that are being used.

The first thing you will notice is how small an influence the DNA markers - distance1, distance2, size1, class1, class2 and class3 had on the model.
The six markers were selected from two genome wide association studies that we did back in 2011 (64,000 markers) and 2015 (620,000 markers) across a few thousand horses. The 3 class based markers were the markers that were found to be the most important in an array of over 200 markers that were found in those two studies to be statistically significant to racetrack performance. The two distance markers were also class related, but associated more with optimal distance, while the size marker had a relationship to overall horse size as well.
This model, and specifically the feature importance graph shows two things:
You only need to measure what matters. It doesn't matter what the measurement is, but if we introduce a measurement that explains the variation between fast and slow horses, better than the existing measurements, it reduces or replaces the value of the existing measurements.
The variation in elite and non-elite horses can be better explained by measurements like an m-mode measurement of the cardio and a walking video of the horse, than DNA markers. We initially fed in those 200 markers into a model, along with the cardio and biomechanics data, and the model kicked out most of the DNA markers as giving very little new information relative to the information that it already had. There is some utility in looking at DNA markers, we are not saying there is none, but what you see in front of your eyes is the physical manifestation of that DNA and what you measure in a heart scan is also a manifestation of that DNA and these represent the actual outcome of the expression of the DNA better than the DNA itself.
The Conformation Model
We started looking at developing a model to measure conformation of the horse as we were interested to understand the dimensions, bone lengths and angles that a yearling has that help determine its athletic potential.
One of the first issues that you have to overcome is that the distance from the camera to the horse itself isn't a known distance, so therefore you cannot reasonably determine the size of the horse itself. In the end, to overcome this we created a measurement tool that uses Procrustes Distance Analysis to scale the distances/lengths of bones of a horse to itself.
If you take a look at the short video above, the process to get the conformation data is:
It starts by cropping the image. To do this we used YOLO8, a computer vision model that can find items in images. We trained a Yolo8 model to identify the horse and crop around the horse into a 500x400 sized image. The horse is longform - slightly longer than it is tall - so the 500x400 generally fits the horse into the bounded box to crop. The fact that images are all the same size is important as we feed that image into a custom image recognition model (as per the other base models) and the probability that the custom model generates, becomes a probability
We trained another neural network to place markers on the horse. We set out what markers we wanted to have placed on the horse and used Arthur Porto's morphometrics package Ml-Morph to place the markers on the horse. The error rate of the trained model is 2.1 pixels so from time to time we have to make some minor adjustments when it doesn't quite place the marker where we want it. That model retrains once a month also so it get's a little more accurate each time.
The Procrustes Distance is calculated from the x,y positions of the markers, and all the bone lengths and angles are created. It is important to understand that as we don't have a physical measured distance on the horse, the distances are relative to the horse itself, so we are unable to know if the horse is a small horse with long legs (relative to itself) or a tall horse. If you want a better description on how Procrustes Analysis works with morphometric measurements, have a look here.
Once we had all the images and data into our database, and had over 10,000 horses with known outcomes. We did two things:
Firstly, we ran all the measurement data through a model to see what features were actually important. It turns out that a lot of the data we gathered wasn't at all related to performance (that is the way it works!). These were the conformation variables that were important (by rank):
hind_cannon - the length of the hind cannon is important to performance. There is data elsewhere to show that those with a slighly longer hind cannon relative to their hind leg, have a 'kick' when racing.
hind_length_distance - this is the total length of all bones in the hindleg so it includes the hind_cannon, but also the length of the tibia, femur and ilium that make up the hind leg.
shoulder - the distance/length of the shoulder blade. I suspect because we didn't produce a distance from the withers to the girth that it is using the shoulder distance as a proxy for girth. As we have previously mentioned in an old post, girth is related to lung capacity (not a big heart as some suggest).
body_length - relative body length is calculated from the midpoint of the shoulder to the ischium (point of the buttock). Horses need to have enough body length.
neck - the length of the neck. Short necks are a negative.
forelimb - the length of the forelimb relative to the size of the horse. Those with longer forelegs relative to their size tend to be better horses.
proportion - This is a derived measurement that is the height of the horse from the point of the fetlock to the point of the withers relative to the body length of the horse. Horses that are slightly longer than they are tall, tend to be better, but there is a point where they are too long relative to their height.

scope - this is the measurement of the distance from the point of the wither to the hip relative to the distance from the point of the elbow to the stifle. Basically horses that have little scope, aren't much good.

femur - the length of the femur
humerus - the length of the humerus
hip_angle - this is the angle created by the ilium (that bone that runs from the point of the hip to the tail) and the femur. Horses with flat hip bones and forward placed femur's tend to be more gifted.
fetlock_angle - the angle of the front fetlock is important. There was a high correlation between this angle and the shoulder angle (as there should be) so the model picked one over the other.
imbalance - this is a derived measurement that is looking at the relative lengths of the ilium (hip bone) and humerus. Horses where the distances of these two bones are close, tend to be better horses.
thrust - this is a derived measurement that looks at the length of the ilium and humerus, relative to the length of the femur
forelimb_length - this is the total length of the forelimb from the fetlock through to the leg, including the humerus (from the shoulder to the elbow) and shoulder blade length. Again, longer, relative to the overall size of the horse, the better.
body_hind_ratio - this is a derived measurement that is the total length of the hind limb - fetlock, hind cannon, femur, humerus and ilium, relative to the body length of the horse.
As you can see, a lot of the data is related to the hind leg of the horse. This is where all the propulsive power comes from in the racehorse so it is unsurprising that this is the case.
After we had worked out the data that was important to conformation, we created a new model that took the probability from a trained image model on the conformation image as well as the data as described above. We built the dataset like we have done in the previous posts and ran an AutoML model to see what the outcome was.

The model is "somewhat useful". It is nowhere near as precise as the biomechanics model (there is a big difference between a ROC AUC or 0.77 and 0.91). This is understandable and probably relates to what we do as humans. We (probably subconsciously) get a lot more information about a horse watching it move and walk, than we do as it stands still. From a machine learning model viewpoint, the biomechanics model gets a 10 second video shot at 30 frames a second, so it is getting 300 images of the horse on which to make its decision. The conformation model is getting one image and a dozen measurements derived from it.

The conformation model, as you can see, is quite good at telling you if the horse is non-elite (remember the dataset for the models are balanced so there is as many elite as non-elite) but it is only moderate at telling you if the horse is elite. This indicates to us at least that the conformation model isn't a perfect standalone model to use to select for horses. It misses too many of the elite horses. It has some utility though, and in the next post we will show you how.
The Optimal Distance Model
The final model we developed was a multi-class model that predicts optimal distance. This model solely uses DNA markers to predict outcomes. We looked at adding in the biomechanics and cardio measurements but we found that it was too hard to disentangle racing class from actual distance preference, so we went back to just looking at DNA markers to build the model. Here is the confusion matrix for that model.

You can see from the data that in terms of the edges of the population, it is good at predicting outcomes. For the 5-7f horses it predicts their distance correctly 89% of the time. For the 10f+ horses it gets them right 67% of the time and when it misclassifies them, it does so in the 8-10f category. The model isn't quite able (yet) to correctly categorize the 6-8f types, placing most in the 8-10f category, while the 8-10f types are also misplaced a little too highly (mostly as 5-7f types).
The discordance in the model with the 6-8f and 8-10f types seems to be more down to absolute biomechanical measurements rather than the DNA markers, which are related to muscle fibre type. We see a lot of horses that are heterozygous on the markers we use for distance prediction (so they should be 8-10f types) that are sprinting. This might be more related to their stride mechanics rather than muscle fiber type.
In the above we have discussed the meta-models that we use for racehorse selection. As the data we have ages, we are seeing more and more horses be considered for inclusion in the datasets we use to retrain the models each month which in turn makes the models more stable (ratings don't move around on a horse) and the models become more and more useful.
Using the models in the best way is the subject of the next blog post. Let's dig in...
Next Post: The selection process and future improvements.