Finding the Elite Thoroughbred Racehorse using Artificial Intelligence
This is the first blog post in a four-part series on how we used machine learning and artificial intelligence tools to find yearlings that subsequently become elite racehorses. The four blog posts are:
Understanding the problem, MLOps, and building the dataset
Building the base video and image models in Vertex AI
Building the predictive tabular models in Vertex AI
The selection process and future improvements
An Artificial Intelligence Co-Pilot for Racehorse Selection
I have been working with my colleagues at Performance Genetics for nearly a decade now. We've found many elite racehorses, faced some bitter disappointments, and learned valuable lessons from the data we have collected. During that time we've explored pedigree data, biomechanics and kinematics, cardiovascular parameters, and DNA markers, continually refining our predictive approach based on the data that we have collected and the outcomes on the racetrack. The ground truth from that data is this - Most horses are predictably slow, but the truly exceptional ones stand out. Our journey has brought us closer to understanding what variables are most critical in predicting racehorse performance.
The Challenge of Selecting Yearlings
When we started to look at how to use data to better select yearlings, it took me, at least, a little while to understand what we were trying to do, or more specifically what we were trying to take advantage of. It turns out that what we are doing, is trying to exploit the learning weakness that occurs with every single person when they go to a yearling sale and try to buy what they hope will subsequently become an elite racehorse. Consider this:
About 3 out of every 10 yearlings anyone looks at at a sale, don't get to the racetrack. Those 3 don't give you any meaningful information to form your selection process on (other than they might have a conformation defect that meant racing wasnt possible). As humans, we are unaware of this outcome when we observe a horse, so who can we discard the input from those unraced horses as useless? Answer - we can't.
A further 1 out of 10 yearlings don't make more than 3 lifetime starts, so they also fail to yield sufficient information on which to base a reasonable assumption of their potential ability.
So only 6 of 10 yearlings at sales will give you information to learn from. How do you know which 6 do and which 4 don’t? (the answer is, you don't know)
In addition to the imperfect samples to learn from at a sale we also have three further issues which make things even more difficult for us:
Retention of Visual Information: It's nearly impossible to remember exactly what a yearling looked like and match that to an outcome two years later, especially when evaluating thousands of horses annually. You might remember the odd one or two good ones each year, but the retention of all information is poor, especially of the examples that help you the most.
Bias in Memory: We tend to remember those true positives (horses we liked that turned out good) and discount the difference between true positives and false positives (horses we thought would be elite but were in fact slow). We also fail to remember false negatives (horses we thought would be slow but were fast). The false positives and false negatives are the ones we learn the most from but we don't recall them well enough.
Overwhelmed by Negative Cases: The truly elite racehorses are only 3-6% of all horses at a sale. Our brains are overwhelmed by the negative cases, making identification of the elite horse more difficult. Don't worry, unless they are specifically trained to find it (like credit card fraud detection) it is the same problem most machine learning models have if we have that type of imbalance in the dataset.
Given the above (and setting aside train er effect on outcomes) it is no wonder that those that are considered the best judges on the planet, are striking at 12% accuracy!
The Role of AI in Racehorse Selection
Given these challenges, we realized that what we were building would act like a co-pilot at the yearling sales. By knowing all the subsequent racetrack outcomes as the data matured and building predictive models iteratively, the models could, if fed data consistently, overcome human limitations, namely:
The computer doesn't "forget" what any horse looked like. It has the data for both good and bad horses.
It wouldn't be confused by horses that provide no data.
It would learn from both false negatives (horse it thought were slow but were fast) and false positives (horses it thought were fast but were slow).
Building an AI/Machine Learning Application
In mid-2019, I was approached by Google to beta test what was to become VertexAI. Ultimately, I wanted to develop an application that could:
Easily scale to handle tens of thousands of records.
Be fairly "hands-off" without requiring much manual editing.
Improve its predictive power over time as more horses had their data collected as yearlings, aged into the dataset.
I had been using Google's custom video recognition models for the year prior to that which allowed me to train a model that predicted how "good" a cardio was (you can see this blog post back here to look at that), but the VertexAI platform was more what I was looking for - something that could be used as the backbone for a completely managed end-to-end application.
Google subsequently launched Vertex AI in May, 2021 and starting in June 2021 (thanks Covid) I did a complete re-write of the Performance Genetics platform. I started from the basics - getting the data in, training models, refining models based on what features/variables that were found to be important, and expanding the dataset to be as large as I could make it.
It has been a 2 year project to get built what is now internally known as Velox (Latin for "swift" or "rapid") up and going with the scale and operation that I was after but it is now battle ready. Much of what we discuss in these posts will encapsulate the "secret sauce" on how we developed Velox and how we now go about racehorse selection at Performance Genetics. So let's begin:
Understanding the Problem - how do we try to define and predict elite horses
To begin the process of trying to create a set of models that helped us predict elite racehorses, we had to start by defining how we wanted to approach the problem, and how we planned to overcome some of the issues that come up along the way.
In data science terms there were two options available:
Binary classification task - we could set the task up as a binary classification where the goal is to classify instances into one of two possible classes (elite or non-elite racehorses).
Regression task - we could set it up as a regression task where the goal is to predict a continuous numerical value. So, we could take a rating of a horse (Timeform, Equibase, etc) and try to build models that predict the rating that a horse could achieve.
It's crucial to identify whether the problem is a classification or regression task at the outset, as this will guide our choice of algorithms, evaluation metrics, and overall approach and outcomes. We went around this first problem a few times and each of the approaches has their positives and negatives, but in the end we settled with a binary classification. I will explain why:
We did firstly try it as a regression problem and tried to predict the exact rating of a horse (we used Timeform ratings to start with as we could get ratings in Europe and the US) but found that it was really difficult from a data science viewpoint. We ran into two major issues.
In a general commercial population of yearlings and their subsequent racetrack performance, the distribution of Timeform ratings is heavily skewed. Most of the ratings are clustered around 50-70 with only a few yearlings getting to 100+ ratings. While we tried methodologies to address the imbalance within the dataset, even after doing this the models did not perform well for those outliers because they seem to optimize for the majority distribution.
It also suffered from what is known as heteroscedasticity. The models consistently predicted more accurately (less error) for lower rated horses and less accurately for higher rated horses. I think that this comes down to the fact that when you look at a large population of horses, in performance/effort terms there isn't that much difference between a 90 rated horse that is a good allowance/handicapper, and a 100 rated horse that is a Listed level horse, but there is a big difference between that 90 rated horse and one rated 110. Also, for those that are rated 110+, they are different (leaders/backmarkers, dirt/turf, peak as 2YO/3YO/Older) and a small portion of the dataset so while there are some core variables that separate out elite and non-elite horses getting a good number of them to represent each possbility of elite performance is almost impossible.
Said more plainly, the regression models were really good at predicting the average horse correctly, but couldn't predict the elite horses as correctly as it should. I think this is a major issue that competitors in our space that are trying to predict racehorse outcomes and are using regression models have failed to overcome.
Given the above, we then developed our solution as a binary classification task, but to understand that properly, we also need to firstly discuss our own worldwide rating, how we built it and how we determined who was elite (1) and non-elite (0) in our datasets.
Creating a worldwide rating - the challenge of an international dataset.
One of the other challenges we had is finding a rating that we could use to accurately describe performance. As I said earlier we initially used Timeform ratings. The reason for this is that, generally speaking, I have found they are an accurate measure of performance and they are a metric that can be found in both Europe and the US and have a similar basis of comparison (so European Timeform ratings are roughly calibrated with US Timeform). After using Timeform for a few years it became logitiscally very unwieldy to use. Not because the rating was wrong, rather:
We had a large portion of the dataset racing in Australia/New Zealand/Asia that Timeform didn't cover.
As the dataset expanded it was very difficult to manually look up and insert ratings into the database. I know, I should have got an API call from Timeform to automatically update it but it was two companies (the US is different to the UK) and the issue above with lack of data in all racing jurisdictions was the primarly concern.
So, eventually we had to go about creating our own rating. To do this we partnered with a database that gets worldwide racing results and started to think about how to create a rating that reflected performance in every country that races around the world.
How did we do that?
It would seem simple to use the pattern/graded stakes system as it is a worldwide structure with some boundaries to the annotations given to each race so G1 races should be roughly comparable worldwide, but we also needed a way to rate the majority of horses that didnt race in group races. That part was more difficult as it is very hard to compare allowance level races in the US for example, with a benchmark 64 in Australia.
While some might say it is imperfect, we settled on a prizemoney based rating as the best that was available. Generally speaking prizemoney is quite well distributed relative to racing class in each country with better races having higher prizemoney values, and lesser races having less. There is some problems with races like the Everest in Australia that skew a little, but they are small and can be overcome.
What we firstly did was to normalize prizemoney per race. So, we took the full value in prizemoney for a race and then proportionally distributed it across all runners in the race, so even if a horse was finishing 10th of 10 in a race, it got some prizemoney from that race assigned to it for that performance. This smoothed out the distribution of money somewhat across all horses that competed in races, but didn't diminish the achievement of winning the race.
After that, we then created an index. We did this by comparing:
Horses of the same sex (i.e. only males compared to males and females compared to females)
Horses of the same year of birth
Horses racing the same year
Horses racing in the same countries
So for example, if Horse A is a filly born in 2019 that raced in 2023 in England, then we get all the prizemoney earned by the same horses as Horse A (filly; born 2019; raced 2023; England) and find what the average prizemoney earned for that subset is. That value is given an index of 1.0. If Horse A has earned 3 times the average it will have an index of 3.0. If she has earned 50% of the average it will have an index of 0.5.
Once we have the value for each horse, we then had to weight it by the country that the horse performed in. We had to do this as if you left it unweighted, you would see some absolutely silly results in small countries where a high class horse has astronomically high numbers compared to others, but has been beaten out of sight when its competed in other racing jurisdictions. So a horse with a high unweighted race rating in say Hungary, starts off with a similar value to one in France, but we weight the country so France is higher than Hungary and the adjusted rating more accurately reflects the actual level of performance.
Getting this weighting right took some time as we had to study horses that had raced in multiple countries and look at their relative performance and we used some pairwise ranking techniques to make sure it was right (there is a good R package here for those that are interested). For example, when you consider all horses racing in all jurisdictions the data would suggest that the performances of horses in Argentina are roughly 60% of the value of a performance in Australia. This doesn't mean that the best horses bred in Argentina aren't good, it means that the average horse there isn't as good as the average horse in Australia. A lot of the smaller countries got a weight of zero (0) which meant that no matter how good the horse was in that country, its adjusted race rating was zero. That might seem unfair, but there was enough evidence to show that even the best horses coming out of lower tier countries to compete in other countries were getting well beaten. There were some issues with countries like South Africa/Zimbabwe and South Korea where their performances are isolated from outside competition (e.g. there is not much cross pollination of horses in other racing jurisdictions) but after producing all the data, looking at what weights were assigned to each country, looking at the distributions of adjusted race ratings and consulting people like Alan Porter, I believe we have it broadly right.
The result of all this is an Race Rating index that goes from as low as 0.01 to as high as 300 (for what it is worth, the highest rating we have at the moment for a horse we have data on is the Argentine-bred but US raced Candy Ride at 307.30) which can be found and applied for any horse racing in any country. It allows us to automate the process of assigning class to a horse without worrying where the horse ends up racing after we get data on it as a yearling.
Once we had this worldwide rating, we then went back and matched the Race Rating against Timeform Ratings and the Pattern/Graded system to verify its worth. What we found was:
A Race Rating of 3.0 on our rating is roughly equivalent to a horse that runs 100 Timeform rating - a Listed level horse.
About 4.5% of all horses sold will run a Timeform of 100 or greater. About 6.5% of all horses by the top 1% of stallions will win any stakes race (listed or better). Horses we measured as yearlings that end up with a Race Rating of 3.0 or greater are 6% of our database so it roughly aligns with Timeform (4.5%) and Foals/SW for the top 1% of all stallions (6.5%)
For horses with yearling measurements, the top 2.5% of Race Ratings in our database is a rating of 4.7 or better. That rating equates to a Group/Graded stakes winner in our database. 2.3% of offspring by the top 1% of stallions win a Graded/Group race so if you look at the truly elite horses (the top 2.5% of all race ratings) the database is broadly reflective of a commerical population.
Building the Dataset - getting the balance right.
Once we had the rating sorted out, what we also did was create an API (a programmable interface that can ping their database for new information on command) to the database that created the rating for us so that every month, every horse in our database is firstly checked to see if it is 1250 days old (so it is 3 ½ years old) and if so it is updated with any changes to:
It's raw Race Rating (which we then convert with the country weight)
Number of Starts
Date of Last start
Average Distance Raced
Country it raced in (to lookup the country weight and convert the Race Rating)
We also use the average distance race to format some basic distance categories (5-7f, 6-8f, 8-10f, 10f+) so we can predict optimal distance in addition to racing class. This process allows us to build a new dataset each month for any model that we are training, with an ever expanding number of horses, to retrain the models on. Depending on the time of year, in a given month we can have over 500 new records that age into the dataset by turning 1250 days old. Additionally in that month, about 1o% of our database will have their starts updated to where they have at least 3 lifetime starts, so can be used in a model dataset, and a further 60% of the records will have changes to the number of starts and their Race Rating. This process starts to fulfill one desire to have an automated system that is more 'hands off'.
The datasets change each month a fair bit giving the models more information to learn from. We build the model datasets by:
Finding records where the horse is at least 1250 days old
Finding records where the horse has had 3 or more starts
This means that any dataset that the models are asked to learn from are only horses that have had the opportunity to have at least 3 lifetime starts and are 3 ½ years of age or older. On average it takes a horse 3 starts to break its maiden, and 7 starts to win a stakes race. In our database the average number of starts by a horse that is included in any dataset for the models to learn from is 15.21 with one record having 121 starts!
The API keeps updating the records every month until a record hasn't had an change in it's Date of Last start for 1000 days (so once the horse hasn't raced for 3 years) at which point it stops requesting the update for that record as we presume the horse is retired (e.g the API pinged Candy Ride, updated his records and then the next time it skipped him as it was more than 1000 days since he last started).
Once we have all the ratings for all the horses that have three or more starts, we then get down to the more important part of how we construct the dataset for the models to learn from. There were quite a few considerations to take when we construct the datasets and to be honest, a lot of this was about trial and error where we created the datasets, trained a model and then tested the datasets on unseen samples to examine its performance. There was a lot of cost involved in this part but we found there were a few things to consider:
When building a dataset, the size of the sample set plays a crucial role. Too few samples can lead to underfitting, where the model might not learn the underlying patterns of the data. Conversely, too many samples without adding new information might lead to overfitting, where the model becomes too specific to the training data and performs poorly on new, unseen data. Intuitively, a larger dataset generally provides a more comprehensive representation of the underlying distribution of the data. However, the law of diminishing returns applies; after a certain point, adding more data may not lead to significant improvements in model performance.
Binary classification problems involve predicting one of two possible classes, in our case "elite" or "non-elite". The distribution of these classes in the database can significantly influence the performance of the model. In our real-world database of all records, the prediction target - elite, is imbalanced and significantly underrepresented. If we take all data as our dataset, 95% will be "non-elite" samples and 5% "elite" samples, it is quite easy for a model to achieve 95% accuracy by merely predicting "non-elite" all the time. It's like a Bloodstock Agent or trainer going to the sales and saying "this is a slow horse" to all the horses in the sale, they are going to be right 95% of the time, but they arent useful to buy a fast one at all!
We looked at different ways to tackle these two problems above and built a lot of different dataset sizes, we looked at splits where it was 60:40 non-elite:elite, 70:30 split and other percentages. We also looked at oversampling the minor case using resampling the elite samples and also looked at synthetically creating elite samples. In the end we found the following:
As we found with the regression modeling we tried earlier, using samples of horses whose performance rating is too close to what is defined as elite is a mistake. The binary models can't find information in the shape of the data to disciminate between those that are elite, and those that are nearly elite.
To capture the variation of data in the samples in the dataset that reflect the samples that are not included in the dataset the model is trained on, you need at least 750, preferably 1,000 samples. The reward for more samples caps out at about 1,500 samples as beyond that the models start to overfit the data.
It is best to get a 50:50 balance between elite and non-elite samples. The balanced dataset of the two provides a far more effective dataset to learn from and operates well on those samples not found in the dataset. So, if the max number of samples you need per class is 1,500 a dataset of 3,000 total is enough to build really good models on.
While 3,000 seems a small number, given that elite horses, those that rate 3.0 or better, are just 6% of our database, and we need at least 750 elite samples (preferably 1,500), we had to get at least 12,500 samples overall. This isn't just 12,500 samples of horses, it is 12,500 samples of horses that have had 3 or more starts, so you actually need to get ~20,000 samples in the database to have at least 750 elite samples to use in a model. There is a difference between what data you have and what data you need to collect to get the right number of samples for a dataset.
As the models never learn off the full dataset, rather they learn off the elite horses compared to the most non-elite, we use the Race Rating itself to weight each row of data so that the model treats the highest scoring elite horse a little more importantly than the lowest scoring elite horse and the data for the worst horse in the dataset is more important to learn from than those that are less bad. This weighting of the rows is important.
Building the datasets and getting the balance right took a lot of time, effort and money to get right but as you will see in later posts, it is worth it.
Machine Learning Operations - the end-to-end lifecycle
As I outlined earlier, I wanted to build something that was able to easily scale, didn't require me to do a lot of manual editing or entry, and the models automatically "improved" over time. This is where the process of MLOps comes in.
Machine Learning Operations, or MLOps, is a way to standardize and streamline the end-to-end machine learning lifecycle, from data collection and model training to model deployment, scoring current and new records, and monitoring the models to see if they are drifting or performing less optimally. The goal is to increase automation and reproducibility and minimize manual interventions.
MLops can be thought of in stages that flow from one to another and repeat.
Stage 1 - Data entry and Validation checks
The first stage is getting the data into the database. All the data had to have a date of measure attached to it so that the models know what data was to be used in the model. Getting over 20,000 videos and images into the database and having it run the initial analysis as you will see in later posts, took a lot of time.
At this stage we also added some data monitoring to ensure that the incoming data quality remains consistent and flagged any anomalies where any of the data taken wasn't suitable for use so they were not included in any datasets and couldn't be scored by any models.
Stage 2 - Data Monitoring and Dataset Updating
Once the data is in there, it is monitored and updated once a month. We have set up the API call for the external database of race results to operate each month so that it picks up the record in our database and updates it with any changes that have been made from race outcomes. As a horse has more starts it is included in the datasets that we use for the models.
For right now we have it set up to do this each month but eventually we will track the performance of models in real-time, identifying drift or degradation and kick off a dataset to be created and a model retrained with triggers for when the models have degraded.
Stage 3 - Dataset creation
In later posts we will discuss the different types of models we have and the frequency at which they are retrained, but for this MLOps process, it relates to the tabular predictive models. Following the updating of the database, we run an SQL Query that creates the dataset for each of the models. The dataset selects up to 1,500 elite horses (those with a Race Rating of 3.0 or greater) and matches these with the bottom 1,500 horses by Race Rating.
Vertex AI also allows you to assign weights to each of the rows, so while the model is tasked with predicting Elite status (either 1 for elite or 0 for non-elite) the Race Rating is also included in the dataset as a weight variable that the models use to make the higher ratings for the records where they are elite more important, and the lower Race Ratings for the records that are non-elite more important.
This dataset changes every month as more horses have data that can be used. For example, once we get to 1,500 samples of the elite the worst horse in that 1,500 will have a rating of 3.0, but as we get more horses that are elite, the rating of the worst elite horse will rise. For example, in the dataset for the Biomechanics model right now, the lowest rating of an elite horse is 3.51 and over time that will get higher and higher as more horses fit into the top 1,500 . The reverse also applies for the non-elite as the "best" of the 1,5000 non-elite horses in the Biomechanics dataset has a Race Rating of 0.2 (thus there is a big gap between 0.2 and 3.51). This change in the "bar" that it takes to be considered elite has a strong positive impact on the models predictive power.
Stage 4 - Automated Model Retraining with Versioning
This stage is where we retrain the models, but in a particular way which is known as Versioning. Vertex AI has a great process within what they call Pipelines that allow this stage and the next to occur seamelessly.
Once we have enough samples, and the datasets we have are big enough, we use the process of Versioning to iteratively learn. We do this by starting with the model from the previous month as it knows the weights to put on each of the variables given it was trained on data from the previous month. With the current month, we generate a new dataset and the previous model then starts from the knowledge it already has about the shape of the data, and makes adjustments based on the new dataset. These are generally small adjustments to changes in the shape of the dataset but it means that we tend not to have massive swings in the predicted outcomes of horses.
At this stage we also have some failsafes to make sure that the model we have just trained, isnt worse than the one we had the previous month.
Stage 5 - Deployment and Rescoring
Once the models are retrained, the Pipeline automatically deploys the new models to production and rescores all the records in the database, including those that are not in the dataset. This then gives us a good guide as to the model performance as we can look at the records in the database that were not included in the dataset like those with ratings of 2.5 to 2.9 (the "almost elite") and those that have ratings close to the non-elite. These "edge cases" give us a good guide to the overall performance of the model.
The newly deployed models are then also available to score any new records of yearlings we collect at the sales, and the process returns to Stage 1 above and re-iterates.
As we have now set this up we are closer to my goal of a yearling sales co-pilot that has the following features:
Faster Iteration - We have automated as many of the stages of the ML lifecycle as possible which means that there are quicker deployments and feedback loops.
Scalability - there is now easy scaling of ML workflows, accommodating larger datasets and more complex models each month.
Reproducibility & Reliability - because we are using versioning and getting the models to retrain on new data each month, this ensures that while the models may find different ways to look at the data, the results are consistent and can be replicated even as new data comes in or underlying patterns change.
So now you have the broad picture on how we went about creating the MLOps framework and the process, lets dig in a little closer and see exactly how the models were built.