Building Computer Vision Models for Racehorse Selection
Updated: Aug 14
This is the second blog post in a four-part series on how we used machine learning and artificial intelligence tools to find yearlings that subsequently become elite racehorses. The four blog posts are:
Building the base video and image models in Vertex AI
Building the predictive tabular models in Vertex AI
The selection process and future improvements
In this post we are going to get into the detail about how we used computer vision and image recognition models to create our predictive models. Before we do however, you need to understand the concept of Model Stacking in machine learning.
Model Stacking is a machine learning technique that combines the predictions of multiple models to make a better prediction.
Imagine you have a group of bloodstock agents who are all trying to predict if a yearling is potentially an elite racehorse. Each agent will have their own unique perspective and will make their own prediction (lets say a score from 1 to 100). If you just took the average of the independent predictions of all the experts, you will get a better prediction than any individual expert could make.
Model stacking works in a similar way. We train multiple models - some that are based on video recognition and some on image recognition, on the same dataset. Each model will learn different patterns in the data and will make its own prediction. Some models will be very precise (they have a high strike rate for the ones they like, but miss a lot of good horses) while others will have higher recall (they like more of the good horses, but like a lot more in general), but having variation and ensuring that their predictions are different enough, but useful overall, is what builds a strong model stack.
The diagram shows the five models that we have as the base models for our tabular Biomechanics model. Each base model is trained on the same dataset and then we score all the records in the database and create a second dataset which is just the probabilities from each of the base models. The second dataset forms the tabular model that does the prediction overall, but let's rewind a little and I will explain the elements of the base models and how they are used,
The Biomechanics Model
The tabular Biomechanics model is our most predictive model, mainly because it is trained on the largest dataset and it uses five different base models to generate the data to make a prediction. Let's walk (no pun intended) through the process:
We start by clipping a ~10 second raw video of the horse walking left to right. It is important to get a good representation of the horse so we want to see it walk as best we can. From time to time the weather gods don't play well with us but generally speaking we work on the basis of "garbage in, garbage out" so we try to get as good a representation as possible. Often we will get multiple measures of the one horse at the one time, just to make sure.
Once we have the video, the first process that is undertaken when it is loaded into the application is what is known as Embedding Clustering. In machine learning, embeddings are a type of feature representation that can be used to represent data as vectors.
To explain, let's say we have a dataset of images of flowers. We can use a deep learning model to learn the embeddings for these images. Once we have the embeddings, we can then use the process of k-means clustering to group the images together so that all the roses are in the one cluster while all the pansies are in another. It's not saying if one flower is better than another, rather it is just getting them into the same group. The k-means clustering algorithm will find a number of clusters of images that are similar to each other so that the images are similar to each other, but they will be different from the images in the other clusters.
We use Embedding Clustering of the walk videos we input as the first process in order to deal with two things
Variation of distance of the horse from the camera
Variation in light/contrast in the video
So the clustering algorithm puts each of the videos into a cluster, depending on its similarity to other videos. At this stage a small percentage of the videos taken will be rejected. This is a quality control issue where we have identified a cluster where it is difficult for the video models to ascertain the difference between elite and non-elite horses so we have to take the video of the horse again. If they pass the Clustering process they are then sorted into one of 10 different clusters. Each Cluster then has its own video and image models trained. What this means is that within each cluster we are training models that have like videos together, but are looking at fine grained differences between them. It is significantly more accurate to train the base models on a per-cluster basis than it is to have one general base model.
The Custom Raw Image Model
A custom video model in Vertex AI is a machine learning model that is trained on your own data to perform a specific task, such as classifying videos or detecting objects in videos. Vertex AI uses a convoluted neural network in its AutoML video model which automatically optimizes the task at hand and provides the best predictive model possible. There are now other vision models available to use on videos within Vertex AI, but I have found the standard CNN within AutoML sufficient for what we are doing.
For our purposes we are looking at using the custom models for a variety of tasks, but specifically its ability to separate out elite and non-elite horses from walk video. Effectively we are trying to train a model to judge a 'way of going' at a yearling sale and see if it is predictive of its performance later. It's no different to what anyone does at a yearling sale, watching a horse walk from left to right, but the model has the benefit of knowing what turned out to be bad as well as good.
The first probability that is scored with any new video is from the Raw video model. As I said each cluster has its own model, so their predictive power is slightly different depending on the cluster. That said, the Raw video models for all clusters are invariably the ones with the highest predictive power.
Above is the confusion matrix from the Raw Video model for Cluster 7. As you can see if the horse is Elite (True Label = Yes), then in the holdout data it is predicting them as elite (probability at 0.5 or better) 68% of the time. For the ones that are not elite (True Label = No) it is predicting them as not elite 74% of the time. So this is the first probability that we get out when a new video is put into the application. It is a value from 0 to 1 (so like 0.7544) for every video.
The Keypoint Probability model
Back in 2019 I did a paper with the team at the Mathis Lab using their opensource package DeeplabCut. DeepLabCut is an open source toolbox that uses deep neural networks to track the posture of animals in different settings.
The toolbox is based on an animal pose estimation algorithm and can match human labeling accuracy. DeepLabCut was primarily built for neuroscience studies to understand how the brain controls movement or how animals interact socially, but for our purpose, we wanted to quantify the variation in bone lengths and angles on a horse at the walk as accurately as possible.
I did a blog post on in back a few years ago which you can read here, but the basics of Deeplabcut is that it is a markerless biomechanics program that allows you to train a neural network to place landmarks on a horse. We did this and trained a model that had a pixel error of about 6 pixels, meaning it was quite accurate at placing the markers. What the Deeplabcut package also allows you to do is generate a Keypoint video output of the markers as the horse walks, like this:
Those Keypoint videos, which are representations of the markers of joint positions on the horse, can be used to train a second custom video model that produces another probability for the tabular models.
Generally speaking the Keypoint video models aren't nearly as predictive as the raw video models. That is understandable as we have stripped away a lot of the informative data in each image of the video. Here is the confusion matrix for the same Cluster 7 as the raw video above.
From the above you can see that the distribution of probabilities is totally different. At first look you would be happy to see that it is correctly putting all the non elite (True Label = No) as predicted to be non-elite (so all the non-elite horses have probabilities less than 0.5), but you can see from the Yes group that it struggles to find the elite horse, incorrectly predicting 76% of all elite horses as non-elite.
The Keypoint models are what are known as 'weak learners' for the tabular models. Sort of like a bloodstock agent that goes to a sale that doesnt like many of the horses, misses lots of good ones, but those he does like turn out to be good so he's somewhat useful.
The Joint Recurrance Plot model
In addition to being able to produce the Keypoint style videos, Deeplabcut also provides output of the time series data for the markers that it places.
If you can imagine that a 10 second video, shot at 30 fps is just 300 images, one after another, the time series provided is the x,y position on the image for that bodypart, for every frame/image of the video. You could to a lot with this data (Deeplabcut have a Kinematics package that I will probably do something with over winter this year), but one of the throughts that I had was to convert this data into an image.
The reason I thought of this is because at the time I felt that image recognition was a more mature technology than other methods to analyze time series data. Its a non trival process to change the time series data into an image, but thankfully there is a package known as PYTS: a Python package for time series classification
PYTS allows you to take time series data and transform it into an image, one of these being a Joint Recurrence Plot . A recurrence plot is an image representing the distances between trajectories extracted from the original time series data and There is a paper here on the basis of how it all works but they look something like this:
Similar to the Raw Video and also the Keypoint Video model, we train a separate model for each cluster using the Joint Recurrance Plot. The plot is a single image, so it is quicker to train, but it is not as good a predictive model as the Raw video model. Here is the confusion matrix for Cluster 7, the same cluster as the Raw and Keypoint video models.
You can see from the above that the JRP image model is quite good at predicting if the time series data from DeepLabCut is an elite horse, getting it right 68% of the time, but when it comes to predicting the non-elite, it is a coin-flip, so not that useful. Again, similar to the Keypoint videos, it is a 'weak learner' that is useful, but not definitive.
The Gramian Angular Field model
The other way of looking at the time series data as an image is to create it as a Gramian Angular Field. Gramian Angular Field (GAF) Imaging turns out one of the most popular time series imaging algorithms. First proposed by team of faculty and students from the Department of Mathematics and Computer Science of the University of Cagliari in Italy, it was expanded in a 2015 paper entitled "Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks."
Similar to the Joint Recurrance Plot above, the approach transforms time-series into images and uses a Convoluted Neural Network, or image recognition model, to identify visual patterns for prediction. What we are trying to do is look at the way a horse moves in a totally different way to just looking at the video. For all our bodyparts (nose, eye, withers, etc) as they are plotted in the time series data over the walk of the horse, their transformation into a Gramian Angular Field results in something like this:
Similar to the Joint Recurrance Plot above, we train a separate model for each cluster using the Gramian Angular Field (GAF) Image as the input to train on. Here is the confusion matrix for the GAF model for Cluster 7, the same cluster as the JRP Image, Raw and Keypoint video models.
What is interesting here is that the model for the GAF is somewhat the inverse of the JRP image model. Where the JRP was able to correctly identify the elite horses 68% of the time, the GAF model can only do so 37% of the time. But where the JRP model could only identify the non-elite 50% of the time, the GAF model can do it 61% of the time. This discordance is actually quite important as it gets back to what we talked about earlier where if you have one agent that predicts a type of horse well, and another agent that predicts another type well, a judgement of their opinions is better than one alone.
So after all this we end up with 5 variables from the models above. One of the variables is the Cluster value (so it is a number from 1 to 10) while the others are the probabilities (a range from 0 to 1, with 1 being it is sure it is elite) of the four other models detailed above. It is these 5 variables that we use as the basis of the tabular models which we will discuss in the next blog post.
We have a further two models that we also use that are worth discussing.
The Conformation Model
In addition to the biomechanics model, we have a conformation model that uses a conformation pose of the horse.
There is a bit of of a process that is undertaken which we will discuss in the next post more in depth, but one of the variables that the tabular model uses is a prediction from a trained image model. The single pose photo is cropped and resized in the application to the same size 500x400 (horses are more long than they are tall) which means that the computer vision model is seeing data of the same size. It's sort of the same idea as we have with putting videos in clusters. From there the image model is asked to try to learn the difference between elite and non-elite horses by just looking at its conformation as it stands in an image.
We use a larger dataset here with 3,000 images, 1,500 that are elite, and 1,500 that are non-elite, but as you can imagine, it is not a straightforward task to predict if a horse is an elite horse just by looking at a photo! That said, here is the confusion matrix:
You can see from the above that by just using an image we have a somewhat useful model. It can correctly predict the elite horse 60% of the time and correctly predict the non-elite horse 70% of the time. Similar to the biomechanics model above, we use the probability from this model in a larger tabular model which we will discuss in the next blog post.
The Cardio Model
The cardiovascular video model works the same way as the biomechanics model. I did a blog post on this a few years ago but briefly, once the video is fed into the application, the first thing it does is apply a clustering algorithm to it. This clustering algorithm works the same way as the biomechanics clustering algorithm in that while the cardiovascular video is black and white, it is able to separate them into distinct groups. Interestingly when we looked at the race outcomes of the horses in each cluster, it became apparent that it was quite easy to label each of the clusters with a group name, which is what we did. Here is the breakdown of the names and the count of records.
As you can see there are some horses with their cardio scans who fail on quality control because of the quality of the scan. We take those cardio scans again. Unsurprisingly there is a reasonable spread of different types of cardios that we see. What is also interesting is that as we will later see the cardio type can often be a mismatch with the biomechanics and other measures we take resulting in a poor overall score for a horse.
As per the biomechanics models, each of the clusters has their own custom video model. Here is the confusion matrix for the Bullseye Dirt Miler cluster.
As you can see from the above, the cardio models are really good at predicting if a cardio is an elite cardio, getting that right 74% of the time, but it finds it difficult to predict the non-elite, getting that right only 39% of the time. I forget who said it but this speaks to what is generally true about cardiovascular measurements in racehorses - "there are more good hearts in horses than there are good horses".
We will talk about this more in the next blog posts when we discuss the tabular models, but we will say at this point is that in our experience with cardiovascular scans, they are only useful under certain circumstances!