Over the past year or so I have been working away at improving and refining the algorithms that I use towards predicting outcomes from yearling/two-year-old measurements. It seems an ever evolving process of incremental improvement, as it should be, but it is starting to heads towards four factors that are bringing the models towards some level of completion and focus:
Trying to model every horse is a fruitless task. There are just too many ways to be a good racehorse, the positive case (fast horses) is low in numbers and you only have to beat who turns up on a given day. It is much more profitable finding one type of horse and modeling it well.
What I can do in the field (that is, at public sales), limits the data that I can readily collect. If I were going to be completely thorough, I'd have every yearling up on a treadmill measuring their muscle oxygenation, breath, heart rate, miRNAs and the like, but all of that is not possible at a sale so I can only collect a limited number of variables to model.
In a similar vein to point 2 above, often what you collect in terms of data is useless or has very low predictive power, but you don't find that out until much later when the horses have had a chance to truly prove themselves one way or another. Certainly there are some variables that are really important, but at best a single variable (cardio, biomechanics, DNA) is predictive of ~20% of performance. Combining them all into a single predictive model is no doubt the best way to go about it but even with the best machine learning methods, it gets back to point 1 above.
The small positive case, somewhere around 3-5% of all horses being fast, really makes it tough, but not impossible, to model outcomes. This low positive case means that missing a good horse is seen as a complete failure of the model by others as they recognize that the elite horse is a rare event. The models I am now building are far more ruthless in determining if they like a horse or not as they are penalizing false positives (horses that it liked that turned out to be slow) far more than in the past. This reduces the overall commerciality of what I do in some ways, as it means that you need to do a lot more testing of horses to find the good ones as it reaches ground truth (3-5% of horses are good).
I had previously posted how I had managed to train a neural network on the 10 second video loop that I take on a horse to identify elite racehorses but as I said in that previous post, at that point in time there wasn't a commercial API/GUI available to train a neural network on with video so I had to create one myself. That all changed last year when Google put into early-adopter (I was one of those) and then into Beta, their AutoML Video Intelligence algorithm.
What Google have done is allowed anyone to use their own dataset of videos, and as long as there is 100 samples of each case (so in my case at least 100 cardio videos of yearlings that turned out to be fast and 100 that turned out to be slow), train a neural network to predict the probability that any new video is one of those labels. The trick to doing this however is to know what the dataset is, and only supply data (in our case videos) to the model that it will most likely see 'in the wild', so as to not bias it in any particular way. In our case this meant two things:
While I had a lot of cardio video of horses that could be labeled as 'non-elite', using it all would effectively overwhelm the other case of 'elite' horses, and that was the one we wanted to predict. Through a series of trial and error testing, I settled on a 4 to 1 ratio between the two classes. That is for every 100 samples of videos for 'elite' horses I limited the other case of 'non-elite' to 400. 20% elite runners to non-elite runners, isn't obviously what we would find in real life (most major yearling sales run at about 6-8% and the general population at 3%),but I found dropping it below 10%, so a 1 to 9 ratio, made for a poor predictive model, at least in the case of cardio videos. Equally, setting the ratio at 1:1 or 1:2 actually resulted in poor models also as the model didn't see enough variation in the samples it was given when compared to what it was finding in all the other samples I had (it wasn't testing that well out-of-bag as they say).
There is a big difference between video taken of a yearling and video taken of a 2YO or Older horse. There is quite a bit of variation in the level of fitness of yearlings at sales between each sale and each consignor, and this manifests itself more in the fatness around the heart than anything else, but there is a significant difference in the cardiac hypertrophy (muscle strength) that occurs when a horse goes into training. Using videos taken of horses after they have raced and retired, and expecting them to be predictive of videos taken of horses at yearling stage is a trap. They are significantly different.
So what this meant is that I have effectively created a few models for horses at different ages. Thankfully I had enough video data to do this. Another great thing about Google, even in Beta testing like it is now, they do a lot of the grunt work for you in terms of once you feed them the videos, they create a training set (80% of the data) for the model to train/learn on, a validation set (10%) for it to refine itself on, and a testing/holdout set (10%) which is data that it doesn't see to learn or validate on, and can be used to test the trained model. From my yearling video model here is the output in terms of the holdout set, and how the trained neural network performed.
So what does this mean?
Firstly, a model with an average precision above 0.8 is a reasonably good model. A model with average precision of 0.5 would be a coin-flip, so essentially useless, while a model with 1.0 would perfectly predict the test dataset (and be a unicorn as it doesn't exist!). If you go back to reading my previous post on my hand crafted network, it achieved average precision of 0.742, so it is clear that Google has it over what I could come up with (in defense that was a year and a half ago and a lot has changed since then!).
For every prediction, Google supply a JSON file with the probability from the model of the video ranging from 0 (predicted as non-elite) to 1 (predicted as elite) so the value falls somewhere in between. The image below is that of the "2017Retraceable", who I bid on as a yearling and who is now known as Another Miracle - Stakes winner and 3rd in the Breeders Cup Turf Juvenile Sprint, so his probability of his cardio video being an elite horse was 0.627 - closer to 1 than 0.
The confusion matrix generated from the samples in the holdout set gives you more of the 'nuts and bolts' as to where the prediction model is strong, and where it isn't. In summary, the trained Google video model is very sure about the horses I had video on that were labeled as non-elite, its less sure about the ones labeled as elite.
In the holdout set it correctly predicts a nonelite video 96% of the time, yet it only correctly predicts an elite video 28.5% of the time. Why would that be? Two reasons:
Sample size, more specifically sample distribution. The weight of elite vs non elite samples has an effect on how easy it is for the model to recognize elite samples.There is still some effect of the smaller number of elite samples being overwhelmed by the non-elite samples, but as I said earlier, this is a necessary evil to give the model enough samples overall to understand the variation that does occur in all cardios.
The inescapable truth - cardios are only mildly/moderately predictive. Paddy Miller from EQB once stated in a BloodHorse article that there " are more good hearts than good horses", which sort of captures what the data is showing us above. What it is telling me is that there is often little difference between the size/shape of a cardio that is equally found in fast and slow horses, but there is a sub-set of hearts that are almost exclusively found in good horses (which is why the model is only finding that heart in 4% of slow horses).
It seems very intuitive to allow computers who are 'agnostic' with data tell us what they see rather than relying on humans who are inherently biased and see things that don't stand up to scrutiny. After all, if we are setting the data up correctly we are overcoming the two major issues that face all agents/trainers/owners
We have a small positive case, so as humans we have to see a lot of horses to see enough of the good ones, but if we have collected enough data, computers are way better at balancing the observations than humans are.
We have a long lead time - its 3 years from the time you see a horse as a yearling until you know for sure, if it is good. Again, for humans that time period that can blur the mind in terms of learning (we tend to remember the true positives - horses you liked a yearling that turned out to be good, and forget the false positives - horses you liked that turned out to be slow), where for a computer that time isn't filtered like that and their learning is without that bias.
There is some danger to be found in this model of course. Buyers hate it if you get a horse that turns out to be a good runner 'wrong', mainly because they know how rare they are, so the fact that the best model that I could create, still got 70% of the elite horses in the holdout set 'wrong', tells me that relying on this model as a single arbiter of performance potential isn't the way to go. Using the probability supplied from this model as a single variable and integrating it with other data - biomechanics and DNA markers - into more descriptive model, will provide better predictive power.
On a side note: if you are speaking to anyone that says they are able to predict racetrack outcomes on yearlings using data, and they cannot show you something like this, they are probably telling you a load of rubbish.