Comedians, Death and racehorse selection
On Netflix there’s a really good episode of Comedians in Cars getting Coffee where Jerry Seinfeld is talking with Eddie Murphy. Overall it’s a brilliant episode (those two are literally my favorite comedians) but part of the conversation between the two of them really resonated with me.
When talking about comedians in general and how he got to where he is, Eddie comments to Jerry that to get really good at anything you have to put time into it, to make it worthwhile. Jerry then fine-grained that thought with the comment that a lack of specific focus is why we don’t find greatness to which Eddie agreed and added that finding a niche, really focusing on it, and making it yours is actually an art itself.
That’s sort of the view I’ve now come to in terms of selecting yearlings. It's impossible to be good at selecting all types of horses with any accuracy overall. Anyone who is seriously collecting any good data on yearlings and seeing their subsequent outcomes understands that the variation in horses that excel on different surfaces, over different distances and at different times in their life cycle, is too great a problem to perfectly model so that all of the 'good horses' are properly captured by the model. As the comedians suggest, you really must specialize in one particular type of horse, and the focus get really good at making finding them a repeatable event. I like turf sprinters, horses like Dayjur, Lochsong, Schillaci and Placid Ark captivated me over others as a younger man (many moons ago!) so thats the niche that I want to excel at (each to their own as they say).
To make selection of these types of horses a repeatable event at yearling sales, and for any prediction model to work, it gets to my second analogy about death.
Let's say you walk into the doctors office and she says to you "I've got bad news, you have to have an operation, it is life or death, so if you don't have the surgery you will die." That's not something you want to hear....she then goes on..."Its a difficult surgery. Your options are to have the surgeon do it, he studied for 5 years and he's been doing this surgery for 10 years and has a 60% success rate, or you can choose the robot to do it, it has been trained on 10 years worth of data to learn how to do the surgery, has been doing the surgery for 5 years and has a 90% success rate.".....
Most people choose the robot. It's the correct answer, but the question that needs to be asked before the robot operates on you is this...."Has it seen enough patients and data from people that are like me....?"
That is the problem with all machine learning and A.I. It is built on generalizations and relies on the dataset that the model is trained on to be reflective of the dataset that it will be asked to make predictions on. You are not going to let the robot do the surgery if its never seen a person that is remotely similar to you as it won't have seen data like you before and won't accurately react to that data. Equally, building a dataset that is full of horses that you are not going to see again in the future, or isnt the group that you are trying to predict isn't that helpful as the predictions that are made, won't be relevant to the population that you are making future predictions on.
This is exacerbated by two other factors that come into play when selecting yearlings. Firstly, if you look at it from a data science perspective, fast horses is what we want but we have a really low positive case, that is, about three percent of all horses are fast, and if we’re just interested in a subset of these good horses, like say top class turf sprinters like I am, that could be about half a percent of all horses born. This means that at a Keeneland September sale for example (a good broad commercial population) with some 4000 yearlings there is literally only 20 yearlings that will be what I would be looking for (0.5% of all) in terms of an elite turf sprinter. There will be other good horses in the sale, but just 20 of them will be good turf sprinters. That’s an awfully small number to find, and more importantly an awfully small number to learn from.
There are ways around this first point when it comes to data. Using oversampling and other data techniques can get enough sample size of the positive case (fast horses) to allow for models to be built but it does highlight that measuring what matters and creating good data and a database of information is so important. But it is also why, specialization within that data is important. In a certain way it does help to have a lot of different data in there as while elite horses share a lot of common features, it also sets them apart. Knowing what a high class ten-furlong horse is, also tells you what a high class sprinter is not, however when a lot of the variables are normalized (say they are all C:C sprinters genetically) it is more important to know the weight of the other variables that separate out the slow sprinter from the elite.
Which gets to the second point. It’s two years from the time someone looks at a yearling to when that same person learns if it was any good or not. As good as we are, the human mind cannot retain and learn from the exact mechanics of every single horse you look at, and we certainly don’t know all the outcomes of all the horses we look at on the track, so as humans we generalize from just the true positives, horses that we thought were fast that turned out to be fast, and try to learn from that. When we generalize we make errors which is why a trainer or agent that gets ten percent of the horses he or she likes being good horses, is considered successful. From a data science perspective, ten percent precision or “strike rate” is awful. While a well build model will generalize, it is at a massive advantage with the collection of data occurring years prior and if the outcomes are properly labeled it doesn't 'forget' what the slow horse looked like, rather it weighs all the variables that determine the difference between fast and slow horses.
Generally speaking individual racehorses are at their core unsolvable riddles, but as an aggregate they become more of a mathematical certainty. I can't foretell with complete precision exactly what one horse that I buy will become, nobody can, but with specialization and capturing good data I can say what an average number of them will be capable of. The individuals might vary, but with good data and a well trained model that has seen horses like the ones I am predicting on, the percentages over a population remain constant and predictable.