A client asked me a question the other day which really struck home at what we are about here at Performance Genetics. After explaining how we look at horses and assess their chances of becoming a graded stakes winner, the client commented "ohh, so you are really about talent identification rather than classification...."
I stopped for a minute and thought about what he was saying. Are we more interested in correctly classifying each horse, or are we more interested in identifying them?
I am pretty sure we are not interested in perfectly classifying every horse. We have seen enough Belmont Stakes winners whose genotype suggests that they should be sprinters to say that while we try to suggest where the best area of performance is with a horse, there is no black and white answer here. Trainers heavily influence the outcome and some trainers can make horses go further than they are genetically made for and others can take a horse that has the genotype of a mile and a half runner and make him a come from behind sprinter. Equally, while we are using the best tools available to us, we are in prediction and the variables we cannot control, trainers, vets, jockeys racing conditions, etc, can make horses become things that they genetically don't look capable of.
In statistical terms there are two concepts to be entertained. Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of elite horses who are correctly identified as being elite out of all possible elite horses in the group). Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of non-elite horses who are correctly identified as such).
In theory it is possible to be almost 100% in both measures, but anyone who works with prediction knows that it is a trade-off between one or the other. For us, sensitivity is the most important. We judge our performance based on the number of elite horses that we correctly identify as elite racehorses. Are we going to miss some, absolutely. But at the end of this, when it comes to talent identification at yearling and two year old sales, it comes down to sinking shots on goal. For example, if you are going to buy 5 horses at the two-year-old sales next year, sensitivity is what you really are after with any prediction service. From those 5 selections, you are looking for a high proportion of these to be correctly identified as elite. You are not that interested in horses that we correctly identify as non-elite, as you are not going to buy them. What you want to know is from the horses that we select as 'potentially elite', how many become elite.
Like any prediction model there are portions of the model that are more predictive than others. We are now getting towards 1000 horses sequenced (take a look at a sample report here) and we are starting to see three rather interesting aspects emerge.
1) Colts are slightly more easy to predict than fillies - this is an interesting observation. Ideally there would be no sex bias in the predictive model, but it seems that the model we have developed slightly (and ever so slightly) makes it easier to predict superior colts than fillies. That is not to say we can't rate fillies, one of the highest rated horses we have is a G1 winning filly, it just means that the superior colts tend to have good copies of the variants we test for more frequently than the superior fillies. What is also means is that the fillies that we see score really well, are usually really good!
2) Sprinters are slightly more easy to predict than distance horses - If you have taken a look at the model you would see that we predict elite sprinters to appear in the bottom right corner of the model. Horses that appear in this bottom corner are slightly more predictable than horses that appear in the top right. What is interesting is that there are horses that appear in this bottom right corner that are capable of being more than just sprinters. Australian G1 winner God's Own scored a 6.1 on distance (1 being absolute sprinter, 100 being absolute router) and a 94.6 on class (1 being slow, 100 being elite). He should have been a 5f speed machine like Dayjur who has a similar score. Instead he got to a mile and won the Gr 1 Caulfield Guineas. As I said above, trainers, in this case the legendary Bart Cummings, can make a difference.
3) A subset of class/distance combination is the most predictable - there is a subset of the prediction model that has a certain score on distance and class which makes them highly predictable horses. The sensitivity of this group (i.e the percentage of elite in the group) is very high. If we were in a position to buy 20 horses that fell into this group we would be very confident that at least half of them would turn out to become graded stakes winners. The problem....it is not like we can tell just by looking at the horse. We would literally need to test every horse at the sale to see where they fell.To give you an idea on how hard this subset is to reach, we have tested over 240 yearlings this year (born 2010) at the September Yearling Sales, primarily Books 1 and 2, and on the farm of an Eclipse Award winning breeder, and only 10 of these (4%) fell into the subset. There might only be 50 that would fall into this subset across the entire September Sale catalog.
While the last observation is, to put it mildly, needle in the haystack stuff, it is this type of sensitivity that our clients are demanding. They don't want to go and buy hundreds of yearlings or two-year-olds, they just want to buy a few. They want their shots on goal to count.