Using Machine Learning to "Short List" - Part Deux
Its been a while between posts, but I have been busy working on our database, cleaning up some new data and also fine tuning the algorithms that we run in Azure ML that form the basis of our predictive models.
We have some interesting developments coming out over the next few months, some of which really push the boundaries in terms of the technology available, especially in the field of image and video recognition using deep learning neural networks.
Before we get to that point however, we have to tackle the issue of scale (you can't test every horse at a sale, especially Keeneland September!) so we are cognizant of the need to develop an effective short listing tool that allows us to get around the sales effectively and use these other tools to drive us towards the end result of consistently selecting stakes winners.
Let's look at some numbers. In 2015 there were 8,538 yearlings offered for sale across North America. The reality is that you would want to own only about 340 of these (about 4%). The rest you wouldn't want to be paying to feed or train on. Hard as that may sound, it is the reality that agents and owners seem to forget when they get to a sale and create a short list. Agents and owners also seem to forget that while breeders/sales consignors, do a reasonable job in grading horses, that is the better horses tend to find their way to Saratoga or Book 1 at Keeneland, they are not infallible and the next Big Brown could easily be selling at the October Sale and they may not even turn up for it. In that respect, there is no first or last horse or first or last sale, and each horse should be considered as an independent event.
I have worked, and still do, with agents that inspect every horse at a sale and come away with a short list that they work from. One of the best 'short-listers' I have seen can reduce 1000 horses (with 40 elite horses in that group) down to 100 with 15 elite horses still in the mix. If you look at the data of most leading agents and their purchase outcomes, that is about as good as a human can do at it when left to their own devices.
The two metrics that are used to measure the performance of any binary model, as we have here in that they are either elite or they're not, are Precision and Recall. Precision (P) is the proportion of the predicted positive cases that were correct. In the case above of our 'short-lister', Precision is 0.15 as 15 of the 100 predicted positive cases, were in fact positive (elite). Recall or true positive rate (TP) is the proportion of positive cases that were correctly identified. In our case recall is 0.38 as 15 of the 40 possible positive (elite) cases were identified by the 'short-lister'.
In terms of data science, this would be considered a very average (actually poor!) model. The goal of a short list is to eliminate as many true negative cases as possible while preserving as many of the positive cases as possible. There is often a trade off in terms of Precision (getting as many good horses as you can relative to overall selections) and Recall (getting as may good horses as you can relative to all good horses available) but the error rate or low positive selection of a 'short-lister' doesn't stem from this trade off.
Error comes from a couple of sources. Firstly, they're human and get tired and tend not to judge a horse early in the day the same way they judge one late in the day. Secondly, they let catalog page, which is after all a sales document, sway their opinion on a horse and incorrectly weigh a factor that is only moderately relevant (its the 9th foal out of the mare!) when they are looking at the horse in front of them and thirdly, they don't consistently treat all sales, and all horses within each sale as independent events so they are often harder on judging horses at say the Fasig Tipton July Sale on the belief that what they might see in September will represent a better opportunity.
Almost two years ago we started to look at various data points as they related to yearling sales and to examine the possibility of using data to 'short list' at a yearling sale. Since that time we have been working on other projects, but also coming back to the data, adding in new fields, chucking out others (feature selection) and trying to come up with some data models that can put us on the path towards objective selection and more consistently selecting superior horses. We are now ready to tackle it in a more serious fashion.
As I said earlier, there is a bit of a trade-off between Precision and Recall. From our perspective, when we are developing models for testing horses for other agents and buyers at yearling sales, we generally get the model tuned to drive Recall up so we capture as many of the elite horses as possible. We do this because people generally don't like it when you tell them a horse isn't going to be an elite horse it and it is, but the trade-off for this is that we have a lot more false positives (horses that we say are good but turn out to be slow) that we would optimally desire if it were our own choice.
Developing a selection model for ourselves, as we are doing here, we are more interested in driving Precision up. After all, realistically speaking we may only buy 15-20 horses in a year, so driving precision up, and making sure that we have a high percentage of elite runners relative to our overall positive selections is more important. We want to get as may good horses as we can in the 15-20 that we ultimately recommend to buy. The fact that we miss some good horses isn't relevant.
So we gathered a pilot study of 500 horses, did some processing of the data (normalizing, binning, etc) and trained a Boosted Decision Tree model in Azure ML. The variables within the model are all based on data that is available before a yearling sale (many of which you can find here) and included some new variables that we have been testing over the past few years that we believe are related to performance outcomes (a lot of the new ones are related to soundness and precocity). After training the Boosted Decision Tree model on 400 horses, we then tested the model on a holdout set of 100 horses that the model hadn't seen. Here are the results below.
So what is the image above telling us?
From the 100 holdout horses that it has never seen before, the data selects 33 horses with a positive label (14 True Positives, 19 False Positives), suggesting that these horses are potentially elite. From a practical viewpoint that is actually a good number in terms of it being possible to then apply the further tests that we are developing. If you extrapolate that to the Keeneland September Sale of 4,000 yearlings, then it is going to shortlist about 1,300 of them, a manageable number to go on with and do further testing.
Importantly, precision is relatively high at 0.42., so 42% of the horses that the model thinks are good racehorses, actually are which is solid. Now you will notice that this is an unbalanced data set in that there are 27 Elite horses in a subset of 100 horses which is significantly more than what you'd see in a normal population (there should be 4 elite horses). We did this on purpose as this was a pilot study using a lot more variables than we had used before and we wanted to measure the effect of these new variables and see which ones to keep and which ones to discard in a future model that we are working on now.
The larger study that we are now undertaking looks at all foals that have had at least 3 lifetime starts (so they've had a chance to be winner at least) from the 2009 and 2010 Keeneland September, OBS August, Tattersalls October (Bk1 & Bk2), Inglis Easter, Magic Millions and Karaka yearling sales. It is obviously a much larger data set (over 10,000 examples) and will be more balanced towards actual outcomes in terms of the number of elite horses forming somewhere around 7% of the population (it is higher than normal because its a subset of horses that have raced, rather than all horses).
This will present some challenges in terms of the positive cases being such a small percentage of the overall dataset, but there are plenty of tools available, such as synthetic oversampling, that we can use to overcome this imbalance. We are hoping that in a short while we will be training on this data, with the goal being that as we come out of spring and hit the Northern Hemisphere yearling sales we will have a pre-sale selection model that can be consistently used across every sale and start us looking in the right place for a stakes winner.