Today I am pleased to announce the launch of BreezeUpIQ.com
This project is the culmination of years of work, gathering data at major two year old in training sales across North America and Europe and using racetrack outcomes to develop a predictive algorithm for two year old in training sales. In all, I have data on just over 6,000 horses that subsequently had 3 or more starts so are valid records to create a prediction algorithm on.
BreezeUpIQ.com uses the latest in machine learning algorithms, specifically XGBoost, to develop a predictive algorithm to select elite horses from two-year-old in training sales. More than half of the winning solutions in machine learning challenges hosted at data science platform Kaggle adopt XGBoost as their winning algorithm.
I am happy with the soundness of the data science behind BreezeupIQ in developing its predictive algorithm. Some data suppliers are offering breeze/gallop out times and some offer stride length analysis but what I have developed here is a predictive algorithm that considers these data points, as well as derived data points such as velocity, decay, stride length, strides per second and other factors that are normalized to reflect horses of the same sex, breezing over the same distance to come up with a probability of a horse being an elite runner.
It's not a perfect model with 100% precision and 100% recall (thus predicting all possible future stakes winners at a sale) which is an impossible claim that some others make, but it does provide a significant advantage at two year old in training sales. In a randomized holdout set of 1000 two-year olds with racing outcomes that the algorithm had not seen before, the XGBoost model had the following outcome:
Fraction Above 0.9 Threshold - 6.8% (that is the model selects 6.8% of all horses in the sale)
Precision (Strike Rate) -23.1% (or one out of every 4.5 horses subsequently being Stakes winners).
Recall - 33.5% (the model finds a third of all possible stakes winners in the sale).
You can see from the above that the model doesn't find every stakes winner, as some stakes winners will generate breeze data that is not different to a lot of slow horses, but it finds a third of all stakes winners in a sale and delivers an outstanding strike rate for selecting elite runners at two year old in training sales.
You can take a look at a sample of the output by clicking here
Given the proprietary nature of the machine learning algorithm developed and its ability to select a high proportion of stakes winners relative to the number of selection it makes, I won't be offering a report at a two year old in training sale. If I sold a report, given the highly selective nature of the algorithm, it wouldn't take long before the clients that purchased it would all be bidding on the same horses.
What I will be doing is using the algorithm as the start of a process to select racehorses. This algorithm provides a pool of horses on which I will be testing the Cardio and DNA and using additional predictive algorithms to refine down to a select group of horses who will have a high percentage chance of becoming an elite runner.
I will be heading down to the OBS March 2YO Sale and will cover the main two year old sales in North America and also plan to offer the service at European two year old sales.