I've been working on some new data models for the yearling sales season. The goal being that if the average sale has 3% Elite racehorses to all horses, and the average bloodstock agent gets about 10% Elite racehorses to purchases, if you can get a data algorithm to get you started with a shortlist at each sale with a little north of that latter figure, that you are putting yourself in a position for more consistent success.
In analyzing the dataset, which is substantial, it is interesting what features rank as more important, or indeed describe a phenomenon in a way you wouldn't expect. One "variable" is the attempt to explain the production class of a sire. That's obvious right? We use something like Graded/Group Stakes winners to Runners Percentage or an Average Earnings Index (AEI) or APEX rating.
It turns out that there is something that is more predictive of a stallions performance and it revolves around the process of Assortative Matings.
An assortative mating is a mating pattern and a form of sexual selection in which individuals with similar phenotypes mate with one another more frequently than would be expected under a random mating pattern. That is, it is a non-random event.
The Thoroughbred industry works by assortative matings and it is this very event that is in fact a stronger indication of a stallions success. If a stallion is old, and has proven himself to be less than useful, he is either rarely bred to, or if he is, the racing class of the mares that he is bred to is poor. Alternatively, if an older stallion is thoroughly proven, like say Galileo, he has Group One winning mares visit his court regularly (especially young mares).
Thus, when modeling the performance of sires in prediction models, the generational interval of the foal ((HorseYOB-SireYOB)+(HorseYOB-DamYOB)/2) and a label for the racing class of the sire and dam (G1, G2, G3, Winner, Unplaced, etc) is enough to represent the production class of the sire in an overall prediction algorithm.