top of page

Ranking the Importance of Ancestors Using PageRank in a Thoroughbred Pedigree Database.


In the world of thoroughbred horse racing, understanding the pedigree of racehorses is crucial. In terms of selection, the industry across the world lives by the motto "what we value, survives". This means that we (the industry) are fairly ruthless in making sure that if a stallion is capable of reproducing himself, and more importantly he has more than one son capable of doing the same, their genetics tends to survive. Equally, even if the horse was an outstanding horse, if it transpires that its genetics were unable to create that impact, we remove them from the commercial population ruthlessly (Smarty Jones, King of Kings, etc) and you rarely see those names in pedigrees, at least in a commercial world.


The same obviously applies for mares, but in a slightly different way. Because of the way. the industry (incorrectly mind you) considers the catalogue page, we tend to value families where they have produced black type horses. Branches of pedigrees where they have gotten a little "thin" in terms of black type are pruned from the genetic tree (we dont breed from the females). Additionally, certain females have from time to time been assigned some type of mythical status, namely "blue-hen" which promotes the idea that breeders should "get into the family" and branches from that mare tend to flourish as they are given an outsized opportunity with the quality of stallions that they are bred to.


To delve deeper into this, and to establish what genetic material we truly value, I embarked on an intriguing project: applying Google's PageRank algorithm, typically used for ranking web pages, to a thoroughbred pedigree database. My aim was to rank the importance of horses in a pedigree, providing unique insights into the pedigree structure and dynamics of the modern breed.


The Dataset


I have a database comprising 497,127 horses and 984,018 parent-child relationships. To better establish how the relationships work, and to use algorithms like PageRank to understand the pedigree structure, I firstly migrated this into a Graph Database (neo4j).


A graph database is a type of database designed to store and navigate complex relationships between data points. Unlike traditional relational databases that store data in rows and tables, graph databases use nodes (to represent entities) and edges (to represent relationships), making them particularly adept at handling interconnected data.



In the context of a pedigree database, a graph database allows us to elegantly map and explore the intricacies of the thoroughbred breed. Each horse is a node, and the familial ties—parent to offspring—are the edges. This structure not only facilitates efficient queries about the relationships of specific horses (it can calculate inbreeding coefficients in milliseconds) but also enables the application of advanced algorithms like PageRank to analyze the influence and significance of individual horses within the entire network of the breed. Such a setup is invaluable in understanding breeding patterns, genetic traits, and the propagation of certain characteristics across generations, providing us with deeper insights into exactly how the thoroughbred breed has been constructed over time.


My database is in no way a complete database of all racehorses over all years. It was built from the merging of two databases:

  1. The pedigree database that is used in Bloodlines.net

  2. A private database of stakes results (first to last) in stakes races from 2000 to 2022.


While it wasn't a complete record of all horses and race participants (with some stakes races, especially in North America, missing), it represented a substantial and commercially significant dataset, tracing back to the founders of the breed.


What is PageRank?


PageRank is what is known as a Centrality Algorithm. A centrality algorithm is a method used in network analysis to identify the most important or influential nodes (horse names in our case) within a graph (a network of names). These algorithms are fundamental in understanding the structure and dynamics of complex networks, be they social networks, transport networks, biological networks, or any system represented as a graph. The concept of centrality helps to pinpoint key elements within the network, which are crucial for its functioning and connectivity.



PageRank is a renowned algorithm developed by Larry Page and Sergey Brin, founders of Google, is traditionally used for ranking websites based on their link structure. But why limit it to web pages? Our idea was to apply this algorithm to a pedigree context, treating each horse as a 'node' and their parent-child relationships as 'links'.


A critical aspect of this analysis was considering the directionality of these relationships. In pedigrees, information flows from parent to child. This flow is pivotal in understanding pedigree and influence. By applying PageRank with consideration to the directional flow of information, or in our case genetic material from parent to child, I could quantify the influence or importance of a horse in the entire network of pedigree based on how many descendants it has and the prominence of those descendants and so on.


It is not a ranking based on how many times the sire shows up in the network, rather it is a ranking of its importance as its children and their children (and so on) have survived in the network as time has gone on.


To apply PageRank to our database I firstly created a graph

CALL gds.graph.project(
  'horses',
  'Horse',
  {
    Father_Of: {
      type: 'Father_Of',
      orientation: 'REVERSE'
    },
    Mother_Of: {
      type: 'Mother_Of',
      orientation: 'REVERSE'
    }
  }
)
YIELD graphName, nodeCount, relationshipCount

And once I had the graph I applied the PageRank algorithm

CALL gds.pageRank.stream('horses', {
  maxIterations: 40,
  dampingFactor: 0.85
})
YIELD nodeId, score

// Match Horse nodes and join with PageRank scores
MATCH (h:Horse) WHERE id(h) = nodeId
RETURN h.Name AS horse, score
ORDER BY score DESC

For those of the technical mind, I applied various iterations (20,30,40,50) to see the variability of the algorithm and this didn't change the output by much at all. That is, the names that were most important stayed that way under different iteration values.


Insights and Findings


The analysis yielded fascinating results. I ranked horses based on their PageRank scores, revealing which horses held the most 'influence' in entire network of all pedigrees in the database. Here are the top 20:


Rank

Name

PageRank

1

NORTHERN DANCER

2355.000315

2

MR. PROSPECTOR

1638.400582

3

NEARCO

1586.637367

4

NASRULLAH

1395.032029

5

ST. SIMON

1388.981924

6

NATIVE DANCER

1143.612113

7

NEARCTIC

1140.796903

8

GALOPIN

1130.526789

9

RAISE A NATIVE

1098.99274

10

NATALMA

1080.666792

11

HYPERION

1069.372728

12

PHAROS

968.2484486

13

DANZIG

946.776179

14

PHALARIS

933.753855

15

GAINSBOROUGH

830.9146146

16

STOCKWELL

805.3530843

17

TEDDY

793.9423159

18

BOLD RULER

773.6204061

19

HEROD

751.8420376

20

BLENHEIM II

739.69578


It is unsurprising to see Northern Dancer at the top of the tree. His influence in shaping the breed is undisputed and his PageRank score is a long way ahead of the next most influential in Mr Prospector. The latter, born in 1970, rode the wave of the commercialization of the breed in the 1980s. The other modern name to get into the top 20 was Mr Prospector's barn-mate Danzig, with the two standing at the famed Claiborne Stud. Obviously Danzig's influence is primarily through his breed shaping son Danehill (who ranks 28) but with his PageRank significantly higher than his best son, it describes his influence outside of Danehill as well.


The older names of St Simon, Stockwell and Herod interestingly fit into a 2018 paper "Founder-specific inbreeding depression affects racing performance in Thoroughbred horses". This paper used a different technique to calculate a list of influential ancestors (they used a process that created marginal contribution values), but the names of Herod, St. Simon and Stockwell all appeared in their top 10.



They found that horses that having higher contributions of Herod in their pedigree had a positive influence on Cumulative earnings, Earnings per start and career length. For those interested the complete list of the top 50 horses can be found by clicking here.


Conclusion


So, while the PageRank algorithm has provided us with a good snapshot of what ancestors are important in the breed, and created a ranking of such, where do I take this from here?


Firstly, I would like to run a community detection algorithm like say Louvain on the data to detect what names in pedigrees are generally associated with one another, that is a community of pedigree names in the entire network of all names. The idea behind this thought being is that there are "metabolic blocks" of pedigree data that tend to inherit together. Biological networks like pedigree data are generally highly modular and contain a number of clusters or groups, which are often associated with a specific function so it should follow that there are certain ancestors that go together in a pedigree over generations that have an impact on outcomes.


Secondly, there is the ability to create weights with the PageRank algorithm. With sufficient data, I could apply a weight to the child node in terms of its race ability. By re-running PageRank with the weight of a race rating being known, this would then give further importance to the relationships where the children (and their children) have better race performances than others. While my database is primarily horses that have competed in stakes races since 2000, as it has the parents of these horses, it will also have a lot of horses with limited race performance. If I apply a weighting to the race performance of all the names in the pedigree, it is possible that some of the ancestors will be considered more, or less important.


Finally, once I have done the above, I would like to do some type of monte-carlo simulation of the pedigree data to come up with some type of pedigree rating for the ancestry. If I know what ancerstors are important by PageRank, and what ancestors tend to "stick together" via Louvain, it makes sense then to try to see if a simulation of those select names that really matter are inherited more frequently in top class horses or not. It would just be a part of the overall judgement of a pedigree (the stud and production performance of the immediate ancestry will have a greater impact) that gets me working towards some type of final pedigree rating.

651 views0 comments

Recent Posts

See All

Subscribe to our email list and get the latest post straight to your inbox

Thanks for submitting!

bottom of page