PageRank

This is an old revision of this page, as edited by The Anome (talk | contribs) at 17:39, 1 February 2003 (indent and modified adjacency matrix). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

PageRank is a concept of assigning numerical values to all Web pages listed in a search engine. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked by many pages with high rank receives a high rank itself.

The PageRank system is used by the Google search engine and was developed by Google's founders Larry Page and Sergey Brin while at Stanford University in 1998. The name PageRank™ is a trademark of Google. Whether or not the pun on the name Larry Page and the word "page" was intentional or accidental remains an open question.

Are there patents involved?

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the frequency of hits on that page by the random surfer. It can be understood as a Markovian process in which the states are pages, and the transitions are all equally probable and are the links between pages. If a page has no links to another pages, it becames a sink and therefore makes this whole think unusable, because the sink pages will trap the random visitors forever. However, the solution is quite simple. If the random surfer arrives to a sink page, it picks another URL at random and continues surfing again.

To be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually q=0.15, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.

So, the equation is as follows:

PageRank(i) = (q/N) + (1-q) Sum(j={pages that point to i}; PageRank(j))

It's worth noticing, and that's why the PageRank metric is so appealing in terms of elegance, that the PageRank values are the eigenvalues of the modified adjacency matrix. The values are fast to calculate (only a few iterations are needed) and in practice it gives good results.

The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages). That's why PageRank should be combined with textual analysis or other ranking methods.

External references