A 30 billion eigenvector - the Google algorithm
I am working on a project about Google's algorithm of PageRank. I wanted to share some things I found out about Google, maybe you can even use it yourself.
Introduction
For the following tutorial, there are several requirements you should come up to. First of all, you need to have some basic knowledge about matrices especially about eigenvectors and eigenvalues. (As I don't want to make a tutorial about matrices)
Secondly, you should have a higher tolerance to grammatical issues or weird use of the English language.
The idea
If you have 100 websites and someone searches your "library" of websites with only one keyword, how would you give him the best matching website?
This is exactly what Larry Page and Sergey Brin asked themselves. They developed a algorithm for the PageRank, a list of websites matching to a given keyword.
The idea is pretty easy: The more links are incoming, the more important is the website.
To be more accurate, this is the pseudo-code of it:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
These are the variables used:
- PR(A) is the PageRank of a website A
- PR(Ti) is the PageRank of a website Ti
- C(Ti) number of links on a website Ti
- d is a damping factor, 0 < d <1 (mostly it is exactly 0.85, you will see later why)
As you can see, there is no "absolute" solution of this problem, because every website is related to some others. If you want to calculate the PageRank of a website, you have to know the PageRank of another one (here its PR(Ti)). Therefore, this will be a system of equations with billions of unknown variables. We will come to this soon enough
Larry Page and Sergey Brin used a "random surfer" to develop this algorithm.
Just imagine a guy surfing the Internet through links. He will randomly click on links on several websites. There is a higher possibility to get on a website, that is linked a lot in the Internet. Therefore, the random surfer will get on those "highly linked" sites over and over again. (Just like all roads lead to Rome)
If you let him surf the Internet forever (-> infinite time), he will spend more or less time on some websites. The ratio of the time spent on one website to the time spent overall (here it is unending), you will get the importance of the websites. This is exactly how the PageRank code above works.
There is also a damping factor. It adds the possibility of stopping surfing. (When the random surfer gets tired or hungry )
This was the theory. Don't worry if you didn't get the idea. Let's try it out with an example.
Example 1
Let's start with a fictional Internet with 3 websites.
Here we have 3 websites A, B and C.
There is a link on A to both websites B and C. B is linked to C and C has one link to A.
We can set 0.5 for d, so it is easier to work with it.
If you fill in the values PR() and C() for each of the websites, you will get 3 equations:
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
If you solve this system of equations you will get those results:
PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615
As you see, the most important/matching website is C. There are 2 links leading to C, thus the possibility to get to C is the highest one. If our random surfer surfs this Internet, he will get mostly on C. That is the main idea behind the algorithm of PageRank.
There are of course other factors influencing the PageRank of a website like the font type, the font size, the placement of links and so on. But the weighting of each of those factors are secret, because that is what makes the difference between Google and other search engines.
To implement those factors is very complicated, so I won't go deeper into it. (Also because one can't know all the factors and algorithm Google is using)
Well, if you have such a system equations for an index with billions of websites, you will have billions of unknown variables. Do you have an idea how to solve such a system exactly? If you know how to solve system of equations like that one above, you know it is a pain in the ass.
There is a simple but effective way to approximate the solution: You can set the PageRank of each website to 1 (to get a start value), and solve the equations. You put the result into the equations and solve them again. The result will converge to the "true" result. This is an example of such a recursive iteration:
Phew! Now we got the main idea and the mathematic problem we need to solve. That wasn't that hard, was it?
Here is a nice picture that really matches our problem:
Probably you know the chicken-egg-problem. It is a philosophical dilemma and I could make an own "tutorial" for the problem, because there are several point of views.
However, it is the same problem as with our equations, as I mentioned above: Who was there at first: The egg or the chicken? We can't get a solution for a given website, because we need to get the PageRank of another one(s). But to get the PageRank of this one, we need to calculate the PageRank of another website(s).
Here we can see, all PageRanks are linked up. If you disturb the system (like add or remove links on a website), a lot of PageRanks will change, too.
The eigenvalue problem
So, we have the problem now. We want to solve a system of equations with billions of variables. There are several ways to do this, but many people like to use matrices. This is the way Google does it and that's why I want to show you this method.
First of all you need to know matrices. Or you have to know at least what matrices are. You can give it a try and Google it by yourselves to get some basic knowledge about matrices. If you wish, I can make a basic tutorial about matrices. Just post a suggestion
Very(!) basic summary of matrices:
Probably you saw already some matrices. They are like tables where you can store information. You can transform system of equations into matrices (what they are actually). You can understand matrices as operators that you can use on vectors to transform them. Therefore, they are used in computer games or computer graphics. You can also transform vectors with more than just 3 components (like the vector of PageRank).
You see that matrices are actually very cool and worth to learn.
Here are some matrices and their effects on vectors/objects:
[to be continued...]
If you find some faults, I would be glad if you could tell me. (I don't reread it every time I add something)
Sources:
http://pr.efactory.de/d-index.shtml
http://www.wikipedia.com