Page 1 of 2 12 LastLast
Results 1 to 25 of 33

Thread: The 30 billion eigenvector - Google algorithm

  1. #1
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default The 30 billion eigenvector - Google algorithm

    A 30 billion eigenvector - the Google algorithm

    I am working on a project about Google's algorithm of PageRank. I wanted to share some things I found out about Google, maybe you can even use it yourself.


    Introduction

    For the following tutorial, there are several requirements you should come up to. First of all, you need to have some basic knowledge about matrices especially about eigenvectors and eigenvalues. (As I don't want to make a tutorial about matrices)
    Secondly, you should have a higher tolerance to grammatical issues or weird use of the English language.

    The idea

    If you have 100 websites and someone searches your "library" of websites with only one keyword, how would you give him the best matching website?
    This is exactly what Larry Page and Sergey Brin asked themselves. They developed a algorithm for the PageRank, a list of websites matching to a given keyword.

    The idea is pretty easy: The more links are incoming, the more important is the website.


    To be more accurate, this is the pseudo-code of it:

    PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

    These are the variables used:
    • PR(A) is the PageRank of a website A
    • PR(Ti) is the PageRank of a website Ti
    • C(Ti) number of links on a website Ti
    • d is a damping factor, 0 < d <1 (mostly it is exactly 0.85, you will see later why)


    As you can see, there is no "absolute" solution of this problem, because every website is related to some others. If you want to calculate the PageRank of a website, you have to know the PageRank of another one (here its PR(Ti)). Therefore, this will be a system of equations with billions of unknown variables. We will come to this soon enough

    Larry Page and Sergey Brin used a "random surfer" to develop this algorithm.
    Just imagine a guy surfing the Internet through links. He will randomly click on links on several websites. There is a higher possibility to get on a website, that is linked a lot in the Internet. Therefore, the random surfer will get on those "highly linked" sites over and over again. (Just like all roads lead to Rome)
    If you let him surf the Internet forever (-> infinite time), he will spend more or less time on some websites. The ratio of the time spent on one website to the time spent overall (here it is unending), you will get the importance of the websites. This is exactly how the PageRank code above works.
    There is also a damping factor. It adds the possibility of stopping surfing. (When the random surfer gets tired or hungry )

    This was the theory. Don't worry if you didn't get the idea. Let's try it out with an example.

    Example 1

    Let's start with a fictional Internet with 3 websites.



    Here we have 3 websites A, B and C.
    There is a link on A to both websites B and C. B is linked to C and C has one link to A.
    We can set 0.5 for d, so it is easier to work with it.
    If you fill in the values PR() and C() for each of the websites, you will get 3 equations:

    PR(A) = 0.5 + 0.5 PR(C)

    PR(B) = 0.5 + 0.5 (PR(A) / 2)

    PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

    If you solve this system of equations you will get those results:

    PR(A) = 14/13 = 1.07692308
    
PR(B) = 10/13 = 0.76923077
    
PR(C) = 15/13 = 1.15384615

    As you see, the most important/matching website is C. There are 2 links leading to C, thus the possibility to get to C is the highest one. If our random surfer surfs this Internet, he will get mostly on C. That is the main idea behind the algorithm of PageRank.

    There are of course other factors influencing the PageRank of a website like the font type, the font size, the placement of links and so on. But the weighting of each of those factors are secret, because that is what makes the difference between Google and other search engines.
    To implement those factors is very complicated, so I won't go deeper into it. (Also because one can't know all the factors and algorithm Google is using)

    Well, if you have such a system equations for an index with billions of websites, you will have billions of unknown variables. Do you have an idea how to solve such a system exactly? If you know how to solve system of equations like that one above, you know it is a pain in the ass.
    There is a simple but effective way to approximate the solution: You can set the PageRank of each website to 1 (to get a start value), and solve the equations. You put the result into the equations and solve them again. The result will converge to the "true" result. This is an example of such a recursive iteration:




    Phew! Now we got the main idea and the mathematic problem we need to solve. That wasn't that hard, was it?
    Here is a nice picture that really matches our problem:





    Probably you know the chicken-egg-problem. It is a philosophical dilemma and I could make an own "tutorial" for the problem, because there are several point of views.
    However, it is the same problem as with our equations, as I mentioned above: Who was there at first: The egg or the chicken? We can't get a solution for a given website, because we need to get the PageRank of another one(s). But to get the PageRank of this one, we need to calculate the PageRank of another website(s).
    Here we can see, all PageRanks are linked up. If you disturb the system (like add or remove links on a website), a lot of PageRanks will change, too.

    The eigenvalue problem

    So, we have the problem now. We want to solve a system of equations with billions of variables. There are several ways to do this, but many people like to use matrices. This is the way Google does it and that's why I want to show you this method.

    First of all you need to know matrices. Or you have to know at least what matrices are. You can give it a try and Google it by yourselves to get some basic knowledge about matrices. If you wish, I can make a basic tutorial about matrices. Just post a suggestion

    Very(!) basic summary of matrices:

    Probably you saw already some matrices. They are like tables where you can store information. You can transform system of equations into matrices (what they are actually). You can understand matrices as operators that you can use on vectors to transform them. Therefore, they are used in computer games or computer graphics. You can also transform vectors with more than just 3 components (like the vector of PageRank).
    You see that matrices are actually very cool and worth to learn.

    Here are some matrices and their effects on vectors/objects:



    [to be continued...]

    If you find some faults, I would be glad if you could tell me. (I don't reread it every time I add something)

    Sources:
    http://pr.efactory.de/d-index.shtml
    http://www.wikipedia.com
    Last edited by Gala; 04-14-2012 at 08:41 PM.

  2. #2
    Join Date
    Mar 2012
    Posts
    208
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Lost me at eigenvector

  3. #3
    Join Date
    Feb 2012
    Location
    Wonderland
    Posts
    1,988
    Mentioned
    41 Post(s)
    Quoted
    272 Post(s)

    Default

    Looks like a lot of hard work

    Source:
    http://www.rose-hulman.edu/~bryan/go...rsionFixed.pdf

  4. #4
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    @Cerylin
    it is not that hard to understand. It is just the tool to solve the system of equations, you solve it by "hand" or with many iterations. (approximation)
    The tut hasn't even started, so let's see if you get it with a few examples

    @Le Jingle
    Yes, I must get the idea. That is one of the sourced I used for the project.

  5. #5
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    There's a problem with your way of solving systems. Iteration like that only really works when the numbers will converge. For example, take this system:

    100x + y = 10
    x + 100y = 18
    Putting it into a format so your method would work, you can do it one of two different ways:

    1). This method will converge at the right answers:
    x = (-y + 10)/100
    y = -x + 18)/100
    2). This method will not converge:
    y = 10-100x
    x = 18-100y

  6. #6
    Join Date
    Jan 2012
    Location
    In A Farm
    Posts
    3,301
    Mentioned
    30 Post(s)
    Quoted
    444 Post(s)

    Default

    Is that the evil chicken from runescape?

  7. #7
    Join Date
    Feb 2012
    Location
    Wonderland
    Posts
    1,988
    Mentioned
    41 Post(s)
    Quoted
    272 Post(s)

    Default

    So based on the Example 1 you listed, would the billions of variables consist of varying converging equations to get a characteristic of page rank?

    What I mean is, if your random surfer person is most likely to click C (as shown in Example 1), then is this just 1 of the billion of variables incorporated to obtain page rank?

    Can't wait for this thread to progress, it's probably my first favorite that I've seen here on villavu!

  8. #8
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    @Imagine
    I will come soon enough to the point if it doesn't converge, be patient

    @Ucantseemebot
    It's the good twin chicken!

    @Le Jingle
    Thx. No, with three sites, you get three variables.
    The number of variables is equal to the numbers of pages in the index of Google.
    I'll explain this point way better

  9. #9
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Quote Originally Posted by Gala View Post
    @Imagine
    I will come soon enough to the point if it doesn't converge, be patient
    Not all equations converge...
    I wrote a program in Matlab just to test this, and after like 10 iterations, the script had the x and y values like well above the millions.

  10. #10
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    Quote Originally Posted by Imagine View Post
    Not all equations converge...
    I wrote a program in Matlab just to test this, and after like 10 iterations, the script had the x and y values like well above the millions.
    That is what i said, there is a trick to make them convergent.

  11. #11
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Quote Originally Posted by Gala View Post
    That is what i said, there is a trick to make them convergent.
    You can't make everything converge... There are systems of equations with many unknowns (complex differentials, simulations, etc) which currently cost a lot of money to get someone to solve them. If there was a simple trick to make everything converge, then anybody could solve them.

  12. #12
    Join Date
    Sep 2006
    Posts
    89
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    tl;dr...hahaha, just playing!

    Nice read! I'll have to ponder on it and get back to you!

  13. #13
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    Quote Originally Posted by Imagine View Post
    You can't make everything converge... There are systems of equations with many unknowns (complex differentials, simulations, etc) which currently cost a lot of money to get someone to solve them. If there was a simple trick to make everything converge, then anybody could solve them.
    I wish you just listened to me.
    The system of equations of google will always converge, unless there isnt a site without any links. This one is called dangling node, but as I said, there is a trick to make it convergent.
    (There are as many equations as variables)

    Edit: hoopla, lol, thx?

  14. #14
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Take this system:

    Page A has 100 links from Page B and 1 from Page C, which has 100 links from Page C and 1 from Page A, which has 100 links from Page A and 1 from Page B.

    You get:
    A = 100B + C
    B = 100C + A
    C = 100A + B
    Iteration 1:

    A = 1, B = 1, C = 1.
    Iteration 2:
    A = 101, B = 102, C = 10202
    Etc, you can see how these will not converge.

    Yes, the same system, written like this:
    B = (A - C)/100
    C = (B - A)/100
    A = (C - B)/100
    would converge, but writing systems so they converge becomes increasingly more difficult as the number of equations increases.

  15. #15
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    Dude, I will continue as soon I have time, then you will see that the google matrix always converges.
    I dont think you understood the algorithm right, cause the only reason a system of equations wouldnt converge is when they arent any links on a website.

  16. #16
    Join Date
    Apr 2012
    Posts
    2
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Thanks mate, this was rather informative, since you seem quite apt in mathematics, do you know of any resources that pertain specifically to this topic in the context of 3D Projection/2D representation?.

    Thanks.

  17. #17
    Join Date
    Feb 2012
    Location
    SRL Jail
    Posts
    1,319
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    This is very interesting! We need more posts like this on the forums to add variety.

  18. #18
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Quote Originally Posted by Gala View Post
    Dude, I will continue as soon I have time, then you will see that the google matrix always converges.
    I dont think you understood the algorithm right, cause the only reason a system of equations wouldnt converge is when they arent any links on a website.
    I just gave you an example where a system would not converge.
    Of course, it is possible to re-arrange equations so that they do converge, but written the way I wrote them, it is a possible system, and one that would not converge.

  19. #19
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    This thread isnt about system of equations in general is it?
    I think it is pretty obbious that not all system of equation will converge or have real solution.
    I will complete the tutorial this weekend as i have enough time. Then I will come to the point of non-convergent equations made with the pagerank algorithm.

    Matessim: if you want i could make a tut about it. When I am at home I will look at my bookmarks.

    Joebot: Thank you

  20. #20
    Join Date
    Dec 2011
    Location
    Belgium
    Posts
    623
    Mentioned
    0 Post(s)
    Quoted
    3 Post(s)

    Default

    very nice guide!

    Made by P1ng

  21. #21
    Join Date
    Feb 2012
    Posts
    168
    Mentioned
    0 Post(s)
    Quoted
    0 Post(s)

    Default

    Quote Originally Posted by Gala View Post
    This thread isnt about system of equations in general is it?
    I think it is pretty obbious that not all system of equation will converge or have real solution.
    I will complete the tutorial this weekend as i have enough time. Then I will come to the point of non-convergent equations made with the pagerank algorithm.

    Matessim: if you want i could make a tut about it. When I am at home I will look at my bookmarks.

    Joebot: Thank you
    Fair enough, okay

  22. #22
    Join Date
    Mar 2012
    Location
    127.0.0.1
    Posts
    1,199
    Mentioned
    0 Post(s)
    Quoted
    26 Post(s)

    Default

    The eigenvalue problem

    So, we have the problem now. We want to solve a system of equations with billions of variables. There are several ways to do this, but many people like to use matrices. This is the way how Google does it and that's why I want to show you this method.

    First of all you need to know matrices. Or you have to know at least what matrices are. You can give it a try and Google it by yourselves to get some basic knowledge about matrices. If you wish, I can make a basic tutorial about matrices. Just post a suggestion
    Small grammatical error in here; you wrote, "This is the way how Google does it and that's why I want to show you this method."

  23. #23
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    @ Hazzah

    fixed, thank you!

  24. #24
    Join Date
    Mar 2012
    Location
    127.0.0.1
    Posts
    1,199
    Mentioned
    0 Post(s)
    Quoted
    26 Post(s)

    Default

    I keep coming back to this thread waiting for another portion of it to be released!

  25. #25
    Join Date
    Feb 2007
    Location
    Switzerland
    Posts
    583
    Mentioned
    1 Post(s)
    Quoted
    50 Post(s)

    Default

    I spend a lot of my free time in scripting now, but you will see the updates if you subscribed the thread

Page 1 of 2 12 LastLast

Thread Information

Users Browsing this Thread

There are currently 2 users browsing this thread. (0 members and 2 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •