Tag Archives: Bing

A little something about search engines.

Imagine this, you are sitting at your computer and type in a search engine "Microsoft", and the first result is filled with website's that people created who worked at Microsoft, instead of the homepage of Microsoft. Or when you type in "Egypt", but instead of information about the country, the first page is filled with people who talked about Egypt on their website?
When this would happen, the search engine would not be really popular, as you would never find what you needed in an easy way. When you type the above mentioned examples, google and bing, or other search engines, do return you links that you expect. All of this is relying on a concept introduced back in 1998 by google, called PageRank.
(PageRank looks like a play on word with the name of google co-founder Larry Page, yet in their historical publication it's used with the meaning I'm using here)

Let's think for a moment of how this could work.
One of the ideas could be, ranking a page by importance depending on how many other website's refer to it. However, this is not a very good idea. When a website like www.cnn.be refers to, for example, www.gumclan.org, it is way more important than when, let's say, 10 people their own little websites that they put up for university refer to it.

A better idea would be to give websites a rank. Depending on their rank, their referral to another website will be more important. Thus giving the website they link to a higher rank as well, and so forth.
Yet using this later definition, we encounter a first problem. This causes self-referential. To know the position of a page, we need to know all the pages that refer to this page. But to know how much influence the referring page has, we need to know the rank of that page, which we know by the pages that refer to it, who's rank we …. etc.
If however, this function would be linear, this problem would be an easy one to solve with linear algebra.

A simple model
To put our idea to work, we first have to define some variables. A page p we will call Np the amount of links on this page. R1(p) will be the rank of p, and with \displaystyle\sum\limits_{p->q} . we mean the sum over all pages q that contain a link to p. (note that Nq \neq 0 for q).

A simple example is the following situation, with just three pages. Named respectively, p, q and r. The arrows indicate the links between pages.

The equation that belong to these are
R1(p) = \frac{1}{2} R1(q) + R1(r)
R1(q) = R1(p)
R1(r) = \frac{1}{2} R1(q)

If we add one more condition to this, namely that the total rank has to be equal to 1, so with other words R1(p) + R1(q) + R1(r) = 1, we find that R1(p) = R1(q) = 0.4 and R1(r) = 0.2.
Another way to look at these results is by putting them in a table, and showing each step of the summation.
Following this way, we get the following table..

A second model.

In our former explenation, the definition of R1 still has some "bugs". One of them has an easy solution. The bug we have is that some pages can essentially make ranks dissapear. We can see how this happens when a page does not have any links on it. Or, in other terms, there are no arrows leaving the webpage.

R1(p) = \frac{1}{2}R1(q)
R1(q) = \frac{1}{2}R1(p)
R1(r) = \frac{1}{2}R1(p) + \frac{1}{2}R1(q))

If we solve this system, we get R1(p) = R1(q) = R1(r) = 0. Which is obviously rather useless.
Though, we can solve this problem quite easily. What we have to do, is define a new rank. R2.

 c * R2(p) = \sum\limits_{q-p} \frac{R2(q)}{Nq}

With the constant c \in R+

The real pagerank!

Even our second model still has some bugs in it. And they aren't far-fetched. If you look ath the following image

We can see that pages r and s keep stocking up rank. We call this a "rank sink" problem.

It's not hard to check that c = 1, R2(p) = R2(q) = 0 and R2(r) = R2(s) = 0.5 is a possible PageRank according to equation c * R2(p) = \sum\limits_{q-p} \frac{R2(q)}{Nq}

This is deffinently a problem. The solution to this rank sink problem is what they call a rank source.
By creating a rank source, we give each page a initial value. E(p), which makes the equation look like:

c * R(p) = \sum\limits_{q-p} \frac{R(q)}{Nq} + E(p)

This deals with our former stated problem.

The general idea of a search engine is as I've explained it in this blog. But an actual search engine adds a lot of extra parameters to the pagerank system. For example searches in a certain language can get priority, also your location can affect the search. Anyway, I hope this atleast gives you an idea of how the linear algebra that led up to the creation of google looks like.