Link Analysis Algorithms (PageRank & HITS) (Lesson)

Lecture: Link Analysis Algorithms (PageRank & HITS)

0. Introduction: The Evolution of Web Search

In the early days of the internet, search engines (like Altavista or Lycos) relied primarily on Term Frequency. If you searched for "movie," they looked for pages where the word "movie" appeared the most.

The Problem: Term Spam

Spammers quickly learned to exploit this by:

  • Adding the word "movie" thousands of times in white text on a white background.
  • Copying high-quality content from other pages just to rank higher. This rendered early search engines almost useless as the "most relevant" results were often just spam.

The Solution: Link Analysis

Google's innovation (PageRank) changed the game. Instead of just looking at what a page says about itself, search engines began looking at what the rest of the web says about that page via links.


1. The Intuition: Why Link Analysis Works?

The "Massive Datasets" theory suggests three main reasons why link-based importance is a reliable signal:

  1. Voting with Feet: A link from Page A to Page B is a signal that the creator of A thinks B is useful. It is a "vote of confidence."
  2. The Random Surfer Model: Imagine a user browsing the web by clicking links at random. Pages where the surfer spends the most time are naturally more important.
  3. Hard to Fool: While it's easy to add 1,000 keywords to your own page, it's very difficult to convince 1,000 other reputable people to link to you.

2. The PageRank Algorithm (The Google Way)

2.1 The Concept of "Taxation"

As we discussed, a simple link-counting system can be gamed by "Spider Traps" or "Dead Ends."

To solve this, we introduce Taxation (Teleportation):

  • 85% of the time, the surfer follows a link.
  • 15% of the time, the surfer gets bored and teleports to a random page.

2.2 Mathematical Formula

PR(u)=1dN+dvBuPR(v)L(v)PR(u) = \frac{1-d}{N} + d \sum_{v \in B_u} \frac{PR(v)}{L(v)}


3. Numerical Example: PageRank Step-by-Step

The Scenario: A small 4-page web. Graph: A → B, C | B → C, D | C → A, B | D → C, A Params: N=4,d=0.8N=4, d=0.8. Initial PR=0.25PR = 0.25 for all.

Calculation (Iteration 1):

  1. For A: Receives from C (outdegree 2) and D (outdegree 2).
    • PR(A)=0.05+0.8(0.25/2+0.25/2)=0.25PR(A) = 0.05 + 0.8(0.25/2 + 0.25/2) = \mathbf{0.25}
  2. For B: Receives from A (outdegree 2) and C (outdegree 2).
    • PR(B)=0.05+0.8(0.25/2+0.25/2)=0.25PR(B) = 0.05 + 0.8(0.25/2 + 0.25/2) = \mathbf{0.25}
  3. For C: Receives from A (2), B (2), and D (2).
    • PR(C)=0.05+0.8(0.25/2+0.25/2+0.25/2)=0.35PR(C) = 0.05 + 0.8(0.25/2 + 0.25/2 + 0.25/2) = \mathbf{0.35}
  4. For D: Receives from B (2).
    • PR(D)=0.05+0.8(0.25/2)=0.15PR(D) = 0.05 + 0.8(0.25/2) = \mathbf{0.15}

4. The HITS Algorithm (Hubs & Authorities)

HITS (by Jon Kleinberg) argues that importance comes in two flavors.

  • Authorities: Pages we want to read (e.g., Wikipedia).
  • Hubs: Pages that tell us where to find authorities (e.g., A curriculum list).

4.1 Numerical Example: HITS Step-by-Step

Graph remains the same. Initial a=1,h=1a=1, h=1.

Iteration 1: Authorities (aa)

  • a(A)=h(C)+h(D)=1+1=2a(A) = h(C) + h(D) = 1 + 1 = 2
  • a(B)=h(A)+h(C)=1+1=2a(B) = h(A) + h(C) = 1 + 1 = 2
  • a(C)=h(A)+h(B)+h(D)=1+1+1=3a(C) = h(A) + h(B) + h(D) = 1 + 1 + 1 = 3
  • a(D)=h(B)=1a(D) = h(B) = 1
  • Result Vector aa: [2,2,3,1][2, 2, 3, 1]

Iteration 1: Hubs (hh) (Using the new values of aa)

  • h(A)=a(B)+a(C)=2+3=5h(A) = a(B) + a(C) = 2 + 3 = 5
  • h(B)=a(C)+a(D)=3+1=4h(B) = a(C) + a(D) = 3 + 1 = 4
  • h(C)=a(A)+a(B)=2+2=4h(C) = a(A) + a(B) = 2 + 2 = 4
  • h(D)=a(C)+a(A)=3+2=5h(D) = a(C) + a(A) = 3 + 2 = 5
  • Result Vector hh: [5,4,4,5][5, 4, 4, 5]

5. Global Web Structure: The "Bowtie" Model

Research shows the actual web isn't a simple circle; it looks like a bowtie.

  • SCC (Strongly Connected Component): The core.
  • In-Component: New pages linking in.
  • Out-Component: Pages reached from the core.

6. Fighting Back: Link Spam & Spam Farms

Spammers build Spam Farms to trick PageRank.

Combat Techniques:

  1. TrustRank: Only teleporting to trusted pages (.edu, .gov).
  2. Spam Mass: Measuring the gap between PageRank and TrustRank to flag suspicious pages.
PREVIOUS
Web Search and Crawling
NEXT
Machine Learning for IR