Ch3 Boolean And Vector Space Retrieval Model (Lesson)

2. Text Processing and Indexing [5 Hours]

Source Reference: Text Processing and Indexing – Sunil Regmi, Lecturer, DoAI, Kathmandu University, Kavre, Nepal.

2.1 Text Preprocessing

(Tokenization, Stemming, Lemmatization)

Traditional IR Perspective

Traditional Information Retrieval (IR):

Adopts index terms to index and retrieve documents.
Assumes semantics of documents and user information needs can be expressed through sets of index terms.
However, a lot of semantics are lost in this simplification.
A central problem:

Predicting which documents are relevant and which are not.
This prediction usually depends on a ranking algorithm, where:
- Documents are simply ordered.
- Documents on top are more likely to be relevant.

Different premises lead to different IR models.
The adopted IR model determines how relevance is predicted.

The User Task in IR

In IR:

Users express their information needs using queries with specific keywords.
Data retrieval systems use constraint-based expressions (e.g., regular expressions).
Users search for useful information by executing a retrieval task.

If the user's interest is broad (e.g., "car racing"):

They may browse instead of searching precisely.
Browsing is still IR.
Focus may shift during interaction (e.g., from car racing → tourism).

Logical View of Documents

Documents can be represented as:

A full set of keywords/index terms
Full text

For very large collections, we must reduce the set of index terms.
This reduction is achieved through text processing.

Text Processing

Text processing in IR transforms and analyzes textual data to retrieve relevant information.

It is critical for:

Search engines
Recommendation systems
Document retrieval applications

Before indexing or searching, raw text must be cleaned and standardized.

Timeline of Developments (After Luhn, 1958)

Luhn’s Method (1958)
Statistical Methods
- TF-IDF (1972)
- LSA (1990)
Graph-Based Methods
- TextRank (2004)
- LexRank (2004)
Neural Networks
- RNNs (1986)
- Seq2Seq (2014)
- Attention (2017)
Transformers
- BERT (2018)
- GPT (2018)
- T5 (2019)
- BART (2020)
Advanced Transformers
- Longformer (2020)
- BigBird (2020)
Machine Learning Models
- SVM (1995)
- LDA (2003)
Multimodal Systems (2022 onward)

Core Preprocessing Steps

Tokenization
Lowercasing
Stopword Removal
Stemming / Lemmatization
Punctuation & Special Character Removal

1. Tokenization

Tokenization = splitting text into smaller units.

Types:

Word tokenization
Sentence tokenization
Character tokenization

Word Tokenization Example

text = "Text preprocessing includes tasks like tokenization, stemming, and lemmatization."

Tokens:

['text', 'preprocessing', 'includes', 'tasks', 'like',
 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']

Sentence Tokenization

Input:

"Tokenization is an important NLP task. It helps break down text into smaller units."

Output:

[
 "Tokenization is an important NLP task.",
 "It helps break down text into smaller units."
]

Character Tokenization

Input:

"Tokenization"

Output:

["T","o","k","e","n","i","z","a","t","i","o","n"]

Tokenization is essential because inverted indexes rely on tokens.

2. Stopword Removal

Stopwords = frequent words with low semantic value.

Examples:

Original tokens:

['text','preprocessing','includes','tasks','like',
 'tokenization',',','stemming',',','and','lemmatization','.']

After stopword removal:

['text','preprocessing','includes','tasks','like',
 'tokenization',',','stemming',',','lemmatization','.']

This reduces noise and index size.

3. Stemming

Stemming reduces words to root form by chopping suffixes.

Example:

running → run
includes → includ
preprocessing → preprocess

Example result:

['text','preprocess','includ','task','like',
 'token',',','stem',',','lemmat','.']

Stemming is crude and may produce non-dictionary words.

4. Lemmatization

Lemmatization uses vocabulary + morphological analysis.

Example:

better → good
running → run
includes → include

Result:

['text','preprocessing','include','task','like',
 'tokenization',',','stemming',',','lemmatization','.']

More accurate than stemming.

2.2 Inverted Index and Other Indexing Structures

Inverted Index

Most common indexing structure (used in Google-like systems).

Maps:

Term → List of documents

Structure:

Vocabulary
Posting List

Example Collection

Doc1:

"Information retrieval and indexing are essential."

Doc2:

"Indexing techniques improve retrieval performance."

Term	Posting List
information	[Doc1]
retrieval	[Doc1, Doc2]
indexing	[Doc1, Doc2]
essential	[Doc1]
techniques	[Doc2]

Types of Inverted Index

1. Record-Level Inverted Index

Stores:

term → [Doc1, Doc2, Doc3]

2. Word-Level (Positional) Inverted Index

Stores:

term → (DocID : frequency : [positions])

Example:

1:2:[2,6]

Meaning:

Document 1
Term frequency = 2
Positions = 2 and 6

More powerful but requires more space.

Query Execution Process

Tokenize query
Lookup each term
Retrieve posting lists
Merge lists
Compute relevance score

Relevance may consider:

Term Frequency (TF)
Document Length
Proximity

Time and Space Complexity

Search Time:

Best case:
```
O(1)
```
Worst case:
```
O(N)
```

Space Complexity:

O(T + D + P)

Where:

T = number of unique terms
D = number of documents
P = number of postings

Other Indexing Structures

1. Suffix Tree

Used for substring search.

Time Complexity:

O(m)

m = pattern length

Space Complexity:

O(n)

n = length of input string

Very fast search, but space heavy.

2. Suffix Array

Sorted array of all suffixes.

More space-efficient than suffix trees.

3. B-Trees

Balanced tree structure.

Used in database indexing.

Search time:

O(log n)

Efficient for range queries.

2.3 Compression Techniques for IR

Why compression?

Inverted indexes are huge.

We want to:

Reduce storage
Improve cache performance
Speed up retrieval

1. Dictionary Compression

Techniques:

Blocking
Front Coding

Front Coding Example

Sorted terms:

compression
compute
computer
computing

Common prefix: "comp"

Compressed representation:

comp|ression
    |ute
    |uter
    |uting

This reduces redundancy.

2. Posting List Compression

Posting lists contain sorted document IDs:

Example:

[3, 10, 15, 21, 30]

Instead of storing raw numbers, we store gaps.

Gap Encoding

Original:

3, 10, 15, 21, 30

Gaps:

3,
10 - 3 = 7,
15 - 10 = 5,
21 - 15 = 6,
30 - 21 = 9

Compressed form:

[3, 7, 5, 6, 9]

Smaller numbers → fewer bits required.

3. Variable Byte Encoding

Each number encoded in variable-length bytes.

Example:

Suppose we encode gap 7.

Binary:

7 = 00000111

Add continuation bit.

If single byte, prefix with 1:

10000111

For larger numbers:

Example: 130

Binary:

130 = 10000010

Split into 7-bit chunks:

0000001 0000010

Add continuation bits:

0000001 → 00000001
10000010 → 10000010

4. Gamma Encoding (Elias Gamma Code)

Used for positive integers.

Steps:

Write number in binary
Let length = L
Write L-1 zeros
Append binary representation

Example: Encode 9

Binary:

9 = 1001

Length L = 4

Step 1: L-1 zeros

Step 2: Append binary

Gamma code:

Storage Comparison Example

Suppose we store document IDs:

[1000, 1005, 1010]

Raw 32-bit storage:

3 × 32 = 96 bits

Gap encoding:

[1000, 5, 5]

Binary size:

1000 → 10 bits
5 → 3 bits
5 → 3 bits

Total:

10 + 3 + 3 = 16 bits

Massive reduction.

Summary

Text Processing:

Tokenization
Stopword removal
Stemming
Lemmatization

Indexing:

Inverted Index
Record-level vs Word-level
Suffix trees
B-trees

Compression:

Gap encoding
Variable byte encoding
Gamma encoding
Front coding

Without preprocessing and compression, large-scale IR would collapse under its own weight.
Efficient indexing and compression transform chaos into searchable structure.