Ch3 Boolean And Vector Space Retrieval Model (Lesson)
2. Text Processing and Indexing [5 Hours]
Source Reference: Text Processing and Indexing – Sunil Regmi, Lecturer, DoAI, Kathmandu University, Kavre, Nepal.
2.1 Text Preprocessing
(Tokenization, Stemming, Lemmatization)
Traditional IR Perspective
Traditional Information Retrieval (IR):
- Adopts index terms to index and retrieve documents.
- Assumes semantics of documents and user information needs can be expressed through sets of index terms.
- However, a lot of semantics are lost in this simplification.
- A central problem:
Predicting which documents are relevant and which are not.
- This prediction usually depends on a ranking algorithm, where:
- Documents are simply ordered.
- Documents on top are more likely to be relevant.
Different premises lead to different IR models.
The adopted IR model determines how relevance is predicted.
The User Task in IR
In IR:
- Users express their information needs using queries with specific keywords.
- Data retrieval systems use constraint-based expressions (e.g., regular expressions).
- Users search for useful information by executing a retrieval task.
If the user's interest is broad (e.g., "car racing"):
- They may browse instead of searching precisely.
- Browsing is still IR.
- Focus may shift during interaction (e.g., from car racing → tourism).
Logical View of Documents
Documents can be represented as:
- A full set of keywords/index terms
- Full text
For very large collections, we must reduce the set of index terms.
This reduction is achieved through text processing.
Text Processing
Text processing in IR transforms and analyzes textual data to retrieve relevant information.
It is critical for:
- Search engines
- Recommendation systems
- Document retrieval applications
Before indexing or searching, raw text must be cleaned and standardized.
Timeline of Developments (After Luhn, 1958)
- Luhn’s Method (1958)
- Statistical Methods
- TF-IDF (1972)
- LSA (1990)
- Graph-Based Methods
- TextRank (2004)
- LexRank (2004)
- Neural Networks
- RNNs (1986)
- Seq2Seq (2014)
- Attention (2017)
- Transformers
- BERT (2018)
- GPT (2018)
- T5 (2019)
- BART (2020)
- Advanced Transformers
- Longformer (2020)
- BigBird (2020)
- Machine Learning Models
- SVM (1995)
- LDA (2003)
- Multimodal Systems (2022 onward)
Core Preprocessing Steps
- Tokenization
- Lowercasing
- Stopword Removal
- Stemming / Lemmatization
- Punctuation & Special Character Removal
1. Tokenization
Tokenization = splitting text into smaller units.
Types:
- Word tokenization
- Sentence tokenization
- Character tokenization
Word Tokenization Example
text = "Text preprocessing includes tasks like tokenization, stemming, and lemmatization."
Tokens:
['text', 'preprocessing', 'includes', 'tasks', 'like',
'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
Sentence Tokenization
Input:
"Tokenization is an important NLP task. It helps break down text into smaller units."
Output:
[
"Tokenization is an important NLP task.",
"It helps break down text into smaller units."
]
Character Tokenization
Input:
"Tokenization"
Output:
["T","o","k","e","n","i","z","a","t","i","o","n"]
Tokenization is essential because inverted indexes rely on tokens.
2. Stopword Removal
Stopwords = frequent words with low semantic value.
Examples:
- and
- is
- the
Original tokens:
['text','preprocessing','includes','tasks','like',
'tokenization',',','stemming',',','and','lemmatization','.']
After stopword removal:
['text','preprocessing','includes','tasks','like',
'tokenization',',','stemming',',','lemmatization','.']
This reduces noise and index size.
3. Stemming
Stemming reduces words to root form by chopping suffixes.
Example:
running → run
includes → includ
preprocessing → preprocess
Example result:
['text','preprocess','includ','task','like',
'token',',','stem',',','lemmat','.']
Stemming is crude and may produce non-dictionary words.
4. Lemmatization
Lemmatization uses vocabulary + morphological analysis.
Example:
better → good
running → run
includes → include
Result:
['text','preprocessing','include','task','like',
'tokenization',',','stemming',',','lemmatization','.']
More accurate than stemming.
2.2 Inverted Index and Other Indexing Structures
Inverted Index
Most common indexing structure (used in Google-like systems).
Maps:
Term → List of documents
Structure:
- Vocabulary
- Posting List
Example Collection
Doc1:
"Information retrieval and indexing are essential."
Doc2:
"Indexing techniques improve retrieval performance."
| Term | Posting List |
|---|---|
| information | [Doc1] |
| retrieval | [Doc1, Doc2] |
| indexing | [Doc1, Doc2] |
| essential | [Doc1] |
| techniques | [Doc2] |
Types of Inverted Index
1. Record-Level Inverted Index
Stores:
term → [Doc1, Doc2, Doc3]
2. Word-Level (Positional) Inverted Index
Stores:
term → (DocID : frequency : [positions])
Example:
1:2:[2,6]
Meaning:
- Document 1
- Term frequency = 2
- Positions = 2 and 6
More powerful but requires more space.
Query Execution Process
- Tokenize query
- Lookup each term
- Retrieve posting lists
- Merge lists
- Compute relevance score
Relevance may consider:
- Term Frequency (TF)
- Document Length
- Proximity
Time and Space Complexity
Search Time:
- Best case:
O(1) - Worst case:
O(N)
Space Complexity:
O(T + D + P)
Where:
- T = number of unique terms
- D = number of documents
- P = number of postings
Other Indexing Structures
1. Suffix Tree
Used for substring search.
Time Complexity:
O(m)
m = pattern length
Space Complexity:
O(n)
n = length of input string
Very fast search, but space heavy.
2. Suffix Array
Sorted array of all suffixes.
More space-efficient than suffix trees.
3. B-Trees
Balanced tree structure.
Used in database indexing.
Search time:
O(log n)
Efficient for range queries.
2.3 Compression Techniques for IR
Why compression?
Inverted indexes are huge.
We want to:
- Reduce storage
- Improve cache performance
- Speed up retrieval
1. Dictionary Compression
Techniques:
- Blocking
- Front Coding
Front Coding Example
Sorted terms:
compression
compute
computer
computing
Common prefix: "comp"
Compressed representation:
comp|ression
|ute
|uter
|uting
This reduces redundancy.
2. Posting List Compression
Posting lists contain sorted document IDs:
Example:
[3, 10, 15, 21, 30]
Instead of storing raw numbers, we store gaps.
Gap Encoding
Original:
3, 10, 15, 21, 30
Gaps:
3,
10 - 3 = 7,
15 - 10 = 5,
21 - 15 = 6,
30 - 21 = 9
Compressed form:
[3, 7, 5, 6, 9]
Smaller numbers → fewer bits required.
3. Variable Byte Encoding
Each number encoded in variable-length bytes.
Example:
Suppose we encode gap 7.
Binary:
7 = 00000111
Add continuation bit.
If single byte, prefix with 1:
10000111
For larger numbers:
Example: 130
Binary:
130 = 10000010
Split into 7-bit chunks:
0000001 0000010
Add continuation bits:
0000001 → 00000001
10000010 → 10000010
4. Gamma Encoding (Elias Gamma Code)
Used for positive integers.
Steps:
- Write number in binary
- Let length = L
- Write L-1 zeros
- Append binary representation
Example: Encode 9
Binary:
9 = 1001
Length L = 4
Step 1: L-1 zeros
000
Step 2: Append binary
1001
Gamma code:
0001001
Storage Comparison Example
Suppose we store document IDs:
[1000, 1005, 1010]
Raw 32-bit storage:
3 × 32 = 96 bits
Gap encoding:
[1000, 5, 5]
Binary size:
1000 → 10 bits
5 → 3 bits
5 → 3 bits
Total:
10 + 3 + 3 = 16 bits
Massive reduction.
Summary
Text Processing:
- Tokenization
- Stopword removal
- Stemming
- Lemmatization
Indexing:
- Inverted Index
- Record-level vs Word-level
- Suffix trees
- B-trees
Compression:
- Gap encoding
- Variable byte encoding
- Gamma encoding
- Front coding
Without preprocessing and compression, large-scale IR would collapse under its own weight.
Efficient indexing and compression transform chaos into searchable structure.