Advanced Retrieval and Web Mining


Course Syllabus

Suggested Textbooks:

MG = Managing Gigabytes, by Witten, Moffat, and Bell.
MIR = Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto.
FOA = Finding Out About, by Belew.
FSNLP = Foundations of Statistical Natural Language Processing, by Manning and Schütze.
None of the textbooks is mandatory.

Schedule:

Date Topics Notes Who Readings
23 Aug Basic inverted indexes:
Boolean query processing
[powerpoint]
[pdf]
PR
MG Ch. 3.2; MIR Ch. 8.2
Shakespeare plays
WestLaw
23 Aug
Finish basic indexing
Query processing – more tricks
Proximity/phrase queries
[powerpoint]
[pdf]
PR
MG Ch. 4.0-4.3, 4.5; MIR Ch. 3
Porter's stemmer
More Porter from the author
Lovins stemmer
Fast Phrase Querying with Combined Indexes, from http://www.seg.rmit.edu.au/research/research.php?author=4
24 Aug
Postings pointer storage
Dictionary storage
Compression
Wild-card queries
[powerpoint]
[pdf]
PR
MG 3.3, 3.4, 4.2
 
 
24 Aug
Query expansion
Index construction
[powerpoint]
[pdf]
PR
MG 5
25 Aug
Parametric and field searches
Scoring documents: zone weighting
tf-df and vector spaces
[powerpoint]
[pdf]
PR
MG Ch 4.4
New Retrieval Approaches Using SMART: TREC 4
Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990.
25 Aug
Vector space scoring
Nearest neighbors and approximations
[powerpoint]
[pdf]
PR
MG Ch. 4.4-4.6; MIR 2.5, 2.7.2; FSNLP 15.4
Random projection theorem
Faster random projection
http://lsi.argreenhouse.com/lsi/LSIpapers.html
http://lsa.colorado.edu/
http://www.cs.utk.edu/~lsi/
26 Aug
Evaluating a search engine
Precision and recall
[powerpoint]
[pdf]
PR
MG 4.5
26 Aug
Web search
Link-based ranking in web search engines I
[powerpoint]
[pdf]
PR
MIR Ch. 13
Bibliography from Bharat/Broder/Hawking/Raghavan Tutorial at ACM SIGIR 2002 [pdf | html]
Anatomy of a large-scale hypertextual web search engine
26 Aug
Afternoon session: crawling and course project introduction
 
MS+PR
Tools
27 Aug
Link-based ranking in web search engines II
[powerpoint]
[pdf]
PR
 
 
FOA Ch. 6.1
Authoritative sources in a hyperlinked environment
Hypersearching the Web
Dubhashi resource collection covering recent topics
The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank
Topic-Sensitive PageRank
The Structure of Information Networks (CS 685 at Cornell)
Stable algorithms for link analysis
27 Aug
Behavior-based ranking; crawling; duplicate detection; search engine infrastructure.
[powerpoint]
[pdf]
PR
FOA Ch. 6.2
Supplemental notes on min-wise hashing [ppt | pdf].
30 Aug
XML RETRIEVAL [powerpoint]

HS
XML tutorial
XML full text requirements
Other approaches: XRank, Result Ranking for Structured Queries against XML Documents
XML classification
30 Aug
CLUSTERING 1. Introduction to the problem. Agglomerative and k-means clustering. Clustering versus classification. [powerpoint]
HS
Scatter/Gather
Data Clustering Review
Single-Link and Complete-Link Clustering
31 Aug
CLUSTERING 2. Clustering terms using documents, labelling clusters, evaluating clustering, link-based clustering, trawling [powerpoint]
HS
FSNLP Ch. 14
Mining Association Rules Between Sets of Items in Large Databases
Clustering Hypertext with Applications to Web Searching
Trawling Emerging Cyber-communities Automatically
Projections for Efficient Document Clustering
31 Aug
LATENT SEMANTIC INDEXING, USER INTERFACES. Browsing and Visualization models. Evaluation of IR interfaces. [powerpoint]
HS
MIR Ch. 10.0-10.7, FOA Ch. 4.3, MIR Ch. 10.8-10.10
Probabilistic Latent Semantic Analysis
Latent semantic indexing
Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results
Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy
A Case for Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness
OLIVE: On-line Library of Information Visualization Environments
Overview of the Third REtrieval Conference (TREC-3)
Overview of the Fourth Text REtrieval Conference (TREC-4)
TREC-6 Interactive Track Report
1 Sep
CLASSIFICATION 1. Naive Bayes methods [powerpoint]
HS
Machine Learning in Automated Text Categorization
A Comparison of Event Models for Naive Bayes Text Classification
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
A Re-examination of Text Categorization Methods
1 Sep
CLASSIFICATION 2. Evaluation, vector space classification, k nearest neighbors, decision trees [powerpoint]
HS
FSNLP Ch. 16
Evaluating and Optimizing Autonomous Text Classification Systems
Dumais, Platt, Heckerman, and Sahami. Inductive learning algorithms and representations for text categorization. CIKM 1998.
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Reuters dataset
A Comparative Study on Feature Selection in Text Categorization
2 Sep
CLASSIFICATION 3. Logistic regression, support vector machines [powerpoint]
HS
Support Vector Machine Tutorial
Dumais. Using SVMs for Text Categorization. IEEE Intelligent Systems 13(4) Jul-Aug 1998.
Text Categorization Based on Regularized Linear Classification Methods
Why the logistic often is a good estimator of class probability. Tutorial.
2 Sep
INFORMATION EXTRACTION AND MINING. Rapier, hidden markov models [powerpoint]
HS
Fast Effective Rule Induction (Cohen 1995)
Berkeley HMM Tutorial
Information Extraction Using Hidden Markov Models
Learning Hidden Markov Model Structure for Information Extraction
Information Extraction with HMM Structures Learned by Stochastic Optimization
HMM parameter estimation
Introduction to Information Extraction Technology, IJCAI 1999
Learning Information Extraction Rules for Semi-Structured and Free Text
3 Sep (1st slot)
BIOINFORMATICS. Special constraints in bioinformatics, combining textual and non-textual data [pdf ]
HS
Gene Ontology
Jeff Chang's BioNLP server
Biological literature improves homology search
3 Sep (2nd slot)
Compression techniques for the Web Graph [pdf]
PB

3 Sep
Presentation of Projects
HS+PB+MS