Faculty of Mathematics and Physics

*"Information retrieval is the task of searching a body of information for objects that statisfied an information need."*

This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Introduction to Information Retrieval. The course covers both the foundations of information retrieval and some more advanced topics.

SIS code: NPFL103

Semester: winter

E-credits: 5

Examination: 2/2 C+Ex

Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz

Language: The course is taught in **English**. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.

- Mondays, 15:40-17:10 (18:50) Zoom/S8 Malostranské nám. 25
- Consultations upon request.

Oct 5

- The course in 2020/21 will be taught via Zoom (the invitation will be sent via email to all enrolled students in SIS).
- The first class will take place on Oct 5 at 15:40.

1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides Questions

2. Dictionaries, Tolerant retrieval, Spelling correction Slides Questions

3. Index construction and index compression Slides Questions

4. Ranked retrieval, Term weighting, Vector space model Slides Questions

5. Ranking, Complete search system, Evaluation, Benchmarks Slides 1. Vector space models Questions

6. Result summaries, Relevance Feedback, Query Expansion Slides Questions

7. Probabilistic information retrieval Slides Questions

8. Language models, Text classification Slides Questions

9. Vector space classification Slides 2. Retrieval frameworks Questions

10. Document clustering Slides Questions

11. Latent Semantic Indexing Slides Questions

12. Web search, Crawling, Duplicate detection, Spam detection Slides Questions

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.

To pass the course, students need to complete two homework assignments and a written test. See grading for more details.

**Note:** The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.

- definition of Information Retrieval
- boolean model, boolean queries, boolean search
- inverted index, dictionary, postings, index construction, postings list intersection
- text processing, tokenization, term normalization, diacritics, case folding, stop words, lemmatization, stemming, Porter algorithm
- phrase queries, biword index, positional index, proximity search

- dictionary data structures, hashes, trees
- wildcard queries, permuterm index, k-gram index
- spelling correction, correcting documents, correcting queries, spellchecking dictionaries
- edit distance, Levenshtein distance, weighted edit distance, query spelling correction
- Soundex

- Reuters RCV1 collection
- sort-based index construction, Blocked sort-based indexing, Single-pass in-memory indexing
- distributed indexing, Map Reduce, dynamic indexing
- index compression, Heap's law, Zipf's law, dictionary compression, postings compression

- ranked retrieval vs. boolen retrieval, query-document scoring, Jaccard coefficient
- Bag-of-words model, term-frequency weighing, document-frequency weighting, tf-idf, collection frequency vs. document frequency
- vector space model, measuring similarity, distance vs. angle, cosine similarity
- length normalization, cosine normalization, pivot normalization

Nov 2 Slides 1. Vector space models Questions

- ranking, motivation and implementation, document-at-a-time vs. term-at-a-time processing
- tiered indexes
- query processing, query parser
- evaluation, precision, recall, F-measure, confusion matrix, precision-recall trade-off, precision-recall curves, average precision, mean everage precision, averaged precision-recall graph, A/B testing
- existing benchmarks

- static summaries, dynamic summaries
- relevance feedback, Rocchio algorithm, pseudo-relevance feedback
- query expansion, thesaurus construction

- probabilistic approaches to IR
- Probabilistic Ranking Principle
- Binary Independence Models
- Okapi BM25

- language models for IR
- smoothing (Jelinek-Mercer, Dirichlet, Add-one)
- text classification
- Naive Bayes classifier
- evaluation of text classification, micro averaging, macro averaging,

Nov 30 Slides 2. Retrieval frameworks Questions

- vector space classification
- k Nearest Neighbors
- linear classifiers
- Support Vector Machines

- document clustering in IR
- vector space clustering
- K-means clustering
- setting number of clusters
- evaluation of clustering
- hierarchical clustering, dendrogram, cluster similarity measures

- Singular Value Decomposition
- dimensionality reduction
- LSI in information retrieval
- LSI as soft clustering

Lecture 1 Questions

- One
- One one
- One two

- Two
- Three

**Note:** Detailed specification of the assignments (with a link to data download) will be distributed via email.

Deadline: Nov 29, 2020 23:59 100 points

Design, develop and evaluate your own retrieval system based on vector space model.

Deadline: Jan 3, 2021, 23:39 100 points

Design, develop and evaluate a state-of-the-art retrieval system using an off-the-shelf retrieval framework of your choice.

- There are two homework assignments during the semester with a fixed deadline announced on the webpage.
- The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
- Students will present their solutions during the practicals in 10 minute presentations.
- The assignments will be awarded by 0-100 points each.
- Late submissions received up to 2 weeks after the deadline will be penalized by 50% point reduction.
- Submissions received later than 2 weeks after the deadline will be awarded 0 points.

- There will be an exam in a form of a written test at the end of semester.
- The test includes approximately 20 short-answer questions covered by the topics discussed during the lectures.
- The maximum duration of the test is 90 minutes.
- The test will be graded by 0-100 points.

- Completion of both the homework assignments and exam is required to pass the course.
- The students need to earn at least 50 points for each assignment (before late submission penalization) and at least 50 points for the test.
- The points received for the assignments and the test will be available in SIS.
- The final grade will be based on the average results of the exam and the two homework assignments, all three weighted equally:
- ≥ 90%: grade 1 (excellent)
- ≥ 70%: grade 2 (very good)
- ≥ 50%: grade 3 (good)
- < 50%: grade 4 (fail)

- No plagiarism will be tolerated.
- All cases of plagiarism will be reported to the Student Office.

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze
*Cambridge University Press*, 2008, ISBN: 978-0521865715.

Available online.

David A. Grossman and Ophir Frieder,
*Springer*, 2004, ISBN 978-1402030048.

Ricardo Baeza-Yates and Berthier Ribeiro-Neto,
*Addison Wesley*, 1999, ISBN: 978-0201398298.