Lecture (2V+1Ü, 4 ECTS-LP) **"Information Retrieval and Data Mining"** (Module Description), Course Number INF-24-52-V-7

- Level: Master
- Language: English

**Time and Location**

**Lecture:**- KIS entry
- Monday, 11:45-13:15.
- Room 42-110
- Begin: 24.04.2017

**Exercise:**- KIS entry
- Wednesday, 13:45-15:15.
- Room 46-110
- Begin: 03.05.2017

**News**

Date |
News |

25.04.2017 |
All news will be posted in OLAT from now on. |

23.03.2017 |
Regulations for qualification to the final exam are posted. Please read carefully. |

22.03.2017 |
Room for exercise changed to 46-110 and also time slot slightly moved. |

03.03.2017 |
Website is online. |

### **Regulations**

Please read carefully.

**Students need to successfully participate in the exercise sessions, according to the regulations below, in order to get admitted to the final exam.**

- There will be 6 exercise sheets.
- The teaching assistant presents the solutions and answers questions.
- There is no mandatory attendance of the exercise sessions; still, we would be happy to see a lively participation.
- Each sheet consists of 3 assignments, which makes 18 assignments in total. Each assignment is equivalent to one point.
- A student needs to reach a total of at least 13 points throughout the semester to qualify for the final exam.
- Solutions to exercise sheets have to be submitted in OLAT.
- Students can work alone or in groups of max. two, determined with the first submission, and upload, individually, the same solution in OLAT, with names of both members on all sheets.
- Students need to mark in OLAT which individual assignments they have managed to solve correctly.
- Students can only mark an assignment as solved, if they have managed to complete more than ⅔ of the assignment.
- If the solution of an assignment is not done correctly to an extent of ⅔ or more, the point for that assignment will be not given. That means, if you did not work on more than ⅔, don’t put the mark at all.
- If it is obvious that the mark has been placed in a dishonest attempt to obtain a point without proper engagement with the assignment, the entire sheet is assessed with zero points. For instance, if the marked exercise is not done at all or clearly below ⅔ solved.
- Copying solutions from other groups or taking solutions from previously published solution sheets, if clearly identifiable, will cause all involved groups to get immediately disqualified from the course, independent of the number of points accomplished regularly.

**People**

**Contents (tentative)**

- Boolean Information Retrieval (IR), TF-IDF, IR evaluation
- Probabilistic IR, BM25
- Hypothesis testing
- Statistical language models, latent topic models
- Relevance feedback, novelty & diversity
- PageRank, HITS
- Spam detection, social networks
- Inverted lists
- Index compression, top-k query processing
- Frequent itemsets & association rules
- Hierarchical, density-based, and co-clustering
- Decision trees and Naive Bayes
- Support vector machines

**Slides**

**Lecture 1**: Motivation, Regulations, Ranking Models, Relevance Assessment pdf**Lecture 2**: Probabilistic Retrieval Models pdf**Lecture 3**: Language Models, Smoothing, Novelty&Diversity, Latent Topic Models pdf**Lecture 4**: Latent Topic Models pdf**Lecture 5**: Link Analysis: PageRank, HITS pdf**Lecture 6**: Indexing, Compression pdf**Lecture 7**: Compression, Query Processing pdf**Lecture 8**: Data Mining Intro, K-Means Clustering, Hierarchical Clustering, DBSCAN pdf

**Exercise Sheets**

If you are using Latex to prepare your submission, you might find the Latex sources of the sheets useful. But do not expect that these compile; just copy the parts you need.

**Sheet 1 pdf (latex source), Solution1(pdf)****Sheet 2 pdf (latex source), Solution2(pdf)****Sheet 3 pdf (latex source, benchmark.tar.gz), Solution3(pdf)****Sheet 4 pdf (latex source), Solution4(pdf)****Sheet 5 pdf (latex source, data.txt)****Sheet 6 pdf (latex source)**

**Literature**

- Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge University Press, 2008
- Larry Wasserman. All of Statistics, Springer, 2004.
- Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines
- Anand Rajaraman and Jeffrey D. Ullman. Mining of Massive Datasets, Cambridge University Press, 2011.
- - supplementary literature references will be given in the lecture

**Acknowledgements**

The course material is to a large extent based on material by Klaus Berberich, Martin Theobald, Pauli Miettinen and Gerhard Weikum, MPI Informatik, Saarbrücken.