IRWS-Homework

Repository including all the programming assignements given throughout the course of Information Retrieval and Web Search at the University of Mannheim during the Spring Term 2017

Homework 1: Minimum Edit Distance

For Homework 1 the (Damerau-)Levensthein distance has been implemented both with dynamic programming and recursions.

There are several flags to customize what happens at runtime

original: left side of the comparison
compare: right side of the comparison
recursive: true if a recursive version shall be used
damerau: true if the Damerau-Levensthein Distance shall be used
weigths: true if custom weights for transposition/ replacement shall be used

For the implementation golang was used, there are a couple of tests to show sample output and benchmark tests to see the difference in runtime between recursive and dynamic programming versions.

Homework 2: Vector Space and Probabilistic Retrieval

Term weighting: Compute TF-IDF for a toy document collection with different definitions for TF and IDF and rank the documents given a query with cosine similarity.
Distance/similarity metrics: Ranking of documents given a query and 'raw Euclidean distance', 'normalized Euclidean distance' and 'cosine similarity'
Optimizing vector space model: Given a toy collection of TF-IDF vectors perform random projections to reduce computation costs. Do a pre-clustering of the documents using a given set of leader vectors. Finally retrieve top 5 documents for a query vector using the random projection vectors and leader vectors with clusters.
Classic probabilistic retrieval: Given a query rank documents with 'Binary independence model', 'Two-Poisson model', 'BM25'
Unigram Likelihood Model for Information Retrieval: For the programming assignment the tasks was to build a query likelihood model based on a unigram Likelihood Model for the 20 News corpus, which is able to take ad-hoc queries and rank the documents by relevance based on the unigram model. This part is implemented using Scala and the Spark Api.

Homework 3: Semantic Retrieval, Text Clustering, and IR Evaluation

Latent Semantic Indexing: Computing the similarity of latent vectors for a toy collection of documents and a query
Text Clustering: Using 'K-Means' and 'Single Pass Clustering' to cluster a toy collection of TF-IDF vectors
IR Evaluation: Calculating precision, recall, F1, P@k, R-precision, average precision and mean average precision for a toy collection of retrievals and their relevance rating
Semantic Retrieval with Word-Embeddings: Implementation of a simple retrieval engine based on aggregation of word embeddings using the pretrained 'GloVe' word embeddings and a random subsample of 500 documents from the '20 News Groups dataset'

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Homework1		Homework1
Homework2		Homework2
Homework21		Homework21
Homework3		Homework3
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homework1

Homework1

Homework2

Homework2

Homework21

Homework21

Homework3

Homework3

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

IRWS-Homework

Homework 1: Minimum Edit Distance

Homework 2: Vector Space and Probabilistic Retrieval

Homework 3: Semantic Retrieval, Text Clustering, and IR Evaluation

About

Releases

Packages

Contributors 2

Languages

License

IvoGoman/IRWS-Homework

Folders and files

Latest commit

History

Repository files navigation

IRWS-Homework

Homework 1: Minimum Edit Distance

Homework 2: Vector Space and Probabilistic Retrieval

Homework 3: Semantic Retrieval, Text Clustering, and IR Evaluation

About

Resources

License

Stars

Watchers

Forks

Languages