Skip to content

gvsandeep2647/Vector_Space_Model

Repository files navigation

Vector Space Model

Course Number : CS F469

Contributors :

  • G V Sandeep
  • Kushagra Agrawal
  • Snehal Wadhwani

Aim : To implement a retrieval system based on vector space model on the given dataset

Language : Python v2.7.12

Working :
  1. The entire corpus is run through a main.py file which extracts the required fields and stores it in the normalized form
  2. For Eg : Date is stored in the form of a UNIX timestamp so that comparison is easier. Outlinks, Inlinks and comments have been recalculated as 1+ log(num) where num is their value. All the words in posts and title are tokenized and stemmed according to Porter's Stemming Algorithm. The words have been stripped of spaces and case sensitivity.
  3. These values are then passed to the file new_inverted.py which constructs a dictionary and forms an inverted index for these individual attributes
  4. These dictionaries are then used by the file tfidf.py which calculates the tf idf (term frequency - inverse document frequency) score
  5. The idf of every word is calculated using the formula log(N/df) where N is the size of the corpus and df is the document frequency of the word. The tf of the word per document is calculated by the formula 1 + log(tf) where tf is the frequency of the word in a particular document. The formula used for weighing the document-query similarity is lnc.ltc (ddd.qqq). The document vector (which has only the tf) is normalised by making it as a unit vector.
  6. Query processing in GUI.py includes tokenization, normalization, tf-idf calculation and normalization of the query vector. The cosine similarity is calculated with the document vectors calculated in tfidf.py. Scores are given according to the similarity with the document's title, blogger and post. If more than two documents have the same score, the clash is resolved by taking into account the no. of inlinks, outlinks and comments.
  7. For query search, the query is broken down into 2 terms each. The corpus is then searched for two words within the given distance (for phrase queries it is one). the returned list is used to run on other pairs of normalised queries to finally return the document which best matches the query
  8. Tkinter GUI of python is used for giving it a more professional look and making it user friendly.
  9. The user has the option to narrow down his results by selecting a particular date range and category of result he wants
Setting it up:
  1. Extract the folder and then run GUI.py. That's it :D
  2. It takes around 14 minutes for the GUI.py to pre-process the data.

Screenshots

Normal Query Phrase Query Phrase Query with topic slicing

About

An Information Retrieval System Implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published