This project is a sample IR System using the CACM document collection. The Boost library is used to serialize the datastructure (map) holding the document information (the inverted index)
The PageRank algorithm is used to calculate the scores while the index is being created, and is used to rank the results after the user submits a query.
The document collection used is the CACM Collection
The program is split into three different parts:
invert
: Creates the inverted index for the document collection. The output are files which are needed for the other two programssearch
: Provides a command line interface to search for documents within the collection. This program uses the output files produced by theinvert
programeval
: Evaluates the results of the searching algorithm using the "query.text" and "qrels.text" files found within the CACM zip
This program generates all the files necessary for both search
and eval
. To run this program, compile and run the .exe
- Required Files (Must be in same directory):
- cacm.all : required
- stopwords.txt : optional
- Command Line Options:
- --help : get a list of available command line options
- --stopwords : enable the removal of stopwords
- --ps : enable porter stemming
- --decay number : specify the decay value for the random surfer model where number is a decimal number between 0 and 1
- --iterations number : specify the number of iterations for the PageRank step where number is the number of iterations
- --norandomsurfer : disable the random surfer model (is on by default)
- --normalize : turn off normalization (is off by default)
Example: invert.exe --nonormalize --decay 0.93 --iterations 10
- Output Files:
- lookup_table.dat : a helper file needed for
eval
andsearch
- pagerank_scores.dat : binary formatted pagerank scores needed for
eval
andsearch
- postings.dat : a binary formatted postings file needed for
eval
andsearch
- pagerank_scores.txt : a readable text file with all the pagerank scores. Not needed by any other programs
This program goes through the query.text file and executes the search function. To run this program, simply run the .exe
- Required Files (Must be in the same directory):
- lookup_table.dat : required
- pagerank_scores.dat : required
- postings.dat : required
- qrels.text : required
- query.text : required
- stopwords.txt : optional
- Command Line Options:
- --help : get a list of available command line options
- --ps : enable porter stemming
- --w1 number : specify the w1 value where number is the value
- --w 2 number : specify the w 2 value where number is the value
Example: eval3.exe --w1 0.7 --w2 0
.
- Output Files: None
This program takes the users search query and returns the relevant results. To run this program, simply run the .exe
- Required Files (Must be in the same directory):
- lookup_table.dat : required
- pagerank_scores.dat : required
- postings.dat : required
- stopwords.txt : optional
- Command Line Options:
- --help : get a list of available command line options
- --ps : enable porter stemming
- --w1 number : specify the w1 value where number is the value
- --w 2 number : specify the w 2 value where number is the value
Example: search.exe --w1 0.7 --w2 0
.