Cranfield Collection Lucene SE

A search engine built upon the cranfield collection for "CS7IS3 INFORMATION RETRIEVAL AND WEB SEARCH. Read Report - here

Ran similarities

tfidf
boolean
bm25

And a CustomAnalyzer

Running Project

Grant Permission to bash script to automatically unzip trec_eval.zip, build java lucene project, and trecEval

git clone https://github.com/QUzair/LuceneSE.git

cd LuceneSE

chmod u+x trecEval.sh

./trecEval.sh

Files In Project

Main Classes:

CranFileParser.java

Parses Cran Docs File and Index it with specified Analyzer
CranfieldQueries

Parses Cran Queries File and creates DockRank for queries
CranfieldModel

Basic model for field in cranfield doc (id,title,author,biblio,content)
PersonalQueries

Class to create custom queries for created Index
Main

Main class which indexes and searches with different analyzers and similarity classes

Within cran folder:

cran.all.1400

Contains 1400 documents from the Cranfield Collection.
cran.qry

Queries that will be used to test our Implementation of the Search Engine with trec_eval
QRelsCorrectedforTRECeval

RelDocs used for evaluation of our own search results

Output/Other files:

similarityFiles

Creating 'DocRanks' results from our scoring functionality with bm25, boolean and tfidf
trecEval.sh

Bash Script to unzip and make trec_eval.zip, build java lucene project, and run trecEval on the outputted similarityFiles (contains 'DocRanks') and QRelsCorrectedforTRECeval
stopWords.txt

List of stopwords taken from https://www.ranks.nl/stopwords

Custom Analyzer

Basic Custom analyzer with stopwords taken from https://www.ranks.nl/stopwords

//Creating New Token Stream  
TokenStream tokenStream = new LowerCaseFilter(source);  
  
//Adding Filters  
tokenStream = new EnglishPossessiveFilter(tokenStream);  
tokenStream = new PorterStemFilter(tokenStream);  
tokenStream = new EnglishMinimalStemFilter(tokenStream);  
tokenStream = new KStemFilter(tokenStream);  
  
  
CharArraySet newStopSet = null;  
try {  
    newStopSet = StopFilter.makeStopSet(getStopWords()); //Set of Words from ranks.nl/stopwords
} catch (IOException e) {  
    e.printStackTrace();  
}  
tokenStream = new StopFilter(tokenStream, newStopSet);  
return new TokenStreamComponents(source, tokenStream);

Results

	StandardAnalyzer	CustomAnalyzer
tfidf	0.1557	0.2796
boolean	0.1782	0.2781
bm25	0.2864	0.3375

As can be seen bm25 provides the best results along with the CustomAnalyzer.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
.settings		.settings
cran		cran
simiarlityFiles		simiarlityFiles
src/main/java/ie/tcd/quzair		src/main/java/ie/tcd/quzair
target		target
.DS_Store		.DS_Store
.classpath		.classpath
.gitignore		.gitignore
.project		.project
1_IR_Uzair_15318872.pdf		1_IR_Uzair_15318872.pdf
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml
rec_precision.png		rec_precision.png
stopWords.txt		stopWords.txt
trecEval.sh		trecEval.sh
trec_eval.zip		trec_eval.zip

QUzair/LuceneSE

Folders and files

Latest commit

History

Repository files navigation

Cranfield Collection Lucene SE

Running Project

Files In Project

Custom Analyzer

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages