Skip to content

A simple text search engine in python that uses vector space model.

Notifications You must be signed in to change notification settings

nilayjain/text-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We've used Vector Space model. The corpus consists of around 1550 documents. So we're attaching the corpus with the assignment. From the directory that this readme file is in, type in the following command to run the code:
	
	python ir.py
	
The script takes a few seconds to run.

Type in your query when you're asked on console. The ranked results will be displayed, with docid and weight. On typing the empty query(i.e. just pressing enter) the program will exit.

Source of the Dataset: http://qwone.com/~jason/20Newsgroups/ 
					   20 Newsgroups sorted by date; duplicates and some headers removed. Our corpus consists of 1500 documents. 
					   
You can run the code on other documents to by pasting the documents in the corpus folder and rename the documents as doc**** where **** is a 4 digit number.
NOTE: 1. Because we have used the Vector Space model, the terms precision and recall do not apply.
      2. We've given the output in ascending cosine score (ascending order of relevance) so that the most relevant result can be seen on that page itself. (You won't have to scroll up to get to the first result.)


TEST CASES:

1. query : hello world

Last 10 lines of Output: 

The docid is 840 and the weight is 0.0123868243244
The docid is 414 and the weight is 0.0125227040535
The docid is 795 and the weight is 0.0126194983527
The docid is 075 and the weight is 0.0127447172068
The docid is 446 and the weight is 0.0128253567852
The docid is 442 and the weight is 0.0131724993109
The docid is 336 and the weight is 0.0131857066489
The docid is 339 and the weight is 0.0133365372412
The docid is 293 and the weight is 0.0137057664574
The docid is 828 and the weight is 0.0171555644517
____________________________________

2. query : help me

Last 10 lines of Output:

The docid is 222 and the weight is 0.064405360291
The docid is 714 and the weight is 0.0645867135002
The docid is 717 and the weight is 0.0654263948886
The docid is 152 and the weight is 0.0670786638134
The docid is 583 and the weight is 0.0719265769142
The docid is 858 and the weight is 0.0886557609352
The docid is 431 and the weight is 0.092779776323
The docid is 197 and the weight is 0.0941716886973
The docid is 313 and the weight is 0.115243508121
The docid is 322 and the weight is 0.129980295508
_______________________________________

3. query : please give us full marks

Last 10 lines of Output:

The docid is 359 and the weight is 0.0304982690928
The docid is 903 and the weight is 0.0342245838741
The docid is 908 and the weight is 0.0370779530946
The docid is 029 and the weight is 0.0377193952148
The docid is 379 and the weight is 0.0389999670468
The docid is 079 and the weight is 0.0443838393164
The docid is 1219 and the weight is 0.0458483706654
The docid is 319 and the weight is 0.0515440732627
The docid is 330 and the weight is 0.0539118537392
The docid is 1501 and the weight is 0.0554254846328

                                                                   THANK YOU

About

A simple text search engine in python that uses vector space model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages