Index Search Query

Inverted Index, Query Formulation and Ranking from Scratch in Python.

Part of Information Retrieval Lab (Autumn 2017-18)

Part 1: The Inverted Index

Dataset

The dataset used for this purpose is taken from the FIRE 2011 corpus. It can be downloaded from here. It contains articles from two different magazines. The methods for handling these files are present in the magazine_index package.

Usage

If you wish to index all the files recursively from a directory, use the following command -

$ python lab1.py path/to/files

This will create an inverted index and save it to a file called index.bin. You can directly use this file if created already by not passing any argument to the script -

$ python lab1.py
Loading index from "index.bin"
<Index documents=392577 words=105314026>
...

Using a pre-built index

Since indexing documents can take a lot of time, here are some already indexed files which can be renamed to index.bin and used directly -

Name	Link	Size	Comments
`index.bin`	LINK	478 MB	Full index, 392k documents
`index.bin.bak1`	LINK	374 MB	303k documents
`index.bin.bak`	LINK	36 MB	25.8k documents

Example

$ python lab_1.py
Loading index from "index.bin"
<Index documents=303290 words=83225120>
Please start entering words to get top 5 documents containing them (CTRL+C to exit) -
Enter word: market
[('1100110_calcutta_story_11965855.utf8', 58), ('1070603_calcutta_story_7858507.utf8', 31), ('1100326_opinion_story_12251777.utf8', 30), ('1050912_frontpage_story_5227346.utf8', 30), ('1040406_opinion_story_2948544.utf8', 29)]
Enter word: delhi
[('1080422_sports_ipl.utf8', 30), ('1031223_opinion_story_2710457.utf8', 28), ('1090225_sports_story_10587273.utf8', 22), ('1090812_sports_story_11351508.utf8', 21), ('1100223_sports_story_12140507.utf8', 21)]
Enter word: messi
[('1100612_sports_story_12557276.utf8', 27), ('1100527_sports_story_12492679.utf8', 17), ('1100619_sports_story_12582889.utf8', 17), ('1090529_calcutta_story_11031479.utf8', 16), ('1100613_frontpage_story_12560387.utf8', 12)]

Part 2: Ranking of Documents

Usage

You can use the pre-built index here.

$ python lab_2.py index.bin
Loading index from index.bin
Enter query: programming
en.15.66.21.2008.5.9 : 97.3490637742472
en.3.347.409.2010.2.2 : 79.64923399711134
en.3.373.142.2007.6.8 : 53.09948933140756
en.3.406.410.2007.11.18 : 48.6745318871236
en.2.296.350.2010.1.20 : 48.6745318871236
en.3.393.372.2007.9.23 : 44.24957444283963
en.3.321.344.2009.7.31 : 44.24957444283963
en.3.373.299.2007.6.11 : 44.24957444283963
en.15.109.486.2009.4.1 : 44.24957444283963
en.3.393.75.2007.9.24 : 44.24957444283963

Development

pipenv is used for this project -

$ sudo -H pip install pipenv

To install dependencies, simply

$ pipenv install

To enter a virtualenv shell

$ pipenv shell

This will spawn a new shell where all dependencies will be present.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
index		index
magazine_index		magazine_index
ranking		ranking
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
example_1.py		example_1.py
lab_1.py		lab_1.py
lab_2.py		lab_2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

index

index

magazine_index

magazine_index

ranking

ranking

.gitignore

.gitignore

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

example_1.py

example_1.py

lab_1.py

lab_1.py

lab_2.py

lab_2.py

Repository files navigation

Index Search Query

Part 1: The Inverted Index

Dataset

Usage

Using a pre-built index

Example

Part 2: Ranking of Documents

Usage

Development

About

Releases

Packages

Languages

singhpratyush/index-search-query

Folders and files

Latest commit

History

Repository files navigation

Index Search Query

Part 1: The Inverted Index

Dataset

Usage

Using a pre-built index

Example

Part 2: Ranking of Documents

Usage

Development

About

Topics

Resources

Stars

Watchers

Forks

Languages