indexer.py

We will be using SAX Parser to parse the XML data

indexer.py

Running Format --> python3 indexer.py

-> All the XML data you need to parse has to be inside a folder named Folder.
-> Feed the files into XML parser and then text preprocessing is done:

1. Tokenization
2. Stop Words Removal
3. Stemming (Stemming Steps have been commented in the code)
and after preprocessing, Links,Body,Info,Categories,References,Title are extracted using the appropriate regex expressions.

-> All the files created as a result of running the indexer.py code will be inside the files folder.

-> Files Produced:

    title.txt : It consist of id-title mapping.
    titleOffset.txt : Offset for title.txt
    vocab.txt : It has all the words and the file number in which those words can be found along with the document frequency.
    offset.txt : Offset for vocab.txt
    supu.txt : Offset for various field files.
    inverted_index(file_Number).txt : These are the temporary inverted_indexes files that will be created for every input file

One can learn more about the regular expressions used in the code from the below youtube link: https://www.youtube.com/watch?v=K8L6KVGG-7o

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
indexer.py		indexer.py
inverted_index0.txt		inverted_index0.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

indexer.py

indexer.py

inverted_index0.txt

inverted_index0.txt

search.py

search.py

Repository files navigation

indexer.py

About

Releases

Packages

Languages

swapnil-satpathy/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

indexer.py

About

Topics

Resources

Stars

Watchers

Forks

Languages