A simple search engine of UCI's ICS web pages.
This search engine was a group assignment for a class at UCI Winter 2021. We were given a large corpus (roughtly 56000 web pages).
- Uses an inverted index containing tf-idf scores
- No databases! The inverted index is not loaded in memory. It is kept in a text file.
- Less than 300ms search retrieval for queries
- Python 3.6+
- Libraries - Beautiful Soup 4, nltk
- Flask
- HTML
To create a local copy and run the program, follow these steps on a Windows OS.
You would first need to obtain the course's corpus file and extract it. There should be less than 56000 files after extraction, totaling about 3GB of disk space.
You may also need to install a few libraries if you have never used them before. Click the links under Built With and follow the instructions on how to install the libraries.
- Clone the repo
git clone https://github.com/lilwon/ICS_Search_Engine.git
- Run the indexer on PowerShell
py -3 inverted_index.py
- Wait for the indexer to finish creating the inverted index. Takes about 20 minutes.
- Run the search retrieval on Powershell
py -3 search_component.py
- (Optional) You can also use the search retrieval on a Web Browser
py -3 webgui.py
- (Optional) When running the
webui.py
file, open a browser and paste the following url to your adderess bar: http://127.0.0.1:5000/
See the contributors section on the side of this Github page.