TTDS coursework 3:
- Data
Link to data: ...
Link to glove embeddings: http://nlp.stanford.edu/data/glove.6B.zip
URL format to papers: https://arxiv.org/abs/{ID}
e.g. https://arxiv.org/abs/1901.00001
- Please use the 'preprocessed' version of data, each file an ordered dict in json format, a data sample is:
# format:
ID: {title (str), authors (list(str)), abstract (str), primary subject (dict), subjects (dict)}
# example:
{
"1901.00001": {"title": "some text ...",
"authors": ["author1", ...]
"abs": "some text ...",
"1st_subj": {"cs.CV": "Computer Vision and Pattern Recognition"},
"subjs": {"cs.CV": "Computer Vision and Pattern Recognition", "cs.LG": "Machine Learning", "stat.ML": "Machine Learning"}}},
...
}
- Import Our Own Modules
# example to use the Normaliser class in normalise.py in the folder preprocess
import sys
sys.path.append('..') # append the main directory path
from preprocess.normalise import Normaliser
To use the Normaliser, simply:
norm = Normaliser()
...
# get tokens from the raw text
clean_text = norm.normalise_text(text)
- Testing
- To test the application you need to create a folder called data in the same directory "search-engine" is in. In data folder you need to get two create two files: test.json and glove.6B.50d.txt
- Search Types
- General Search (title + author + abs + subjs)
- Title Search (title)
- Author Search (author)
- Abstract Search (abs + subjs)
Fuzzy Search
- Language Model for Similarity Search
- Spelling Correction
- ...
You are not allowed to use search engine libraries that does index or search. However, you are allowed to use libraries for secondary features, such as autocomplete or spelling correction.
- Do not use DICE! (you don't have root access)
- pip install firebase_admin
- Download the json public key file shared in the groupchat
configure your local mongdb: https://www.runoob.com/mongodb/mongodb-osx-install.html use python to operate on your local mongodb: https://api.mongodb.com/python/current/tutorial.html
Link of data: https://drive.google.com/open?id=1szTszClFYDPWeUX9P_NizAKVXhG7UffM
Group ID: 2
Number of students: 5
Positive:
-
Nice and clear report
-
The idea is very interesting
-
Having evaluation section
Negative:
Comments:
Excellent work.
There are other features that could be added such as author suggestion in the query.
It might be useful to cinsider graphical explanation for the implemented system (i.e. to show the different modules and how they interact with each other.
Well done!
Overall group mark (15): 13