Skip to content

nyu-dl/dl4ir-searchQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchQA

Associated paper:
https://arxiv.org/abs/1704.05179

Here are raw, split, and processed files: https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649


One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset.

There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward.

And here is the link for the Jeopardy! files themselves:
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

NOTE: We will release the the script that converts these to the training files above with appropriate restrictions.


Some requirements: nltk==3.2.1
pandas==0.18.1
selenium==2.53.6
pytest==3.0.2
pytorch==0.1.11

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published