Skip to content

Latest commit

 

History

History
51 lines (31 loc) · 2.94 KB

README.md

File metadata and controls

51 lines (31 loc) · 2.94 KB

SetExpan: Corpus-based Set Expansion Framework

Update (2018-09-19)

  1. Add apr dataset in the following Google Drive link and also move wiki queries & ground truth sets into the dataset.

Update (2018-09-07)

  1. We add the original EgoSet dataset under "./data/" folder for references.
  2. A new (but slightly different) version of SetExpan (used in HiExpan) is available at: https://github.com/mickeystroller/HiExpan/tree/master/src/SetExpan-new, together with a more easy-to-use data preprocessing pipeline.

Introduction

This is the source code for SetExpan framework developed for corpus-based set expansion (i.e., finding the "complete" set of entites belonging to the same semantic class based on a given corpus and a tiny set of seeds).

Usage

We provide the data preprocessing code and the python implementation of SetExpan. If you want to use our data preprocessing code, then you need to download the following two related packages and put them in the "/src/tools/" folder:

  • AutoPhrase: used to extract quality phrases from raw input data.
  • Stanford CoreNLP 3.8.0: used to do POS tagging and select quality Noun Phrases from the previous phrase list generated by AutoPhrase. The quality Noun Phrase will be treated as the "entity".

Otherwise, you can directly download our preprocessed data from Google Drive; unzip it and put the dataset in under the "./data/" folder.

Files in the folder

  • /data/, the input folder of SetExpan;
  • /result/, the output folder of SetExpan;
  • /src/corpusProcessing/, the first step of data preprocessing, convert raw text to sentences.json
  • /src/dataProcessing/, the second step of data preprocessing, generate all SetExpan input files from sentences.json
  • /src/tools/, tools used in the data processing
  • /src/SetExpan/, the python implementation of SetExpan algorithms
    • /src/SetExpan/set_expan_main.py: the main entrance of SetExpan, including loading data, forming queries, and running algorithm.
    • /src/SetExpan/set_expan.py: the main implementation of SetExpan. You can change model hyper-parameters in this file.

To Run

cd src/SetExpan/ 
python3 ./set_expan_main.py

Results are saved under the same folder and named "setexpan_result.txt"

Publications

Please cite the following paper if you are using this code. Thanks!