Skip to content

mickeysjm/SetExpan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SetExpan: Corpus-based Set Expansion Framework

Update (2018-09-19)

  1. Add apr dataset in the following Google Drive link and also move wiki queries & ground truth sets into the dataset.

Update (2018-09-07)

  1. We add the original EgoSet dataset under "./data/" folder for references.
  2. A new (but slightly different) version of SetExpan (used in HiExpan) is available at: https://github.com/mickeystroller/HiExpan/tree/master/src/SetExpan-new, together with a more easy-to-use data preprocessing pipeline.

Introduction

This is the source code for SetExpan framework developed for corpus-based set expansion (i.e., finding the "complete" set of entites belonging to the same semantic class based on a given corpus and a tiny set of seeds).

Usage

We provide the data preprocessing code and the python implementation of SetExpan. If you want to use our data preprocessing code, then you need to download the following two related packages and put them in the "/src/tools/" folder:

  • AutoPhrase: used to extract quality phrases from raw input data.
  • Stanford CoreNLP 3.8.0: used to do POS tagging and select quality Noun Phrases from the previous phrase list generated by AutoPhrase. The quality Noun Phrase will be treated as the "entity".

Otherwise, you can directly download our preprocessed data from Google Drive; unzip it and put the dataset in under the "./data/" folder.

Files in the folder

  • /data/, the input folder of SetExpan;
  • /result/, the output folder of SetExpan;
  • /src/corpusProcessing/, the first step of data preprocessing, convert raw text to sentences.json
  • /src/dataProcessing/, the second step of data preprocessing, generate all SetExpan input files from sentences.json
  • /src/tools/, tools used in the data processing
  • /src/SetExpan/, the python implementation of SetExpan algorithms
    • /src/SetExpan/set_expan_main.py: the main entrance of SetExpan, including loading data, forming queries, and running algorithm.
    • /src/SetExpan/set_expan.py: the main implementation of SetExpan. You can change model hyper-parameters in this file.

To Run

cd src/SetExpan/ 
python3 ./set_expan_main.py

Results are saved under the same folder and named "setexpan_result.txt"

Publications

Please cite the following paper if you are using this code. Thanks!

About

The source code for SetExpan framework, published in ECML-PKDD 2017

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published