Skip to content

princeton-nlp/semsup-xc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SemSup-XC: Semantic Supervision for Extreme Classification

WebsitePaper

HuggingFace Spaces GitHub license Twitter

Pranjal Aggarwal, Ameet Deshpande, Karthik Narasimhan

Abstract

Extreme classification (XC) involves classifying over large numbers of classes (thousands to millions), with real-world applications like news article classification and e-commerce product tagging. The zero-shot version of this task requires generalization to novel classes without additional supervision, like a new class "fidget spinner" for e-commerce product tagging. In this paper, we develop SemSup-XC, a model that achieves state-of-the-art zero-shot (ZS) and few-shot (FS) performance on three XC datasets spanning the domains of law, e-commerce, and Wikipedia. SemSup-XC uses automatically collected semantic class descriptions to represent classes ("fidget spinner" can be described as "A spinning toy for stress relief") and enables better generalization through our proposed hybrid matching module (Relaxed-COIL) which matches input instances to class descriptions using both (1) semantic similarity and (2) lexical similarity over contextual representations of similar tokens. Trained with contrastive learning, SemSup-XC significantly outperforms baselines and establishes state-of-the-art performance on all three datasets, by 5-12 precision@1 points on zero-shot and >10 precision@1 points on few-shot (K = 1), with similar gains for recall@10. Our ablation studies highlight the relative importance of our hybrid matching module (upto 2 P@1 improvement on AmazonCat) and automatically collected class descriptions (upto 5 P@1 improvement on AmazonCat).

semsup

Setup

First clone the repository and install dependencies:

git clone https://github.com/princeton-nlp/semsup-xc.git
pip install -r requirements.txt

Inside the semsup-xc folder, download pre-processed datasets and scraped class descriptions from here. Unzip the downloaded file into datasets folder.

Running

Training

You need to run python main.py <config_file> <output_dir> to train both zero-shot and few-shot models on both datasets. See configs folder for list of all relevant config files.

Evaluation

For all datasets, you can directly run main.py script by updating config file by changing pretrained_model parameter and and setting do_train set to False. You can also adjust random_sample parameter to adjust the number of samples to evaluate on. For ensembling results with TF-IDF, use the Evaluator.ipynb script. Previous method is slow, and memory hungry. For faster inference in Amazon and Wikipedia datasets, use: bash scripts/fastEval{DSET}.sh <config_file> <checkpoint_path>.

Trained Models

Pre-trained models can be downloaded from here.

Citing SemSup-XC

@article{aggarwal2023semsupxc,
  title   = {SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification},
  author  = {Pranjal Aggarwal and Ameet Deshpande and Karthik Narasimhan},
  year    = {2023},
  journal = {arXiv preprint arXiv: Arxiv-2301.11309}
}

LICENSE

SemSup-XC is MIT licensed, as found in the LICENSE file.

About

SemSup-XC: Semantic Supervision for Extreme Classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published