ELIAS

Learnable graph-based search index for large output spaces

ELIAS: End-to-end Learning to Index and Search in Large Output Spaces
Nilesh Gupta, Patrick H. Chen, Hsiang-Fu Yu, Cho-Jui Hsieh, Inderjit S. Dhillon
Neurips 2022

Highlights

Fully learnable graph-based search index for classification in large output space
Scalable to $\mathcal{O}(10M)$ label space on a single A100 GPU
Achieves SOTA on multiple large-scale extreme classification benchmarks

Preparing Data

The codebase assumes following data structure:

Datasets/
└── amazon-670k # Dataset name
    ├── raw
    │   ├── trn_X.txt # train input file, ith line is the text input for ith data point
    │   └── tst_X.txt # test input file, ith line is the text input for ith data point
    ├── X.trn.npz # train bow input features (needed to generate initial clustering)
    ├── Y.trn.npz # train relevance matrix (stored in scipy sparse npz format), num_train x num_labels
    └── Y.tst.npz # test relevance matrix (stored in scipy sparse npz format), num_test x num_labels

Before running the training/testing the default code expects you to convert the input features to BERT's (or any text transformer) tokenized input indices. You can achieve that by running:

dataset="amazon-670k"
tf-max-len="128" # Use 32 for short-text datasets
tf-token-type="bert-base-uncased" # You can use any huggingface pre-trained tokenization
./prepare.sh ${dataset-name} ${tf-max-len} ${tf-token-type}

Evaluating ELIAS

# Single GPU
python eval.py ${config_dir}/config.yaml

# Multi GPU
accelerate launch --config_file configs/accelerate.yaml --num_processes ${num_gpus} eval.py Results/ELIAS/${dataset}/${expname}/config.yaml

Training ELIAS

Sample script: run_benchmark.sh (example ./run_benchmark.sh amazon-670k)

Generate initial clustering matrix

python elias_utils.py gen_cluster_A configs/${dataset}/elias-1.yaml --no_model true

Train Stage 1

# Single GPU
python train.py configs/${dataset}/elias-1.yaml

# Multi GPU
accelerate launch --config_file configs/accelerate.yaml --num_processes ${num_gpus} eval.py configs/${dataset}/elias-1.yaml

Generate sparse approx adjacency graph matrix

# Single GPU
python elias_utils.py gen_approx_A configs/${dataset}/elias-1.yaml

# Multi GPU
accelerate launch --config_file configs/accelerate.yaml --num_processes ${num_gpus} elias_utils.py gen_approx_A configs/${dataset}/elias-1.yaml

Train Stage 2

# Single GPU
python train.py configs/${dataset}/elias-2.yaml

# Multi GPU
accelerate launch --config_file configs/accelerate.yaml --num_processes ${num_gpus} eval.py configs/${dataset}/elias-2.yaml

Download pretrained models

Coming soon...

Notebook Demo

Coming soon...

Cite

@InProceedings{ELIAS,
  author    = "Gupta, N. and Chen, P.H. and Yu, H-F. and Hsieh, C-J. and Dhillon, I.",
  title     = "ELIAS: End-to-end Learning to Index and Search in Large Output Spaces",
  booktitle = "Neural Information Processing Systems",
  month     = "November",
  year      = "2022"
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.ipynb_checkpoints		.ipynb_checkpoints
configs		configs
media		media
.gitignore		.gitignore
README.md		README.md
create_tokenized_files.py		create_tokenized_files.py
datasets.py		datasets.py
dl_helper.py		dl_helper.py
elias_utils.py		elias_utils.py
eval.py		eval.py
losses.py		losses.py
nets.py		nets.py
nns.py		nns.py
optimizer_bundles.py		optimizer_bundles.py
prepare.sh		prepare.sh
requirements.txt		requirements.txt
resources.py		resources.py
run_benchmark.sh		run_benchmark.sh
run_large_scale.sh		run_large_scale.sh
train.py		train.py

nilesh2797/ELIAS

Folders and files

Latest commit

History

Repository files navigation

ELIAS

Highlights

Preparing Data

Evaluating ELIAS

Training ELIAS

Generate initial clustering matrix

Train Stage 1

Generate sparse approx adjacency graph matrix

Train Stage 2

Download pretrained models

Notebook Demo

Cite

About

Topics

Resources

Stars

Watchers

Forks

Languages