GlasXC
is an implementation of paper Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces - NIPS'19
in Pytorch:
- GlasXC: is for solving the extreme classification using simple Deep Learning architecture + GLAS Regularizer
- Python 2.7 or 3.5
- Requirements for the project are listed in requirements.txt. In addition to these, PyTorch 0.4.1 or higher is necessary. The requirements can be installed using pip:
or using conda:
$ pip install -r requirements.txt
$ conda install --file requirements.txt
-
Clone the repository.
$ git clone https://github.com/pyschedelicsid/GlasXC $ cd GlasXC
-
Install
$ python setup.py install
Use train_GlasXC.py
. A description of the options available can be found using:
$ python train_GlasXC.py --help
This script trains (and optionally evaluates) evaluates a model on a given dataset using the GlasXC algorithm.
To run GlasXC in the configuration used in the report, use:
$ ./train_GlasXC_with_args.sh
To run the baseline model, use:
$ python baseline.py
Links to downloading each dataset used can be found here, and the project report can be found here. The configuration files used (described below) for each dataset can be found here.
The input data must be in the LIBSVM format. An example of such a dataset is the Bibtex dataset found here.
The first row in the LIBSVM format specifies dataset size and input and output dimensions. This row must be removed, and this information must be provided through configuration files, as explained below.
For using GlasXC through GlasXC.py
, you need to have valid neural network configurations for the encoding of the inputs, labels in the latent space and the regressor in the YAML format. An example configuration file is:
- name: Linear
kwargs:
in_features: 500
out_features: 1152
- name: LeakyReLU
kwargs:
negative_slope: 0.2
inplace: True
- name: Linear
kwargs:
in_features: 1152
out_features: 1836
- name: Sigmoid
Please note that the name
and kwargs
attributes have to resemble the same names as those in PyTorch.
Optimizer configurations are very similar to the neural network configurations. Here you have to retain the same naming as PyTorch for optimizer names and their parameters - for example: lr
for learning rate. Below is a sample:
name: SGD
args:
lr: 0.01
momentum: 0.9
In both the scripts, you are required to specify a data root (data_root
), dataset information file (dataset_info
). data_root
corresponds to the folder containing the datasets. dataset_info
requires a YAML file in the following format:
train_filename:
train_opts:
num_data_points:
input_dims:
output_dims:
test_filename:
test_opts:
num_data_points:
input_dims:
output_dims:
If the test dataset doesn't exist, then please remove the fields test_filename
and test_opts
. An example for the Bibtex dataset would be:
train_filename: bibtex_train.txt
train_opts:
num_data_points: 4880
input_dims: 1836
output_dims: 159
test_filename: bibtex_test.txt
test_opts:
num_data_points: 2515
input_dims: 1836
output_dims: 159
Written By: Siddhant Katyan