Skip to content

cvangysel/cuNVSM

Repository files navigation

cuNVSM

⚠️ You need a CUDA-compatible GPU (compute capability 5.2+) to use this software.

cuNVSM is a C++/CUDA implementation of state-of-the-art NVSM and LSE representation learning algorithms. It also supports injecting a priori knowledge of document/document similarity, as was the main subject of study in the CIKM2018 paper on product substitutability.

It integrates conveniently with the Indri search engine and model parameters are estimated directly from indexes created by Indri. Model parameters are stored in the open HDF5 format. A lightweight Python module nvsm, provided as part of this toolkit, allows querying the models and more.

For more information, see Section 3.3 of the 2018 TOIS paper "Neural Vector Spaces for Unsupervised Information Retrieval".

Requirements

To build the cuNVSM training binary and manage dependencies, we use CMake (version 3.8 and higher). In addition, we rely on the following libraries for the cuNVSM training binary:

The cnmem library is used for memory management. The tests are implemented using the googletest and googlemock frameworks. CMake will fetch and compile these libraries automatically as part of the build pipeline. Finally, you need a CUDA-compatible GPU in order to perform any computations.

Dependencies for the nvsm Python (>= 3.5) library used for loading and querying trained models can be installed as follows:

pip install -r requirements.txt

Note that the Python library depends on pyndri, which in turn also depends on Indri.

Installation

To install cuNVSM, the following instructions should get you started. Note that the installation will fail if dependencies cannot be found.

git clone https://github.com/cvangysel/cuNVSM
cd cuNVSM
mkdir build
cd build
cmake ..
make
make install

Please refer to the CMake documentation for advanced options.

cuNVSM also comes with a rich test harness to verify its implementation, see TESTS for more information.

Examples

See TUTORIAL for examples.

Frequently Asked Questions

How do I run NVSM or LSE?

Different models can be trained/queried by passing the appropriate flags to the cuNVSMTrainModel and cuNVSMQuery executables.

  • For LSE, pass --batch_size 4096, --nonlinearity tanh and --bias_negative_samples to cuNVSMTrainModel.
  • For NVSM, pass --batch_size 51200, --nonlinearity hard_tanh and --batch_normalization to cuNVSMTrainModel and pass --linear to cuNVSMQuery.

For more information, see the train_nvsm function in scripts/functions.sh and the invocation of cuNVSMQuery in rank-cranfield-collection.sh.

Citation

If you use cuNVSM to produce results for your scientific publication, please refer to our TOIS and CIKM 2018 papers:

@article{VanGysel2018nvsm,
  title={Neural Vector Spaces for Unsupervised Information Retrieval},
  author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
  publisher={ACM},
  journal={TOIS},
  year={2018},
}

@inproceedings{VanGysel2018substitutability,
  title={Mix ’n Match: Integrating Text Matching and Product Substitutability within Product Search},
  author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
  booktitle={CIKM},
  volume={2018},
  year={2018},
  organization={ACM}
}

The validate/test splits used in the 2018 TOIS paper can be found here. The test collections for the 2018 CIKM paper can be found here.

The toolkit also contains an implementation of the LSE model described in the following CIKM paper:

@inproceedings{VanGysel2016lse,
  title={Learning Latent Vector Spaces for Product Search},
  author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
  booktitle={CIKM},
  volume={2016},
  pages={165--174},
  year={2016},
  organization={ACM}
}

License

cuNVSM is licensed under the MIT license. CUDA is a licensed trademark of NVIDIA. Please note that CUDA and Indri are licensed separately. Some of the CMake scripts in the third_party directory are licensed under BSD-3.

If you modify cuNVSM in any way, please link back to this repository.