Skip to content

mwydmuch/extweme_wabbit

 
 

Repository files navigation

Extweme Wabbit

This fork implements Probabilistic Label Trees (PLTs) in Vowpal Wabbit for extreme multi-label classification. It was marged to the main Vowpal Wabbit repository.

Our other PLTs implementations are available here:

References

PLTs have been introduced and extended in the articles listed below. Please cite this article if you use PLTs in your research.

  • Marek Wydmuch, Kalina Jasinska-Kobus, Rohit Babbar, Krzysztof Dembczyński: Propensity-scored Probabilistic Label Trees Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021
@inproceedings{Wydmuch_at_el_2021,
  author =    {Wydmuch, Marek and Jasinska-Kobus, Kalina and Babbar, Rohit and Dembczynski, Krzysztof},
  title =     {Propensity-Scored Probabilistic Label Trees},
  year =      {2021},
  isbn =      {9781450380379},
  publisher = {Association for Computing Machinery},
  address =   {New York, NY, USA},
  url =       {https://doi.org/10.1145/3404835.3463084},
  doi =       {10.1145/3404835.3463084},
  booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages =     {2252–2256},
  numpages =  {5},
  keywords =  {label trees, recommendation, multi-label classification, missing labels, tagging, propensity model, supervised learning, extreme classification, ranking},
  location =  {Virtual Event, Canada},
  series =    {SIGIR '21}
}
  • Kalina Jasinska-Kobus, Marek Wydmuch, Devanathan Thiruvenkatachari, Krzysztof Dembczyński: Online probabilistic label trees PMLR, Volume 130: International Conference on Artificial Intelligence and Statistics, AISTATS, 2021
@inproceedings{Jasinska-Kobus_Wydmuch_at_el_2021,
  title =     {Online probabilistic label trees},
  author =    {Jasinska-Kobus, Kalina and Wydmuch, Marek and Thiruvenkatachari, Devanathan and Dembczynski, Krzysztof},
  booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
  pages =     {1801--1809},
  year =      {2021},
  editor =    {Banerjee, Arindam and Fukumizu, Kenji},
  volume =    {130},
  series =    {Proceedings of Machine Learning Research},
  month =     {13--15 Apr},
  publisher = {PMLR},
  pdf =       {http://proceedings.mlr.press/v130/jasinska-kobus21a/jasinska-kobus21a.pdf},
  url =       {http://proceedings.mlr.press/v130/jasinska-kobus21a.html},
}
@misc{Jasinska-Kobus_at_el_2020,
  title=          {Probabilistic Label Trees for Extreme Multi-label Classification},
  author=         {Kalina Jasinska-Kobus and Marek Wydmuch and Krzysztof Dembczynski and Mikhail Kuznetsov and Robert Busa-Fekete},
  year=           {2020},
  eprint=         {2009.11218},
  archivePrefix = {arXiv},
  primaryClass =  {cs.LG}
}  
@incollection{Wydmuch_at_el_2018b,
  title =     {A no-regret generalization of hierarchical softmax to extreme multi-label classification},
  author =    {Wydmuch, Marek and Jasinska, Kalina and Kuznetsov, Mikhail and Busa-Fekete, R\'{o}bert and Dembczynski, Krzysztof},
  booktitle = {Advances in Neural Information Processing Systems},
  volume =    {31},
  editor =    {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
  pages =     {6358--6368},
  year =      {2018},
  publisher = {Curran Associates, Inc.},
  url =       {http://papers.nips.cc/paper/7872-a-no-regret-generalization-of-hierarchical-softmax-to-extreme-multi-label-classification.pdf}
}
@inproceedings{Jasinska_et_al_2016,
  title =     {Extreme F-measure Maximization using Sparse Probability Estimates},
  author =    {Kalina Jasinska and Krzysztof Dembczynski and Robert Busa-Fekete and Karlson Pfannschmidt and Timo Klerx and Eyke Hullermeier},
  booktitle = {Proceedings of The 33rd International Conference on Machine Learning},
  pages =     {1435--1444},
  year =      {2016},
  editor =    {Maria Florina Balcan and Kilian Q. Weinberger},
  volume =    {48},
  series =    {Proceedings of Machine Learning Research},
  address =   {New York, New York, USA},
  publisher = {PMLR},
}

PLT options

--plt arg               Use PLT for multi-label learning with arg labels
--kary_tree arg (=2)    Use an arg-ary tree. By default the tree is binary
--top_k arg (=1)        Predict arg top labels
--threshold arg         Predict labels with marginal probabilities greater than arg

We recommended to use --sgd with --plt for the fastest learning and the best memory efficiency.

Example of usage

# To train:
vw --plt <num labels> <train dataset> -f <output model> --sgd -l <learning rate> --kary_tree <tree arity> --passes <num epochs> -b <number of bits in the feature table> -c

# To test:
vw -t -i <model file> <test dataset> --top_k <k top label> -p <prediction file>

More examples and scripts to replicate results on datasets from The Extreme Classification Repository can be found in the xml_experiments directory.


Originial Vowpal Wabbit README.md

/*
Copyright (c) by respective owners including Yahoo!, Microsoft, and
individual contributors. All rights reserved.  Released under a BSD (revised)
license as described in the file LICENSE.
 */

Vowpal Wabbit

Build Status Windows Build Status Coverage Status

This is the vowpal wabbit fast online learning code. For Windows, look at README.windows.txt

Prerequisite software

These prerequisites are usually pre-installed on many platforms. However, you may need to consult your favorite package manager (yum, apt, MacPorts, brew, ...) to install missing software.

  • Boost library, with the Boost::Program_Options library option enabled.
  • The zlib compression library + headers. In linux distros: package zlib-devel (Red Hat/CentOS), or zlib1g-dev (Ubuntu/Debian)
  • lsb-release (RedHat/CentOS: redhat-lsb-core, Debian: lsb-release, Ubuntu: you're all set, OSX: not required)
  • GNU autotools: autoconf, automake, libtool, autoheader, et. al. This is not a strict prereq. On many systems (notably Ubuntu with libboost-program-options-dev installed), the provided Makefile works fine.
  • (optional) git if you want to check out the latest version of vowpal wabbit, work on the code, or even contribute code to the main project.

Getting the code

You can download the latest version from here. The very latest version is always available via 'github' by invoking one of the following:

## For the traditional ssh-based Git interaction:
$ git clone git://github.com/JohnLangford/vowpal_wabbit.git

## For HTTP-based Git interaction
$ git clone https://github.com/JohnLangford/vowpal_wabbit.git

Compiling

You should be able to build the vowpal wabbit on most systems with:

$ make
$ make test    # (optional)

If that fails, try:

$ ./autogen.sh
$ make
$ make test    # (optional)
$ make install

Note that ./autogen.sh requires automake (see the prerequisites, above.)

./autogen.sh's command line arguments are passed directly to configure as if they were configure arguments and flags.

Note that ./autogen.sh will overwrite the supplied Makefile, including the Makefiles in sub-directories, so keeping a copy of the Makefiles may be a good idea before running autogen.sh. If your original Makefiles were overwritten by autogen.sh calling automake, you may always get the originals back from git using:

git checkout Makefile */Makefile

Be sure to read the wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki for the tutorial, command line options, etc.

The 'cluster' directory has it's own documentation for cluster parallel use, and the examples at the end of test/Runtests give some example flags.

C++ Optimization

The default C++ compiler optimization flags are very aggressive. If you should run into a problem, consider creating and running configure with the --enable-debug option, e.g.:

$ ./configure --enable-debug

or passing your own compiler flags via the OPTIM_FLAGS make variable:

$ make OPTIM_FLAGS="-O0 -g"

Ubuntu/Debian specific info

On Ubuntu/Debian/Mint and similar the following sequence should work for building the latest from github:

# -- Get libboost program-options and zlib:
apt-get install libboost-program-options-dev zlib1g-dev

# -- Get the python libboost bindings (python subdir) - optional:
apt-get install libboost-python-dev

# -- Get the vw source:
git clone git://github.com/JohnLangford/vowpal_wabbit.git

# -- Build:
cd vowpal_wabbit
make
make test       # (optional)
make install

Ubuntu advanced build options (clang and static)

If you prefer building with clang instead of gcc (much faster build and slighly faster executable), install clang and change the make step slightly:

apt-get install clang

make CXX=clang++

A statically linked vw executable that is not sensitive to boost version upgrades and can be safely copied between different Linux versions (e.g. even from Ubuntu to Red-Hat) can be built and tested with:

make CXX='clang++ -static' clean vw test     # ignore warnings

Mac OS X-specific info

OSX requires glibtools, which is available via the brew or MacPorts package managers.

Complete brew install of 8.0

brew install vowpal-wabbit

The homebrew formula for VW is located on github.

Manual install of Vowpal Wabbit

OSX Dependencies (if using Brew):

brew install libtool
brew install autoconf
brew install automake
brew install boost
brew install boost-python

OSX Dependencies (if using MacPorts):

## Install glibtool and other GNU autotool friends:
$ port install libtool autoconf automake

## Build Boost for Mac OS X 10.8 and below
$ port install boost +no_single +no_static +openmpi +python27 configure.cxx_stdlib=libc++ configure.cxx=clang++

## Build Boost for Mac OS X 10.9 and above
$ port install boost +no_single +no_static +openmpi +python27

OSX Manual compile:

Mac OS X 10.8 and below: configure.cxx_stdlib=libc++ and configure.cxx=clang++ ensure that clang++ uses the correct C++11 functionality while building Boost. Ordinarily, clang++ relies on the older GNU g++ 4.2 series header files and stdc++ library; libc++ is the clang replacement that provides newer C++11 functionality. If these flags aren't present, you will likely encounter compilation errors when compiling vowpalwabbit/cbify.cc. These error messages generally contain complaints about std::to_string and std::unique_ptr types missing.

To compile:

$ sh autogen.sh --enable-libc++
$ make
$ make test    # (optional)

OSX Python Binding installation with Anaconda

When using Anaconda as the source for Python the default Boost libraries used in the Makefile need to be adjusted. Below are the steps needed to install the Python bindings for VW. This should work for Python 2 and 3. Adjust the directories to match where anaconda is installed.

# create anaconda environment with boost
conda create --name vw boost
source activate vw
git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
# edit Makefile
# change BOOST_INCLUDE to use anaconda env dir: /anaconda/envs/vw/include
# change BOOST_LIBRARY to use anaconda lib dir: /andaconda/envs/vw/lib
cd python
python setup.py install

Code Documentation

To browse the code more easily, do

make doc

and then point your browser to doc/html/index.html.

About

Extweme Wabbit implements Probabilistic Label Tree (PLT) algorithm for extreme multi-label classification in Vowpal Wabbit

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 48.7%
  • C# 20.1%
  • Jupyter Notebook 9.6%
  • PLSQL 9.6%
  • Perl 3.2%
  • Python 2.8%
  • Other 6.0%