Skip to content

stanford-futuredata/wmsketch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sketching Linear Classifiers over Data Streams

Overview of the Weight-Median Sketch

The code in this repository supplements the SIGMOD 2018 paper Sketching Linear Classifiers over Data Streams by Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant.

The Weight-Median Sketch (WM-Sketch) and Active-Set Weight-Median Sketch (AWM-Sketch) are two algorithms for learning compressed, memory-budgeted linear classifiers on streaming data. In addition to supporting fast binary classification, these sketches also support the approximate retrieval of high-magnitude model weights in the original, uncompressed feature space -- this is analogous to the estimation of frequent items using the Count-Sketch or the Count-Min Sketch.

The WM-Sketch and AWM-Sketch can be seen as improved variants of feature hashing, a commonly-used technique for handling high-dimensional feature spaces (e.g., in scikit-learn and Vowpal Wabbit).

Build

The code can be built using CMake. From the root project directory, run:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

This builds the library libwmsketch and the binaries wmsketch_classification and wmsketch_pmi. wmsketch_classification trains binary linear classifiers using the WM-Sketch and other baseline methods described in the paper. wmsketch_pmi is an application of the WM-Sketch to streaming pointwise mutual information estimation -- this is described in more detail below.

Binary Classification

The wmsketch_classification binary trains logistic regression classifiers using the WM-Sketch, AWM-Sketch, and with a number of other baseline methods described in the paper. Training and testing data should be provided in LIBSVM format -- many standard classification datasets are available in LIBSVM format at this link.

For example, RCV1 binary classification dataset can be downloaded here (we use the split marked "test" for training since it is larger than the "train" split).

Run:

wmsketch_classification --train <dataset_dir>/rcv1_test.binary

This outputs the following in JSON format:

{
    "params": {
        "consv_update": false,
        "count_smooth": 1.0,
        "depth": 1,
        "feature_dim": 47237,
        ...
    },
    "results": {
       ...
       "train_count": 677399,
       "train_err_count": 34049,
       "train_err_rate": 0.050264319846944,
       "train_ms": 13078
    }
}

The train_err_rate field indicates the online error rate achieved by the classifier: the number of examples that were misclassified by the classifier before applying the online update. This value can differ from run to run due to the randomness in defining the sketch -- for deterministic results, the random seed can be set using the --seed option.

The top_indices and top_weights fields indicate the estimates for the highest-magnitude model weights. These can be compared to the indices and weights returned when training a full, uncompressed classifier (using the option --method logistic) to evaluate the accuracy of top-k weight estimation achieved by a memory-budgeted method.

By default, the executable uses the AWM-Sketch to learn the classifier. The full list of methods is:

  • logistic: Uncompressed logistic regression
  • logistic_sketch: WM-Sketch
  • activeset_logistic: AWM-Sketch
  • truncated_logistic: Simple truncation to k highest-magnitude features
  • probtruncated_logistic: Probabilistic truncation to k highest-magnitude features (similar to weighted reservoir sampling)
  • countmin_logistic: Track most frequent features as estimated by a Count-Min sketch
  • spacesaving_logistic: Track most frequent features as estimated by the Space Saving algorithm

For the feature hashing baseline, run with the options --method logistic_sketch --depth 1. See the paper for full details on the baseline methods.

For a full list of options, run:

wmsketch_classification --help

Application: Streaming Pointwise Mutual Information Estimation

Pointwise mutual information (PMI) is an information theoretic measure of association between random events. In natural language processing, PMI is often used as a measure of the degree to which pairs of words tend to co-occur in text -- bigrams with high PMI are more "phrase-like" since the occurrence of one word is indicative of the co-occurrence of the other. In fact, the popular word2vec skip-gram algorithm for learning word embeddings can be interpreted as an implicit factorization of a matrix of PMI values.

As an application of the WM-Sketch, we show how pairs of words with high PMI values can be estimated online from a stream of text data while using a small memory budget of less than 2MB.

Data

In the paper, we ran on the first 20 files of the Google Billion-Word Language Modeling Benchmark, which was derived from a newswire corpus. This dataset can be downloaded here.

Run

Following the --data flag, the executable accepts a list of whitespace-delimited paths to data files. We assume that each line in each file is a separate sentence and that each sentence has already been tokenized.

For example, to run on the first two files of the dataset:

wmsketch_pmi --data <dataset_dir>/news.en-00001-of-00100 <dataset_dir>/news.en-00002-of-00100

This outputs a list of bigrams along with their estimated PMI values in JSON format:

{
    ...
    "results": {
        "num_tokens": 7769690,
        "tokens": [
          [
            "!",
            "!"
          ],
          [
            "don",
            "'t"
          ],
          [
            "per",
            "cent"
          ],
          [
            "united",
            "states"
          ],
          [
            "$",
            "million"
          ],
          ...
        ],
        ...
    }
}

For a full list of options, run:

wmsketch_pmi --help

About

Sketching linear classifiers over data streams with the Weight-Median Sketch (SIGMOD 2018).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published