Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

Logistic Regression

All-less edited this page Apr 18, 2017 · 17 revisions

Logistic Regression

This tool is a distributed implementation of the Logistic Regression with (Asynchronous) Stochastic Gradient Descent and FTRL-Proximal algorithm on top of Multiverso.

Performance

We test the tool in a Bing Ads click prediction dataset in Microsoft. The dataset is about 4TB with more than 5 billions of samples. The experiment is running on a cluster with 24 machines. Each machine has 20 physical cores and 256 GB ram and machines are connected with InfiniBand. The training of one epoch can be finished in about 18 minutes.

Follow the build guide to download and install first.

How to run

For single machine training, run

LogisticRegression config_file_path

Here is a simple example for training MNIST data set on single machine without parameter server.

To run in a distributed environment, run with MPI

mpirun -m $machine_file LogisticRegression config_file_path

Configure file

Configure file is used to configure the setting of training. It is a text file, each line of which is formatted as key=value. Below is a simple example to show the format of the config file. Suppose we're going to train a linear model with 100000-dimensions features with FTRL.

input_size=100000
output_size=2
objective_type=ftrl
train_epoch=1
sparse=true
use_ps=true
pipeline=true
minibatch_size=20
sync_frequency=5
train_file=D:/ftrl/part-1;D:/ftrl/part-2
test_file=D:/ftrl/test.data
reader_type=bsparse
output_file=D:/LogReg/ftrl.out

Basic model configure

  • regular_type, default will use no regularization. Can also be [L1 / L2].
  • objective_type, [default / sigmoid / softmax / default]
  • updater_type, used when use no ps. [default / sgd / ftrl]
  • input_size, the dimension of features. It is used when training dense data.
  • output_size, the dimension of output result.
  • sparse, [true] for sparse data, [false] for dense data.
  • train_epoch, indicate the epoch number of training.
  • minibatch_size, LogReg use mini-batch sgd to do optimization, this is the mini-batch size.
  • use_ps, specify whether to use parameter server or not. [true] will use DMTK framework.
  • learning_rate, initial learning rate for sgd updater
  • learning_rate_coef, we use max(1e-3, initial - (update count - learning rate coef * minibatch size)) to update the learning rate.
  • regular_coef, coefficient for regularization term

Parameter server configure

  • pipeline, whether to pipeline the computation and communication
  • sync_frequency, if use no pipeline, the worker will pull model after each sync_frequency mini-batch

FTRL model configure

  • alpha
  • beta
  • lambda1
  • lambda2

File configure

  • init_model_file, when provided, will load model data when init
  • output_model_file, path to save binary model data
  • train_file, training data
  • test_file, testing data, the tool prints test error every epoch if test_file is provided
  • output_file, path to save test result

train_file and test_file can use semicolon to separate multiple files.

Reader Configure

Input file can be of different format. Use reader_type to specify the reader type.

  • default, for text file. Each line as
# for sparse data use `libsvm` data format, 
label key:value key:value ...
# for dense data,
label value value value ...
  • weight, for text file. Some data set has a weight (double type) for each sample, for each line as:
# for sparse data use `libsvm` data format, 
label:weight key:value key:value ...
# for dense data,
label:weight value value value ...
  • bsparse, for binary file, only for sparse data. Each sample as:
count(size_t)label(int)weight(double)key(size_t)key(size_t)...

Other configuration

  • read_buffer_size, use for reader to preload data. Should be larger than minibatch size * sync frequency.
  • show_time_per_sample, show statistic time after process each #show_time_per_sample sample, including computation, communication