Logistic Regression

This tool is a distributed implementation of the Logistic Regression with (Asynchronous) Stochastic Gradient Descent and FTRL-Proximal algorithm on top of Multiverso.

Performance

We test the tool in a Bing Ads click prediction dataset in Microsoft. The dataset is about 4TB with more than 5 billions of samples. The experiment is running on a cluster with 24 machines. Each machine has 20 physical cores and 256 GB ram and machines are connected with InfiniBand. The training of one epoch can be finished in about 18 minutes.

Follow the build guide to download and install first.

How to run

For single machine training, run

LogisticRegression config_file_path

Here is a simple example for training MNIST data set on single machine without parameter server.

To run in a distributed environment, run with MPI

mpirun -m $machine_file LogisticRegression config_file_path

Configure file

Configure file is used to configure the setting of training. It is a text file, each line of which is formatted as key=value. Below is a simple example to show the format of the config file. Suppose we're going to train a linear model with 100000-dimensions features with FTRL.

input_size=100000
output_size=2
objective_type=ftrl
train_epoch=1
sparse=true
use_ps=true
pipeline=true
minibatch_size=20
sync_frequency=5
train_file=D:/ftrl/part-1;D:/ftrl/part-2
test_file=D:/ftrl/test.data
reader_type=bsparse
output_file=D:/LogReg/ftrl.out

Basic model configure

regular_type, default will use no regularization. Can also be [L1 / L2].
objective_type, [default / sigmoid / softmax / default]
updater_type, used when use no ps. [default / sgd / ftrl]
input_size, the dimension of features. It is used when training dense data.
output_size, the dimension of output result.
sparse, [true] for sparse data, [false] for dense data.
train_epoch, indicate the epoch number of training.
minibatch_size, LogReg use mini-batch sgd to do optimization, this is the mini-batch size.
use_ps, specify whether to use parameter server or not. [true] will use DMTK framework.
learning_rate, initial learning rate for sgd updater
learning_rate_coef, we use max(1e-3, initial - (update count - learning rate coef * minibatch size)) to update the learning rate.
regular_coef, coefficient for regularization term

Parameter server configure

pipeline, whether to pipeline the computation and communication
sync_frequency, if use no pipeline, the worker will pull model after each sync_frequency mini-batch

FTRL model configure

alpha
beta
lambda1
lambda2

File configure

init_model_file, when provided, will load model data when init
output_model_file, path to save binary model data
train_file, training data
test_file, testing data, the tool prints test error every epoch if test_file is provided
output_file, path to save test result

train_file and test_file can use semicolon to separate multiple files.

Reader Configure

Input file can be of different format. Use reader_type to specify the reader type.

default, for text file. Each line as

# for sparse data use `libsvm` data format, 
label key:value key:value ...
# for dense data,
label value value value ...

weight, for text file. Some data set has a weight (double type) for each sample, for each line as:

# for sparse data use `libsvm` data format, 
label:weight key:value key:value ...
# for dense data,
label:weight value value value ...

bsparse, for binary file, only for sparse data. Each sample as:

count(size_t)label(int)weight(double)key(size_t)key(size_t)...

Other configuration

read_buffer_size, use for reader to preload data. Should be larger than minibatch size * sync frequency.
show_time_per_sample, show statistic time after process each #show_time_per_sample sample, including computation, communication

DMTK

Multiverso

Overview
Multiverso setup
Multiverso document
Multiverso API document
Multiverso applications
- Logistic Regression
- Word Embedding
- LightLDA
- Deep Learning
  - Torch
  - Theano
Multiverso binding
- lua
- python
Run in docker

LightGBM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logistic Regression

Logistic Regression

Performance

How to run

Configure file

Basic model configure

Parameter server configure

FTRL model configure

File configure

Reader Configure

Other configuration

Clone this wiki locally