(DDTI) Distributed Decision Tree Induction

Train a model with a decision tree, then measures its predictive accuracy. Ddti does this in a distributed manner, using MPI. The master node will induct the decision tree, using slave nodes to execute tasks such as computing entropies. The algorithm building the decision tree is based on Quilan's C4.5. The algorithm only handles categorical attributes. The algorithm only handles categorical attributes. The provided datasets should not contain headers.

Written in C++14.

Usage

Being an MPI software, Ddti should be executed using mpirun/mpiexec. Here is an example:

mpirun -n 4 ./ddti --training_set ../datasets/careval.csv --attributes buying maint doors persons lug_boot safety class -o careval_model.txt

Required input options:

--training_set (-i) [string]
Path to the training dataset. The accepted formats are
CSV (.csv or .txt), ASCII (.txt), Armadillo ASCII (.txt), PGM (.pgm),
PPM (.ppm), Raw binary (.bin), Armadillo binary (.bin).

Optional input options:

--attributes (-a)
List of attribute names, separated by spaces (e.g. -a name lastname age).

--debug (-d)
Prints some debug information such as Information Gain Ratio values.

--help (-h)
Default help info.

--info [string]
Get help on a specific module or option.

--labels_column (-l) [int]
Index of the column containing the labels to predict (must be between 0 and
N-1). If unspecified, the algorithm will use the last column of the dataset.

--min_leaf_size (-m) [int]
Minimum number of instances per leaf. Default value 2.

--test_set (-t) [string]
The dataset used to test the predictive accuracy of the generated model.
If none is provided, the training set will be used. The accepted formats
are CSV (.csv or .txt), ASCII (.txt), Armadillo ASCII (.txt), PGM (.pgm),
PPM (.ppm), Raw binary (.bin), Armadillo binary (.bin).

--verbose (-v)
Display informational messages and the full list of parameters and timers at
the end of execution.

--version (-V)
Display the version of mlpack.

Optional output options:

--model_file (-o) [string]
The path of the file where the model will be dumped. The only supported
output format at the moment is .txt. If unset, the model will be printed on
the standard output.

Deployment

Make sure to install all the dependencies, then execute the following command:

git clone https://github.com/alexandretea/ddti.git && mkdir ddti/build && cd ddti/build && cmake .. && make

Dependencies

C++14
An MPI implementation
Armadillo (for matrices)
boost_program_options
MLpack (for Load(), CLI, and Logger)
CMake

Design

The processor with the rank 0 hosts the MasterNode. All the other processors host a SlaveNode. The MasterNode distributes computations -to the slaves- to build the decision tree.

List of the currently distributed tasks:

Computation of contingency tables (instance-based distribution)
- Instances are evenly scattered to every slaves
- A stream of contingency tables (one per attribute) is reduced using a sum function
Computation of conditional entropies and split entropies of an attribute
- The counts of each attribute values are scattered (so each row of a CT)
- The conditional entropies and split entropies are reduced using a sum function

NOTE: The MasterNode also receives and executes tasks (collective operations).

Future improvements

Bufferise the loading and scattering of matrices
The number of contingency table computations can be reduced (specify list of attributes)
The entropy computation of the label dimension should be distributed
Shadow master nodes to improve reliability?

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
datasets		datasets
srcs		srcs
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
run_ddti.sh		run_ddti.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

srcs

srcs

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

run_ddti.sh

run_ddti.sh

Repository files navigation

(DDTI) Distributed Decision Tree Induction

Table of Contents

Usage

Deployment

Dependencies

Design

Future improvements

About

Releases

Packages

Languages

License

alexandretea/ddti

Folders and files

Latest commit

History

Repository files navigation

(DDTI) Distributed Decision Tree Induction

Table of Contents

Usage

Deployment

Dependencies

Design

Future improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages