AutomaticBenchmarking
Student: Marcus Edel
E-Mail: marcus.edel@fu-berlin.de
Project Overview: This page contains notes regarding the automatic benchmark system for the GSoC 2013. The project entails writing support scripts which will run mlpack methods on a variety of datasets and produce runtime numbers. The benchmarking scripts will also run the same machine learning methods from other machine learning libraries and then produce runtime graphs.
The config file is used by the benchmark script to identify the available methods to be run. The benchmark script is modular. For each library some lines in the config file needs to be written. The lines in the config file specifies:
- Where the particular script/method is.
- The datasets to benchmark the method.
- Supported formats.
The benchmark script will run the script on the basis of this lines.
I've picked YAML as the configuration file format for specifying the structure for the project, because YAML has a clean syntax and YAML was designed from the start to be a data serialization language that's both powerful and human readable.
PyYAML is a YAML parser and emitter for Python. The core of the module is written in pure Python, but as of version 3.0.4, it also supports binding to the high-speed LibYAML implementation written in C. YAML is widely used in all sorts of places, such as the configuration settings for Google's AppEngine.
PyYAML requires Python 2.5 or higher.
wget http://pyyaml.org/download/pyyaml/PyYAML-3.10.tar.gz
tar xfvz PyYAML-3.10.tar.gz
cd PyYAML-3.10
python setup.py
or
sudo pip install pyyaml
If you want to install pyyaml system-wide in linux, you can also use a package manager.
sudo apt-get install python-yaml
# General datasets.
Datasets: &pca_datasets
- 'wine.csv'
- 'iris.csv'
# MLPACK:
# A Scalable C++ Machine Learning Library
library: mlpack
methods:
PCA:
script: methods/mlpack/pca.py
format: [csv, txt]
run: true
plot: ['cities.csv']
datasets:
- files: [*pca_datasets, 'cities.csv']
options: '-d 2'
- files: ['cities.csv']
options: '-d 6'
NMF:
script: methods/mlpack/nmf.py
format: [csv, txt]
datasets:
- files: ['piano.csv']
options: '-r 6 -s 42 -u multdist'
This sample document defines an associative array with 2 top level keys: library and methods. The entity methods
has two block mappings related to it, PCA
and NMF
. The block mapping contains an array or list, each element of which is itself an associative array with differing keys. To avoid repetitions in the config file, it's possible to reuse mappings as show with the * operator. Notice that strings do not require enclosure in quotations.
A nice feature in YAML is the concept of documents. A document is not just a separate file in this case. You can have multiple documents in a single stream of YAML, if each one is separated by ---
, like this:
# MLPACK:
# A Scalable C++ Machine Learning Library
library: mlpack
methods:
PCA:
...
---
# Weka:
# Data Mining Software in Java
library: weka
methodes:
PCA:
...
||||= '''library''' =||
|| '''description''' || A name to identify the library. The name is also used for the output, for this reason it should be avoided to choose a name with more than 23 characters. ||
|| '''Syntax''' || library: '''name
''' ||
|| '''Required''' || Yes ||
|||| ||
||||= '''script''' =||
|| '''description''' || Path to the current method which should be tested. You can use the relative path from the benchmark root folder, a absolute path or a symlink. ||
|| '''Syntax''' || script: '''name
''' ||
|| '''Required''' || Yes ||
|||| ||
||||= '''files''' =||
|| '''description''' || A array of datasets for this method. You can use the relative path from the benchmark root folder, a absolute path or a symlink. Requires a method more than one data set, you should add the data sets in an extra list. ||
|| '''Syntax''' || files: '''[...]
or [ [...] ]
''' ||
|| '''Required''' || Yes ||
|||| ||
||||= '''run''' =||
|| '''description''' || A flag to indicate if the benchmark will be executed. ||
|| '''Syntax''' || run: '''True | False
''' ||
|| '''Default''' || '''True
''' ||
|| '''Required''' || No ||
|||| ||
||||= '''iterations''' =||
|| '''description''' || The number of executions for this method. It is recommended to set the value higher than one in order to obtain meaningful results. ||
|| '''Syntax''' || iterations: '''number
''' ||
|| '''Default''' || '''3
''' ||
|| '''Required''' || No ||
|||| ||
||||= '''formats''' =||
|| '''description''' || A array of supported file formats for this method. If this data set isn't available in this format, the benchmark script tries to convert the data set. ||
|| '''Syntax''' || formats: '''[...]
''' ||
|| '''Required''' || No ||
|||| ||
||||= '''options''' =||
|| '''description''' || The string contains options for this method. The string is passed when the script is started. ||
|| '''Syntax''' || options: '''String
''' ||
|| '''Default''' || '''None
''' ||
|| '''Required''' || No ||
|||| ||
The configuration described here is the smallest possible configuration. The configuration combines all required options to benchmark a method.
# MLPACK:
# A Scalable C++ Machine Learning Library
library: mlpack
methods:
PCA:
script: methods/mlpack/pca.py
format: [csv, txt, hdf5, bin]
datasets:
- files: ['isolet.csv']
In this case we benchmark the pca method located in methods/mlpack/pca.py and use the isolet dataset. The pca method supports the following formats txt, csv, hdf5 and bin. The benchmark script use the default values for the non-specified values.
Combining all the elements discussed above results in the following configuration, which should be placed typically as config.yaml.
# mlpack:
# A Scalable C++ Machine Learning Library
library: mlpack
methods:
PCA:
script: methods/mlpack/pca.py
format: [csv, txt, hdf5, bin]
run: true
iterations: 2
datasets:
- files: ['isolet.csv', 'cities']
options: '-s'
In this case we benchmark the pca method located in methods/mlpack/pca.py with the isolet and the cities dataset. The pca method scales the data before running the pca method. The benchmark performs twice for each dataset. Additionally the pca.py script supports the following file formats {{{txt, csv, hdf5 and bin}}}. If the data isn't available in this particular case the format will be generated.
To test the configuration file. Use the following command.
make test config.yaml
The command checks the configuration for correct syntax and then try to open files referred in the configuration.
- Python 3.2+
- Python-yaml (Complete YAML 1.1 parser and emitter for Python.)
The main benchmark script is written with the programming language python and use YAML as the configuration file format for specifying the structure for the project.
- Valgrind (Suite of tools for debugging and profiling.)
The benchmark script uses the Massif tool, a heap profiler, from the Valgrind suite to measures how much heap memory the mlpack method uses. By default, the benchmark script doesn't uses the Massif tool to profile the heap, for this reason, it isn't necessary to install Valgrind.
- matplotlib (2D plotting library for python.)
The benchmark script uses the matplotlib python library, to create the plots. By default, the benchmark script doesn't create any plots, for this reason, it isn't necessary to install the matplotlib python library.
The benchmark package already comes with predefined scripts to benchmark the different machine learning libraries:
- mlpack (ALLKFN, ALLKNN, ALLKRANN, DET, FASTMKS, GMM, HMM Generate, HMM Loglik, HMM Train, HMM Viterbi, ICA, KPCA, K-Means, LARS, Linear Regression, Local Coordinate Coding, LSH, NBC, NCA, NMF, PCA, Range Search, Sparse Coding)
- WEKA (ALLKNN, K-Means, Linear Regression, NBC, PCA)
- MATLAB (ALLKNN, HMM Generate, HMM Viterbi, K-Means, Linear Regression, NBC, NMF, PCA, Range Search)
- Shogun (ALLKNN, GMM, KPCA, K-Means, LARS, Linear Regression, NBC, PCA)
- Scikit (ALLKNN, GMM, ICA, KPCA, K-Means, LARS, Linear Regression, NBC, NMF, PCA, Sparse Coding)
- MLPy (ALLKNN, KPCA, K-Means, LARS, Linear Regression, PCA)
In order to run one of the predefined benchmark scripts you need to install by yourself the specified library.
The script specifies how to benchmark the specified method. The script has to provide a python class with two functions.
The Python class must specify two methods a {{{Constructor}}} and a {{{RunMethod()}}} function.
An example based on the MLPACK principal component analysis (PCA) method:
#!python
class PCA:
def __init__(self, dataset)
# Code here.
def RunMethod(self, options):
# Code here.
# return time
In this case we define a class with the name PCA. The name of the class is important because the class name must be listed in the configuration file to benchmark the script.
The first method __init__()
is special and is often called the constructor. This function is automatically invoked for the newly-created class instance and in our case this function is invoked at the beginning of the benchmark. One parameter is handed over when the main benchmark script invokes this function. The dataset
parameter can contain the path to a data set, or a list of data sets. The constructor should be used to initialize values or to load data sets, for things you only have to do once.
In the second method, RunMethod()
, the benchmark should be performed. One parameter is handed over when the main benchmark script invokes this function. The options
parameter can contain additional parameters which are important for the method, e.g.: the desired dimensionality of the output data set. At the end of this method the benchmark time should be returned.
Note that the user also has the ability to write code in another language than Python. To achieve this, the RunMethod()
can call a function in a different language or can invoke e.g. a bash script and return the result. To call a bash script you can use the following code sample:
#!python
import shlex
import subprocess
cmd = shlex.split("ls -l")
s = subprocess.check_output(cmd, shell=False)
To run the benchmarks, please follow the following steps:
- Check out the current sources from subversion. This may take some time because some data sets in the datasets folder are 100MB+ large.
$ svn co http://svn.cc.gatech.edu/fastlab/mlpack/conf/jenkins-conf/benchmark/
- Edit the configuration file
config.yaml
and set the run variable from False to True for the desired method. - Set the correct path for the environment variables depending on which library you would like to benchmark. There are two possibilities you can edit the Makefile and set the correct path for the environment variables or you pass the correct path at the benchmark start.
Edit the Makefile:
Pass the correct path at the benchmark start:
export MLPACK_BIN=/path/to/the/mlpack/bin/
$ make MLPACK_BIN=/path/to/the/mlpack/bin/ run
- This step is optional, if you want to benchmark one of the predefined Weka scripts or the Shogun K-Means script you have to build the source files with this command:
$ make scripts
To benchmark a method run make with one of the following extensions from the root folder:
```
$ make run # Perform the benchmark with the given config. Default config.yaml.
$ make memory # Get memory profiling information with the given config. Default config.yaml.
```
It is also possible to benchmark only specified libraries. You can specify the libraries with the BLOCK parameter. The following command benchmark only the mlpack and the weka library.
```
$ make BLOCK=mlpack,weka run
```
Notes: If necessary you have to set the PYTHONPATH
and LD_LIBRARY_PATH
to start the benchmark script.
To benchmark the mlpack methods, the scripts uses the several executables. The mlpack methods have already a built-in timer, so there is no need to provide a new timing function. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.
Here we run the PCA method with the cities data set and the verbose option displays the necessary information at the end of execution.
$ pca -i cities.csv -o output.csv -v
[INFO ] Loading 'cities.csv' as CSV data. Size is 9 x 329.
[INFO ] Performing PCA on dataset...
[INFO ] Saving CSV data to 'output.csv'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_file: cities.csv
[INFO ] new_dimensionality: 0
[INFO ] output_file: output.csv
[INFO ] scale: false
[INFO ] verbose: true
[INFO ]
[INFO ] Program timers:
[INFO ] loading_data: 0.013257s
[INFO ] saving_data: 0.004750s
[INFO ] total_time: 0.343844s
To get the runtime informations we just parse the three program timers and calculate the elapsed time. This has the advantage that we get the real execution time.
The scripts for the WEKA library are written in java and use the built-in java timer to measure the time. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.
Here we run the PCA method with the cities data set. The necessary information are shown at the end of the execution.
$ java -classpath ".:/path/to/weka/weka.jar" PCA -i cities.arff
[INFO ] total_time: 0.83215s
- The WEKA methods only support files with a header. The benchmark script can convert
csv
andtxt
files without a header into thearff
format with header information. - You can use the provided timer class, to measure the elapsed execution time.
- You have to build the java source code to benchmark the code. You can use the command make scripts form the benchmark root folder to build all java files in the
methods/weka/src
folder. Afterwards, the byte code is located in themethods/weka/
folder.
The scripts for the MATLAB library are written in matlab and use the built-in matlab timer to measure the time. To get the runtime information, the script starts the executables with the right arguments and parses the timing data.
Here we run the PCA method with the cities data set. The necessary information are shown at the end of the execution.
$ matlab -nodisplay -nosplash -r try, PCA(‘-i cities.csv’), catch, exit(1), end, exit(0)
[INFO ] total_time: 1.21523s
- The script must be on your matlab path. By default the benchmark script adds the
methods/matlab/
folder to the matlab path. - The predefined matlab scripts only supports files in the
csv
andtxt
format.
The scripts for the Shogun library are written in python and C++. The scripts written in python use the built-in python timer to measure the time and the Shogun python interface to invoke the functions. The script written in C++ defined its own timer code to measure the time.
Here we run the K-Means method with the iris data set and set initial centroids. The necessary information are shown at the end of the execution.
$ ./KMEANS -i iris.csv -I centroids_iris.csv
[INFO ] total_time: 0.012535s
- You have to build the K-Means source code to benchmark the code. You can use the command make scripts form the benchmark root folder to build the K-Means method.
The scripts for the Scikit library are written in python and use the built-in python timer to measure the time and the Scikit python interface to invoke the functions. To measure the elapsed execution time we don’t have to parse the runtime information.
The scripts for the MLPy library are written in python and use the built-in python timer to measure the time and the MLPy python interface to invoke the functions.
-
CONFIG [string]
- The path to the configuration file to perform the benchmark on. Default 'config.yaml'. -
BLOCK [string]
- Run only the specified blocks defined in the configuration file. Default run all blocks. -
LOG [boolean]
- If set, the reports will be saved in the database. Default 'False'. -
UPDATE [boolean]
- If set, the latest reports in the database are updated. Default 'False'. -
METHODBLOCK [string]
- Run only the specified methods defined in the configuration file. Default run all methods.
-
test [parameters]
- Test the configuration file. Check for correct syntax and then try to open files referred in the configuration file. -
run [parameters]
- Perform the benchmark with the given config. -
memory [parameters]
- Get memory profiling information with the given config. -
scripts
- Compile the java files for the weka methods. -
reports [parameters]
- Create the reports.
To save the results we use the python built-in SQLite database. The SQLite is a C library which provides a disk-based database that doesn't require a separate server. To store the results for the methods, it isn't necessary to specify a new function for a method. The whole data is collected by the main benchmark script and is stored in the database.
||||||= '''BUILDS''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the build. This number is the reference for the other tables. || || '''build''' || '''TIMESTAMP NOT NULL''' || Timestamp to identify the build. The timestamp is mainly used to sort the builds and to determine when the build was made. || || '''libary_id''' || '''INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id)''' || Each build is for a single library, this number is the reference for the associated library. || |||| || ||||||= '''LIBRARIES''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the libraries. This number is the reference for the other tables. || || '''name''' || '''TEXT NOT NULL''' || A Name to identify the library by a name. The name is taken from the configfile. || |||| || ||||||= '''DATASETS''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the data set. This number is the reference for the other tables. || || '''name''' || '''TEXT NOT NULL UNIQUE''' || A Name to identify the data set by a name. The name is taken from the configfile. || || '''size''' || '''INTEGER NOT NULL''' || The size of the data set. The size of the data set should be specified in megabyte. || || '''attributes''' || '''INTEGER NOT NULL''' || The number of attributes of the data set. || || '''instances''' || '''INTEGER NOT NULL''' || The number of instances if the data set. || || '''type''' || '''TEXT NOT NULL''' || The type of the data set e.g. "Real". || |||| || ||||||= '''METHODS''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the method. This number is the reference for the other tables. || || '''name''' || '''TEXT NOT NULL''' || A Name to identify the method by a name. The name is taken from the configfile. || || '''parameters''' || '''TEXT NOT NULL''' || The specified parameters/options for the given method. The parameters/options are taken from the configfile. || |||| || ||||||= '''RESULTS''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the results. || || '''build_id''' || '''INTEGER NOT NULL, FOREIGN KEY(build_id) REFERENCES builds(id)''' || This number is a reference of the id from the builds table. || || '''libary_id''' || '''INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id)''' || This number is a reference of the id from the libraries table. || || '''time''' || '''REAL NOT NULL''' || This value contains the measured time of the specified method. || || '''var''' || '''REAL NOT NULL''' || This value contains the measured variance of the specified method. || || '''dataset_id''' || '''INTEGER NOT NULL, FOREIGN KEY(dataset_id) REFERENCES datasets(id)''' || This number is a reference of the id from the datasets table. || || '''method_id''' || '''INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id)''' || This number is a reference of the id from the methods table. || |||| || ||||||= '''MEMORY''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the memory the results. || || '''build_id''' || '''INTEGER NOT NULL, FOREIGN KEY(build_id) REFERENCES builds(id)''' || This number is a reference of the id from the builds table. || || '''libary_id''' || '''INTEGER NOT NULL, FOREIGN KEY(libary_id) REFERENCES libraries(id)''' || This number is a reference of the id from the libraries table. || || '''dataset_id''' || '''INTEGER NOT NULL, FOREIGN KEY(dataset_id) REFERENCES datasets(id)''' || This number is a reference of the id from the datasets table. || || '''method_id''' || '''INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id)''' || This number is a reference of the id from the methods table. || || '''memory_info''' || '''INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id)''' || This field contains the path of the massif logfile. || |||| || ||||||= '''METHOD_INFO''' =|| || '''id''' || '''INTEGER PRIMARY KEY AUTOINCREMENT''' || Continuous number to identify the method_info the results. || || '''method_id''' || '''INTEGER NOT NULL, FOREIGN KEY(method_id) REFERENCES methods(id)''' || This number is a reference of the id from the methods table. || || '''info''' || '''TEXT NOT NULL''' || This value contains the info for the specified method. || |||| ||
I've picked the matplotlib python library to create the plots for the project because with matplotlib we have full control of the plot properties like styles and fonts.
The data for the graphs, are loaded from the database, for this reason, it isn't necessary to specify a new function for a method. The whole data is collected by the main benchmark script and is stored in the database for later processing.
To create the reports, the values must be written to the database. To store the measured values in the database run the following command from the benchmark the root folder:
make run LOG=True
As with the normal benchmark, you can add parameters to specify the methods and libraries. See the parameter section for more details.
To create the reports page run make with the following extensions from the root folder:
make reports
With this command the reports are saved in the reports folder. To view the reports, you have to open the index.html in a browser. For the design of the reports we have used the Twitter - Bootstrap framework which contains HTML and CSS-based design templates to create websites. Most of the templates are designed to be backward compatible, so the reports are available for almost all devices and browsers.
The top plot shows the development of mlpack (all time values are summed). So you have an initial impression of the mlpack development of the time.
Notice that the plot is highly unstable, if you add or remove a method or a dataset.
The progress bar shows the number of data sets in which mlpack is the best in percent.
The bar chart shows the timing data for the different methods and data sets. The bars are grouped by the specified library taken from the configuration file.
The line chart shows the development of given mlpack method (all time values for the specified method are summed). So you have an initial impression if the changes over the time have caused speedups or slowdowns.
The memory chart shows how much heap memory the method with the given dataset uses. It is also possible to look more closely into the massif log. For this, the logs are attached under the memory chart.
This section is a step by step guide which shows how to write a new script.
- Use the following template.
#!python
class ScriptName(object):
def __init__(self, dataset, timeout=0, verbose=True):
# Code here.
def RunMethod(self, options):
# Code here.
# return time
- Open the template and edit the required sections.
editor script_template.py
2.1 Edit the __init__()
function.
The method __init__()
is special and is often called the constructor. This function is automatically invoked for the newly-created class instance and in our case this function is invoked at the beginning of the benchmark. Two parameters are handed over when the main benchmark script invokes this function. The dataset parameter can contain the path to a data set, or a list of data sets and the timeout parameter which contains the timeout time in seconds.
In this example we do nothing in the __init__()
function. However, we want to use the parameter in another function, so we have to add the following lines to make them available for the other function.
self.dataset = dataset
self.timeout = timeout
2.2 Edit the RunMethod()
function.
The RunMethod()
function is automatically invoked after the __init__()
function. One parameter is handed over when the main benchmark script invokes this function. The option parameter can contain additional parameters. The RunMethod()
function is the place where to put the code to benchmark a method.
In this example we would like to benchmark the following simple command to benchmark the CPU.
$ echo '2^2^20' | time bc > /dev/null
To achieve that, we use the python subprocess module.
To use the subprocess module we have to import the module with the following line.
import subprocess
If you do not want to deal with lexical analyzers problems, import the python shlex module.
import shlex
Now we can pass the command into the shlex.split function which does all the lexical stuff and pass the output into the subprocess function. The subprocess has a nice benefit. It provides a built-in timeout option, so we can use this for our script.
cmd = shlex.split("echo '2^2^20' | time bc > /dev/null")
s = subprocess.check_output(cmd, shell=False, self.timeout)
Note: "Executing shell commands that incorporate unsanitized input from an untrusted source makes a program vulnerable to shell injection, a serious security flaw which can result in arbitrary command execution. For this reason, the use of shell=True
is strongly discouraged. The parameter shell=False disables all shell based features, but does not suffer from this vulnerability".
2.3 Measure the time.
To measure the time we can use the provided timer function from the util folder. To use this timer function we have to add the util folder to the import path and import the timer module.
To add the util folder to the import path, add the following lines:
import os
import sys
import inspect
cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(
os.path.split(inspect.getfile(inspect.currentframe()))[0], "path/to/the/util/folder")))
if cmd_subfolder not in sys.path:
sys.path.insert(0, cmd_subfolder)
To import the timer module add the following line:
from timer import *
To measure the time with the timer module we have to create a timer object and wrap the code we would like to benchmark with the following command:
totalTimer = Timer()
with totalTimer:
# code here
To return the time to the benchmark script we use the following command:
return totalTimer.ElapsedTime()
If you follow the steps, the script should look like:
import os
import sys
import inspect
cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(
os.path.split(inspect.getfile(inspect.currentframe()))[0], "path/to/the/util/folder")))
if cmd_subfolder not in sys.path:
sys.path.insert(0, cmd_subfolder)
from timer import *
class ScriptName(object):
def __init__(self, dataset, timeout=0, verbose=True):
self.dataset = dataset
self.timeout = timeout
def RunMethod(self, options):
totalTimer = Timer()
with totalTimer:
cmd = shlex.split("echo '2^2^20' | time bc > /dev/null")
s = subprocess.check_output(cmd, shell=False, self.timeout)
return totalTimer.ElapsedTime()
- Add the new script to the configuration file located in the benchmark root folder.
To benchmark the new script we have to specify the run-time parameters in the {{{config.yaml}}} file. You can use the following lines to achieve this:
library: newLibrary
methods:
ScriptName:
run: true
script: path/to/the/new/script/new_script.py
format: ['']
datasets:
- files: ['']