Skip to content

Evovest/MLBenchmarks.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLBenchmarks.jl

This repo provides Julia based benchmarks for ML algo on tabular data. It was developed to support both NeuroTreeModels.jl and EvoTrees.jl projects.

Methodology

For each dataset and algo, the following methodology is followed:

  • Data is split in three parts: train, eval and test
  • A random grid of 16 hyper-parameters is generated
  • For each parameter configuration, a model is trained on train data until the evaluation metric tracked against the eval stops improving (early stopping)
  • The trained model is evaluated against the test data
  • The metric presented in below are the ones obtained on the test for the model that generated the best eval metric.

Datasets

The following selection of common tabular datasets is covered:

  • Year: min squared error regression
  • MSRank: ranking problem with min squared error regression
  • YahooRank: ranking problem with min squared error regression
  • Higgs: 2-level classification with logistic regression
  • Boston Housing: min squared error regression
  • Titanic: 2-level classification with logistic regression

Algorithms

Comparison is performed against the following algos (implementation in link) considered as state of the art on tabular data problems tasks:

Boston

model_type train_time mse gini
neurotrees 12.8 18.9 0.947
evotrees 0.206 19.7 0.927
xgboost 0.0648 19.4 0.935
lightgbm 0.865 25.4 0.926
catboost 0.0511 13.9 0.946

Titanic

model_type train_time logloss accuracy
neurotrees 7.58 0.407 0.828
evotrees 0.673 0.382 0.828
xgboost 0.0379 0.375 0.821
lightgbm 0.615 0.390 0.836
catboost 0.0326 0.388 0.836

Year

model_type train_time mse gini
neurotrees 280.0 76.4 0.652
evotrees 18.6 80.1 0.627
xgboost 17.2 80.2 0.626
lightgbm 8.11 80.3 0.624
catboost 80.0 79.2 0.635

MSRank

model_type train_time mse ndcg
neurotrees 39.1 0.578 0.462
evotrees 37.0 0.554 0.504
xgboost 12.5 0.554 0.503
lightgbm 37.5 0.553 0.503
catboost 15.1 0.558 0.497

Yahoo

model_type train_time mse ndcg
neurotrees 417.0 0.584 0.781
evotrees 687.0 0.545 0.797
xgboost 120.0 0.547 0.798
lightgbm 244.0 0.540 0.796
catboost 161.0 0.561 0.794

Higgs

model_type train_time logloss accuracy
neurotrees 12300.0 0.452 0.781
evotrees 2620.0 0.464 0.776
xgboost 1390.0 0.462 0.776
lightgbm 1330.0 0.461 0.779
catboost 7180.0 0.464 0.775

References

About

ML models benchmarks on public dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages