Skip to content
This repository has been archived by the owner on Jul 31, 2021. It is now read-only.

Sparse Matrix Support #10

Open
ThomasWolf0701 opened this issue Sep 2, 2020 · 7 comments
Open

Sparse Matrix Support #10

ThomasWolf0701 opened this issue Sep 2, 2020 · 7 comments

Comments

@ThomasWolf0701
Copy link

ThomasWolf0701 commented Sep 2, 2020

The current code transforms all data into a matrix with as.matrix()
private$dtrain = lightgbm::lgb.Dataset(
data = as.matrix(data[, task$feature_names, with = F]),
label = label,
free_raw_data = FALSE
)

But both mlr3 and the lightgbm R package support sparse matrices:

https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.html

data a matrix object, a dgCMatrix object or a character representing a filename

and mlr3
https://mlr3.mlr-org.com/reference/DataBackendMatrix.html

It would be great if sparse matrices (dgCMatrix ) would be supported.
Maybe as(data,"sparseMatrix") or so.

Would be really great if this would be supported.

@ThomasWolf0701
Copy link
Author

From my understanding in the case of using the mlr3 data table based your code uses some preprocessing steps to transform the data.table infto a data.frame and then transform this into a numerical format useable by lightgbm and then into a matrix with
as.matrix() to the lgb.Dataset function ?

If the user uses mlr3 with a DataBackendMatrix this matrix could directly be passed to the lgb.Dataset function without as.matrix then the sparsity would even be preserved using the canonical mlr3 way.

@statist-bhfz
Copy link
Contributor

I'm not sure if it worth to set dgMatrix as default format. Maybe additional parameterization is required? And one more if-else statement in https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/backend_preprocessing.R where all preprocessing steps should be moved.

@kapsner
Copy link
Member

kapsner commented Sep 7, 2020

I am currently over it! @statist-bhfz, good idea, to move all to the backend_preprocessing; however I need to figure out, how to do it best, since we currently seem to need the "as.matrix" function for passing data.tables to lgb.Dataset

@statist-bhfz
Copy link
Contributor

statist-bhfz commented Sep 9, 2020

@kapsner as.matrix() is not mandatory, lgb.Dataset() also supports dgCMatrix objects: https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.html
mlr3 has DataBackendMatrix backend which stores the data in sparse format, but this format is Matrix::sparseMatrix(). I think, the most convenient solution is to use data.table as the only backend for lightgbm learner and allow to switch between matrix and dgMatrix in learner's parameters list.
Some time ago I wrote simple lightgbm wrapper for my tiny ML framework and some parts of code look very similar to your current implementation.

@kapsner
Copy link
Member

kapsner commented Sep 9, 2020

Indeed, thats correct.

The problem is, that this

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

does not work with data.table backends (there is no internal transformation) and I need to figure out a different solution for allowing both data backends.

(https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/LearnerClassifLightGBM.R#L651)

@statist-bhfz
Copy link
Contributor

statist-bhfz commented Sep 9, 2020

I could be wrong, but it's necessary to specify matrix backend instead of data.table during task construction to get it work:

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

Possibly DataBackendMatrix support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying with data.table backend followed by (sparse) matrix transformation inside the learner.

DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.

Potential data loss is quite serious contraindication!

@ThomasWolf0701
Copy link
Author

I could be wrong, but it's necessary to specify matrix backend instead of data.table during task construction to get it work:

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

Possibly DataBackendMatrix support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying with data.table backend followed by (sparse) matrix transformation inside the learner.

DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.

Potential data loss is quite serious contraindication!

This is how tidymodels seems to handle this issue, but to my understanding this would not be consistent with how mlr3 was designed. If the user already prepared the data as a numeric matrix the data loss should not occur. For factors it would anyway be the data.table backend.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants