Task splits as parquet files #1162

sebffischer · 2022-08-30T10:25:50Z

Are there plans to also provide the task splits as parquet files in the future?
This would allow us to remove the arff dependencies (once all the datasets are successfully migrated).

As an example wrt to the storage size, here the file-size of the NYC taxi dataset in parquet and arff.

library(mlr3oml)
library(duckdb)
#> Loading required package: DBI

otask = OMLTask$new(359943)
task_splits = otask$task_splits
#> INFO  [12:21:06.213] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/359943`, authenticated: `TRUE`}
#> INFO  [12:21:06.955] Retrieving ARFF {url: `https://api.openml.org//api_splits/get/359943/Task_359943_splits.arff`, authenticated: `TRUE`}

file_arff = tempfile(fileext = ".arff")
file_parquet = tempfile(fileext = ".parquet")

con = DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "tbl", task_splits, row.names = FALSE)
DBI::dbExecute(con, sprintf("COPY tbl TO '%s' (FORMAT 'PARQUET', CODEC 'ZSTD') ", file_parquet))
#> [1] 5818350
mlr3oml::write_arff(task_splits, file_arff)

file.size(file_parquet) / file.size(file_arff)
#> [1] 0.1619774

^{Created on 2022-08-30 by the reprex package (v2.0.1)}

joaquinvanschoren · 2022-09-03T19:16:38Z

Yes, that is the plan :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task splits as parquet files #1162

Task splits as parquet files #1162

sebffischer commented Aug 30, 2022

joaquinvanschoren commented Sep 3, 2022

Task splits as parquet files #1162

Task splits as parquet files #1162

Comments

sebffischer commented Aug 30, 2022

joaquinvanschoren commented Sep 3, 2022