write datasets in a JLD2 or Arrow format for faster read #125

CarloLucibello · 2022-05-06T06:38:48Z

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end

The text was updated successfully, but these errors were encountered:

lorenzoh · 2022-05-06T07:04:51Z

Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays.

CarloLucibello · 2022-05-06T07:09:59Z

What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right?

lorenzoh · 2022-05-06T08:31:57Z

Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples.

zsz00 · 2022-05-07T03:58:44Z

I agree too Arrow.jl is a good format:

built-in compression
Cross-language processing dataset

CarloLucibello · 2022-05-14T07:30:21Z

HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow

CarloLucibello · 2023-02-11T22:25:54Z

some code showing how to read/write color arrays from/to arrow tables
https://gist.github.com/CarloLucibello/51d713ec4a1612b46e6c90e53c0f88e8

CarloLucibello changed the title ~~write datasets in a JLD2 format for faster read~~ write datasets in a JLD2 or Arrow format for faster read May 8, 2022

CarloLucibello added gsoc and removed gsoc labels May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write datasets in a JLD2 or Arrow format for faster read #125

write datasets in a JLD2 or Arrow format for faster read #125

CarloLucibello commented May 6, 2022 •

edited

lorenzoh commented May 6, 2022

CarloLucibello commented May 6, 2022

lorenzoh commented May 6, 2022

zsz00 commented May 7, 2022

CarloLucibello commented May 14, 2022

CarloLucibello commented Feb 11, 2023 •

edited

write datasets in a JLD2 or Arrow format for faster read #125

write datasets in a JLD2 or Arrow format for faster read #125

Comments

CarloLucibello commented May 6, 2022 • edited

lorenzoh commented May 6, 2022

CarloLucibello commented May 6, 2022

lorenzoh commented May 6, 2022

zsz00 commented May 7, 2022

CarloLucibello commented May 14, 2022

CarloLucibello commented Feb 11, 2023 • edited

CarloLucibello commented May 6, 2022 •

edited

CarloLucibello commented Feb 11, 2023 •

edited