Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write datasets in a JLD2 or Arrow format for faster read #125

Open
CarloLucibello opened this issue May 6, 2022 · 6 comments
Open

write datasets in a JLD2 or Arrow format for faster read #125

CarloLucibello opened this issue May 6, 2022 · 6 comments
Labels

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented May 6, 2022

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end
@lorenzoh
Copy link
Contributor

lorenzoh commented May 6, 2022

Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays.

@CarloLucibello
Copy link
Member Author

What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right?

@lorenzoh
Copy link
Contributor

lorenzoh commented May 6, 2022

Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples.

@zsz00
Copy link

zsz00 commented May 7, 2022

I agree too Arrow.jl is a good format:

  1. built-in compression
  2. Cross-language processing dataset

@CarloLucibello CarloLucibello changed the title write datasets in a JLD2 format for faster read write datasets in a JLD2 or Arrow format for faster read May 8, 2022
@CarloLucibello
Copy link
Member Author

HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow

@CarloLucibello CarloLucibello added gsoc and removed gsoc labels May 20, 2022
@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 11, 2023

some code showing how to read/write color arrays from/to arrow tables
https://gist.github.com/CarloLucibello/51d713ec4a1612b46e6c90e53c0f88e8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants