Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Change in Epoch Time with Dataset Size #208

Open
dzeego opened this issue Nov 2, 2023 · 3 comments
Open

Significant Change in Epoch Time with Dataset Size #208

dzeego opened this issue Nov 2, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@dzeego
Copy link

dzeego commented Nov 2, 2023

Hello,

I recently started using the nnDetection and have noticed that my training epoch time significantly increased when the size of my training dataset increased.

To be more specific, I ran the nnDetection preprocessing on a large dataset of ~2k CT volumes, then trained a model using the generated splits_final.pkl file. One epoch with this configuration lasted 3 hours.
However, for the exact same preprocessing and training configurations, having only modified the splits_final.pkl file to randomly include a subset (~200 CT volumes) of the original training dataset (~2k CT volumes), the epoch time was reduced to 12 minutes per epoch!

Is there an explanation for this behavior?

Many thanks in advance.

@mibaumgartner
Copy link
Collaborator

mibaumgartner commented Nov 8, 2023

Dear @dzeego ,

That sounds rather surprising. Thank you for reporting the issue and sorry for getting back to you rather late due to my vacation. Is it possible to reproduce the issue with the toy dataset so I can have a look locally as well?

Theoretically, training time should remain independent of the dataset size since the same number of batches/samples is sampled in each epoch. 12 minutes per Epoch also sounds extremely fast, usually epoch times range somewhere between 20-40 minutes (sometimes slightly longer) depending on the configured strides of the network and the available GPU (assuming no other bottleneck are present).

Best,
Michael

Edit: the only case which I could think of is the presence of an IO bottleneck and by reducing the number of samples the OS can cache the inputs which alleviates the IO bottleneck. Even then, 12 minutes for an epoch sounds quite quick though and would highly depend on the input to the network (e.g. 3D data which is rather small in resolution)

@dzeego
Copy link
Author

dzeego commented Dec 4, 2023

Hi @mibaumgartner,

Indeed, the bottleneck was the data IO, which as you said, by reducing the number of samples the OS can cache the inputs.
However, we managed to alleviate the problem by changing the type of saved arrays from numpy memmap objects to zarr objects (the evolution of hdf5). By loading zarr arrays the code runs approximately 3 times faster on large datasets (~2k CT scans) compared to the original numpy configuration and the bottleneck is again the computation on the GPU.
I would highly suggest looking into this for the data IO.
https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/

Best regards,
Dzeego

@mibaumgartner
Copy link
Collaborator

Dear @dzeego ,

thank you for the suggestion, I'll definitely look into it!

Best,
Michael

@mibaumgartner mibaumgartner added the enhancement New feature or request label Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants