Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215

Breeze-Zero · 2022-02-15T15:57:05Z

I recently tried to do some experiments on my model with multi-coil FastMRI brain data. Due to the need for flexibility (and also because I don't have the extra time to learn how to use Pytorch lighting), I didn't use Pytorch Lighting directly. Instead, I chose normal Pytorch, but during the iterating process, I only set num_worker=2, and my memory footprint was quite large at the beginning. As the number of iterations increased, an error occurred:
RuntimeError: DataLoader worker (PID 522908) is killed by signal: killed.
I checked the training codes of other parts, but no obvious memory accumulation error was found. Therefore, I thought there was a large probability of a problem in siliceDataset. I simply used "pass" to traverse the Dataloader loop, and found that the memory occupation kept rising.

mmuckley · 2022-02-15T17:52:20Z

Hello @834799106, thanks for putting an issue here.

Based on your error I doubt the SliceDataset class is the issue. For one, your program is not being terminated due to memory, it is being terminated because some process killed the overall program. Also, if you look at the __getitem__ function, you can see that there are no side-effects. Everything the function creates should be returned to the calling function or destroyed.

In order to verify a memory leak we will need you to give us a reproducible example for your case since you're not using the PyTorch Lightning modules. Also, please let us know what version of PyTorch you are using and any information you have on the memory usage throughout an epoch. Note: high memory at the start might be expected, as you have your model in memory. There is also some metadata about the dataset that is precomputed and stored in memory.

soumickmj · 2022-02-16T02:18:56Z

Hi @mmuckley I was about to file an issue for a memory leak. I'm not sure about the issue of @834799106 though.
I have created a small pieace of code for reproducing.

`
from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain/multicoil_train"

sd = SliceDataset(root=root_gt, challenge="multicoil",
transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), use_dataset_cache=True,
dataset_cache_file=f"{os.path.dirname(root_gt)}/dataset_cache_{os.path.basename(root_gt)}.pkl")
dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

for e in tqdm(dl):
del e
pass
`
I'm currently using the latest git pull of fastMRI.

While running this code, I was monitoring the memory usage, even though I'm deleting the varible, it is still increasing the memory usage constantly. Originally, this was part of my other pipeline where I'm only using the SliceDataset and not the whole Lightning module. If you would like to have a look, this is the code: https://github.com/soumickmj/NCC1701/blob/main/Engineering/datasets/fastMRI.py

I was originally thinking maybe my code is creating the leak. But with other dataset mode (different code for reading other datasets) of the same NCC1701 pipeline of mine did not create the leak.
Then I wrote that small script to try to see if the leak is still there when my pipeline is not involved.

soumickmj · 2022-02-16T02:37:15Z

I also got a similar behaviour while using the Data Module.

`
from fastmri.data.transforms import UnetDataTransform
from fastmri.pl_modules import FastMriDataModule
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain"

data_module = FastMriDataModule(data_path=root_gt,
challenge="multicoil",
train_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
val_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
test_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
batch_size=1,
num_workers=10)
dl = data_module.train_dataloader()

for e in tqdm(dl):
del e
pass

`

mmuckley · 2022-02-16T03:35:17Z

Hello @soumickmj, I ran your script on the knee validation data with memory-profiler and memory usage peaked pretty early a little bit less than 5 GB (see attached), staying flat for the rest of the entire dataset afterwards (which does not suggest a leak).

Perhaps you could try running on your system to verify with PyTorch 1.10?

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm


@profile
def main():
    val_path = "/path/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(dl):
        del e
        pass


if __name__ == "__main__":
    main()

You can run with mprof run --include-children file.py.

Breeze-Zero · 2022-02-16T07:19:02Z

hi @mmuckley，I copied the above code to my machine, but modified batch Szie. The figure below is the result of mprof run --include-children file.py

I didn't even finish an epoch before broke off。

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm


@profile
def main():
    val_path = "/data2/fastmri/mnt/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=4, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass


if __name__ == "__main__":
    main()

Maybe it's PyTorch version that's causing the problem. My PyTorch version is 1.8.1+ cu111

soumickmj · 2022-02-16T07:36:47Z

Sorry @mmuckley I also got the same problem after running memory profiler.
I used to different versions of PyTorch.
On the contrary to @834799106 I am using more latest versions of PyTorch.

With PyTorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 I got:-

With PyTorch 1.11.0.dev20220129 py3.9_cuda11.3_cudnn8.2.0_0 (pytorch-nightly), due to the features I use, I usually need this version for my work:-

I did not run till the very end as it was continously increasing and would have crashed the server again which has 250GB of RAM. So I don't feel its related to the PyTorch version.

Just to let you know: the OS is Ubuntu 20.04.3 LTS and the python version is 3.9.7

soumickmj · 2022-02-16T08:22:30Z

sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass

Hi @mmuckley
In your code, I just noticed that you are looping over the dataset directly without the dataloader, whereas in my code I'm running over DataLoader. Can you please also try that out too?
In my case, both shown similar behaviour. I also tested with 0,1,3 as number of workers - all the same.
I created another conda env with python 3.7.11 and torch 1.8.2, got similar behaviour as well.

Am I doing something wrong somewhere?

mmuckley · 2022-02-16T14:03:41Z

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

soumickmj · 2022-02-16T14:05:38Z

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

Ah okay, no problem!
But still, in my case, I'm getting this constant increase in terms of memory usage as you can see from the plots.
Any suggestions?

mmuckley · 2022-02-16T14:06:41Z

One thing I notice is that you are both using Python 3.9. I could try Python 3.9 and check with that perhaps.

EDIT: Sorry, I see you also tried Python 3.7. Not sure what to do then...

mmuckley · 2022-02-16T14:42:43Z

Okay, I tested on Python 3.9 and I'm still getting the same behavior with batch_size=4.

To help a bit more with this I'm including my complete conda environments that I used for my tests. I'm using a custom Linux distribution based on 5.4.0-81-generic x86_64 on with an Intel Xeon E5-2698. @soumickmj @834799106, if either of you could try one of my conda environments maybe we could figure out if it's one of the packages.

Python 3.8 environment
Python 3.9 environment

soumickmj · 2022-02-16T15:37:25Z

Thanks @mmuckley
I have tested with your 3.9 conda env and it worked without a problem.
Your yml was missing fastMRI:
So I installed it using pip install git+https://github.com/facebookresearch/fastMRI.git

Then I created a "bare minimum" env without using your env, and this resulted in the old issue again.
For this env, I just installed pytorch 1.10.2 with cuda 11.3 and then directly fastMRI repo using git similar to the other one.
This conda environment doesn't contain anything which might not be requred.

I compared the versions. Initially, the version of numpy was different (1.21.2). So I switched to the one you have 1.20.3
Apart from all the extra packages which are there in your env, I couldn't see any difference.
Here is the yml file, zipped as it wasn't allowing me to upload yml here.
fMMem.zip

Do you have any idea what might be the reason?
Do I need some additional package that it is not complaining about but still required and causing this issue?

Breeze-Zero · 2022-02-16T15:48:53Z

I was in a similar situation. The problem was solved by installing your Py3.9 environment directly with Conda, however, creating a Py3.9 environment from Conda normally and pip install git+https://github.com/facebookresearch/fastMRI.git and pandas (It's not in the FastMRI package), the problem remains.

mmuckley · 2022-02-16T16:02:01Z

So my install process is as follows in a few bash commands:

conda create -n memory_test_py39 python=3.9
conda activate memory_test_py39
conda install anaconda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install -e .

Where the pip install -e . is in the fastMRI folder.

In that case I will try to reproduce now with your minimal environments.

mmuckley · 2022-02-16T16:25:49Z

@soumickmj @834799106 I can now reproduce this with the minimal install environment.

Reproduction environment here: https://gist.github.com/mmuckley/838289a388bc65a7adb23d67908635c9.

soumickmj · 2022-02-16T16:35:10Z

@mmuckley This is really strange!
That bare minimum environment had nothing almost.
While running the code, fastMRI did not throw any errors about any missing pacakges.
Still, we all got this weird behaviour.
Do you have any hunch?

mmuckley · 2022-02-16T16:37:24Z

I do think it is related to SliceDataset itself as I see similar characteristics with VarNetDataTransform.

mmuckley · 2022-02-16T17:03:45Z

Actually I have to take that back. With the minimal environment and no transforms, I see no issue.

soumickmj · 2022-02-16T17:12:01Z

Aahahaha, yes, I can confirm that too.
I tried with my old work environment, running pytorch nightly (1.11dev2) and saw the same behaviour.

Then the problem is with the DataTransformations and not with the SliceDataset!

soumickmj · 2022-02-16T17:31:39Z

I might have found the source.
Conversion from numpy to pytorch tensor.

I did not test using your transforms though, but using my transform - but they were showing a similar behaviour.

EDIT: Here's my code
For testing purposes, I returned directly after line 87 in both cases.

Breeze-Zero · 2022-02-16T18:19:05Z

I tried calling return kspace_torch after each line of UnetDataTransform and found that kspace_torch = to_tensor(kspace) doesn't have a memory leak. After mask_func I started having problems, but when I set mask_func=None the problems disappeared. Then add return after image = fastmri.ifft2c(masked_kspace), and the problem arises again. Therefore, this may not be a single statement problem. Due to the time difference, I had to rest and could not conduct more investigations for the time being.

mmuckley · 2022-02-16T18:40:27Z

Okay I think I found the issue: it is due to the h5py from pip. It's a relatively recent issue that has been documented here:

https://forum.hdfgroup.org/t/h5py-memory-leak-in-combination-with-pytorch-dataset-with-multiple-workers/9114

If you install the minimal environment, but use the h5py from conda instead of pip then memory stays stable.

soumickmj · 2022-02-16T19:01:46Z

Thanks @mmuckley
I also stumbled upon the same root for our problem.
And for me, it got solved by building and installing H5py from the gitrepo.
I will check out your conda remedy now!

mmuckley · 2022-02-16T19:04:12Z

With the h5py from conda if I print(h5py.version.info) then I get the following:

Summary of the h5py configuration
---------------------------------

h5py    3.6.0
HDF5    1.10.6
Python  3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.21.2
cython (built with) 0.29.24
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6

So the conda h5py was built against HDF5 1.10.6, before the issue begins.

soumickmj · 2022-02-16T19:08:30Z

Thanks @mmuckley
It works perfectly with conda
I tried with my old env with PyTorch nightly!

Thanks for all your help!

PS: Maybe you can put a notice on the homepage of fastMRI for people to know about this as the impact can be significant.

mmuckley · 2022-02-16T19:16:58Z

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

soumickmj · 2022-02-16T21:25:21Z

Thanks!
I will also explore possible ideas!
If I find some fix, I will let you know :)

soumickmj · 2022-03-09T20:41:52Z

Hi @mmuckley, the issue resurfaced after the conda version of h5py got updated as well.
This time (I don't know why, what missmatched!) I also had problem with the Git version.
One possible workaround would be to use ".copy()" after every h5py operation.

So basically, inside the get item function of mri_data.py,
we need something like this:-

with h5py.File(fname, "r") as hf:
kspace = hf["kspace"][dataslice].copy()
mask = np.asarray(hf["mask"].copy()) if "mask" in hf else None
target = hf[self.recons_key][dataslice].copy() if self.recons_key in hf else None
(Pull request #227)

Maybe its dirty (not sure if it will have some other implications - say in terms of speed), but for me, its working so far.
Can you please have a look and let me know your thoughts :)

mmuckley · 2022-03-10T14:46:01Z

Hello @soumickmj I do not observe HDF5 being updated on conda, at least for Python version 3.8 or 3.9.

Sarah-2021-scu · 2023-05-05T01:49:00Z

Hi @mmuckley

I am running varnet demo for a small set of brain mri data, and I am getting the following error after 3-4 iterations of the first epoch:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 11.91 GiB total capacity; 10.40 GiB already allocated; 74.81 MiB free; 10.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Following is my h5py version:

h5py 3.7.0
HDF5 1.10.6
Python 3.9.16 (main, Mar 8 2023, 14:00:05)
[GCC 11.2.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.30
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6

I am also attaching my memory profiler plot.

Please tell me where i am going wrong.
Thank you for your help!

mmuckley · 2023-05-05T17:54:59Z

Hello @Sarah-2021-scu, your memory usage is good. The error in your case is on the GPU, which is not related to this particular issue.

It looks like your GPU may just be too small. You could try running the model with a lower cascade count.

Sarah-2021-scu · 2023-05-05T19:33:13Z

Thank you @mmuckley for your response. I am using 2 GPU's 12GB of memory each. I will lower the cascade count as well. The other options I have are:

1 GPU with 32GB memory.
4 GPU with 12GB memory.
Which will be the best option to choose from the above 2?

hujb48 · 2023-10-30T08:35:53Z

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

I got the same situation while trained the 'unet baseline model', and I followed this issue #217, by the command
pip uninstall h5py and conda install h5py==3.6.0, it works without any problem, and my env is python 3.8.18 with torch 1.13.0 +cuda 11.7.

mmuckley changed the title ~~About the memory leak of siliceDataset in normal Pytorch Dataloader~~ Potential memory leak in SliceDataset Feb 15, 2022

mmuckley changed the title ~~Potential memory leak in SliceDataset~~ Potential memory leak in UNetDataTransform and VarNetDataTransform Feb 16, 2022

mmuckley changed the title ~~Potential memory leak in UNetDataTransform and VarNetDataTransform~~ Memory leak with h5py from pip and conversion to torch.Tensor Feb 16, 2022

mmuckley mentioned this issue Feb 16, 2022

Add warning about pip h5py #217

Merged

mmuckley added the bug Something isn't working label Mar 7, 2022

hellopipu mentioned this issue Feb 29, 2024

GPU memory for training hellopipu/PromptMR#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215

Breeze-Zero commented Feb 15, 2022 •

edited

mmuckley commented Feb 15, 2022

soumickmj commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

Breeze-Zero commented Feb 16, 2022

soumickmj commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

Breeze-Zero commented Feb 16, 2022

mmuckley commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

soumickmj commented Feb 16, 2022 •

edited

Breeze-Zero commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

soumickmj commented Mar 9, 2022 •

edited

mmuckley commented Mar 10, 2022

Sarah-2021-scu commented May 5, 2023 •

edited

mmuckley commented May 5, 2023

Sarah-2021-scu commented May 5, 2023

hujb48 commented Oct 30, 2023

Memory leak with h5py from pip and conversion to torch.Tensor #215

Memory leak with h5py from pip and conversion to torch.Tensor #215

Comments

Breeze-Zero commented Feb 15, 2022 • edited

mmuckley commented Feb 15, 2022

soumickmj commented Feb 16, 2022 • edited

soumickmj commented Feb 16, 2022 • edited

mmuckley commented Feb 16, 2022 • edited

Breeze-Zero commented Feb 16, 2022

soumickmj commented Feb 16, 2022 • edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022 • edited

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

Breeze-Zero commented Feb 16, 2022

mmuckley commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

soumickmj commented Feb 16, 2022 • edited

Breeze-Zero commented Feb 16, 2022 • edited

mmuckley commented Feb 16, 2022 • edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022 • edited

soumickmj commented Feb 16, 2022

mmuckley commented Feb 16, 2022

soumickmj commented Feb 16, 2022

soumickmj commented Mar 9, 2022 • edited

mmuckley commented Mar 10, 2022

Sarah-2021-scu commented May 5, 2023 • edited

mmuckley commented May 5, 2023

Sarah-2021-scu commented May 5, 2023

hujb48 commented Oct 30, 2023

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215

Breeze-Zero commented Feb 15, 2022 •

edited

soumickmj commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

soumickmj commented Feb 16, 2022 •

edited

Breeze-Zero commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

mmuckley commented Feb 16, 2022 •

edited

soumickmj commented Mar 9, 2022 •

edited

Sarah-2021-scu commented May 5, 2023 •

edited