Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with h5py from pip and conversion to torch.Tensor #215

Open
Breeze-Zero opened this issue Feb 15, 2022 · 33 comments
Open

Memory leak with h5py from pip and conversion to torch.Tensor #215

Breeze-Zero opened this issue Feb 15, 2022 · 33 comments
Labels
bug Something isn't working

Comments

@Breeze-Zero
Copy link

Breeze-Zero commented Feb 15, 2022

I recently tried to do some experiments on my model with multi-coil FastMRI brain data. Due to the need for flexibility (and also because I don't have the extra time to learn how to use Pytorch lighting), I didn't use Pytorch Lighting directly. Instead, I chose normal Pytorch, but during the iterating process, I only set num_worker=2, and my memory footprint was quite large at the beginning. As the number of iterations increased, an error occurred:
RuntimeError: DataLoader worker (PID 522908) is killed by signal: killed.
I checked the training codes of other parts, but no obvious memory accumulation error was found. Therefore, I thought there was a large probability of a problem in siliceDataset. I simply used "pass" to traverse the Dataloader loop, and found that the memory occupation kept rising.

@mmuckley mmuckley changed the title About the memory leak of siliceDataset in normal Pytorch Dataloader Potential memory leak in SliceDataset Feb 15, 2022
@mmuckley
Copy link
Contributor

Hello @834799106, thanks for putting an issue here.

Based on your error I doubt the SliceDataset class is the issue. For one, your program is not being terminated due to memory, it is being terminated because some process killed the overall program. Also, if you look at the __getitem__ function, you can see that there are no side-effects. Everything the function creates should be returned to the calling function or destroyed.

In order to verify a memory leak we will need you to give us a reproducible example for your case since you're not using the PyTorch Lightning modules. Also, please let us know what version of PyTorch you are using and any information you have on the memory usage throughout an epoch. Note: high memory at the start might be expected, as you have your model in memory. There is also some metadata about the dataset that is precomputed and stored in memory.

@soumickmj
Copy link

soumickmj commented Feb 16, 2022

Hi @mmuckley I was about to file an issue for a memory leak. I'm not sure about the issue of @834799106 though.
I have created a small pieace of code for reproducing.

`
from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain/multicoil_train"

sd = SliceDataset(root=root_gt, challenge="multicoil",
transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), use_dataset_cache=True,
dataset_cache_file=f"{os.path.dirname(root_gt)}/dataset_cache_{os.path.basename(root_gt)}.pkl")
dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

for e in tqdm(dl):
del e
pass
`
I'm currently using the latest git pull of fastMRI.

While running this code, I was monitoring the memory usage, even though I'm deleting the varible, it is still increasing the memory usage constantly. Originally, this was part of my other pipeline where I'm only using the SliceDataset and not the whole Lightning module. If you would like to have a look, this is the code: https://github.com/soumickmj/NCC1701/blob/main/Engineering/datasets/fastMRI.py

I was originally thinking maybe my code is creating the leak. But with other dataset mode (different code for reading other datasets) of the same NCC1701 pipeline of mine did not create the leak.
Then I wrote that small script to try to see if the leak is still there when my pipeline is not involved.

@soumickmj
Copy link

soumickmj commented Feb 16, 2022

I also got a similar behaviour while using the Data Module.

`
from fastmri.data.transforms import UnetDataTransform
from fastmri.pl_modules import FastMriDataModule
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain"

data_module = FastMriDataModule(data_path=root_gt,
challenge="multicoil",
train_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
val_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
test_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
batch_size=1,
num_workers=10)
dl = data_module.train_dataloader()

for e in tqdm(dl):
del e
pass

`

@mmuckley
Copy link
Contributor

mmuckley commented Feb 16, 2022

Hello @soumickmj, I ran your script on the knee validation data with memory-profiler and memory usage peaked pretty early a little bit less than 5 GB (see attached), staying flat for the rest of the entire dataset afterwards (which does not suggest a leak).
Screen Shot 2022-02-15 at 10 25 41 PM

Perhaps you could try running on your system to verify with PyTorch 1.10?

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm


@profile
def main():
    val_path = "/path/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(dl):
        del e
        pass


if __name__ == "__main__":
    main()

You can run with mprof run --include-children file.py.

@Breeze-Zero
Copy link
Author

hi @mmuckley,I copied the above code to my machine, but modified batch Szie. The figure below is the result of mprof run --include-children file.py
aa
I didn't even finish an epoch before broke off。
image

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm


@profile
def main():
    val_path = "/data2/fastmri/mnt/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=4, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass


if __name__ == "__main__":
    main()

Maybe it's PyTorch version that's causing the problem. My PyTorch version is 1.8.1+ cu111

@soumickmj
Copy link

soumickmj commented Feb 16, 2022

Sorry @mmuckley I also got the same problem after running memory profiler.
I used to different versions of PyTorch.
On the contrary to @834799106 I am using more latest versions of PyTorch.

With PyTorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 I got:-

fastM11

With PyTorch 1.11.0.dev20220129 py3.9_cuda11.3_cudnn8.2.0_0 (pytorch-nightly), due to the features I use, I usually need this version for my work:-

fastM10

I did not run till the very end as it was continously increasing and would have crashed the server again which has 250GB of RAM. So I don't feel its related to the PyTorch version.

Just to let you know: the OS is Ubuntu 20.04.3 LTS and the python version is 3.9.7

@soumickmj
Copy link

sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass

Hi @mmuckley
In your code, I just noticed that you are looping over the dataset directly without the dataloader, whereas in my code I'm running over DataLoader. Can you please also try that out too?
In my case, both shown similar behaviour. I also tested with 0,1,3 as number of workers - all the same.
I created another conda env with python 3.7.11 and torch 1.8.2, got similar behaviour as well.

Am I doing something wrong somewhere?

@mmuckley
Copy link
Contributor

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

Screen Shot 2022-02-16 at 9 01 48 AM

@soumickmj
Copy link

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

Screen Shot 2022-02-16 at 9 01 48 AM

Ah okay, no problem!
But still, in my case, I'm getting this constant increase in terms of memory usage as you can see from the plots.
Any suggestions?

@mmuckley
Copy link
Contributor

mmuckley commented Feb 16, 2022

One thing I notice is that you are both using Python 3.9. I could try Python 3.9 and check with that perhaps.

EDIT: Sorry, I see you also tried Python 3.7. Not sure what to do then...

@mmuckley
Copy link
Contributor

Okay, I tested on Python 3.9 and I'm still getting the same behavior with batch_size=4.

Screen Shot 2022-02-16 at 9 35 09 AM

To help a bit more with this I'm including my complete conda environments that I used for my tests. I'm using a custom Linux distribution based on 5.4.0-81-generic x86_64 on with an Intel Xeon E5-2698. @soumickmj @834799106, if either of you could try one of my conda environments maybe we could figure out if it's one of the packages.

Python 3.8 environment
Python 3.9 environment

@soumickmj
Copy link

Thanks @mmuckley
I have tested with your 3.9 conda env and it worked without a problem.
Your yml was missing fastMRI:
So I installed it using pip install git+https://github.com/facebookresearch/fastMRI.git

BS1_newEnv

Then I created a "bare minimum" env without using your env, and this resulted in the old issue again.
For this env, I just installed pytorch 1.10.2 with cuda 11.3 and then directly fastMRI repo using git similar to the other one.
This conda environment doesn't contain anything which might not be requred.

BS1_newEnvMin

I compared the versions. Initially, the version of numpy was different (1.21.2). So I switched to the one you have 1.20.3
Apart from all the extra packages which are there in your env, I couldn't see any difference.
Here is the yml file, zipped as it wasn't allowing me to upload yml here.
fMMem.zip

Do you have any idea what might be the reason?
Do I need some additional package that it is not complaining about but still required and causing this issue?

@Breeze-Zero
Copy link
Author

I was in a similar situation. The problem was solved by installing your Py3.9 environment directly with Conda, however, creating a Py3.9 environment from Conda normally and pip install git+https://github.com/facebookresearch/fastMRI.git and pandas (It's not in the FastMRI package), the problem remains.

@mmuckley
Copy link
Contributor

So my install process is as follows in a few bash commands:

conda create -n memory_test_py39 python=3.9
conda activate memory_test_py39
conda install anaconda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install -e .

Where the pip install -e . is in the fastMRI folder.

In that case I will try to reproduce now with your minimal environments.

@mmuckley
Copy link
Contributor

@soumickmj @834799106 I can now reproduce this with the minimal install environment.

Screen Shot 2022-02-16 at 11 23 12 AM

Reproduction environment here: https://gist.github.com/mmuckley/838289a388bc65a7adb23d67908635c9.

@soumickmj
Copy link

@mmuckley This is really strange!
That bare minimum environment had nothing almost.
While running the code, fastMRI did not throw any errors about any missing pacakges.
Still, we all got this weird behaviour.
Do you have any hunch?

@mmuckley
Copy link
Contributor

I do think it is related to SliceDataset itself as I see similar characteristics with VarNetDataTransform.

@mmuckley
Copy link
Contributor

Actually I have to take that back. With the minimal environment and no transforms, I see no issue.

Screen Shot 2022-02-16 at 12 03 01 PM

@soumickmj
Copy link

Aahahaha, yes, I can confirm that too.
I tried with my old work environment, running pytorch nightly (1.11dev2) and saw the same behaviour.

Then the problem is with the DataTransformations and not with the SliceDataset!

@mmuckley mmuckley changed the title Potential memory leak in SliceDataset Potential memory leak in UNetDataTransform and VarNetDataTransform Feb 16, 2022
@soumickmj
Copy link

soumickmj commented Feb 16, 2022

I might have found the source.
Conversion from numpy to pytorch tensor.

I did not test using your transforms though, but using my transform - but they were showing a similar behaviour.

EDIT: Here's my code
For testing purposes, I returned directly after line 87 in both cases.

@Breeze-Zero
Copy link
Author

Breeze-Zero commented Feb 16, 2022

I tried calling return kspace_torch after each line of UnetDataTransform and found that kspace_torch = to_tensor(kspace) doesn't have a memory leak. After mask_func I started having problems, but when I set mask_func=None the problems disappeared. Then add return after image = fastmri.ifft2c(masked_kspace), and the problem arises again. Therefore, this may not be a single statement problem. Due to the time difference, I had to rest and could not conduct more investigations for the time being.

@mmuckley
Copy link
Contributor

mmuckley commented Feb 16, 2022

Okay I think I found the issue: it is due to the h5py from pip. It's a relatively recent issue that has been documented here:

https://forum.hdfgroup.org/t/h5py-memory-leak-in-combination-with-pytorch-dataset-with-multiple-workers/9114

If you install the minimal environment, but use the h5py from conda instead of pip then memory stays stable.

@mmuckley mmuckley changed the title Potential memory leak in UNetDataTransform and VarNetDataTransform Memory leak with h5py from pip and conversion to torch.Tensor Feb 16, 2022
@soumickmj
Copy link

Thanks @mmuckley
I also stumbled upon the same root for our problem.
And for me, it got solved by building and installing H5py from the gitrepo.
I will check out your conda remedy now!

@mmuckley
Copy link
Contributor

mmuckley commented Feb 16, 2022

With the h5py from conda if I print(h5py.version.info) then I get the following:

Summary of the h5py configuration
---------------------------------

h5py    3.6.0
HDF5    1.10.6
Python  3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.21.2
cython (built with) 0.29.24
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6

So the conda h5py was built against HDF5 1.10.6, before the issue begins.

@soumickmj
Copy link

Thanks @mmuckley
It works perfectly with conda
I tried with my old env with PyTorch nightly!

Thanks for all your help!

PS: Maybe you can put a notice on the homepage of fastMRI for people to know about this as the impact can be significant.

@mmuckley
Copy link
Contributor

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

@soumickmj
Copy link

Thanks!
I will also explore possible ideas!
If I find some fix, I will let you know :)

@mmuckley mmuckley added the bug Something isn't working label Mar 7, 2022
@soumickmj
Copy link

soumickmj commented Mar 9, 2022

Hi @mmuckley, the issue resurfaced after the conda version of h5py got updated as well.
This time (I don't know why, what missmatched!) I also had problem with the Git version.
One possible workaround would be to use ".copy()" after every h5py operation.

So basically, inside the get item function of mri_data.py,
we need something like this:-

with h5py.File(fname, "r") as hf:
kspace = hf["kspace"][dataslice].copy()
mask = np.asarray(hf["mask"].copy()) if "mask" in hf else None
target = hf[self.recons_key][dataslice].copy() if self.recons_key in hf else None
(Pull request #227)

Maybe its dirty (not sure if it will have some other implications - say in terms of speed), but for me, its working so far.
Can you please have a look and let me know your thoughts :)

@mmuckley
Copy link
Contributor

Hello @soumickmj I do not observe HDF5 being updated on conda, at least for Python version 3.8 or 3.9.

Screen Shot 2022-03-10 at 9 45 20 AM

@Sarah-2021-scu
Copy link

Sarah-2021-scu commented May 5, 2023

Hi @mmuckley

I am running varnet demo for a small set of brain mri data, and I am getting the following error after 3-4 iterations of the first epoch:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 11.91 GiB total capacity; 10.40 GiB already allocated; 74.81 MiB free; 10.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Following is my h5py version:

h5py 3.7.0
HDF5 1.10.6
Python 3.9.16 (main, Mar 8 2023, 14:00:05)
[GCC 11.2.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.30
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6

I am also attaching my memory profiler plot.

mem_prof_fastmri_brain

Please tell me where i am going wrong.
Thank you for your help!

@mmuckley
Copy link
Contributor

mmuckley commented May 5, 2023

Hello @Sarah-2021-scu, your memory usage is good. The error in your case is on the GPU, which is not related to this particular issue.

It looks like your GPU may just be too small. You could try running the model with a lower cascade count.

@Sarah-2021-scu
Copy link

Thank you @mmuckley for your response. I am using 2 GPU's 12GB of memory each. I will lower the cascade count as well. The other options I have are:

  1. 1 GPU with 32GB memory.
  2. 4 GPU with 12GB memory.
    Which will be the best option to choose from the above 2?

@hujb48
Copy link

hujb48 commented Oct 30, 2023

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

I got the same situation while trained the 'unet baseline model', and I followed this issue #217, by the command
pip uninstall h5py and conda install h5py==3.6.0, it works without any problem, and my env is python 3.8.18 with torch 1.13.0 +cuda 11.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants