New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246
Comments
Do you see memory usage increasing when iterating, or before you even start to iterate? |
@ssnl During the iteration only. |
When we fix #13243 we should check if this one gets fixed too. |
I've been experiencing something similar where memory usage continuously climbs until a OOM is triggered when using a To Reproduceimport math
from torch.utils.data import DataLoader
class Sampler:
def __init__(self, n=100000, batch_size=32):
self.n = n
self.batch_size = batch_size
def __len__(self):
return math.ceil(float(self.n)/self.batch_size)
def __iter__(self):
batch = []
for i in range(self.n):
batch.append(i)
if len(batch) == self.batch_size:
yield batch
batch = []
if batch:
yield batch
N = 100000000
train_data = list(range(N))
def ok():
train_sampler = Sampler(len(train_data))
train_loader = DataLoader(train_data,
num_workers=0,
batch_sampler=train_sampler)
for i, item in enumerate(train_loader):
if i % 10000 == 0:
print(i)
def leaky():
train_sampler = Sampler(len(train_data))
train_loader = DataLoader(train_data,
num_workers=8,
batch_sampler=train_sampler)
for i, item in enumerate(train_loader):
if i % 10000 == 0:
print(i)
print('Starting ok')
ok()
print('ok done, starting leaky()')
leaky()
print('leaky done') Environment
|
After some more investigation, I have found an exact scenario when the leak occurs. Consider the code example below: from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch
class DataIter(Dataset):
def __init__(self):
self.data_np = np.array([x for x in range(24000000)])
self.data = [x for x in range(24000000)]
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
data = self.data[idx]
data = np.array([data], dtype=np.int64)
return torch.tensor(data)
train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
shuffle=True,
drop_last=True,
pin_memory=False,
num_workers=18)
for i, item in enumerate(train_loader):
if i % 1000 == 0:
print(i) If we use the |
I face similar issue, but in my case it occurs with numpy array too. I am using Python 3.7 and PyTorch nightly release. |
I don't know how multiprocessing really works under the hood of pytorch, but we have extensively discussed this "Memory Leak" issue (which probably isn't a memory leak!) on the fast.ai forums (https://forums.fast.ai/t/runtimeerror-dataloader-worker-is-killed-by-signal/31277/55?u=marcmuc). Preliminary findings which hopefully add some insight here (if this does NOT apply, please comment!): Python Multiprocessing: There is no way of storing arbitrary python objects (even simple lists) in shared memory in Python without triggering copy-on-write behaviour due to the addition of refcounts, everytime something reads from these objects. The refcounts are added memory-page by memory-page, which is why the consumption grows slowly. The processes (workers) will end up having all/most of the memory copied over bit by bit, which is why we get the memory overflow problem. Best description of this behavior is here (SO). Possible Solution:
I am not familiar with the torch.multiprocessing drop-in replacement that I understand pytorch uses, but I would assume it will also not be able to remove the core python refcount issue. |
@mprostock torch.multiprocessing is simply Python multiprocessing, with a custom pickler. The custom pickler, whenever it encounters a |
Thanks for the explanation! I have experimented with @bfreskura 's reproduction example and I think I can now pinpoint the problem: The reproduction example by bfreskura above showed the difference between a regular python list and a numpy array. But the problem is not (only) the python list itself, the same happens in a numpy array of type object. Python lists store only references to the objects, the objects are kept separately in memory. Every object has a refcount, therefore every item in the list has a refcount. Numpy arrays (of standard np types) are stored as continuous blocks in memory and are only ONE object with one refcount. This changes if you make the numpy array explicitly of type object, which makes it start behaving like a regular python list (only storing references to (string) objects). The same "problems" with memory consumption now appear. This would explain, why with regular lists (or numpy arrays of type object) we see the "memory leak", which actually is the copy-on-acces problem of forked python processes due to changing refcounts, not a memory leak. So the problem probably (often) has got nothing to do with tensors or actual torch objects, but rather with the lists of filenames and dicts of labels, that are generally used within dataloaders/datasets. I have created a notebook gist, if someone wants to quickly try it. |
I am facing the same issue. It fills up my RAM very fast if the num_workers > 0. |
Switching from dict to pandas and from lists to numpy arrays helps me
|
Thanks for the reply. I will try that and hopefully, it works. |
May I ask for the solution for this issue? I tried @samgd code on last daily built pytorch, and it was still leaking. |
@Godricly See @mprostock and @soumith 's comments above. This is not really a leak, but an unfortunate behavior of using python native list. Using either torch tensor or np array will solve this memory problem. |
@mprostock Do you mean that it is the copy created by copy-on-access use up the memory ,not something else? And doesn't the copy release after used? |
Someone needs to step up and write a proper augmentation op for image datasets at least. The whole reason for all of these multiprocessing shenanigans is because vision datasets have to decode and crop images on multiple cores. If there was an op that took care of decoding and geometric image transforms (resize, crop flip, shear, affine), and produced batch tensors directly, there would be no need to use multiprocessing at all, and further, non-geometric augmentation steps (colors, whitening/normalization, noise) could use intra-op parallelism to rip through the entire tensor. Care needs to be taken when designing such an op to expose transform parameters for each sample in the tensor to the outside, in order to enable parallel transformation of annotations (bounding boxes, masks, keypoints, etc). |
@mprostock thank you for the great explanation! However, no solution has been proposed yet. Storing lists of filenames in Dataset object seems fair, so how one can use them? Did anyone figure it out? |
* Uses `ray` to make it faster * Data is stored as tensors because of pytorch/pytorch#13246 (comment)
What if I want to load my dataset completely as an attribute of the dataset object and then use that attribute inside the |
For all people who stumbled on this recently, please consider upvoting the proposal #101699 in the part of introducing a tensor-backed array of strings into core (at least a read-only one) |
Has upstream cypthon improved something on this topic? |
【自动恢复】来信已收到,我将尽快回复你!
|
Why not use pandas to store paths directly? |
【自动恢复】来信已收到,我将尽快回复你!
|
So,should using pandas dataframe solve this problem?I read csv files into pandas dataframe and use iloc to indexing data in getitem but the leaking problem still exists. Here is my code.
The problem even exists when i just read a line from the dataframe but do not use it;
|
FWIW we've (seemingly so far) worked around this issue by storing dataset state in a polars |
Hi! To avoid memory increase using |
【自动恢复】来信已收到,我将尽快回复你!
|
Editor note: There is a known workaround further down on this issue, which is to NOT use Python lists, but instead using something else, e.g., torch.tensor directly. See #13246 (comment) . You can use a numpy array, but it only fixes the issue for the
fork
start method. See #13246 (comment) for more details🐛 Bug
CPU memory will leak if the DataLoader
num_workers > 0
.To Reproduce
Run the following snippet:
Expected behavior
CPU memory will gradually start increasing, eventually filling up the whole RAM. E.g., the process starts with around 15GB and fills up the whole 128GB available on the system.
When the
num_workers=0
, RAM usage is constant.Environment
Additional info
There are around 24 million images in the dataset and all image paths are loaded into a single list as presented in the above code snippet.
I have also tried multiple Pytorch (0.4.0 and 0.4.1) versions and the effect is the same.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ssnl @VitalyFedyunin @ejguan
The text was updated successfully, but these errors were encountered: