High memory consumption upon instantiation of DataLoader with GridSampler #892
-
Hi, I am running into an issue with my current data loader. I am trying to implement a patch-based data pipeline for the training of medical samples. I am trying to make it already future-proof for subsequent datasets I will obtain throughout my project. Therefore, I use both in training and validation the for subject_ in subjects_dataset:
dataset = GridSampler(subject=subject_, patch_size=self.args.patch_size,patch_overlap=self.args.patch_overlap)
getattr(self, f"subjects_{str(stage)}").aggregator['pred'].append(GridAggregator(dataset))
getattr(self, f"subjects_{str(stage)}").aggregator['target'].append(GridAggregator(dataset))
dataloader.append(DataLoader(dataset=dataset,
batch_size=self.args.batch_size,
num_workers=self.args.num_workers,
pin_memory=self.args.pin_memory)) By running this code, the data loading takes around 8 GB of RAM for 20 subjects, which exceeds the amount of memory that is available on our machine once I scale it up to the 100 samples I intend to use for testing. I tried to modify the following:
but nothing of these alterations reduced the memory consumption significantly. Subsequently, I tried to use the Additionally, I tried to implement a class SequentialLoader:
def __init__(self, subjects_dataset, patch_size, patch_overlap, batch_size, num_workers, pin_memory):
self.subjects_dataset = subjects_dataset
self.patch_size = patch_size
self.patch_overlap = patch_overlap
self.batch_size = batch_size
self.pin_memory = pin_memory
self.num_workers = num_workers
self.num_patches = None
def __len__(self):
return self.num_patches
def __iter__(self):
for subj in self.subjects_dataset:
dataset = GridSampler(
subject=subj,
patch_size=self.patch_size,
patch_overlap=self.patch_overlap
)
dataloader = DataLoader(dataset=dataset,
batch_size=self.batch_size,
num_workers=self.num_workers,
pin_memory=self.pin_memory)
self.num_patches = len(dataset) * len(self.subjects_dataset)
yield from dataloader but the initial estimation of My final question is: Is there any better way to load the samples into RAM only when I need them (lazy loading) to reduce the memory footprint i.e. in the inputs = batch['image'][DATA]
targets = batch['label'][DATA] as described in the TorchIO Documentation?
Or am I doing something severely wrong? There is also some documentation about an I thank you already in advance for your help. Hopefully, I included enough helpful information about my problem. Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, @nicoloesch. This comes terribly late, apologies. I'm trying to go through all unanswered questions. I don't understand why you'd want a data loader per subject. I think this might be what's causing the issue. |
Beta Was this translation helpful? Give feedback.
Hi, @nicoloesch. This comes terribly late, apologies. I'm trying to go through all unanswered questions.
I don't understand why you'd want a data loader per subject. I think this might be what's causing the issue.