[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: `RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered` #1876

SunnyGhj · 2024-04-16T17:42:08Z

the whole error Traceback：

for _iter, (features, labels) in enumerate(data):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/torch.py", line 64, in __next__
    converted_batch = self.convert_batch(super().__next__())
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 328, in _get_next_batch
    self._fetch_chunk()
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 277, in _fetch_chunk
    raise chunks
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 791, in load_chunks
    self.chunk_logic(itr)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 767, in chunk_logic
    chunks.reset_index(drop=True, inplace=True)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 3119, in reset_index
    *self._reset_index(
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/indexed_frame.py", line 3224, in _reset_index
    ) = self._index._split_columns_by_levels(level)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/_base_index.py", line 2120, in _split_columns_by_levels
    [self._data[self.name]],
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 326, in _data
    {self.name: self._values}
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 289, in _values
    return column.as_column(self._range, dtype=self.dtype)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1849, in as_column
    column = libcudf.filling.sequence(
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "filling.pyx", line 97, in cudf._lib.filling.sequence
RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

When I use Pytorch DDP to drive 2 GPUs in single node for distributed training and use torchrun to start the distributed process, the above error occurs in one of the worker processes. When only one GPU is used, the above error does not occur. I am curious, is NVTabular Or does Merlin not support distributed training? My purpose is achieve multi-GPU training on a single node with Pytorch 。

data loader:

def get_loader(self, world_size, rank):
        data_file = []
        l = len(self.config.data_file)
        for i in range(l):
            if i % world_size == rank % world_size:
                data_file.append(self.config.data_file[i])
        logging.info(f'GPU:{rank} read {data_file}')
        source = merlin.io.Dataset(data_file, part_size='128M', cpu=False, schema=Schema(
                [ColumnSchema('keeps', dtype=string, is_list=True), 
                 ColumnSchema('features', dtype=int32, is_list=True), 
                 ColumnSchema('labels', dtype=float32, is_list=True)])  ).to_ddf(columns=['features', 'labels'])
        logging.info(f'GPU:{rank} source end')
        train_dataset = TorchAsyncItr(
            merlin.io.Dataset(source, schema=Schema(
                [
                # ColumnSchema('keeps', dtype=string, is_list=True), 
                 ColumnSchema('features', dtype=int32, is_list=True), 
                 ColumnSchema('labels', dtype=float32, is_list=True)])
                             ),
            batch_size=self.config.batch_size,
            global_size=world_size,
            global_rank=rank,
            drop_last=True,
            device=rank
        )
        logging.info(f'GPU:{rank} train_dataset end')

        data_loader = DLDataLoader(
            train_dataset,
            batch_size=None,
            collate_fn=self.read_and_decode_torch,
            pin_memory=False,
            num_workers=0
        )
        return data_loader

The text was updated successfully, but these errors were encountered:

SunnyGhj · 2024-04-18T05:37:49Z

The reason for illegal memory access error is that cudf loading data uses GPU:0 by default. Therefore, when running on a single node with multiple GPUs, you need to specify the device through

import cupy
cupy.cuda.runtime.setDevice(local_rank)

SunnyGhj added the bug Something isn't working label Apr 16, 2024

SunnyGhj closed this as completed Apr 18, 2024

SunnyGhj reopened this Apr 18, 2024

SunnyGhj closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: `RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered` #1876

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: `RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered` #1876

SunnyGhj commented Apr 16, 2024 •

edited

SunnyGhj commented Apr 18, 2024

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #1876

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #1876

Comments

SunnyGhj commented Apr 16, 2024 • edited

SunnyGhj commented Apr 18, 2024

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: `RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered` #1876

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: `RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered` #1876

SunnyGhj commented Apr 16, 2024 •

edited