Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #1876

Closed
SunnyGhj opened this issue Apr 16, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@SunnyGhj
Copy link

SunnyGhj commented Apr 16, 2024

the whole error Traceback:

for _iter, (features, labels) in enumerate(data):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/torch.py", line 64, in __next__
    converted_batch = self.convert_batch(super().__next__())
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 328, in _get_next_batch
    self._fetch_chunk()
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 277, in _fetch_chunk
    raise chunks
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 791, in load_chunks
    self.chunk_logic(itr)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 767, in chunk_logic
    chunks.reset_index(drop=True, inplace=True)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 3119, in reset_index
    *self._reset_index(
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/indexed_frame.py", line 3224, in _reset_index
    ) = self._index._split_columns_by_levels(level)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/_base_index.py", line 2120, in _split_columns_by_levels
    [self._data[self.name]],
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 326, in _data
    {self.name: self._values}
  File "/usr/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/index.py", line 289, in _values
    return column.as_column(self._range, dtype=self.dtype)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1849, in as_column
    column = libcudf.filling.sequence(
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "filling.pyx", line 97, in cudf._lib.filling.sequence
RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

When I use Pytorch DDP to drive 2 GPUs in single node for distributed training and use torchrun to start the distributed process, the above error occurs in one of the worker processes. When only one GPU is used, the above error does not occur. I am curious, is NVTabular Or does Merlin not support distributed training? My purpose is achieve multi-GPU training on a single node with Pytorch 。

data loader:

def get_loader(self, world_size, rank):
        data_file = []
        l = len(self.config.data_file)
        for i in range(l):
            if i % world_size == rank % world_size:
                data_file.append(self.config.data_file[i])
        logging.info(f'GPU:{rank} read {data_file}')
        source = merlin.io.Dataset(data_file, part_size='128M', cpu=False, schema=Schema(
                [ColumnSchema('keeps', dtype=string, is_list=True), 
                 ColumnSchema('features', dtype=int32, is_list=True), 
                 ColumnSchema('labels', dtype=float32, is_list=True)])  ).to_ddf(columns=['features', 'labels'])
        logging.info(f'GPU:{rank} source end')
        train_dataset = TorchAsyncItr(
            merlin.io.Dataset(source, schema=Schema(
                [
                # ColumnSchema('keeps', dtype=string, is_list=True), 
                 ColumnSchema('features', dtype=int32, is_list=True), 
                 ColumnSchema('labels', dtype=float32, is_list=True)])
                             ),
            batch_size=self.config.batch_size,
            global_size=world_size,
            global_rank=rank,
            drop_last=True,
            device=rank
        )
        logging.info(f'GPU:{rank} train_dataset end')

        data_loader = DLDataLoader(
            train_dataset,
            batch_size=None,
            collate_fn=self.read_and_decode_torch,
            pin_memory=False,
            num_workers=0
        )
        return data_loader
@SunnyGhj SunnyGhj added the bug Something isn't working label Apr 16, 2024
@SunnyGhj SunnyGhj changed the title [BUG] RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered [BUG] In Pytorch DDP I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Apr 16, 2024
@SunnyGhj SunnyGhj changed the title [BUG] In Pytorch DDP I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered [BUG] Distributed Training With (NVTabular + Pytorch DDP), I got this error: RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Apr 16, 2024
@SunnyGhj SunnyGhj reopened this Apr 18, 2024
@SunnyGhj
Copy link
Author

The reason for illegal memory access error is that cudf loading data uses GPU:0 by default. Therefore, when running on a single node with multiple GPUs, you need to specify the device through

import cupy
cupy.cuda.runtime.setDevice(local_rank)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant