[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset #15759

taureandyernv · 2024-05-15T10:22:32Z

Describe the bug
Whenever I'm trying to use cudf,stack() on this large wide dataframe, at around the same index location, the data gets corrupted as you stack past that index until it fails to run, or just fails to run. It happens at index 1159550. go one index before 1159550, everything is fine. One or two after, you start to see issues or it fails. Even if you change around the data a bit, it still fails. eventually. When it fails, it returns RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered.

Happens on both an A100 80GB and H100 running 24.04. Completes successfully on pandas. Falls back to pandas and successfully completes on cudf.pandas.

Steps/Code to reproduce bug
This requires a dataset download, handled in the min repro, and a 32GB GPU or larger to test.

You can actually see the data getting corrupted at the incrementing runs at the end of the min repro, before it finally fails

!if [ ! -f "job_skills.csv" ]; then curl https://storage.googleapis.com/rapidsai/colab-data/job_skills.csv.gz -o job_skills.csv.gz; gunzip job_skills.csv.gz; else echo "unzipped job data found"; fi
import cudf
skills = cudf.read_csv("job_skills.csv")

b = skills["job_skills"].str.split(",", expand=True)
#print(b.iloc[1159550]) # incase you wanted to see what was on that index
print(b.iloc[1159550])
b2 = b[:1159549]
# b2 = b[:1159550] # Uncommenting this, it will fail
stacked_skills = b2.stack()
print(stacked_skills.head())

# this will also fail
# stacked_skills = b.stack().dropna()

# even if you change the dataframe a bit by moving up the indexes incrementally, it will not really change where it fails, as you can start to see the data start glitch
print(skills.count())
skills = skills.dropna()
print(skills.count())
b = skills["job_skills"].str.split(",", expand=True)
print(b.iloc[1159550]) # in case you wanted to see what was on that index
b2 = b[:1159549]
stacked_skills = b2.stack()
print(1159549)
print(stacked_skills.head())
b2 = b[:1159550]
stacked_skills = b2.stack()
print(1159550)
print(stacked_skills.head()) # you can start to see data corruption or it just fails
b2 = b[:1159551]
stacked_skills = b2.stack()
print(1159551)
print(stacked_skills.head())
b2 = b[:1159552]
stacked_skills = b2.stack()
print(1159552)
print(stacked_skills.head())
b2 = b[:1159553]
stacked_skills = b2.stack()
print(1159553)
print(stacked_skills.head())
b2 = b[:1159554]
stacked_skills = b2.stack()
print(1159554)
print(stacked_skills.head()) # by here it should fail

Outputs:

0                         Anesthesiology
1                        Medical license
2                      BLS certification
3                       DEA registration
4       Controlled Substance Certificate
                     ...                
458                                 <NA>
459                                 <NA>
460                                 <NA>
461                                 <NA>
462                                 <NA>
Name: 1159550, Length: 463, dtype: object
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
job_link      1296381
job_skills    1294346
dtype: int64
job_link      1294346
job_skills    1294346
dtype: int64
0      Project Management
1           Communication
2           Collaboration
3              Leadership
4          ProblemSolving
              ...        
458                  <NA>
459                  <NA>
460                  <NA>
461                  <NA>
462                  <NA>
Name: 1161237, Length: 463, dtype: object
1159549
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159550
0  0     PCUeel Nurseendek Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159551
0  0     PCUeel Nursenndek Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159552
0  0     FoUd Safetyeg certificatio
   1                      nCleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 40
     38 print(stacked_skills.head())
     39 b2 = b[:1159553]
---> 40 stacked_skills = b2.stack()
     41 print(1159553)
     42 print(stacked_skills.head())

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:7079, in DataFrame.stack(self, level, dropna, future_stack)
   7073     # homogenize the dtypes of the columns
   7074     homogenized = [
   7075         col.astype(common_type) if col is not None else all_nulls()
   7076         for col in columns
   7077     ]
-> 7079     stacked.append(libcudf.reshape.interleave_columns(homogenized))
   7081 # Construct the resulting dataframe / series
   7082 if not has_unnamed_levels:

File /opt/conda/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File reshape.pyx:26, in cudf._lib.reshape.interleave_columns()

RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Expected behavior
This should just work, as it does in pandas, without ay data corruption

!if [ ! -f "job_skills.csv" ]; then curl https://storage.googleapis.com/rapidsai/colab-data/job_skills.csv.gz -o job_skills.csv.gz; gunzip job_skills.csv.gz; else echo "unzipped job data found"; fi
import pandas as pd
skills = pd.read_csv("job_skills.csv")

b = skills["job_skills"].str.split(",", expand=True)
print(b.iloc[1159550])
b2 = b # just to keep the copying similar.  it doesn't matter.
stacked_skills = b2.stack()
print(stacked_skills.head())

Outputs:

0                         Anesthesiology
1                        Medical license
2                      BLS certification
3                       DEA registration
4       Controlled Substance Certificate
                     ...                
458                                 None
459                                 None
460                                 None
461                                 None
462                                 None
Name: 1159550, Length: 463, dtype: object
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Docker
- If method of install is [Docker], docker run --user root --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 -p 9999:9999 rapidsai/notebooks:24.04-cuda11.8-py3.10 jupyter-lab --notebook-dir=/home/rapids/notebooks --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --allow-root

Environment details
RAPIDS 24.04 cuda 11.8, py 3.9 and 3.10 Docker on ARM SBSA machines

Additional context
When running cudf.pandas, this will succeed, but at the costs of taking nearly 30-40% longer than pandas alone. If and when it succeeds (by reducing it to the last row where it succeeds, it would be 50x+ faster. I have not done a data integrity test just yet, to see if the corruption happens earlier.
@vyasr fyi.

The text was updated successfully, but these errors were encountered:

taureandyernv added the bug Something isn't working label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset #15759

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset #15759

taureandyernv commented May 15, 2024 •

edited

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using .stack() on large, wide dataset #15759

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using .stack() on large, wide dataset #15759

Comments

taureandyernv commented May 15, 2024 • edited

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset #15759

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using `.stack()` on large, wide dataset #15759

taureandyernv commented May 15, 2024 •

edited