Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using .stack() on large, wide dataset #15759

Open
taureandyernv opened this issue May 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@taureandyernv
Copy link
Contributor

taureandyernv commented May 15, 2024

Describe the bug
Whenever I'm trying to use cudf,stack() on this large wide dataframe, at around the same index location, the data gets corrupted as you stack past that index until it fails to run, or just fails to run. It happens at index 1159550. go one index before 1159550, everything is fine. One or two after, you start to see issues or it fails. Even if you change around the data a bit, it still fails. eventually. When it fails, it returns RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered.

Happens on both an A100 80GB and H100 running 24.04. Completes successfully on pandas. Falls back to pandas and successfully completes on cudf.pandas.

Steps/Code to reproduce bug
This requires a dataset download, handled in the min repro, and a 32GB GPU or larger to test.

You can actually see the data getting corrupted at the incrementing runs at the end of the min repro, before it finally fails

!if [ ! -f "job_skills.csv" ]; then curl https://storage.googleapis.com/rapidsai/colab-data/job_skills.csv.gz -o job_skills.csv.gz; gunzip job_skills.csv.gz; else echo "unzipped job data found"; fi
import cudf
skills = cudf.read_csv("job_skills.csv")

b = skills["job_skills"].str.split(",", expand=True)
#print(b.iloc[1159550]) # incase you wanted to see what was on that index
print(b.iloc[1159550])
b2 = b[:1159549]
# b2 = b[:1159550] # Uncommenting this, it will fail
stacked_skills = b2.stack()
print(stacked_skills.head())

# this will also fail
# stacked_skills = b.stack().dropna()

# even if you change the dataframe a bit by moving up the indexes incrementally, it will not really change where it fails, as you can start to see the data start glitch
print(skills.count())
skills = skills.dropna()
print(skills.count())
b = skills["job_skills"].str.split(",", expand=True)
print(b.iloc[1159550]) # in case you wanted to see what was on that index
b2 = b[:1159549]
stacked_skills = b2.stack()
print(1159549)
print(stacked_skills.head())
b2 = b[:1159550]
stacked_skills = b2.stack()
print(1159550)
print(stacked_skills.head()) # you can start to see data corruption or it just fails
b2 = b[:1159551]
stacked_skills = b2.stack()
print(1159551)
print(stacked_skills.head())
b2 = b[:1159552]
stacked_skills = b2.stack()
print(1159552)
print(stacked_skills.head())
b2 = b[:1159553]
stacked_skills = b2.stack()
print(1159553)
print(stacked_skills.head())
b2 = b[:1159554]
stacked_skills = b2.stack()
print(1159554)
print(stacked_skills.head()) # by here it should fail

Outputs:

0                         Anesthesiology
1                        Medical license
2                      BLS certification
3                       DEA registration
4       Controlled Substance Certificate
                     ...                
458                                 <NA>
459                                 <NA>
460                                 <NA>
461                                 <NA>
462                                 <NA>
Name: 1159550, Length: 463, dtype: object
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
job_link      1296381
job_skills    1294346
dtype: int64
job_link      1294346
job_skills    1294346
dtype: int64
0      Project Management
1           Communication
2           Collaboration
3              Leadership
4          ProblemSolving
              ...        
458                  <NA>
459                  <NA>
460                  <NA>
461                  <NA>
462                  <NA>
Name: 1161237, Length: 463, dtype: object
1159549
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159550
0  0     PCUeel Nurseendek Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159551
0  0     PCUeel Nursenndek Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
1159552
0  0     FoUd Safetyeg certificatio
   1                      nCleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 40
     38 print(stacked_skills.head())
     39 b2 = b[:1159553]
---> 40 stacked_skills = b2.stack()
     41 print(1159553)
     42 print(stacked_skills.head())

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:7079, in DataFrame.stack(self, level, dropna, future_stack)
   7073     # homogenize the dtypes of the columns
   7074     homogenized = [
   7075         col.astype(common_type) if col is not None else all_nulls()
   7076         for col in columns
   7077     ]
-> 7079     stacked.append(libcudf.reshape.interleave_columns(homogenized))
   7081 # Construct the resulting dataframe / series
   7082 if not has_unnamed_levels:

File /opt/conda/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File reshape.pyx:26, in cudf._lib.reshape.interleave_columns()

RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Expected behavior
This should just work, as it does in pandas, without ay data corruption

!if [ ! -f "job_skills.csv" ]; then curl https://storage.googleapis.com/rapidsai/colab-data/job_skills.csv.gz -o job_skills.csv.gz; gunzip job_skills.csv.gz; else echo "unzipped job data found"; fi
import pandas as pd
skills = pd.read_csv("job_skills.csv")

b = skills["job_skills"].str.split(",", expand=True)
print(b.iloc[1159550])
b2 = b # just to keep the copying similar.  it doesn't matter.
stacked_skills = b2.stack()
print(stacked_skills.head())

Outputs:

0                         Anesthesiology
1                        Medical license
2                      BLS certification
3                       DEA registration
4       Controlled Substance Certificate
                     ...                
458                                 None
459                                 None
460                                 None
461                                 None
462                                 None
Name: 1159550, Length: 463, dtype: object
0  0    Building Custodial Services
   1                       Cleaning
   2            Janitorial Services
   3             Materials Handling
   4                   Housekeeping
dtype: object

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • If method of install is [Docker], docker run --user root --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 -p 9999:9999 rapidsai/notebooks:24.04-cuda11.8-py3.10 jupyter-lab --notebook-dir=/home/rapids/notebooks --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --allow-root

Environment details
RAPIDS 24.04 cuda 11.8, py 3.9 and 3.10 Docker on ARM SBSA machines

Additional context
When running cudf.pandas, this will succeed, but at the costs of taking nearly 30-40% longer than pandas alone. If and when it succeeds (by reducing it to the last row where it succeeds, it would be 50x+ faster. I have not done a data integrity test just yet, to see if the corruption happens earlier.
@vyasr fyi.

@taureandyernv taureandyernv added the bug Something isn't working label May 15, 2024
@taureandyernv taureandyernv changed the title [BUG] Data corruption and strange CUDA memory address errors at the same row, despite manipulating data, when using .stack() on large, wide dataset [BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using .stack() on large, wide dataset May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants
@taureandyernv and others