[BUG] Data corruption and strange CUDA memory address errors at the same row index, despite manipulating data, when using .stack()
on large, wide dataset
#15759
Labels
bug
Something isn't working
Describe the bug
Whenever I'm trying to use cudf,stack() on this large wide dataframe, at around the same index location, the data gets corrupted as you stack past that index until it fails to run, or just fails to run. It happens at index 1159550. go one index before 1159550, everything is fine. One or two after, you start to see issues or it fails. Even if you change around the data a bit, it still fails. eventually. When it fails, it returns
RuntimeError: parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
.Happens on both an A100 80GB and H100 running 24.04. Completes successfully on pandas. Falls back to pandas and successfully completes on cudf.pandas.
Steps/Code to reproduce bug
This requires a dataset download, handled in the min repro, and a 32GB GPU or larger to test.
You can actually see the data getting corrupted at the incrementing runs at the end of the min repro, before it finally fails
Outputs:
Expected behavior
This should just work, as it does in pandas, without ay data corruption
Outputs:
Environment overview (please complete the following information)
Environment details
RAPIDS 24.04 cuda 11.8, py 3.9 and 3.10 Docker on ARM SBSA machines
Additional context
When running cudf.pandas, this will succeed, but at the costs of taking nearly 30-40% longer than pandas alone. If and when it succeeds (by reducing it to the last row where it succeeds, it would be 50x+ faster. I have not done a data integrity test just yet, to see if the corruption happens earlier.
@vyasr fyi.
The text was updated successfully, but these errors were encountered: