[NO-OOM] Exploration of cudf-spilling + UVM #1504

madsbk · 2024-03-21T12:44:12Z

In order to avoid out-of-memory errors, we try to combine cudf-spilling and UVM in this PR.

Goal

Doesn't affect performance when no spilling is needed.
Preserve the existing performance of cudf-spilling.
When cudf spilling falls short, use UVM to avoid OOM failure.

The approach

Set up cudf to use a memory pool backed by managed memory (UVM).
Register a callback function that gets called every time the memory pool needs to expand.
When the memory pool wants to expand beyond the available device memory, the callback function triggers cudf spilling instead. Only if cudf cannot find any buffers to spill, will we expand the memory pool.

Preliminary results

Running an extended version of @wence-'s join benchmark https://gist.github.com/madsbk/9589dedc45dbcfc828f7274ce3bdabc6 on a DGX-1 (32GiB).

Running with three configs:

cudf-spilling: --use-pool --base-memory-resource cuda --use-spilling
cudf-spilling+UVM: --use-pool --base-memory-resource managed --use-spilling
UVM-only: --use-pool --base-memory-resource managed --no-use-spilling

TL;DR

When spilling isn’t needed, we see some overhead spikes when going from cudf-spilling to cudf-spilling+UVM and UVM-only but overall, the performance is very much on par. It might be possible to avoid these spikes using cudaMemAdvise().

When spilling is needed, the performance of cudf-spilling and cudf-spilling+UVM is very similar but again, we see some overhead spikes.

Finally, when the peak memory usage exceeds device memory, cudf-spilling fails with an OOM error. In this case, we need UVM. The performance of cudf-spilling+UVM and UVM-only are similar but cudf-spilling+UVM is more variable.

Raw Numbers

Everything fits in device memory, no spilling

cudf-spilling

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource cuda --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f253aa31df0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s

cudf-spilling+UVM

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f274a8bd3f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.57s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.52s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.17s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s

UVM-only

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 400_000_000 --base-memory-resource managed --use-pool
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fab56ecd170>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.57s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.52s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.17s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s

Spilling is required

cudf-spilling

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource cuda --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7efd995998f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.95s
medium inner on int repeat=2: 3.31s
medium inner on int repeat=3: 3.59s
medium inner on int repeat=4: 3.58s
medium inner on int repeat=5: 3.58s
medium outer on int repeat=1: 4.50s
medium outer on int repeat=2: 4.78s
medium outer on int repeat=3: 4.79s
medium outer on int repeat=4: 4.78s
medium outer on int repeat=5: 4.79s
medium inner on factor repeat=1: 6.67s
medium inner on factor repeat=2: 5.79s
medium inner on factor repeat=3: 5.77s
medium inner on factor repeat=4: 5.78s
medium inner on factor repeat=5: 5.77s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 44.452s
    cpu => gpu: 88.36GiB in 10.987s

cudf-spilling+UVM

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f6283162160>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 2.01s
medium inner on int repeat=2: 3.20s
medium inner on int repeat=3: 3.49s
medium inner on int repeat=4: 3.50s
medium inner on int repeat=5: 3.53s
medium outer on int repeat=1: 4.76s
medium outer on int repeat=2: 5.40s
medium outer on int repeat=3: 5.34s
medium outer on int repeat=4: 4.69s
medium outer on int repeat=5: 4.73s
medium inner on factor repeat=1: 6.57s
medium inner on factor repeat=2: 5.65s
medium inner on factor repeat=3: 5.66s
medium inner on factor repeat=4: 5.67s
medium inner on factor repeat=5: 5.67s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 43.841s
    cpu => gpu: 88.36GiB in 12.214s

UVM-only

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 600_000_000 --base-memory-resource managed --use-pool
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f2743ff92b0>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.55s
medium inner on int repeat=2: 4.07s
medium inner on int repeat=3: 4.23s
medium inner on int repeat=4: 1.90s
medium inner on int repeat=5: 2.06s
medium outer on int repeat=1: 4.67s
medium outer on int repeat=2: 7.65s
medium outer on int repeat=3: 4.59s
medium outer on int repeat=4: 4.47s
medium outer on int repeat=5: 5.40s
medium inner on factor repeat=1: 8.77s
medium inner on factor repeat=2: 9.29s
medium inner on factor repeat=3: 8.14s
medium inner on factor repeat=4: 13.85s
medium inner on factor repeat=5: 14.92s

UVM is required

cudf-spilling

OOM

cudf-spilling+UVM

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 700_000_000 --base-memory-resource managed --use-pool --use-spilling
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f1845c8b6f0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 3.11s
medium inner on int repeat=2: 10.05s
medium inner on int repeat=3: 5.95s
medium inner on int repeat=4: 8.77s
medium inner on int repeat=5: 5.42s
medium outer on int repeat=1: 10.37s
medium outer on int repeat=2: 4.88s
medium outer on int repeat=3: 11.70s
medium outer on int repeat=4: 4.95s
medium outer on int repeat=5: 9.82s
medium inner on factor repeat=1: 16.28s
medium inner on factor repeat=2: 18.24s
medium inner on factor repeat=3: 17.66s
medium inner on factor repeat=4: 17.84s
medium inner on factor repeat=5: 17.99s
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 3.92GiB in 1.626s
    cpu => gpu: 3.92GiB in 0.492s

UVM-only

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py 700_000_000 --base-memory-resource managed --use-pool 
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f6423872520>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 7.50s
medium inner on int repeat=2: 5.94s
medium inner on int repeat=3: 7.18s
medium inner on int repeat=4: 7.70s
medium inner on int repeat=5: 6.72s
medium outer on int repeat=1: 8.30s
medium outer on int repeat=2: 7.28s
medium outer on int repeat=3: 9.88s
medium outer on int repeat=4: 6.36s
medium outer on int repeat=5: 9.28s
medium inner on factor repeat=1: 16.43s
medium inner on factor repeat=2: 19.07s
medium inner on factor repeat=3: 18.61s
medium inner on factor repeat=4: 17.92s
medium inner on factor repeat=5: 18.42s

NB

The current code is hacky. If this approach is viable, we should implement the functionality in a standalone rmm resource.

harrism · 2024-03-21T19:12:55Z

@madsbk thank you for this! Would you be able to extract and summarize the important performance numbers? Perhaps in a graph or table to make clear the performance landscape?

madsbk · 2024-03-22T10:35:34Z

Some more results. I am now calling cudaMemPrefetchAsync() when initializing the memory pool in order to stabilize the performance.

I am measuring the total time of all the joins (5 repeats each) . The timings of join repeats vary quite a bit when using UVM. This is expected since we do not reset the memory page locations between repeats.

Rows	backend	total time (sec)
400k	cudf-spilling-only	9.00
400k	cudf-spilling+UVM	9.00
400k	UVM-only	8.99

600k	cudf-spilling-only	67.28
600k	cudf-spilling+UVM	67.56
600k	UVM-only	121.02

700k	cudf-spilling-only	out-of-memory
700k	cudf-spilling+UVM	162.75
700k	UVM-only	166.02

Raw data

(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource cuda --use-spilling  
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f9de448b0b0>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 9.00
Spill Statistics (level=3):
  Spilling (level >= 1): None
  Exposed buffers (level >= 2): None
Exception ignored in: <function RandomState.__del__ at 0x7f9e020fd440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f9741697060>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 9.00
Spill Statistics (level=3):
  Spilling (level >= 1): None
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 400_000_000 --base-memory-resource managed  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fe06666f420>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 400000000 rows and is 7.45 GiB
Right table is 400000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.34s
medium inner on int repeat=2: 0.34s
medium inner on int repeat=3: 0.34s
medium inner on int repeat=4: 0.34s
medium inner on int repeat=5: 0.34s
medium outer on int repeat=1: 0.34s
medium outer on int repeat=2: 0.34s
medium outer on int repeat=3: 0.34s
medium outer on int repeat=4: 0.34s
medium outer on int repeat=5: 0.34s
medium inner on factor repeat=1: 1.12s
medium inner on factor repeat=2: 1.12s
medium inner on factor repeat=3: 1.12s
medium inner on factor repeat=4: 1.12s
medium inner on factor repeat=5: 1.12s
Total time: 8.99
Exception ignored in: <function RandomState.__del__ at 0x7fe095fa5440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource cuda --use-spilling  
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fb388b3ab60>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.91s
medium inner on int repeat=2: 3.18s
medium inner on int repeat=3: 3.47s
medium inner on int repeat=4: 3.47s
medium inner on int repeat=5: 3.45s
medium outer on int repeat=1: 4.36s
medium outer on int repeat=2: 4.62s
medium outer on int repeat=3: 4.63s
medium outer on int repeat=4: 4.62s
medium outer on int repeat=5: 4.62s
medium inner on factor repeat=1: 6.48s
medium inner on factor repeat=2: 5.61s
medium inner on factor repeat=3: 5.62s
medium inner on factor repeat=4: 5.62s
medium inner on factor repeat=5: 5.62s
Total time: 67.28
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 42.732s
    cpu => gpu: 88.36GiB in 10.561s
  Exposed buffers (level >= 2): None
Exception ignored in: <function RandomState.__del__ at 0x7fb3b8469440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fa22586d440>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.90s
medium inner on int repeat=2: 3.20s
medium inner on int repeat=3: 3.49s
medium inner on int repeat=4: 3.47s
medium inner on int repeat=5: 3.48s
medium outer on int repeat=1: 4.41s
medium outer on int repeat=2: 4.64s
medium outer on int repeat=3: 4.65s
medium outer on int repeat=4: 4.64s
medium outer on int repeat=5: 4.64s
medium inner on factor repeat=1: 6.50s
medium inner on factor repeat=2: 5.64s
medium inner on factor repeat=3: 5.63s
medium inner on factor repeat=4: 5.64s
medium inner on factor repeat=5: 5.64s
Total time: 67.56
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 105.16GiB in 43.151s
    cpu => gpu: 88.36GiB in 10.759s
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 600_000_000 --base-memory-resource managed
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fbdb85b09f0>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 600000000 rows and is 11.18 GiB
Right table is 600000 rows and is 0.01 GiB
medium inner on int repeat=1: 4.30s
medium inner on int repeat=2: 6.42s
medium inner on int repeat=3: 5.75s
medium inner on int repeat=4: 8.27s
medium inner on int repeat=5: 3.92s
medium outer on int repeat=1: 4.50s
medium outer on int repeat=2: 5.38s
medium outer on int repeat=3: 4.46s
medium outer on int repeat=4: 4.49s
medium outer on int repeat=5: 5.30s
medium inner on factor repeat=1: 8.50s
medium inner on factor repeat=2: 15.25s
medium inner on factor repeat=3: 14.66s
medium inner on factor repeat=4: 14.97s
medium inner on factor repeat=5: 14.86s
Total time: 121.02
Exception ignored in: <function RandomState.__del__ at 0x7fbddc43d440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 700_000_000 --base-memory-resource managed --use-spilling  
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7fad40a5d030>
use_pool=True
use_spilling=True
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 3.33s
medium inner on int repeat=2: 9.67s
medium inner on int repeat=3: 5.67s
medium inner on int repeat=4: 8.68s
medium inner on int repeat=5: 5.60s
medium outer on int repeat=1: 10.25s
medium outer on int repeat=2: 4.99s
medium outer on int repeat=3: 11.59s
medium outer on int repeat=4: 5.17s
medium outer on int repeat=5: 9.68s
medium inner on factor repeat=1: 16.23s
medium inner on factor repeat=2: 18.07s
medium inner on factor repeat=3: 17.79s
medium inner on factor repeat=4: 17.93s
medium inner on factor repeat=5: 18.09s
Total time: 162.75
Spill Statistics (level=3):
  Spilling (level >= 1):
    gpu => cpu: 3.92GiB in 1.575s
    cpu => gpu: 3.92GiB in 0.480s
  Exposed buffers (level >= 2): None
(cudf-0319) mkristensen@dgx15:~/repos/cudf$ CUDF_SPILL_STATS=3 CUDA_VISIBLE_DEVICES=1 python join-benchmark.py --use-pool 700_000_000 --base-memory-resource managed
do_allocate(managed) - prefetched to device bytes: 32212254720
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f7df52d5850>
use_pool=True
use_spilling=False
string_categoricals=False
Left table has 700000000 rows and is 13.04 GiB
Right table is 700000 rows and is 0.01 GiB
medium inner on int repeat=1: 9.01s
medium inner on int repeat=2: 5.57s
medium inner on int repeat=3: 9.24s
medium inner on int repeat=4: 7.02s
medium inner on int repeat=5: 7.45s
medium outer on int repeat=1: 7.36s
medium outer on int repeat=2: 8.85s
medium outer on int repeat=3: 8.60s
medium outer on int repeat=4: 7.80s
medium outer on int repeat=5: 7.39s
medium inner on factor repeat=1: 16.01s
medium inner on factor repeat=2: 18.34s
medium inner on factor repeat=3: 17.33s
medium inner on factor repeat=4: 17.81s
medium inner on factor repeat=5: 18.24s
Total time: 166.02
Exception ignored in: <function RandomState.__del__ at 0x7f7e16491440>
Traceback (most recent call last):
  File "/datasets/mkristensen/miniforge3/envs/cudf-0319/lib/python3.11/site-packages/cupy/random/_generator.py", line 65, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

The results show that, at least in this case, combining cudf-spilling with UVM clearly outperforms UVM-only without any real downside.

I think the next step is to implement this in a more unintrusive way and test more workflows and hardware setups. E.g., how is the performance of cudf-spilling+UVM when running on multiple GPUs using UCX?

harrism · 2024-04-04T09:06:32Z

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

@@ -203,7 +211,7 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_

    if (size <= 0) { return nullptr; }

-    lock_guard lock(mtx_);
+    // lock_guard lock(mtx_);


question: why comment out the lock?

That is because of a deadlock that would otherwise trigger when the allocation results in cudf-spilling. In this case, cudf-spilling will find another buffer to spill and de-allocate its memory, which also
requires the lock.

I haven't given this too much thought, but I think this could be handled with a reentrant lock.

…_uvm

hack

5853a2b

github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels Mar 21, 2024

prefetch managed memory to device once

368e70e

harrism reviewed Apr 4, 2024

View reviewed changes

Merge branch 'branch-24.06' of github.com:rapidsai/rmm into spill_and…

ac1b614

…_uvm

github-actions bot added the ci label Apr 9, 2024

madsbk changed the base branch from branch-24.04 to branch-24.06 April 9, 2024 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO-OOM] Exploration of cudf-spilling + UVM #1504

[NO-OOM] Exploration of cudf-spilling + UVM #1504

madsbk commented Mar 21, 2024 •

edited

harrism commented Mar 21, 2024

madsbk commented Mar 22, 2024

harrism Apr 4, 2024

madsbk Apr 4, 2024

[NO-OOM] Exploration of cudf-spilling + UVM #1504

Are you sure you want to change the base?

[NO-OOM] Exploration of cudf-spilling + UVM #1504

Conversation

madsbk commented Mar 21, 2024 • edited

Goal

The approach

Preliminary results

TL;DR

Raw Numbers

Everything fits in device memory, no spilling

cudf-spilling

cudf-spilling+UVM

UVM-only

Spilling is required

cudf-spilling

cudf-spilling+UVM

UVM-only

UVM is required

cudf-spilling

cudf-spilling+UVM

UVM-only

NB

harrism commented Mar 21, 2024

madsbk commented Mar 22, 2024

harrism Apr 4, 2024

Choose a reason for hiding this comment

madsbk Apr 4, 2024

Choose a reason for hiding this comment

madsbk commented Mar 21, 2024 •

edited