Python / NVRTC performance (CUDA 12.2+) #1118

ptheywood · 2023-10-03T12:05:39Z

Recent runs of the python test suite (CUDA 12.0, 535.104.05, Python 3.12) took a significant length of time to run under linux

650 passed, 11 skipped, 69 warnings in 3080.89s (0:51:20)

A second run, which uses the jitify cache / python caches was significantly faster (965x)

650 passed, 11 skipped, 69 warnings in 3.19s

This was a manylinux based wheel, so SEATBELTS=ON, GLM=OFF.

We should probably investigate this if we are going to push the python side more thoroughly, 50 mins of jitting for 3s of total runtime is bad (the test suite is more or less worst-case compilation vs model runtime, but its pretty bad).

Best guess is that nvrtc has got slower with CUDA 12.x, which compounds into a very long time? but would need to investigate to know for certain (profile the test suite / compare different cuda versions).
Just runnign a python example with -t -v might be enough for a quick confirmation if its rtc time or not (with differnt CUDAs)

The text was updated successfully, but these errors were encountered:

ptheywood · 2023-10-03T12:53:27Z

This looks like an nvrtc perf regression within CUDA 12.2.

Using python_rtc/boids_spatial3D_bounded/boids_spatial3D.py with -t -v -s 1, purging the jitify cache between runs:

Wheel CUDA	loaded CUDA (.so's)	RTC Time (s)
12.0	12.2	33.501999
12.0	12.1	3.763000
12.0	12.0	3.800000
11.2	12.2	34.901001
11.2	12.1	3.987000
11.2	12.0	4.092000
11.2	11.8	4.060000
11.2	11.2	2.218000

It's not impacted by the CUDA 12.2 change to Lazy loading (didn't think it would be relevant, but tested via CUDA_MODULE_LOADING=EAGER just in case).

For now, we can probably just use CUDA 12.1, but we might want to try and narrow this down further (test a jitify example / native nvrtc example) and report this to nv.

ptheywood · 2023-10-20T15:02:58Z

CUDA 12.3 build with 12.3 at runtime had an RTC processing time of 20.773s, with driver 545.23.06, so its still painful but not quite as bad.

With 545.23.06 and python 3.10

Wheel CUDA	loaded CUDA (.so's)	RTC Time (s)
12.3	12.3	20.773001
12.0	12.3	23.533001
12.0	12.2	23.684000
12.0	12.1	3.815000

So driver update / diff python seems to have helped, but perf is still bad.

ptheywood · 2023-11-14T11:13:15Z

Confirmed this is not hardware specific, running on a Titan V, compiled with CUDA 12.0 and driver 545.23.06

module load CUDA/12.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70" -DFLAMEGPU_RTC_DISK_CACHE=OFF 
cmake --build . --target rtc_boids_spatial3D -j 8

Executed using CUDA 12.0+, only single run so not perfect, but the difference is clear.

module load CUDA/12.0
./bin/Release/rtc_boids_spatial3D -s 1 -t

CUDA	RTC time (s)
12.3	33.048
12.2	37.532
12.1	5.634
12.0	5.746

ptheywood · 2024-04-02T13:09:27Z

Google colab has now update to CUDA 12.2, which makes this issue more prominant to potential FLAME GPU 2 users, with the run_simulation cell now taking ~3-5 minutes for the first run, and ~5 seconds for the second run...

RTC compilation previously would have been ~80s for 16 agent functions.

ptheywood changed the title ~~Python / NVRTC performance~~ Python / NVRTC performance (CUDA 12.2) Oct 5, 2023

ptheywood mentioned this issue Oct 20, 2023

Use CUDA 12.3 on CI, add 12.2.x URLS to windows known downloads #1130

Merged

ptheywood changed the title ~~Python / NVRTC performance (CUDA 12.2)~~ Python / NVRTC performance (CUDA 12.2+) Oct 27, 2023

ptheywood added this to the 2.0.0-rc2 milestone Jan 12, 2024

ptheywood mentioned this issue Apr 2, 2024

[Bug]: Colab tutorial failing due to CUDA version conflict #1191

Closed

ptheywood mentioned this issue Apr 2, 2024

Remove CUDA 12.2+ performance warning when possible FLAMEGPU/FLAMEGPU2-tutorial-python#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python / NVRTC performance (CUDA 12.2+) #1118

Python / NVRTC performance (CUDA 12.2+) #1118

ptheywood commented Oct 3, 2023

ptheywood commented Oct 3, 2023

ptheywood commented Oct 20, 2023

ptheywood commented Nov 14, 2023 •

edited

ptheywood commented Apr 2, 2024

Python / NVRTC performance (CUDA 12.2+) #1118

Python / NVRTC performance (CUDA 12.2+) #1118

Comments

ptheywood commented Oct 3, 2023

ptheywood commented Oct 3, 2023

ptheywood commented Oct 20, 2023

ptheywood commented Nov 14, 2023 • edited

ptheywood commented Apr 2, 2024

ptheywood commented Nov 14, 2023 •

edited