Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python / NVRTC performance (CUDA 12.2+) #1118

Open
ptheywood opened this issue Oct 3, 2023 · 4 comments
Open

Python / NVRTC performance (CUDA 12.2+) #1118

ptheywood opened this issue Oct 3, 2023 · 4 comments
Milestone

Comments

@ptheywood
Copy link
Member

Recent runs of the python test suite (CUDA 12.0, 535.104.05, Python 3.12) took a significant length of time to run under linux

650 passed, 11 skipped, 69 warnings in 3080.89s (0:51:20)

A second run, which uses the jitify cache / python caches was significantly faster (965x)

650 passed, 11 skipped, 69 warnings in 3.19s 

This was a manylinux based wheel, so SEATBELTS=ON, GLM=OFF.

We should probably investigate this if we are going to push the python side more thoroughly, 50 mins of jitting for 3s of total runtime is bad (the test suite is more or less worst-case compilation vs model runtime, but its pretty bad).


Best guess is that nvrtc has got slower with CUDA 12.x, which compounds into a very long time? but would need to investigate to know for certain (profile the test suite / compare different cuda versions).
Just runnign a python example with -t -v might be enough for a quick confirmation if its rtc time or not (with differnt CUDAs)

@ptheywood
Copy link
Member Author

This looks like an nvrtc perf regression within CUDA 12.2.

Using python_rtc/boids_spatial3D_bounded/boids_spatial3D.py with -t -v -s 1, purging the jitify cache between runs:

Wheel CUDA loaded CUDA (.so's) RTC Time (s)
12.0 12.2 33.501999
12.0 12.1 3.763000
12.0 12.0 3.800000
11.2 12.2 34.901001
11.2 12.1 3.987000
11.2 12.0 4.092000
11.2 11.8 4.060000
11.2 11.2 2.218000

It's not impacted by the CUDA 12.2 change to Lazy loading (didn't think it would be relevant, but tested via CUDA_MODULE_LOADING=EAGER just in case).

For now, we can probably just use CUDA 12.1, but we might want to try and narrow this down further (test a jitify example / native nvrtc example) and report this to nv.

@ptheywood ptheywood changed the title Python / NVRTC performance Python / NVRTC performance (CUDA 12.2) Oct 5, 2023
@ptheywood
Copy link
Member Author

CUDA 12.3 build with 12.3 at runtime had an RTC processing time of 20.773s, with driver 545.23.06, so its still painful but not quite as bad.

With 545.23.06 and python 3.10

Wheel CUDA loaded CUDA (.so's) RTC Time (s)
12.3 12.3 20.773001
12.0 12.3 23.533001
12.0 12.2 23.684000
12.0 12.1 3.815000

So driver update / diff python seems to have helped, but perf is still bad.

@ptheywood ptheywood changed the title Python / NVRTC performance (CUDA 12.2) Python / NVRTC performance (CUDA 12.2+) Oct 27, 2023
@ptheywood
Copy link
Member Author

ptheywood commented Nov 14, 2023

Confirmed this is not hardware specific, running on a Titan V, compiled with CUDA 12.0 and driver 545.23.06

module load CUDA/12.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70" -DFLAMEGPU_RTC_DISK_CACHE=OFF 
cmake --build . --target rtc_boids_spatial3D -j 8

Executed using CUDA 12.0+, only single run so not perfect, but the difference is clear.

module load CUDA/12.0
./bin/Release/rtc_boids_spatial3D -s 1 -t 
CUDA RTC time (s)
12.3 33.048
12.2 37.532
12.1 5.634
12.0 5.746

@ptheywood
Copy link
Member Author

Google colab has now update to CUDA 12.2, which makes this issue more prominant to potential FLAME GPU 2 users, with the run_simulation cell now taking ~3-5 minutes for the first run, and ~5 seconds for the second run...

RTC compilation previously would have been ~80s for 16 agent functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant