-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python / NVRTC performance (CUDA 12.2+) #1118
Comments
This looks like an nvrtc perf regression within CUDA 12.2. Using
It's not impacted by the CUDA 12.2 change to Lazy loading (didn't think it would be relevant, but tested via For now, we can probably just use CUDA 12.1, but we might want to try and narrow this down further (test a jitify example / native nvrtc example) and report this to nv. |
CUDA 12.3 build with 12.3 at runtime had an RTC processing time of 20.773s, with driver 545.23.06, so its still painful but not quite as bad. With 545.23.06 and python 3.10
So driver update / diff python seems to have helped, but perf is still bad. |
Confirmed this is not hardware specific, running on a Titan V, compiled with CUDA 12.0 and driver 545.23.06 module load CUDA/12.0
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70" -DFLAMEGPU_RTC_DISK_CACHE=OFF
cmake --build . --target rtc_boids_spatial3D -j 8 Executed using CUDA 12.0+, only single run so not perfect, but the difference is clear. module load CUDA/12.0
./bin/Release/rtc_boids_spatial3D -s 1 -t
|
Google colab has now update to CUDA 12.2, which makes this issue more prominant to potential FLAME GPU 2 users, with the RTC compilation previously would have been ~80s for 16 agent functions. |
Recent runs of the python test suite (CUDA 12.0, 535.104.05, Python 3.12) took a significant length of time to run under linux
A second run, which uses the jitify cache / python caches was significantly faster (965x)
This was a manylinux based wheel, so SEATBELTS=ON, GLM=OFF.
We should probably investigate this if we are going to push the python side more thoroughly, 50 mins of jitting for 3s of total runtime is bad (the test suite is more or less worst-case compilation vs model runtime, but its pretty bad).
Best guess is that nvrtc has got slower with CUDA 12.x, which compounds into a very long time? but would need to investigate to know for certain (profile the test suite / compare different cuda versions).
Just runnign a python example with
-t -v
might be enough for a quick confirmation if its rtc time or not (with differnt CUDAs)The text was updated successfully, but these errors were encountered: