Migrate to Jitify2 #1150

Robadob · 2023-11-15T16:58:51Z

RTC and Execution (Works with CUDA12.0, Windows/Linux)
Serialisation/Deserialisation (jitify2 pre-processed serialised objects are 2-3x larger)
CUDA 12.3 support
- Works with Jitify2 misc fixes NVIDIA/jitify#128, hence wait for merge.
- Same branch deprecates the launch() method (used in CUDASimulation), says to replace with launch_raw().
Investigate access violation from cuda().ModuleUnload() during sim shutdown, when CUDAAgent map is cleared by CUDASimulation destructor.
- Has occurred on Windows & Linux, but its inconsistent.
- gdb log: https://gist.github.com/Robadob/3cb0e93014d56f05587f4ea1ea581203 (with jitify2-misc-fixes2 branch)
Reimplement jitify1 demangle used in curve_rtc.cpp
Optimise serialisation load time
- Jitify2 serialises pre-NVRTC, this is 50x slower and produces 2x larger serial blob
- Waiting for clarity: (jitify2) Serialisation post-NVRTC NVIDIA/jitify#133

Optimise compile time
- Rework of old header hack? (would require loading headers from file)
  - Preloading fgpu headers only cuts agent fn time from 6.8s to 4.1s.
  - Loadings CUDA headers too makes a big difference, but these may not be particularly stable between version
- Offline pre-process FLAMEGPU2 include hierarchy into a single header file? (using jitify tools?)
- Wait for this PR to be merged?

Visual Studio 2019 support (we may be able to drop this)
ManyLinux2014 support

Robadob · 2023-11-17T11:21:12Z

.github/scripts/install_cuda_centos.sh

-    if [[ ${package} == *devel* ]] && version_lt "$CUDA_VERSION_MAJOR_MINOR" "11.0" ; then
-        package="${package//devel/dev}"
+    # libnvjitlink not required prior to CUDA 12.0
+    if [[ ${package} == libnvjitlink-dev* ]] && version_lt "$CUDA_VERSION_MAJOR_MINOR" "12.0" ;then


Not sure if this is as narrow as it could be, but I wanted to make minimal changes to get it working.

The wildcard on libnvjitlink-dev can probably be removed.

Same issue in the ubuntu script

Robadob · 2023-11-20T10:25:53Z

Update regarding header pre-loading with Jitify2/CUDA 12.3

Windows/CUDA 12.0

No preload
Millis: 6822.000000
Millis: 6853.000000

Preloading FLAMEGPU headers
Millis: 4045.000000
Millis: 4277.000000

Preload FLAMEGPU + CUDA headers
Millis: 1296.000000
Millis: 1667.000000

Linux/CUDA 12.3

Jitify 2 from scratch (Waimu)
Millis: 25318.000000
Millis: 24143.000000

Preload FLAMEGPU + CUDA headers
Millis: 1376.000000
Millis: 2218.000000

CUDA 12.0 has ~30 CUDA headers to preload.
CUDA 12.3 has ~257 CUDA headers to preload. (List contains some dupes)

Not clear whether we would want to generalise this code, to better handle different CUDA versions, because we could be potentially needing to update it with each CUDA update.

Edit: Removed from-cache times, latest commit has these matching Jitify1.

Having issues on windows, will try Linux

Slow (as we haven't got our pre-header hack) and lacks serialization.

…ed (cuda11 may still be dead)

Triggering compile of the preprocessed source after deserialisation is still fast.

This only reduces time from 6.8s to 4.1s (Windows/CUDA 12.0) and can't easily extend it to system headers.

Quick windows test shows it to be much faster to deserialize.

…cking doesnt work in docker

…try_compile

Robadob · 2023-11-21T16:28:14Z

Current issue holding back the Jitify2 preprocesor branch is that it expects our flamegpu headers to be included as system header <> rather than " ". Waiting to here back from the dev (Ben) before I try to correct that on our side.

Same fix earlier applied to agent functions.

Negligible impact with preprocess branch also lint fixes that should have been commit earlier.

Robadob · 2023-11-23T10:27:05Z

Did three full test runs last night, all passed, however in those cases the cmake jitify dependency was pointing at the preprocess branch. Not currently using that here as it causes all windows CI to fail with WError.

Linux/CUDA12.3/Seatbelts ON/GLM ON/Release
Linux/CUDA12.3/Seatbelts OFF/GLM ON/Release
Windows/CUDA12.0/Seatbelts ON/GLM OFF/Debug

In release builds kernels are taking ~1 second to compile each. As Jitify is now doing the pre-processing, this is closer to 2.5 seconds under Debug builds.

Robadob added the RTC label Nov 15, 2023

Robadob self-assigned this Nov 15, 2023

Robadob commented Nov 17, 2023

View reviewed changes

Robadob mentioned this pull request Nov 20, 2023

Expose JitifyCache via Python interface. #1151

Merged

Robadob and others added 24 commits November 21, 2023 16:17

wip

e91956a

Having issues on windows, will try Linux

waimu fixes

544f0e6

This now works Windows + CUDA 12.0

3835087

Slow (as we haven't got our pre-header hack) and lacks serialization.

lint fix

6dfb338

Attempt to fix CUDA 12.0 libnvjitlink

a3070d4

Same fix for manylinux

cfe64b5

Temp fix CI warning.

8edebcc

Windows CUDA 12 CI fix?

c95f4ee

Temp redirect jitify2 dependency to my fork, see if Windows CI is fix…

c42e790

…ed (cuda11 may still be dead)

Fix Centos Package install

8d9f1c2

Empty change to force all CI to rerun

cd4088e

Reenable serialisation

9bc3380

Triggering compile of the preprocessed source after deserialisation is still fast.

Borrow Jitify1 demangle for gcc

3cdf925

fixup

ada42ca

Prevent jitify2::KernelData from being destroyed after context.

d2af9a7

Preload FLAMEGPU headers

49b89a3

This only reduces time from 6.8s to 4.1s (Windows/CUDA 12.0) and can't easily extend it to system headers.

Split up program load so we can serialise linked program.

b1dfdce

Quick windows test shows it to be much faster to deserialize.

Temp switch back to preprocess

df2635c

fix jitify2 preprocess deprecation

656302b

fix windows CI werror

4c5c6e3

Specify -D_FILE_OFFSET_BITS=64 for old linux kernels when using jitify2

c928b58

Dont merge, alswyas set the define to test manylinux while kernel che…

fe96919

…cking doesnt work in docker

Detect if _FILE_OFFSET_BITS=64 is required at configuration time via …

87fcceb

…try_compile

Remove preinclude cuda.h, nolonger required.

dd3d725

Robadob force-pushed the jitify2 branch from 3430f62 to dd3d725 Compare November 21, 2023 16:21

Robadob added 6 commits November 22, 2023 14:30

Header tweaks required by jitify2-preprocess

afdc48c

Jitify2 preprocess branch works with these changes.

6016f73

bugfix, test suite was failing

bdbba58

fixup

5ea1ce2

Fix RTC agent function conditions.

9fee20a

Same fix earlier applied to agent functions.

Remove getKnownHeaders()

fd788a3

Negligible impact with preprocess branch also lint fixes that should have been commit earlier.

This was referenced Dec 10, 2023

[Bug]:The results simulated by using Python Agent API in Tutorial Code are different from those using c++ Agent API. #1158

Closed

Changelog for 2.0.0-rc1 #1159

Merged

ptheywood added this to the 2.0.0-rc2 milestone Jan 12, 2024

ptheywood mentioned this pull request Mar 7, 2024

CUDA 12.4+ NVRTC -minimal #1187

Open

This was referenced Apr 2, 2024

[Bug]: Colab tutorial failing due to CUDA version conflict #1191

Closed

Remove CUDA 12.2+ performance warning when possible FLAMEGPU/FLAMEGPU2-tutorial-python#27

Open

Robadob added the blocked label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Jitify2 #1150

Migrate to Jitify2 #1150

Robadob commented Nov 15, 2023 •

edited

Robadob Nov 17, 2023 •

edited

Robadob commented Nov 20, 2023 •

edited

Robadob commented Nov 21, 2023

Robadob commented Nov 23, 2023 •

edited

Migrate to Jitify2 #1150

Are you sure you want to change the base?

Migrate to Jitify2 #1150

Conversation

Robadob commented Nov 15, 2023 • edited

Robadob Nov 17, 2023 • edited

Choose a reason for hiding this comment

Robadob commented Nov 20, 2023 • edited

Robadob commented Nov 21, 2023

Robadob commented Nov 23, 2023 • edited

Robadob commented Nov 15, 2023 •

edited

Robadob Nov 17, 2023 •

edited

Robadob commented Nov 20, 2023 •

edited

Robadob commented Nov 23, 2023 •

edited