-
Hi, I am trying to use EnzymeAD differentiate a volume renderer. I installed Enzyme according to https://github.com/wsmoses/Enzyme-GPU-Tests and managed to get the program work according to the https://enzyme.mit.edu/getting_started/CUDAGuide/. I profile the program and find that the forward kernel takes 5ms however the backward kernel takes 950ms, which is unexpected slow. I am not sure whether I missed something. The CUDA code of the forward and backward kernel is list below. It will be much appreciated if someone can help check whether I use the Enzyme here correctly. Thanks! Forward kernel: ///////////// Enzyme /////////////
int __device__ enzyme_dup;
int __device__ enzyme_out;
int __device__ enzyme_const;
__device__ void volume_renderer(
const float * __restrict__ sigmas,
const float * __restrict__ rgbs,
const float * __restrict__ deltas,
float * weights,
float * weights_sum,
float * depth,
float * image,
const uint32_t M, const uint32_t N, const int * __restrict__ rays){
/// parallel per ray
const uint32_t n = threadIdx.x + blockIdx.x * blockDim.x;
if (n >= N) return;
// locate
uint32_t index = rays[n * 3];
uint32_t offset = rays[n * 3 + 1];
uint32_t num_steps = rays[n * 3 + 2];
// empty ray, or ray that exceed max step count.
if (num_steps == 0 || offset + num_steps > M) {
weights_sum[index] = 0;
depth[index] = 0;
image[index * 3] = 0;
image[index * 3 + 1] = 0;
image[index * 3 + 2] = 0;
return;
}
sigmas += offset;
rgbs += offset * 3;
deltas += offset * 2;
// accumulate
uint32_t step = 0;
float T = 1.0f;
float T_thresh = 0.0001f;
float r = 0, g = 0, b = 0, ws = 0, t = 0, d = 0;
while (step < num_steps) {
const float alpha = 1.0f - __expf(- sigmas[0] * deltas[0]);
const float weight = alpha * T;
r += weight * rgbs[0];
g += weight * rgbs[1];
b += weight * rgbs[2];
t += deltas[1]; // real delta
d += weight * t;
ws += weight;
T *= 1.0f - alpha;
// minimal remained transmittence
if (T < T_thresh) break;
//printf("[n=%d] num_steps=%d, alpha=%f, w=%f, T=%f, sum_dt=%f, d=%f\n", n, step, alpha, weight, T, sum_delta, d);
// locate
sigmas++;
rgbs += 3;
deltas += 2;
step++;
}
//printf("[n=%d] rgb=(%f, %f, %f), d=%f\n", n, r, g, b, d);
// write
weights_sum[index] = ws; // weights_sum
depth[index] = d;
image[index * 3] = r;
image[index * 3 + 1] = g;
image[index * 3 + 2] = b;
}
void __global__ volume_renderer_wrapper(
const float * __restrict__ sigmas,
const float * __restrict__ rgbs,
const float * __restrict__ deltas,
float * weights,
float * weights_sum,
float * depth,
float * image,
const uint32_t M, const uint32_t N, const int * __restrict__ rays){
volume_renderer(sigmas, rgbs, deltas, weights, weights_sum, depth, image, M, N, rays);
} Backward kernel:
|
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 2 replies
-
Haven't used the cuda side myself, but I guess that is expected based on your compilation command. ''' The reason is that we disable some optimizations before running Enzyme and on Top of that emit mostly unoptimized code that is intended to be easily optimizable. The second optimization run is however missing here. I am currently on my phone so I can't build you the full command, but maybe have a look at the pure C++ example with lld and LLDEnzyme or opt and LLVMEnzyme and try to add the corresponding cuda flags there along the way. |
Beta Was this translation helpful? Give feedback.
-
Hi @ZuseZ4 , thanks for the timely reply! As for the compilation command. I mainly follow the Makefile in this example https://github.com/wsmoses/Enzyme-GPU-Tests/blob/main/LBM/Makefile . It will be great if you can share some sample command please when you are convenient. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Again unrelated to CUDA. Do you need all the primals for |
Beta Was this translation helpful? Give feedback.
-
Can you post a zip file containing the code/makefile/etc? The Makefile you link to has many different options to be able to run an ablation analysis to test how effective optimizations are. |
Beta Was this translation helpful? Give feedback.
-
Hi @wsmoses , here is the zip containing code, makefile, test data and how to repro. It will be great if you can help take a look. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestions @tgymnich. I was a beginner in EnzymeAD and not familiar with the conventions so I just set everything to |
Beta Was this translation helpful? Give feedback.
-
If you add
|
Beta Was this translation helpful? Give feedback.
-
From these perf warnings, it looks like LLVM optimizations aren't happy about your indexing, changing it to something like the following improves it: while (step < num_steps) {
const float alpha = 1.0f - __expf(- sigmas[step] * deltas[2*step]);
const float weight = alpha * T;
r += weight * rgbs[3*step+0];
g += weight * rgbs[3*step+1];
b += weight * rgbs[3*step+2];
t += deltas[2*step+1]; // real delta
d += weight * t;
ws += weight;
T *= 1.0f - alpha;
// minimal remained transmittence
if (T < T_thresh) break;
//printf("[n=%d] num_steps=%d, alpha=%f, w=%f, T=%f, sum_dt=%f, d=%f\n", n, step, alpha, weight, T, sum_delta, d);
// locate
// sigmas++;
// rgbs += 3;
// deltas += 2;
step++;
} The performance warnings now become the following, which are far fewer, but still requires caching for the T and related variable:
|
Beta Was this translation helpful? Give feedback.
-
You can add @michel2323 @sriharikrishna this looks like a good example of why we should move some more of the checkpointing work into Enzyme proper. cc @vchuravy |
Beta Was this translation helpful? Give feedback.
-
Hi @wsmoses , thanks for your help! After I modified the source code and compilation flags according to your suggestions, the time cost of the backward kernel reduce to ~6 ms, which is much more sensible to me. There are still two points which confused me.
Thanks! |
Beta Was this translation helpful? Give feedback.
You can add
-mllvm -enzyme-max-cache
which changes the allocations from dynamic reallocs to static malloc, which should partially improve things -- though those two remaining allocations still persist and can cause some performance issues.@michel2323 @sriharikrishna this looks like a good example of why we should move some more of the checkpointing work into Enzyme proper. cc @vchuravy