Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded block_size in kernels #261

Open
azret opened this issue Apr 26, 2024 · 14 comments
Open

Hardcoded block_size in kernels #261

azret opened this issue Apr 26, 2024 · 14 comments

Comments

@azret
Copy link
Contributor

azret commented Apr 26, 2024

Can we make the block_size in the kernels more adaptive or parameterized? e.g. 1024 is pretty big for my GPU with 12GB of memory.

I have to run with block_size = 32

void fused_classifier3(float* logits, float* losses,
                      const float* dlosses, const int* targets,
                      int B, int T, int V, int P) {
    const int block_size = 1024;
    const int N = B * T;
    const int grid_size = N;
    fused_classifier_kernel3<<<grid_size, block_size>>>(logits, losses, NULL, dlosses, targets, B, T, V, P);
    cudaCheck(cudaGetLastError());
}
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.78                 Driver Version: 551.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   42C    P3             18W /   73W |       0MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
@karpathy
Copy link
Owner

Yeah I think we should do that.
I have a TODO to look more into the NVIDIA Occupancy Calculator, which I think might be helpful here.

@ademeure
Copy link
Contributor

Not sure I understand why a larger block_size doesn’t work for you, it shouldn’t have anything to do with VRAM, and 12GB is plenty anyway!

What’s the exact error you are getting? As a quick experiment, can you try

fused_classifier_kernel3<<<grid_size, block_size,8192>>>(logits, losses, NULL, dlosses, targets, B, T, V, P);

i.e. force the kernel to ask for 8192 bytes of shared memory (it does use shared memory but we don’t explicitly declare it as such so there’s a small chance that’s related).

@azret
Copy link
Contributor Author

azret commented Apr 27, 2024

Same error with 8192

[CUDA ERROR] at file D:\SRC\llm.c\train_gpt2_fp32.cu:1079:
too many resources requested for launch

void fused_classifier3(float* logits, float* losses,
                      const float* dlosses, const int* targets,
                      int B, int T, int V, int P) {
    const int block_size = 1024;
    const int N = B * T;
    const int grid_size = N;
    fused_classifier_kernel3<<<grid_size, block_size, 8192>>>(logits, losses, NULL, dlosses, targets, B, T, V, P);
    cudaCheck(cudaGetLastError());
}
  Name Value Type
  N 256 const int
  B 4 int
  T 64 int

If you consider the maxGridSize, maxThreadDim etc...below from the deviceProps. What limits could I be running into?

devProps

@ademeure
Copy link
Contributor

Typically this happens when the compiled kernel has more registers than the maximum allowed to be able to run that many warps per SM…

There’s 256KiB of register file, so with 1024 threads and 4 bytes per float, that’s a maximum of 64 registers per thread. I’m not sure why/how the compiler ends up using that many registers for you, maybe too aggressive unrolling…

Can you try adding the following to the kernel declaration:

launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)

so it becomes:

template <typename>
__global__ void __launch_bounds__(1024, 1) fused_classifier_kernel3(Type* logits, Type* losses, Type* probs, const Type* dlosses, const int* targets, int B, int T, int V, int P) {

@azret
Copy link
Contributor Author

azret commented Apr 27, 2024

The launch_bounds(1024, 1) bypassed "the too many resources error". However, all the losses go kaboom with block_size = 1024. But everything passeswith block_size = 32

// With block_size=1024
allocated 474 MiB for AdamW optimizer state v
step 0: loss -inf (took 40.586900 ms)
step 1: loss -inf (took 61.810200 ms)
step 2: loss -nan(ind) (took 96.569500 ms)
step 3: loss -nan(ind) (took 99.150600 ms)
step 4: loss -nan(ind) (took 97.418500 ms)
step 5: loss -nan(ind) (took 99.267500 ms)
step 6: loss -nan(ind) (took 94.665500 ms)
step 7: loss -inf (took 92.860700 ms)
step 8: loss -inf (took 94.221800 ms)
step 9: loss -inf (took 98.063400 ms)
LOSS - NOT OK w/ tol = 2.000000e-04, faults = 5, maxdiff=inf
LOSS[0] -inf8 5.2700078, diff=inf
LOSS[1] -inf8 4.0597078, diff=inf
LOSS[7] -inf8 0.9991478, diff=inf
LOSS[8] -inf8 0.6240808, diff=inf
LOSS[9] -inf8 0.3765118, diff=inf
overall okay: NOT OK

I also noticed we don't do CEIL_DIV for fused_classifier3. Is it designed that way?

@azret
Copy link
Contributor Author

azret commented Apr 27, 2024

I'm running fp32 btw. B = 4, T = 64. So pretty small batch.

@ademeure
Copy link
Contributor

Is that at the same time as the previous change setting shared memory at 8192? If so I’m at a loss :(

I have a RTX 4090 so same generation but slightly older drivers, I might try upgrading and see if I see the same issue next week, but…

I just realised your output says WDDM, is this on Windows or WSL? I don’t think that typically matters but some things like timeouts and shared memory allocation restrictions might be different etc…

@azret
Copy link
Contributor Author

azret commented Apr 27, 2024

It's Windows.

@rosslwheeler
Copy link
Contributor

I'm seeing the same error on Windows too. I do NOT see it on WSL2.

Same spot as above in :

fused_classifier_kernel3<<<grid_size, block_size>>>(logits, losses, (Type*)NULL, dlosses, targets, B, T, V, P);

 	train_gpt2_cu.exe!cudaCheck(cudaError error, const char * file, int line) Line 132	C++
>	train_gpt2_cu.exe!fused_classifier3<__nv_bfloat16>(__nv_bfloat16 * logits, __nv_bfloat16 * losses, const __nv_bfloat16 * dlosses, const int * targets, int B, int T, int V, int P) Line 1247	C++
 	train_gpt2_cu.exe!gpt2_forward(GPT2 * model, int * inputs, int * targets, unsigned __int64 B, unsigned __int64 T) Line 1682	C++
 	train_gpt2_cu.exe!main(int argc, char * * argv) Line 2278	C++


@rosslwheeler
Copy link
Contributor

Also had to make these changes to get the file to compile with nvcc:

image

@rosslwheeler
Copy link
Contributor

@azret - fixed it! I was missing the PFLAGS in the build - can you check to see if you are building with those new flags?

@PeterZhizhin
Copy link
Contributor

For block sizes, maybe we should look into cudaOccupancyMaxPotentialBlockSize.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1gee5334618ed4bb0871e4559a77643fc1

This does the occupancy calculator for a given function.

@rosslwheeler
Copy link
Contributor

@PeterZhizhin - I bumped into the -G debug issue too "[CUDA ERROR] at file train_gpt2.cu:1410:
too many resources requested for launch". Cuda debug enabled was a default debug configuration setting for Cuda on Windows.

@PeterZhizhin
Copy link
Contributor

@rosslwheeler yes, I had exactly the same issue. Seems like the kernel uses too many registers? Reducing block size to 512 in debug mode makes the code work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants