Bug in tensor core programming #251

blueWatermelonFri · 2024-01-31T06:33:10Z

I encountered a strange bug while programming tensor core using the WMMA api in A800.
I tried to print the size of the element in the fragment，Normally sizeof(fp16) is 2, the following code also outputs 2.

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d\n", sizeof(a_frag.x[0]));
    printf("%d\n", sizeof(a_frag.x[1]));

However, I changed the code to the following form, the print statement output 2 and 0, even i changed the order of a_frag.x[0] and a_frag.x[1], the output still is 2 and 0, Does anyone know why?

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d %d\n", sizeof(a_frag.x[0]), sizeof(a_frag.x[1]));

My code runs in the following environment, os is ubuntu 20.04.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   43C    P0              68W / 300W |   2768MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

My compile command is:

nvcc tensor_core.cu -std=c++11 -lcublas -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -res-usage -lcudart  -lineinfo -Xcompiler -fopenmp

The text was updated successfully, but these errors were encountered:

yuantailing · 2024-05-30T16:58:02Z

Hello @blueWatermelonFri,

The format specifier %d expects an argument of type int, not size_t. The correct version of the code is

printf("%d %d\n", (int)sizeof(a_frag.x[0]), (int)sizeof(a_frag.x[1]));

I guess the reason for the output of 2 and 0 may be that size_t takes 8 bytes, i.e., 0x00000002, so 0x0002 takes the place of the first %d, and 0x0000 takes the place of the second %d.

Additionally, you may have received a warning like:

warning #181-D: argument is incompatible with corresponding format string conversion (expected type "int" but argument has type "unsigned long long")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in tensor core programming #251

Bug in tensor core programming #251

blueWatermelonFri commented Jan 31, 2024

yuantailing commented May 30, 2024 •

edited

Bug in tensor core programming #251

Bug in tensor core programming #251

Comments

blueWatermelonFri commented Jan 31, 2024

yuantailing commented May 30, 2024 • edited

yuantailing commented May 30, 2024 •

edited