CUDA Tensor Core Integration #2684

vdesai2014 · 2023-12-09T02:49:58Z

vdesai2014
Dec 9, 2023

I've been working on the TC bounty for CUDA and am a bit stuck. For context, the wmma API for CUDA has three relevant API calls wmma_load_matrix, wmma_mma_sync, and wmma_store_matrix_sync. The actual matmul call (mma_sync) requires an input that has 'fragment' objects, which are usually loaded using the warp-wide load_matrix call. The mapping of threads to matrix fragments is opaque in the CUDA/PTX docs, but messing around in CUDA I found that each thread-group (four threads working together) load a contiguous 16 element chunk. The distribution of elements within each thread-group is a bit funky, see below an example with 16x16 row-wise matrix.

You can map threadIdx to matrix element by using stride of 0/128 for each of the two rows a thread group is responsible for, a stride of 0/8 for the two pairwise elements with a row, and an offset within each row of threadIdx//4 . I haven't gotten up to speed yet with the re-factored linearizer but I am not sure the current logic for the way load micro-ops are generated could support (happy to be corrected here) a pattern like this. Any ideas on how to proceed here?

We could also put each 16x16 chunk in shared mem, and call load_matrix on that (Steven implemented this here - #2512), but that will waste memory movement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Tensor Core Integration #2684

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CUDA Tensor Core Integration #2684

vdesai2014 Dec 9, 2023

Replies: 0 comments

vdesai2014
Dec 9, 2023