CUDA Tensor Core Integration #2684
Unanswered
vdesai2014
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've been working on the TC bounty for CUDA and am a bit stuck. For context, the wmma API for CUDA has three relevant API calls wmma_load_matrix, wmma_mma_sync, and wmma_store_matrix_sync. The actual matmul call (mma_sync) requires an input that has 'fragment' objects, which are usually loaded using the warp-wide load_matrix call. The mapping of threads to matrix fragments is opaque in the CUDA/PTX docs, but messing around in CUDA I found that each thread-group (four threads working together) load a contiguous 16 element chunk. The distribution of elements within each thread-group is a bit funky, see below an example with 16x16 row-wise matrix.
You can map threadIdx to matrix element by using stride of 0/128 for each of the two rows a thread group is responsible for, a stride of 0/8 for the two pairwise elements with a row, and an offset within each row of threadIdx//4 . I haven't gotten up to speed yet with the re-factored linearizer but I am not sure the current logic for the way load micro-ops are generated could support (happy to be corrected here) a pattern like this. Any ideas on how to proceed here?
We could also put each 16x16 chunk in shared mem, and call load_matrix on that (Steven implemented this here - #2512), but that will waste memory movement.
Beta Was this translation helpful? Give feedback.
All reactions