Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textures being created with incorrect sizeInBytes #210

Open
torrance opened this issue Jun 9, 2023 · 6 comments
Open

Textures being created with incorrect sizeInBytes #210

torrance opened this issue Jun 9, 2023 · 6 comments

Comments

@torrance
Copy link

torrance commented Jun 9, 2023

The test_linalg/test_matmul_ab_beamformer_kernel_small test adds a misalign parameter to test linalg_kernels.cu handling of memory not aligned to 512 byte boundaries:

bifrost/test/test_linalg.py

Lines 224 to 231 in 30fd268

def test_matmul_aa_correlator_kernel_small(self):
for nchan in range(1, 1+5):
for ntime in [1, 2, 3, 4, 8, 12]:
for nstand in range(1, 1+65):
for misalign in range(0, min(2 * (nstand - 1), 3), 2):
self.run_test_matmul_aa_correlator_kernel(
ntime=ntime, nstand=nstand, nchan=nchan,
misalign=misalign)

For example, on the third innermost iteration, this tests passes in the values (nchan = 1, ntime = 1, nstand = 3, misalign = 2). This produces an array with size (1, 1, 6), which is then truncated (using the misalign value) down to a (1, 1, 4).

After calling linalg.matmul, this eventually reaches the function bf_cherk_N() where the total elements in the array needs to be calculated:

size_t A_nelement_total =
std::max(A_stride * K, A_batchstride * nbatch) + A_offset;

For this particular test, the n_element_total should be 2 (size of truncated array) + 1 (offset), however it calculates it as 3 + 1 = 4.

When this incorrect, larger size is later used to calculate the sizeInBytes of the texture, this results in the texture believe it has access to more memory than it really does. On CUDA backends this isn't actually validated, but on ROCM backends, the texture creation fails since it actually checks the backing memory.

I've spent some time with this but haven't been able to fix it, as I'm not clear what all of N, K, A_stride, nbatch, etc. are, or where they are meant to be coming from.

@jaycedowell
Copy link
Collaborator

I'll try to find some time this week to look into this.

@jaycedowell
Copy link
Collaborator

I spent some time looking into this and I think I know what is happening for this particular case. I think the underlying problem is how Bifrost computes array sizes. By default it uses the product of the first array stride length with the size of the first array dimension to get the overall array size. In Python that would be:

from bifrost import ndarray
a = ndarray(shape=(1,1,6), dtype='ci8')
print(a.size*a.itemsize, 'vs', a.strides[0]*a.shape[0])

which looks ok. When misalign is non-zero we end up with:

b = a[...,2:]
print(b.size*b.itemsize, 'vs', b.strides[0]*b.shape[0])

which doesn't match. The misalign=2 causes the start address for the underlying memory and the shape to be updated but not the strides.

print(a.ctypes.data, 'vs', b.ctypes.data)
print(a.shape, 'vs', b.shape)
print(a.strides, 'vs', b.strides)

In this case the correct array size would come from something like numpy.prod(b.shape)*b.itemsize since both of those should be available in the BFarray class/struct.

I'm less sure what to do about it. In this particular case changing the calculation of A_nelement_total to use shape (somehow) would seem like the thing do to. However, I doubt that is a general solution since A_nelement_total may be only relevant for the batched size of the operation. Thinking about what is happening here we are trying to do c = α * a.aH + β*c. Here a is a matrix with shape1(nbatch, N, K) and c has shape (nbatch, N, N). That would make nbatch the number of independent pieces of a.aH to run, N sets the number of elements in the last two dimensions of c, and K is the number of elements to sum over each element in c for the matrix multiplication. I think that makes A_nelement_total the maximum of "the distance between rows of a" and "the distance between batched sections of a" added to "the offset between the start of a and the texture alignment"?

Footnotes

  1. The shape is technically (nbatch, K, N) because the original LinAlg.matmul() call puts a in the location of b which says what we are passing in is the Hermitian conjugate of a.

@torrance
Copy link
Author

Thank you for looking at this @jaycedowell . I'll re-read your comment a few times and attempt to have another look at it next week.

@torrance
Copy link
Author

torrance commented Jun 21, 2023

Hi @jaycedowell

I think the following gives the correct value for A_nelement_total:

size_t A_nelement_total =
    (A_offset + N)                  // the elements in the first row of first batch
    + (K - 1) * A_stride            // the elements in the rest of the first batch
    + (nbatch - 1) * A_batchstride; // the elements for the remaining batches

The tests pass and I no longer get memory violations in ROCM.

Do you think the logic works now for all use cases?

@jaycedowell
Copy link
Collaborator

I'm not sure that A_nelement_total should include elements from other batches (that last term in the sum). For the batched execution you are only really concerned with those elements that are used for that particular batch. Each batch should handle its own texture binding.

@torrance
Copy link
Author

torrance commented Jul 3, 2023

I think it needs to be sized to store all the batches. The texture isn't created once per batch, it's shared amongst all of them, from my reading of the code.

In any case, if I delete the last term in the sum the tests will fail again.

dentalfloss1 added a commit that referenced this issue Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants