Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary operations with Arrays of different memory layout #681

Open
geggo opened this issue Apr 14, 2023 · 2 comments
Open

Binary operations with Arrays of different memory layout #681

geggo opened this issue Apr 14, 2023 · 2 comments

Comments

@geggo
Copy link
Contributor

geggo commented Apr 14, 2023

For pyopencl.array.Array I get wrong results for binary operations such as adding if the memory layout differs for both arguments. Say, "A" is a C-contiguous, and "AF" is F-contiguous, then adding A+AF gives unexpected results. See this notebook for an example (tested with PyOpenCL 2022.3.1)

import numpy as np
import pyopencl as cl
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
import pyopencl.array as cla
# create simple numpy array 'a'
a = np.zeros((2,2))
a[0,1] = 1
a
array([[0., 1.],
       [0., 0.]])
# with different memory layout
af = np.asarray(a, order='F')
af
array([[0., 1.],
       [0., 0.]])
a + af
array([[0., 2.],
       [0., 0.]])
A = cla.to_device(queue, a)
AF = cla.to_device(queue, af)
A, AF
(cl.Array([[0., 1.],
        [0., 0.]]),
 cl.Array([[0., 1.],
        [0., 0.]]))
# bug: A + AF gives different result than a + aF
B = A + AF
B
cl.Array([[0., 1.],
       [1., 0.]])

Is this intended to work? Looking quickly at the source code, I could not see that the strides are taken into account. It seems that the underlying arrays are added in memory order.

Thanks
Gregor

@inducer
Copy link
Owner

inducer commented Apr 14, 2023

Yep, this is a trade-off that I made at the outset, for two reasons:

  • First, figuring out an appropriate launch config for strided accesses would add considerable overhead to an already too-hot code path leading up to arithmetic. (cf. numarray)
  • Second, the existing code generation infrastructure won't easily stretch to true n-dimensional arrays. (https://github.com/inducer/loopy was intended, among other things, to address that use case)

It's obviously embarrassing that adding different-stride arrays silently gives wrong results, and I've been meaning to at least add a (C-based, to avoid a big impact on said hot code path) check to raise an error. PRs are definitely welcome along those lines.

That said, even if this issue were fixed, eagerly evaluating array calculations is still not a winning move, so I'm not super likely to devote much more attention than adding that check, cf. https://github.com/inducer/pytato/ for a possible remedy.

@geggo
Copy link
Contributor Author

geggo commented Apr 18, 2023

Thanks for the clarification. It is clear to me that the array arithmetic is not optimal in regard with performance - but convenient to use. I consider moving my code to pytorch or JAX for this purpose.

To add a check that throws an error if memory layout is different, I ended up patching PyOpenCL

_pyopencl_array_get_broadcasted_binary_op_result_SAVED = None
def patch_pyopencl_array():
    import pyopencl.array
    global _pyopencl_array_get_broadcasted_binary_op_result_SAVED
    if _pyopencl_array_get_broadcasted_binary_op_result_SAVED is None:
        _pyopencl_array_get_broadcasted_binary_op_result_SAVED = pyopencl.array._get_broadcasted_binary_op_result

    def _get_broadcasted_binary_op_result_PATCHED(obj1, obj2, cq, dtype_getter=pyopencl.array._get_common_dtype):
        if obj1.strides != obj2.strides:
            raise NotImplementedError('Memory layouts (strides) of arrays are not the same')
        else:
            return _pyopencl_array_get_broadcasted_binary_op_result_SAVED(
                obj1, obj2, cq, dtype_getter
            )

    pyopencl.array._get_broadcasted_binary_op_result = _get_broadcasted_binary_op_result_PATCHED

Essentially, it adds a check that the strides are equal. This check is performed for add, mul and the other binary arithmetic operators, but not for the inplace operators. The inplace operators have all a explicit tests for matching shape, a little bit of refactoring would be required to move this into a common method that incorporates the additional check for same strides. Will such a PR be acceptable? Compared to the code already in place I thinks the overhead is acceptable to avoid silent failures. I am wondering if it sensible to make the check more general, similar to strides_equal

Gregor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants