Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_cast has a regression from v7 to v8 #678

Open
astrohawk opened this issue Feb 27, 2022 · 2 comments
Open

batch_cast has a regression from v7 to v8 #678

astrohawk opened this issue Feb 27, 2022 · 2 comments

Comments

@astrohawk
Copy link
Contributor

Given that batch_cast is designed to convert all values in the batch from one type to another as if converted by static_cast, it made sense that, in v7, batch_cast was templated on the size of the input and output batch size. With the change to v8 where the batch is now templated on the architecture, batch_cast currently only works if the if the input and output have the same architecture, which in practice means they need to have the same underlying width (at least on x86). This is a regression on v7.
For example, the following used to work fine on v7:

batch<int32_t, 8> input = ...;
batch<int16_t, 8> output = batch_cast<int16_t>(input);

In order to implement the following in v8, I would expect the following to function identically:

make_sized_batch_t<int32_t, 8> input = ...;
make_sized_batch_t<int16_t, 8> output = batch_cast<int16_t>(input);

Unfortunately this doesn't currently work as the input and output architectures are different.
In order to make this work, I think we should modify batch_cast to look something like

template <class T_out, class T_in, class A>
inline make_sized_batch_t<T_out, batch<T_in, A>::size> batch_cast(batch<T_in, A> const& x) noexcept
{
    using batch_out = make_sized_batch_t<T_out, batch<T_in, A>::size>;
    return kernel::batch_cast(x, batch_out {}, A {}, typename batch_out::arch {});
}

This change will require significant changes to all the overloads of kernel::batch_cast, kernel::detail::batch_cast, kernel::detail::conversion_type, and kernel::detail::fast_cast (among others) to add support for both an input and output architecture. For example, the generic overload of kernel::batch_cast would have to be modified to look something like

template <class A_out, class A_in, class T_out, class T_in>
inline batch<T_out, A_out> batch_cast(batch<T_in, A_in> const& self, batch<T_out, A_out> const& out, requires_arch<generic>, requires_arch<generic>) noexcept
{
    return detail::batch_cast(self, out, A_in {}, A_out {}, detail::conversion_type<A_in, T_in, A_out, T_out> {});
}

Given that this is my use case, I'm happy to do the implementation work for this, but I thought I'd touch base first and see if anyone has any strong opinions around this given this breaks some of the conventions around handing architecture (i.e. we'll now have to handle an input and output arch). Therefore are there any opinions on the code I posted above? Is there any preference on the ordering of template args? Also how should I handle the overloads for fast_cast? For example in the batch<int32_t, 8> to batch<int16_t, 8> case above, the input is __m256i, the output is __m128i, but the underlying intrinsic (_mm256_cvtepi32_epi16) was only introduced in AVX512 (requires AVX512F + AVX512VL)? Which header should I put this overload in? What would the signature of this overload look like given the implementation arch is unrelated to the input and output arch? Any guidance on this would be much appreciated.

@serge-sans-paille
Copy link
Contributor

Hey and thanks for the detailed issue. There are plenty of difficult choices there :-) It's worth trying to split them in independent issues:

  1. How do we support intrinsic that are inherently cross architecture (like an AVX builtin that operates on _m128 registers, etc) we currently have no generic support for this :-/ The only close construct we have is fma5<sse4.2> and fma5<avx> etc. We could envision an architecture named xsimd::ext::avx2<avx512f> which would mean avx2 with avx512 extensions
  2. How do we handle cross-architecture cast? I like your approach here, I'd just order the template arguments as T_out, Arch, T_in so that one can force the architecture of the output batch without specifying the input batch type

I'd be happy to read @JohanMabille on that issue too ;-)

@astrohawk
Copy link
Contributor Author

Hey @serge-sans-paille, sorry I've taken so long to respond. I guess I've been busier than I anticipated. I think adding both parts might be too much for me to take on, especially part 1 which I think will require some high-level architectural work to get right. I'm not sure I'm best placed to make some of the necessary design decisions, though I'm happy to look into implementing part 2 if you can come up with a concrete approach to part 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants