Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request : Populate library with c++ STL like algorithm #5

Open
ThomasRetornaz opened this issue Jul 11, 2019 · 15 comments
Open
Labels
enhancement New feature or request

Comments

@ThomasRetornaz
Copy link

Hi i migrate from boost::simd to nsimd
It seems that nsimd doesnt provide high level STL like algorithm (Transform,Reduce,etc...)
I could try to implement them on my side
Questions:
Where to put them?
Where to put tests?

@gquintin gquintin added the enhancement New feature or request label Jul 12, 2019
@gquintin
Copy link
Contributor

Hi Thomas,

The main goal of the library is to wrap intrinsics and providing C and C++ APIs. Thus STL like algorithms which are C++ only and higher level than intrinsics wrapping are not meant to be in include/nsimd/nsimd.h. But a lot C++ specific stuff make sense to be provided alongside the library and thus must be placed in include/nsimd/modules (to be mkdir'ed.)

The policy is to provide those as separate modules so that the end-user, in your case, will have to do

#include <nsimd/modules/stl-algorithms.hpp>

The organization inside nsimd/modules is rather simple. If only one file is sufficient then put all in one header file. Otherwise, make this one file include other files: feel free to create a subdirectory stl-algorithms and then you are free to organize files inside it as needed.

As for STL-like algorithms, if you take code from Boost.SIMD, please be aware of the following:

  • There is a silent bug in the code that split a range into three parts, the header, the body where SIMD is applied and the tail. If memory serves the bug appears when sizeof(T) > alignof(T).
  • A key feature was missing from Boost.SIMD: the unroll factor. Indeed it is important that we can choose the unroll factor which is a constpexr constant when we call reduce, transform, ... So I think that two APIs must be provided:
    • The "classic" one as can be found in standard C++ STL.
    • A more SIMD-specific one that allows one to choose the unroll factor.
  • I do not know by heart which algorithm is C++ >=11 but if one is available in C++98 then make it so with the NSIMD one, a non negligible number of developers are stuck with GCC 4.4 and we want them to be able to use NSIMD.
  • Try and write simple C++ code. Again if memory serves, Boost.SIMD's ranges are only used in the implementations of STL algorithms but were meant to be generic-general-purpose-SIMD ranges. This has led to complilcated code that hides bugs such as the one above.

@ThomasRetornaz
Copy link
Author

ThomasRetornaz commented Aug 7, 2019

Hi thanks for this usefull explanation of potential pitfalls
So two possibilities:

  • Don't use range

  • Use SFINAE to allow SIMD loop only for arithmetics types. We are sure that sizeof(T)==alignof(T)

WHat do you think?

For loop unrolling OK for two functions

Thanks
Regards
TR

@gquintin
Copy link
Contributor

gquintin commented Aug 30, 2019

Hi Thomas,

Sorry for the late answer.

I am not sure why you need SFINAE. I mean just take arguments as "standard" template. I think the reason is that you want to check that the given types are what the function expects. From my point of view this is the role of concepts which are not available yet (I mean part of the standard, not a TS, as of 2019) and will therefore not be broadly available anytime soon in the industry. "Standard" templates are not meant to do this kind of checking plus it makes compilation times longer, so I suggest not to use them for this.

As for sizeof(T) == alignof(T) for arithmetic types, I think this is not garanteed by the standard and you can "force" the compiler to have a type not respecting this :

#include <iostream>

typedef double foo __attribute__ ((aligned (64)));
alignas(64) double bar;
double baz __attribute__ ((aligned (64)));

int main(int argc, char *argv[]) {
    std::cout << "foo sizeof: " << sizeof(foo) << " alignof: " << alignof(foo) << "\n";
    std::cout << "bar sizeof: " << sizeof(bar) << " alignof: " << alignof(decltype(bar)) << "\n";
    std::cout << "baz sizeof: " << sizeof(baz) << " alignof: " << alignof(decltype(baz)) << "\n";
}

The above snippet is from https://stackoverflow.com/questions/46457449/is-it-always-the-case-that-sizeoft-alignoft-for-all-object-types-t.

So I suggest to take all cases into account:

  • sizeof(T) < alignof(T)
  • sizeof(T) == alignof(T)
  • sizeof(T) > alignof(T)

@ThomasRetornaz
Copy link
Author

ThomasRetornaz commented Apr 27, 2021

Hello , i restart on this issue because i want to switch some code using an another simd lib to nsimd
before i dive in alignof issue (see comment above) i start making a simple std:: transform like function which check if both ptr is aligned or not I don't use prologue and so on for now

I 'm stuck on how to deal with tests for this kind of function (eg for stl like algorithm in general)
I could for the moment do a simple main but i think you want to cover archs and types
So do you have any clues on that, do i try to add a skeleton in egg/gen_tests.py ?

and yes of course thanks for the impressive work on the library 👍

Regards

@gquintin
Copy link
Contributor

Hi ThomasRetornaz,

(Thank you for the impressive work)
I will first try and answer your question as best as I can then I will make a comment on stl-like algorithms.

My answer

You are right we test on all types and SIMD extensions. Maybe something like egg/modules/tet1d/hatch.py should do. More precisely in the gen_tests_for function contained in the latter. For std::transform, as there two "versions", one for unary operators and one for binary operators, you can maybe write tests for all operators in NSIMD this way in the file (egg/modules/stl/hatch.py):

def gen_test(opts):
    for op_name, operator in operators.operators.items():
        if len(operator.params) == 2: # return type + (1 arguments type)
            # generate C++:
            # std::transform(a.begin(), a.end(), b.begin(), [](arg){
            #     return op_name(arg);
            # });
            # for (int i = 0 to n) { nsimd_scalar_op_name(arg); }
            # Check that the for-loop and the std::transform compute the same things
        if len(operator.params) == 3: # return type + (2 arguments types)
            # generate C++:
            # std::transform(a.begin(), a.end(), b.begin(), c.begin(), [](arg1, arg2){
            #     return op_name(arg1, arg2);
            # });
            # for (int i = 0 to n) { nsimd_scalar_op_name(arg1, arg2); }
            # Check that the for-loop and the std::transform compute the same things

If it works like I think the NSIMD operators that are eligible for the STL algorithms are the same that are eligible to be used in expressions templates and SPMD kernels. One has to check

for op_name, operator in operators.operators.items():
    if not operator.has_scalar_impl:
        continue

Note also that the other two NSIMD modules do not split the interval for using aligned loads and stores. You can write a first version of std::transform that does not do that too and directly uses unaligned loads and stores. It would be interesting to see if using unaligned loads/stores on aligned data result in a loss of performances. I hope this will give you all the answers you need. Also do not hesitate to consult the bottom of this page on how to add a new module so that your modules and its documentation can be easily and automatically integrated into NSIMD.

My comment

In general, performance-wise, stl-like algorithm are not optimized properly by compilers. I am not talking about a single transform but when you have several algorithms which are chained. For example, consider the following snippet of code:

#include <algorithm>
#include <cmath>

void foo(float *a, int n) {
    std::transform(a, a + n, a, [](float item) {
        return std::abs(item);
    });
    std::transform(a, a + n, a, [](float item) {
        return std::sqrt(item);
    });
}

#include <nsimd/modules/tet1d.hpp>

void bar(float *a, int n) {
    tet1d::out(a) = tet1d::sqrt(tet1d::abs(tet1d::in(a, n)));
}

The foo function makes use of "standard" C++ but is verbose and will be slower than the bar function. You can compare the assembly in Godbolt: https://godbolt.org/z/xYeebMosM.

You will see that the two std::transform are not merged into one. Moreover notice that the second std::transform will not be autovectorized. Why? The Standard C/C++ (std::)sqrt sets errno for negative values. As this behavior cannot be reproduced by using only the intrinsics _mm_sqrt_ps, _mm_sqrt_ss, the compiler always calls "libm.sqrtf". If you want the second transform to be vectorized (which is I think the case most of the time) you have to pass the compiler the -fno-math-errno.

I advise you to write a old-style-C for-loop or use an expression template mechanism. Not a C++ for-loop because std::vector do not play well with intrinsics (https://godbolt.org/z/xbrqxWMGr):

void foo(std::vector<float> &v) {
    for (int i = 0; i < v.size(); i += 4) {
        _mm_storeu_ps(&v[i],
            _mm_sqrt_ps(_mm_loadu_ps(&v[i]))
        );
    }
}

gives

.L3:
        lea     rax, [rcx+rdx*4]
        add     rdx, 4
        movups  xmm1, XMMWORD PTR [rax]
        sqrtps  xmm0, xmm1
        movups  XMMWORD PTR [rax], xmm0
        mov     rcx, QWORD PTR [rdi]
        mov     rax, QWORD PTR [rdi+8]
        sub     rax, rcx
        sar     rax, 2
        cmp     rax, rdx
        ja      .L3

while the "C" version:

void bar(std::vector<float> &v_) {
    float *v = v_.data();
    int n = v_.size();
    for (int i = 0; i < n; i += 4) {
        _mm_storeu_ps(&v[i],
            _mm_sqrt_ps(_mm_loadu_ps(&v[i]))
        );
    }
}

gives

.L8:
        movups  xmm1, XMMWORD PTR [rax]
        add     rax, 16
        sqrtps  xmm0, xmm1
        movups  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rdx
        jne     .L8

@ThomasRetornaz
Copy link
Author

Thanks for the detailled anwser 👍 and self explained exemples.
I'm not familiar with spmd and tet1d (yet) but its seems that i could use tet1d for my use case (aka a "generic" image processing library in "modern" c++ where i bind low level operators subs/adds/min/max/shift etc ... if its allowed by the "type" of the image through template specialization)
So is there an interest (for the community) to provide stl like algorithms (not just stransform and reduce) even with the drawback you mention?
Anyway i think i will do the exercise in order to much understand nsimd

@gquintin
Copy link
Contributor

I believe there is an interest to provide STL-like algorithms to let people port their code especially if it supports GPUs. It would be really interesting to do that in fact. again the fact that kernels won't be merge will imply a loss in performances but it would be a really good first step for C++-people who needs to port their algorithms.

@ThomasRetornaz
Copy link
Author

ThomasRetornaz commented May 18, 2021

Hi, i'm stuck to find the canonical way to
compute step size function of pack
Eg:

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          typename UnaryOp>
NSIMD_REQUIRES(sizeof_v<L> == sizeof_v<T>)
T *transform(L const *first, L const *last, T *out, UnaryOp f) {

  typedef typename nsimd::pack<L> pack_L;
  typedef typename nsimd::pack<T> pack_T;

  //const std::size_t alignment = NSIMD_MAX_ALIGNMENT;

  // Define loop counter and range
  const int step_simd = ???????
  const std::ptrdiff_t range_size = std::distance(first, last);

  const std::ptrdiff_t size_simd_loop =
      (range_size > step_simd) ? ((std::ptrdiff_t)step_simd *
                                  ((range_size) / (std::ptrdiff_t)step_simd))
                               : (std::ptrdiff_t)0;

  std::ptrdiff_t i = 0;

  //---main simd loop
  for (; i < size_simd_loop; i += step_simd) {
    pack_L element, res;
    element = nsimd::loadu<pack_L>(
        first); //[Q] if we check alignement we could use aligned
                // load/store right?
    res = f(element);
    nsimd::storeu(out, res);
    first += step_simd;
    out += step_simd;
  }

  //---epilogue
  for (; i < range_size; ++i) {
    *out++ = f(*first++);
  }
  return out;
}

After that,
what prevent me to check isaligned(first) and isaligned(out) and make aligned load?
How do you want to deal with an unroll factor?
Does this code is suitable in GPU cases?

Thanks
Regards TR

@gquintin
Copy link
Contributor

Hi Thomas, I think what you are looking for is const int step_simd = nsimd::len(nsimd::pack<L>());

You are right you can check whether pointers are aligned or not and use the proper loads/stores. For convenience we added templated versions of loads and stores based on the alignement so that you have to write your kernel only one. I also would write the code like this to take into account GPUs (but I think that for CUDA it will be impossible without some help from the end-user), I think you should separate kernels from the implementation of std::transform:

// WARNING: this is more pseudo-code than C++

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          NSIMD_CONCEPT_ALIGNMENT InputAlignment,
          NSIMD_CONCEPT_ALIGNMENT OutputAlignment
          typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
  typedef typename nsimd::pack<L> pack_L;

  const int step_simd = nsimd::len(pack_L());
  const std::ptrdiff_t n = std::distance(first, last);
  nsimd::nat i = 0;

  for (; i + ste_simd <= n; i += step_simd) {
    nsimd::store<OutputAlignment>(out + i, f(
      nsimd::load<InputAlignment, pack_L>(first + i)));
  }

  for (; i < n; ++i) {
    out[i] = f(first[i]);
  }
  return out;
}

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform(L const *first, L const *last, T *out, UnaryOp f) {
  if (first and out are aligned) {
    return transform<L, T, nsimd::aligned, nsimd::aligned, UnaryOp>(first, last, out, f);
  } else if (only first is aligned) {
    return transform<L, T, nsimd::aligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
  } else if (only out is aligned) {
    return transform<L, T, nsimd::unaligned, nsimd::aligned, UnaryOp>(first, last, out, f);
  } else {
    return transform<L, T, nsimd::unaligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
  }
}

For GPUs, the problem is CUDA. For HIP and oneAPI there is no need to annotate (__device__) a function for her to be compiled as device code. For CUDA I believe (to be checked) that this is mandatory. But this would result in non-portable code. from the user point of vue: he would have to annotate his lambdas with __device__. Maybe we can propose a macro that could be called NSIMD_DEVICE:

#ifdef NSIMD_CUDA
#define NSIMD_DEVICE __device__
#else
#define NSIMD_DEVICE
#endif

in which case the tranform_impl would become something like this:

// WARNING: this is more pseudo-code than C++

#ifdef NSIMD_CUDA

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
__global__ void transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
  int i = blockIdx.x ...;
  if (i < last - first) {
    out[i] = f(in[i]);
  }
}

#else

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          NSIMD_CONCEPT_ALIGNMENT InputAlignment,
          NSIMD_CONCEPT_ALIGNMENT OutputAlignment
          typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform_impl(L const *first, L const *last, T *out, UnaryOp f) {
  typedef typename nsimd::pack<L> pack_L;

  const int step_simd = nsimd::len(pack_L());
  const std::ptrdiff_t n = std::distance(first, last);
  nsimd::nat i = 0;

  for (; i + ste_simd <= n; i += step_simd) {
    nsimd::store<OutputAlignment>(out + i, f(
      nsimd::load<InputAlignment, pack_L>(first + i)));
  }

  for (; i < n; ++i) {
    out[i] = f(first[i]);
  }
  return out;
}

#endif

template <NSIMD_CONCEPT_VALUE_TYPE L, NSIMD_CONCEPT_VALUE_TYPE T,
          typename UnaryOp>
NSIMD_REQUIRES((sizeof_v<L> == sizeof_v<T>))
T *transform(L const *first, L const *last, T *out, UnaryOp f) {
#ifdef NSIMD_CUDA
  // we suppose that pointers are GPU pointers
  // It is not the job of transform nor NSIMD to
  // handle memory copies between host and
  // devices
  transform_impl<<< ... >>>(first, last, out, f);
  return out;
#else
  if (first and out are aligned) {
    return transform<L, T, nsimd::aligned, nsimd::aligned, UnaryOp>(first, last, out, f);
  } else if (only first is aligned) {
    return transform<L, T, nsimd::aligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
  } else if (only out is aligned) {
    return transform<L, T, nsimd::unaligned, nsimd::aligned, UnaryOp>(first, last, out, f);
  } else {
    return transform<L, T, nsimd::unaligned, nsimd::unaligned, UnaryOp>(first, last, out, f);
  }
#endif
}

Hope it helps. This just got out of my head so errors are surely in there. I did not take time to check all my asumptions especially on CUDA.

@ThomasRetornaz
Copy link
Author

Thanks a lot for this inspirational answer, i will follow your tips and advices
For cuda case, i will try to understdand how it works and valid a POC

Do you want such a "details" namespace for impl?

Regards
TR

@gquintin
Copy link
Contributor

Hi Thomas, I think we already use a detail or details namespace so sure you can do that.

@ThomasRetornaz
Copy link
Author

Sorry for the delay
I m currently try to implement test suite and i have some pbs with "supported" operators
For instance in stl/hatch/gentest i try something like

                        template<typename T>
                        struct UnaryOP
                        {{
                            UnaryOP(){{}}
                            T operator()(T const &a0) const
                            {{
                                return std::{op_name}(a0); 
                            }}
                            template<typename U> 
                            U operator()(U const &a0) const 
                            {{ 
                                return nsimd::{op_name}(a0, {typ}());
                            }}
                        }};

                        void compute_result({typ} *dst, {typ} const *tab0,
                                            unsigned int n) {{
                            UnaryOP<{typ}> op;                       
                            std::transform(tab0, tab0+n, dst, op);
                        }}

                        void compute_output({typ} *dst, {typ} const *tab0,
                                            unsigned int n) {{
                            UnaryOP<{typ}> op;                       
                            nsimd::stl_algorithms::transform(tab0, tab0+n, dst, op);
                        }}

But not all "nsimd" operators have stl counterpart
So i could

  1. list in hatch.py::gen_test a sub list of operator
  2. change egg/operators.py to add has_stl_counterpart properties
    Or may i miss something trivial
    Regards
    TR

@ThomasRetornaz
Copy link
Author

My bad i simply not think enough

template<typename T>
struct UnaryOP
{{
    UnaryOP(){{}}
    T operator()(T const &a0) const
    {{
        return nsimd_scalar_{op_name}_{typ}(a0);
    }}
    template<typename U> 
    U operator()(U const &a0) const 
    {{ 
        return nsimd::{op_name}(a0);
    }}
}};

void compute_result({typ} *dst, {typ} const *tab0,
                    unsigned int n) {{
    UnaryOP<{typ}> op;                       
    std::transform(tab0, tab0+n, dst, op);
}}

void compute_output({typ} *dst, {typ} const *tab0,
                    unsigned int n) {{
    UnaryOP<{typ}> op;                       
    nsimd::stl_algorithms::transform(tab0, tab0+n, dst, op);
}}

do the job
I will try to finish transform like func, may i need some help on gpu part
Regards
TR

@gquintin
Copy link
Contributor

Hi Thomas, I saw your 2 posts, I am really busy right now. As soon as I have some time I will answer you properly.

@gquintin
Copy link
Contributor

Hi Thomas, I finally took the time to read your posts. You found the solution yourself by using the scalar versions of the operators. As for GPUs, NSIMD also provides functions to use inside kernels (all is in include/nsimd/scalar_utilities.h)

void kernel() {
    // CPU scalar version
    nsimd::scalar_OPERATOR_TYPE(arg1, ..., argN);

    // GPU scalar version
    nsimd::gpu_OPERATOR_TYPE(arg1, ..., argN);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants