Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #114

Open
wants to merge 24 commits into
base: dev
Choose a base branch
from
Open

Dev #114

wants to merge 24 commits into from

Conversation

ThomasRetornaz
Copy link

First pull request around isue #107

  • Add all_of,any_of,copy,copy_n,count,count_if,equal,fill,find,find_if,find_if_not,lexicographical_compare,max,max_element,min,min_element,none_of,reduce,replace,replace_if,transform,transform_reduce "STL" like algorithm

  • Provide non regressions tests (Validated on Visual2017 and GCC) and documentation

  • Please pay attention on workaround i make around masktype. May i miss something and better approach exist

  • I think a preliminary refactoring could be to move some usefull Unary/Binary predicate in a dedicated header

Other fix and/or proposal

  • fix TestData& operator=(const TestData& other) assignment operator

  • reduce warning (of course last commit coud be dropped)

Copy link
Owner

@p12tic p12tic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the PR! I really like it :-)

I raised a number of comments, but it seems that there's nothing serious.

For most of the algorithms I think we could rewrite them to not use non-SIMD operations in the prologue and epilogue at all. We could do a single unaligned load that overlaps with the main aligned SIMD body, do the computations and then do unaligned store that also overlaps with the main aligned SIMD body. This would be faster in most cases, as the scalar code is multiple times slower than SIMD.

@@ -0,0 +1,69 @@
/* Copyright (C) 2018 Povilas Kanapickas <povilas@radix.lt>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't be shy to use your name :-) It's you who wrote the code, the copyrights belong to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think we could share the copyright. Anyway its your lib :) and i just add new capabilities

#include <simdpp/algorithm/helper_input_range.h>

namespace simdpp {
namespace SIMDPP_ARCH_NAMESPACE {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation: Please don't indent the namespace blocks. Also, the library uses 4 space indent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I switch between different computer (linux,windows) and default indentation are not the same between my ide.

template<typename T, typename UnaryPredicate>
bool all_of(T const* first, T const* last, UnaryPredicate pred)
{
#ifndef NDEBUG //precondition debug mode
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add something like SIMDPP_DEBUG and use it throughout the library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed switch between NDEBUG->SIMDPP_DEBUG. Now we have to decide when/where we activate these flag regarding input configurations

throw std::runtime_error("all_of - null ptr last.");
#endif

using simd_type_T = typename typetraits<T>::simd_type;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer type_traits. Is there any conflict that prevents this name by chance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i could switch to type_traits or simd_type_traits to avoid any conflicts


//prologue
auto lastprologue = first + size_prologue_loop;
if(!std::all_of(first, lastprologue, pred)) return false;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after if, return on next line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. And also fixed for any_of

const auto predEqualTen = UnaryPredicateEqualValue<T>((T)10);
const auto predEqualFive = UnaryPredicateEqualValue<T>((T)5);
{ //test prologue
vector_t ivect = { (T)10,(T)10 };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for higher level algorithms we need more exhaustive testing. Like all small array lengths, plus all combinations of alignments of both sequence begin and end pointers. Probably makes sense to create some kind of generic generator and use it in all tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i will think about this. And provide something asap

//---prologue
for (; i < size_prologue_loop; ++i)
{
init = binary_op(init,unary_op(*first++));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to do possibly overlapping SIMD operations for the prologue and epilogue and then mask out the extra lanes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i don't understand this point. Could you make an exemple?

auto i = 0u;

//---prologue
for (; i < size_prologue_loop; ++i)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to do possibly overlapping SIMD operations for the prologue and epilogue. Some elements will be computed twice, but the code will be much faster this way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@@ -84,10 +84,10 @@ inline Arch get_arch_string_list(const char* const strings[], int count, const c
return res;
#endif

int prefixlen = std::strlen(prefix);
for (int i = 0; i < count; ++i) {
auto prefixlen = std::strlen(prefix);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the explicit type for the simple numeric types. Auto makes the code less readable in these cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK use size_t instead

@@ -90,6 +90,7 @@ class TestData {
TestData& operator=(const TestData& other)
{
data_ = other.data_;
return (*this);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we drop the parentheses? I thing they're not needed in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ThomasRetornaz
Copy link
Author

Many thanks for the PR! I really like it :-)

Thanks !

For most of the algorithms I think we could rewrite them to not use non-SIMD operations in the prologue and epilogue at all. We could do a single unaligned load that overlaps with the main aligned SIMD body, do the computations and then do unaligned store that also overlaps with the main aligned SIMD body. This would be faster in most cases, as the scalar code is multiple times slower than SIMD.

May i miss something but as i spotted above, we also need prologue if data lenght is too small to fit in simd registers( eg 7 uint , in this way we could use transparently simddp::function everywhere )
For epilogue i think i understand the overlapp concept but may we could give me some hints how to achieve this. Anyway i will try on my side

@p12tic
Copy link
Owner

p12tic commented Mar 31, 2018

May i miss something but as i spotted above, we also need prologue if data lenght is too small to fit in simd registers( eg 7 uint , in this way we could use transparently simddp::function everywhere )
For epilogue i think i understand the overlapp concept but may we could give me some hints how to achieve this. Anyway i will try on my side

Yes, if the total length is less than the width of the register, then the scalar part is needed. My point was that if you have, say, a range of 15 uint16 elements to process, then it's faster to just process two overlapping 8 element pairs instead of doing 1 full wave and 7 element scalar prologue/epilogue.

ThomasRetornaz and others added 3 commits April 9, 2018 17:53
Follow review
* fix indent
* add "fuzzing" tests for all algorithm
* add TEST_EQUAL_COLLECTIONS
* add nrt helpers for generating data (to be moved elsewhere ?)
* Try to fix visual 2013/2015 compilation issues
* enforce const/inline and noexcept for predicate
@Cazadorro
Copy link

what is the purpose of having SIMDPP_NOEXECPT and changing inline to a custom macro?

@ThomasRetornaz
Copy link
Author

what is the purpose of having SIMDPP_NOEXECPT and changing inline to a custom macro?

Regards
TR

ThomasRetornaz and others added 2 commits July 16, 2018 03:41
* Add google benchmark as ExternalProject
* Add three bench suite transform "unary", reduce "unary", load/store

Todo:
* strange behavior on transform bench suite. STD seems faster than SIMD on MSVC2017 <--- to be checked on gcc>5
* add other cases
* add binary flavor of transform reduce bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants