Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Calculate use avx2 are slower 3 times than normal #880

Open
YggSky opened this issue Jan 7, 2023 · 4 comments
Open

Complex Calculate use avx2 are slower 3 times than normal #880

YggSky opened this issue Jan 7, 2023 · 4 comments

Comments

@YggSky
Copy link

YggSky commented Jan 7, 2023

use your example ,only modify operate. the std::vector<double, xsimd::aligned_allocator> size is 1e8.

xsimd::sqrt((xsimd::cos(ba) + xsimd::sin(bb)) / 2) use time 12s is slower than
std::sqrt((std::cos(a[i]) + std::sin(b[i])) / 2) use time 4.4s

@serge-sans-paille
Copy link
Contributor

Sorry, I cannot reproduce your timings. Here is the test program I've been using

#include <iostream>
#include <xsimd/xsimd.hpp>
#include <vector>

int main(int argc, char** argv)
{
  unsigned n = std::atoi(argv[1]);
  unsigned p = std::atoi(argv[2]);
  std::vector<double> x(n);
  std::vector<double> y(n);
  std::vector<double> out(n);
  for(unsigned i = 0; i < n; ++i) {
    x[i] = .00002 * i;
    y[i] = .00003 * i;
  }
  for(unsigned j = 0; j < p; ++j) {
#ifdef USE_XSIMD
    for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) {
      auto vout = xsimd::load_unaligned(&out[i]);
      auto vx = xsimd::load_unaligned(&x[i]);
      auto vy = xsimd::load_unaligned(&y[i]);
      vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.);
      vout.store_unaligned(&out[i]);
    }
#else
    for(unsigned i = 0; i < n; ++i) {
      out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2);
    }
#endif
  }
  std::cout << out[n / p] << "\n";
  return 0;
}

compiled with g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100 or g++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100

I get a consistent x2.5 speedup with xsimd on... same for -O3 and with clang.

@YggSky
Copy link
Author

YggSky commented Jan 11, 2023

Sorry, I cannot reproduce your timings. Here is the test program I've been using

#include <iostream>
#include <xsimd/xsimd.hpp>
#include <vector>

int main(int argc, char** argv)
{
  unsigned n = std::atoi(argv[1]);
  unsigned p = std::atoi(argv[2]);
  std::vector<double> x(n);
  std::vector<double> y(n);
  std::vector<double> out(n);
  for(unsigned i = 0; i < n; ++i) {
    x[i] = .00002 * i;
    y[i] = .00003 * i;
  }
  for(unsigned j = 0; j < p; ++j) {
#ifdef USE_XSIMD
    for(unsigned i = 0; i < n; i += xsimd::batch<double, xsimd::avx2>::size) {
      auto vout = xsimd::load_unaligned(&out[i]);
      auto vx = xsimd::load_unaligned(&x[i]);
      auto vy = xsimd::load_unaligned(&y[i]);
      vout += xsimd::sqrt((xsimd::cos(vx) + xsimd::sin(vy)) / 2.);
      vout.store_unaligned(&out[i]);
    }
#else
    for(unsigned i = 0; i < n; ++i) {
      out[i] += std::sqrt((xsimd::cos(x[i]) + std::sin(y[i])) / 2);
    }
#endif
  }
  std::cout << out[n / p] << "\n";
  return 0;
}

compiled with g++ test.cpp -O2 -mavx2 -o r -UUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100 or g++ test.cpp -O2 -mavx2 -o r -DUSE_XSIMD -DNDEBUG -Iinclude && time ./r 1000000 100

I get a consistent x2.5 speedup with xsimd on... same for -O3 and with clang.

I use your code ,get the same result,even if i change the n=1e8,p=1. but when I change the data with my input

for(unsigned i = 0; i < n; ++i) {
    //x[i] = .00002 * i;
  // y[i] = .00003 * i;
	x[i] = i;
	y[i] = std::sin(i);
  }

the time is Very different. the n=1e8,p=1. (12s with 4s). with n=1e6,p=1e2 time 1.1s(whit xsimd avx2) to 2.4s(normal) . I think maybe 2 reason. 1.the data input ,2 the data size could affect the time cost.

@amyspark
Copy link
Contributor

Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.

@YggSky
Copy link
Author

YggSky commented Jan 15, 2023

Here's a godbolt with the alternatives: https://godbolt.org/z/TosvEr9fz Both show the 2.5x speedup.

the code you use n=1e6,p=1e2 . with this parameter ,time ture speedup,I get the same with you. but when n=1e8,p=1,time will different,as i description above. when i change your code

 //unsigned n = std::atoi("1000000");
//    unsigned p = std::atoi("100");

unsigned n = std::atoi("100000000");
unsigned p = std::atoi("1");

unfortunately,can't execute. with n = std::atoi("1000000") p = std::atoi("100") certainly the time with yours. so if you make the n size enough,you will find time xmind will cost more, i debug find the internal ximd::sin or simd::cos time-consuming. if the data size small .such n=1e6 the function will not,but if you use n=1e8 and use std::sqrt((std::cos(x[i]) + std::sin(y[i])) / 2),not use std::sqrt((xmind::cos(x[i]) + xmind::sin(y[i])) / 2) time more

that great impact ximd::cos or ximd::sin, if the data size is small ,such you use n=1e6,not big different,but n=1e8,the result big different.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants