Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added vectorization to generate_n #6215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Johan511
Copy link
Contributor

@Johan511 Johan511 commented Mar 29, 2023

added vectorization to generate_n

hpx::generate_n calls std::generate_n, if in parallel mode it splits up the work into chunks and calls generate_n on each chunk.
Previously no execution policy was specified for std::generate_n (defaulted to seq), this PR changes it and mentions seq or unseq based on hpx::execution policy mentioned by user

par_unseq:
scale

par:
add

@hkaiser
Copy link
Member

hkaiser commented Mar 31, 2023

@Johan511 could you create graphs that use the same y-axis limits, please?

@hkaiser
Copy link
Member

hkaiser commented Mar 31, 2023

The code you proposed touches on sequential operations only. Could you measure the sequential speedup as well?

@Johan511
Copy link
Contributor Author

Johan511 commented Apr 1, 2023

The change works for par_unseq too as parallel version of generate_n works by calling sequential generate on chunks. Will post speedups for unseq soon.

@Johan511
Copy link
Contributor Author

Johan511 commented Apr 8, 2023

Disposed runs which took more than 0.4ms

unseq :
mean : 0.15
scale

seq : 0.2
add

@Johan511
Copy link
Contributor Author

Johan511 commented Apr 9, 2023

Please note that the performance gains are actually not very significant.
Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations
BM_gen_n_par 6889328 ns 6888372 ns 76
BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations
BM_gen_n_par 124210 ns 124210 ns 4038
BM_gen_n_par_unseq 159027 ns 159020 ns 4949

@hkaiser
Copy link
Member

hkaiser commented Apr 9, 2023

Please note that the performance gains are actually not very significant. Reason for minimal performance gains is because std::generate_n has very similar performance when compiled with -O3 flag. Google bench results are attached.

Without -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 6889328 ns 6888372 ns 76 BM_gen_n_par_unseq 2665548 ns 2665325 ns 257

With -O3 flag

Benchmark Time CPU Iterations BM_gen_n_par 124210 ns 124210 ns 4038 BM_gen_n_par_unseq 159027 ns 159020 ns 4949

You should always enable all optimizations for performance measurements.

@Johan511
Copy link
Contributor Author

Johan511 commented Apr 9, 2023

-O3 flag seems to tries vectorize most loops. Should I try compiling HPX with O2 flag and compare performance of vectorized vs non vectorized?

Often times the performance on vectorization gains seem to be minimal as -O3 seems to already vectorize loops.

@hkaiser
Copy link
Member

hkaiser commented May 3, 2023

@Johan511 could you please rebase this onto master, now that the release is out?

@StellarBot
Copy link

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each-??-

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb54153f93250
Clusternamerostamrostam
Datetime2023-05-10T14:50:18.616050-05:002023-05-16T17:00:01.775607-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch-

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb54153f93250
Clusternamerostamrostam
Datetime2023-05-10T14:52:35.047119-05:002023-05-16T17:02:24.950778-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)(=)(=)
Stream Benchmark - Scale(=)(=)(=)
Stream Benchmark - Triad(=)(=)(=)
Stream Benchmark - Copy(=)(=)(=)

Info

PropertyBeforeAfter
HPX Datetime2023-05-10T12:07:53+00:002023-05-16T21:41:46+00:00
HPX Commitdcb54153f93250
Clusternamerostamrostam
Datetime2023-05-10T14:52:52.237641-05:002023-05-16T17:02:44.582158-05:00
Compiler/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@hkaiser
Copy link
Member

hkaiser commented Jul 23, 2023

inspect was reporting:

/libs/core/algorithms/include/hpx/parallel/algorithms/detail/generate.hpp

*I* missing #include (type_traits) for symbol std::true_type on line 57

Please rebase one more time to pull in all changes from master.

Signed-off-by: Johan511 <harihara.sn@gmail.com>
@srinivasyadav18
Copy link
Contributor

@hkaiser I have rebased and added unit tests to ensure everything is working with generate_n algorithm with unseq and par_unseq execution policies.

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq.
However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

@hkaiser
Copy link
Member

hkaiser commented Oct 23, 2023

retest lsu

@hkaiser
Copy link
Member

hkaiser commented Oct 23, 2023

And for the performance, both seq and unseq are almost generating same assembly with Release Mode (uses -O3). Because compiler is able to vectorize the loops in seq mode also, there is seems to be no extra gains using unseq.
However, if compared with -fno-tree-vectorize -O3 which disables auto vectorization, there is a 3-5x speed up.

Can we construct test cases where the compiler is not able to vectorize things on its own?

@hkaiser hkaiser removed this from the 1.10.0 milestone May 3, 2024
@hkaiser hkaiser added this to the 1.11.0 milestone May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants