Add a SSE2 fast path for AMD GPU #827

WickedShell · 2022-08-23T00:50:52Z

This does a couple of things:

Rearranges the core structure to be a structure of array's, rather then an array of structures, which improves the cache hits when summarizing the results, and saves 400 bytes of stack space due to better alignment, and is a speedup by itself.
Moves to storing the result directly in the GPU structure and memcpying it. This saves us from some handling of fields that aren't actually exported, and is a bit less future maintenance.
Adds support for using SSE2 to summarize the results. There's a bit more that could be made faster, particularly if we raised the minimum target from SSE2, but on any 64bit build SSE2 was guaranteed which seemed like a reasonable minimum. I've done some loose benchmarking on my machine that shows this is faster, I need to formalize the results now that I've pushed an actual coherent branch rather then just the experiments.

Open questions with this work:

Do we want any manually SIMD loops included in the build? It makes readability a bit worse, but since it was hidden in the macro it might not be too bad.
Verify the timings to justify it. (Ideally on something with an APU such as a steam deck)
I've never worked with Meson before. The SIMD detection appears to be working, but I think what I've currently presented doesn't actually allow you to disable SSE2 if the build machine supports it. Is there a better way I should be probing for SSE2 support?
Remove the debug timing commits.

This saves memory because of the differnce in structure padding. As a side effect storage of unused fields has been removed from this to save more time and effort.

Joshua-Ashton · 2022-08-23T02:49:01Z

I can see moving from AOS to SOA making a difference, but.

Does the SSE2 stuff actually make any difference? I am guessing not? Compilers are good at vectorizing in $CURRENT_YEAR

(pls provide numbers + compile flags you used)

mupuf · 2022-08-24T06:16:55Z

src/amdgpu.cpp

-	uint16_t soc_temp_c;
-	uint16_t gpu_temp_c;
-	uint16_t apu_cpu_temp_c;
+#ifdef AMG_GPU_TEMP_MONITORING


AMG? Did you mean AMD?

stephanlachnit · 2022-10-11T15:58:06Z

meson.build

+# Check for SSE2
+if cc.compiles('''#include <emmintrin.h>
+                  int main() {
+                    __m128 v1 = _mm_set1_ps(-1.0f);
+                    __m128 v2 = _mm_set1_ps(1.0f);
+                    v1 = _mm_add_ps(v1, v2);
+                    float sum[4];
+                    _mm_store_ps(sum, v1);
+                    return (int)sum[0];
+                  }''',
+               name : 'SSE2 support',
+               args : '-msse2')
+  pre_args += '-DUSE_SSE2'
+  pre_args += '-msse2'
+endif


Maybe use the SIMD module?

WickedShell added 3 commits August 13, 2022 16:54

Swap to an array form of the metrics buffer

52dbfd4

This saves memory because of the differnce in structure padding. As a side effect storage of unused fields has been removed from this to save more time and effort.

Add SIMD support for AMD GPU handling

7b9b361

Add some timing functions

db7ce4f

mupuf reviewed Aug 24, 2022

View reviewed changes

stephanlachnit reviewed Oct 11, 2022

View reviewed changes

jackun force-pushed the master branch from 35f027e to 328df38 Compare April 7, 2023 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a SSE2 fast path for AMD GPU #827

Add a SSE2 fast path for AMD GPU #827

WickedShell commented Aug 23, 2022

Joshua-Ashton commented Aug 23, 2022 •

edited

mupuf Aug 24, 2022

stephanlachnit Oct 11, 2022 •

edited

Add a SSE2 fast path for AMD GPU #827

Are you sure you want to change the base?

Add a SSE2 fast path for AMD GPU #827

Conversation

WickedShell commented Aug 23, 2022

Joshua-Ashton commented Aug 23, 2022 • edited

mupuf Aug 24, 2022

Choose a reason for hiding this comment

stephanlachnit Oct 11, 2022 • edited

Choose a reason for hiding this comment

Joshua-Ashton commented Aug 23, 2022 •

edited

stephanlachnit Oct 11, 2022 •

edited