Improving Multi-processor Performance #959

seanlaw · 2024-02-13T02:25:07Z

This new paper titled "Exploring Multiprocessor Approaches to Time Series Analysis" claims to significantly improve the performance of matrix profile calculations. We should consider looking into this.

Additionally, we should consider if it may be practical to refactor some parts of our code a la cache oblivious algorithms

JaKasb · 2024-02-13T15:21:40Z

I have 2 open questions:

Does Numba support Transactional Memory and locking/synchronization primitives/instructions ?
Are the speedups transfereable to a "normal" CPU/GPU ?
They use a 4-Socket times 16-Core system. Modern CPU/GPU have a uniform-ish architecture of 100+ Cores.
I assume that those locking and synchronization methods are very beneficial if the core-to-core latency is high.
I bet that a single CPU/GPU has less speedup than a 4-Socket system for the proposed locking mechanisms.

If the answer to both questions is a "No", I assume that this paper is not useful for stumpy.

seanlaw · 2024-02-13T16:02:10Z

@JaKasb These are good questions. I don't think numba has Transactional Memory (TM) support. However (I only came across this paper yesterday and only skimmed it), with some creativity/thinking, it might be possible to mimic hardware TM with software TM, though this seems to be an on-going and active area of research. What I am particularly interested in is understanding how a tiling scheme might help speed up the GPU matrix profile computation (and possibly the CPU computation). We've looked at this in the past but it got messy. If none of this is usable then at least we've captured the research, considered it, and discussed it at a minimum.

Certainly, I'm not looking for 3x speedup but even a modest 20% speedup might be worth exploring if there isn't too much added complexity by, say, switching over to tiles for example. @JaKasb Do you see anything else that we can do to improve the speed of our CPU or GPU matrix profile calculations? Any low hanging fruits?

JaKasb · 2024-02-13T18:08:55Z

All that advanced tiling stuff goes over my head.
After the STOMP paper, the papers on speedups and optimizations papers became hard to understand for me.
I fully agree with you that such tiling optimizations get messy, and add a lot of code complexity.

Readable code also has intrinsic worth.

In my opinion, the tradeoff is not worth the effort.
I mean sure, you can improve the loops for cache locality and whatnot, but it is neccesary to modify all loops and nested variables. Ultimately one would practically have to rewrite stumpy from scratch.

Furthermore, for higher speed one can use approximate matrixprofile algorithms or use multiple GPUs.
For my use cases stumpy is already fast enough.

seanlaw added enhancement New feature or request help wanted Extra attention is needed labels Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Multi-processor Performance #959

Improving Multi-processor Performance #959

seanlaw commented Feb 13, 2024 •

edited

JaKasb commented Feb 13, 2024

seanlaw commented Feb 13, 2024 •

edited

JaKasb commented Feb 13, 2024

Improving Multi-processor Performance #959

Improving Multi-processor Performance #959

Comments

seanlaw commented Feb 13, 2024 • edited

JaKasb commented Feb 13, 2024

seanlaw commented Feb 13, 2024 • edited

JaKasb commented Feb 13, 2024

seanlaw commented Feb 13, 2024 •

edited

seanlaw commented Feb 13, 2024 •

edited