Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Multi-processor Performance #959

Open
seanlaw opened this issue Feb 13, 2024 · 3 comments
Open

Improving Multi-processor Performance #959

seanlaw opened this issue Feb 13, 2024 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@seanlaw
Copy link
Contributor

seanlaw commented Feb 13, 2024

This new paper titled "Exploring Multiprocessor Approaches to Time Series Analysis" claims to significantly improve the performance of matrix profile calculations. We should consider looking into this.

Additionally, we should consider if it may be practical to refactor some parts of our code a la cache oblivious algorithms

@seanlaw seanlaw added enhancement New feature or request help wanted Extra attention is needed labels Feb 13, 2024
@JaKasb
Copy link

JaKasb commented Feb 13, 2024

I have 2 open questions:

  • Does Numba support Transactional Memory and locking/synchronization primitives/instructions ?
  • Are the speedups transfereable to a "normal" CPU/GPU ?
    They use a 4-Socket times 16-Core system. Modern CPU/GPU have a uniform-ish architecture of 100+ Cores.
    I assume that those locking and synchronization methods are very beneficial if the core-to-core latency is high.
    I bet that a single CPU/GPU has less speedup than a 4-Socket system for the proposed locking mechanisms.

If the answer to both questions is a "No", I assume that this paper is not useful for stumpy.

@seanlaw
Copy link
Contributor Author

seanlaw commented Feb 13, 2024

@JaKasb These are good questions. I don't think numba has Transactional Memory (TM) support. However (I only came across this paper yesterday and only skimmed it), with some creativity/thinking, it might be possible to mimic hardware TM with software TM, though this seems to be an on-going and active area of research. What I am particularly interested in is understanding how a tiling scheme might help speed up the GPU matrix profile computation (and possibly the CPU computation). We've looked at this in the past but it got messy. If none of this is usable then at least we've captured the research, considered it, and discussed it at a minimum.

Certainly, I'm not looking for 3x speedup but even a modest 20% speedup might be worth exploring if there isn't too much added complexity by, say, switching over to tiles for example. @JaKasb Do you see anything else that we can do to improve the speed of our CPU or GPU matrix profile calculations? Any low hanging fruits?

@JaKasb
Copy link

JaKasb commented Feb 13, 2024

All that advanced tiling stuff goes over my head.
After the STOMP paper, the papers on speedups and optimizations papers became hard to understand for me.
I fully agree with you that such tiling optimizations get messy, and add a lot of code complexity.

Readable code also has intrinsic worth.

In my opinion, the tradeoff is not worth the effort.
I mean sure, you can improve the loops for cache locality and whatnot, but it is neccesary to modify all loops and nested variables. Ultimately one would practically have to rewrite stumpy from scratch.

Furthermore, for higher speed one can use approximate matrixprofile algorithms or use multiple GPUs.
For my use cases stumpy is already fast enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants