Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thread pool #400

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Use thread pool #400

wants to merge 9 commits into from

Conversation

goerch
Copy link
Contributor

@goerch goerch commented Jul 20, 2023

Inspired by this discussion I tried to understand the work scheduling. The patch uses a plain pthreads emulation for Windows and a thread pool implementation from the same site with what I hope are corrections.

The patch is only tested on Windows. When testing I noticed that test_opt seems to need quite a bit more run time than earlier. Is this a regression or due to a bug fix? (This was due to different test sizes.)

Here are the results of a quick check with

./bin/release/starcoder --threads 12 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330184 bytes
main:     load time =  1162.09 ms
main:   sample time =    11.67 ms
main:  predict time = 18299.45 ms / 90.59 ms per token
main:    total time = 19597.86 ms

After:

main: mem per token =   330184 bytes
main:     load time =  1168.33 ms
main:   sample time =    13.62 ms
main:  predict time = 11272.42 ms / 55.80 ms per token
main:    total time = 12591.08 ms

Here the numbers for

./bin/release/starcoder --threads 4 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330056 bytes
main:     load time =  1189.20 ms
main:   sample time =    11.53 ms
main:  predict time =  9613.42 ms / 47.59 ms per token
main:    total time = 10968.48 ms

After:

main: mem per token =   330056 bytes
main:     load time =  1203.09 ms
main:   sample time =    12.97 ms
main:  predict time =  9725.99 ms / 48.15 ms per token
main:    total time = 11100.05 ms

Run on a Core i7 with 6 cores built with clang-cl and without BLAS support.

@slaren
Copy link
Collaborator

slaren commented Jul 20, 2023

Tagging @JohannesGaessler since I think he was also planning on doing something similar.

@goerch
Copy link
Contributor Author

goerch commented Jul 20, 2023

I'm now going to look into @ggerganov 's PR. Back again: I need guidance on how to merge the different approaches.

Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I did not yet test the code.
  • If code is taken verbatim from an external source I think we should document this.
  • The code in question is licensed as MIT so there should be no legal issues.

}

params.type = GGML_TASK_COMPUTE;
ggml_compute_forward(&params, node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the code correctly the main thread managing the other threads still does part of the computation itself. A design where the main thread queues work for all worker threads and then waits for them to be done would maybe be better since this would allow you to parallelize the computations with administrative overhead (and maybe computations from multiple tensors, but I think the overhaul by @slaren also does something like that). But this can be done in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work scheduling should be intentionally identical to current master to ease comparison (modulo errors in my understanding of it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this PR does not change the granularity with which work is distributed: all threads still receive one even share of work. With reduced threading overhead it may be a good idea to split the work into smaller chunks since on some Intel architectures the speed of the cores is uneven and smaller chunks would allow the faster cores to work on more chunks than the slow cores. But this can be done in another PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work chunks are already short enough as-is right now; depending on the OS it's up to the scheduler to decide what kind of thread rotation algorithm is implemented on a given device to even out execution time across threads on heterogeneous CPU cores at different speeds.

If you do wish to implement a smaller work quantum, I highly recommend this to be a parameterizable option as benefits and negatives will highly depend on OS/CPU target.

@goerch
Copy link
Contributor Author

goerch commented Jul 21, 2023

  • If code is taken verbatim from an external source I think we should document this.

Absolutely. Assorted associations:

  • I'd love to use some kind of standard pthreads thread pool, but this is the 'best' I found
  • The presentation on the website is expository
  • I had to apply fixes to make the code compile
  • I had to apply fixes to make the code run
  • The code looks to me a bit like a text book implementation
  • The code in question is licensed as MIT so there should be no legal issues.

Maybe we should try to contact the author what to do?

Thanks for looking into this!

@goerch
Copy link
Contributor Author

goerch commented Jul 21, 2023

There are a couple more questions I'd like to discuss:

  • The pool is currently located in ggml_state. Should we manage it in ggml_context instead?
  • Can we improve error handling?
  • How would you like the condition variable to signal: during locks held or after releasing?

@JohannesGaessler
Copy link
Contributor

JohannesGaessler commented Jul 21, 2023

I had to apply fixes to make the code compile
I had to apply fixes to make the code run

Please document this as well.

@JohannesGaessler
Copy link
Contributor

JohannesGaessler commented Jul 22, 2023

On my Linux system with a Ryzen 3700X, 3200 MHz dual channel memory, and an RTX 3090 this PR has worse performance than master:

GPU -ngl Test ms/t master ms/t PR Speedup
RTX 3090 0 tg128 402.21 426.87 0.94
RTX 3090 20 tg128 224.68 238.41 0.94
RTX 3090 40 tg128 42.62 51.55 0.83

Edit: the numbers are also for starcoder-mmap.

@goerch
Copy link
Contributor Author

goerch commented Jul 23, 2023

Thanks for testing!

I changed the order of unlocking/signaling and retested with a bigger model

./bin/release/mpt -m models\mpt-7b-storywriter\ggml-model-f16.bin --threads n --temp 0 --prompt "..."

for n = 4/6 and with Clang and ICX/MKL. Results for the compilers are similar so I'm showing Clang results.

Master 4 threads/6 threads

main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 16946.88 ms
main:    sample time =    11.75 ms / 0.06 ms per token
main:      eval time = 108975.37 ms / 544.88 ms per token
main:     total time = 128000.77 ms
main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 17137.89 ms
main:    sample time =    12.15 ms / 0.06 ms per token
main:      eval time = 98924.89 ms / 494.62 ms per token
main:     total time = 117962.98 ms

PR 4 threads/6 threads

 4 thread
main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 17143.71 ms
main:    sample time =    14.17 ms / 0.07 ms per token
main:      eval time = 105632.52 ms / 528.16 ms per token
main:     total time = 124864.12 ms
main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 16956.28 ms
main:    sample time =    14.93 ms / 0.07 ms per token
main:      eval time = 92799.80 ms / 464.00 ms per token
main:     total time = 111578.88 ms

I'm not sure how to proceed.

@slaren
Copy link
Collaborator

slaren commented Jul 23, 2023

Have you tested with llama? If I understand correctly, this will cause the threads to wait on a mutex for each node in the graph. In normal conditions, a spin lock should have lower acquire overhead than a mutex (excluding edge cases such as using more threads than physically available). Since that overhead is paid for every node, the overhead will scale with the number of nodes in the graph.
For what it's worth, what I was planning to do for ggerganov/llama.cpp#2239 was to keep the spin locks while evaluating a graph, while keeping the threads alive waiting on a mutex between calls to ggml_graph_compute, to avoid having to re-launch them on every graph.

@JohannesGaessler
Copy link
Contributor

If the performance is not universally better I think this should be added as a compile option so that both implementations are still available. To get a better grasp regarding which should be the default I think we would need more data.

The pool is currently located in ggml_state. Should we manage it in ggml_context instead?

@ggerganov is the authority to ask regarding ggml architecture.

@JohannesGaessler
Copy link
Contributor

To get back to what I said earlier: with the current design the threads need to wait and be notified for each tensor. With a design that queues all work ahead of time you could instead have the thread wait on a barrier (but I'm not sure whether that would be faster).

Maybe we should wait for slaren's backend refactor? From my perspective the biggest issue with threading is that you manually have to set the threads to 1 to get optimal performance with llama.cpp CUDA which is an inconvenience to the user and suboptimal for performance if you offload only part of the network.

@slaren
Copy link
Collaborator

slaren commented Jul 23, 2023

Keep in mind that the backend refactor is still going to take a while (several weeks), and I wouldn't want it to cause a delay to any possible improvements that could be made in the meanwhile. I understand that it is a bit of an uncomfortable situation because the changes needed to fix that issue for CUDA may be irrelevant after the backends refactor, so ultimately it is up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants