Use thread pool #400

goerch · 2023-07-20T11:49:04Z

Inspired by this discussion I tried to understand the work scheduling. The patch uses a plain pthreads emulation for Windows and a thread pool implementation from the same site with what I hope are corrections.

The patch is only tested on Windows. ~~When testing I noticed that test_opt seems to need quite a bit more run time than earlier. Is this a regression or due to a bug fix?~~ (This was due to different test sizes.)

Here are the results of a quick check with

./bin/release/starcoder --threads 12 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330184 bytes
main:     load time =  1162.09 ms
main:   sample time =    11.67 ms
main:  predict time = 18299.45 ms / 90.59 ms per token
main:    total time = 19597.86 ms

After:

main: mem per token =   330184 bytes
main:     load time =  1168.33 ms
main:   sample time =    13.62 ms
main:  predict time = 11272.42 ms / 55.80 ms per token
main:    total time = 12591.08 ms

Here the numbers for

./bin/release/starcoder --threads 4 -m models/gpt_bigcode-santacoder/ggml-q4_0.bin -p "def test(" --temp 0

Before:

main: mem per token =   330056 bytes
main:     load time =  1189.20 ms
main:   sample time =    11.53 ms
main:  predict time =  9613.42 ms / 47.59 ms per token
main:    total time = 10968.48 ms

After:

main: mem per token =   330056 bytes
main:     load time =  1203.09 ms
main:   sample time =    12.97 ms
main:  predict time =  9725.99 ms / 48.15 ms per token
main:    total time = 11100.05 ms

Run on a Core i7 with 6 cores built with clang-cl and without BLAS support.

slaren · 2023-07-20T13:28:08Z

Tagging @JohannesGaessler since I think he was also planning on doing something similar.

goerch · 2023-07-20T14:07:38Z

I'm now going to look into @ggerganov 's PR. Back again: I need guidance on how to merge the different approaches.

JohannesGaessler

I did not yet test the code.
If code is taken verbatim from an external source I think we should document this.
The code in question is licensed as MIT so there should be no legal issues.

JohannesGaessler · 2023-07-21T10:34:55Z

src/ggml.c

+        }
+
+        params.type = GGML_TASK_COMPUTE;
+        ggml_compute_forward(&params, node);


If I understand the code correctly the main thread managing the other threads still does part of the computation itself. A design where the main thread queues work for all worker threads and then waits for them to be done would maybe be better since this would allow you to parallelize the computations with administrative overhead (and maybe computations from multiple tensors, but I think the overhaul by @slaren also does something like that). But this can be done in another PR.

The work scheduling should be intentionally identical to current master to ease comparison (modulo errors in my understanding of it).

JohannesGaessler · 2023-07-21T10:38:43Z

src/ggml.c

As far as I can tell this PR does not change the granularity with which work is distributed: all threads still receive one even share of work. With reduced threading overhead it may be a good idea to split the work into smaller chunks since on some Intel architectures the speed of the cores is uneven and smaller chunks would allow the faster cores to work on more chunks than the slow cores. But this can be done in another PR.

The work chunks are already short enough as-is right now; depending on the OS it's up to the scheduler to decide what kind of thread rotation algorithm is implemented on a given device to even out execution time across threads on heterogeneous CPU cores at different speeds.

If you do wish to implement a smaller work quantum, I highly recommend this to be a parameterizable option as benefits and negatives will highly depend on OS/CPU target.

goerch · 2023-07-21T10:58:27Z

If code is taken verbatim from an external source I think we should document this.

Absolutely. Assorted associations:

I'd love to use some kind of standard pthreads thread pool, but this is the 'best' I found
The presentation on the website is expository
I had to apply fixes to make the code compile
I had to apply fixes to make the code run
The code looks to me a bit like a text book implementation

The code in question is licensed as MIT so there should be no legal issues.

Maybe we should try to contact the author what to do?

Thanks for looking into this!

goerch · 2023-07-21T11:15:34Z

There are a couple more questions I'd like to discuss:

The pool is currently located in ggml_state. Should we manage it in ggml_context instead?
Can we improve error handling?
How would you like the condition variable to signal: during locks held or after releasing?

JohannesGaessler · 2023-07-21T13:10:58Z

I had to apply fixes to make the code compile
I had to apply fixes to make the code run

Please document this as well.

JohannesGaessler · 2023-07-22T21:17:46Z

On my Linux system with a Ryzen 3700X, 3200 MHz dual channel memory, and an RTX 3090 this PR has worse performance than master:

GPU	-ngl	Test	ms/t master	ms/t PR	Speedup
RTX 3090	0	tg128	402.21	426.87	0.94
RTX 3090	20	tg128	224.68	238.41	0.94
RTX 3090	40	tg128	42.62	51.55	0.83

Edit: the numbers are also for starcoder-mmap.

goerch · 2023-07-23T10:14:00Z

Thanks for testing!

I changed the order of unlocking/signaling and retested with a bigger model

./bin/release/mpt -m models\mpt-7b-storywriter\ggml-model-f16.bin --threads n --temp 0 --prompt "..."

for n = 4/6 and with Clang and ICX/MKL. Results for the compilers are similar so I'm showing Clang results.

Master 4 threads/6 threads

main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 16946.88 ms
main:    sample time =    11.75 ms / 0.06 ms per token
main:      eval time = 108975.37 ms / 544.88 ms per token
main:     total time = 128000.77 ms

main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 17137.89 ms
main:    sample time =    12.15 ms / 0.06 ms per token
main:      eval time = 98924.89 ms / 494.62 ms per token
main:     total time = 117962.98 ms

PR 4 threads/6 threads

 4 thread
main: sampled tokens =      200
main:  mem per token =   365968 bytes
main:      load time = 17143.71 ms
main:    sample time =    14.17 ms / 0.07 ms per token
main:      eval time = 105632.52 ms / 528.16 ms per token
main:     total time = 124864.12 ms

main: sampled tokens =      200
main:  mem per token =   366000 bytes
main:      load time = 16956.28 ms
main:    sample time =    14.93 ms / 0.07 ms per token
main:      eval time = 92799.80 ms / 464.00 ms per token
main:     total time = 111578.88 ms

I'm not sure how to proceed.

slaren · 2023-07-23T10:53:09Z

Have you tested with llama? If I understand correctly, this will cause the threads to wait on a mutex for each node in the graph. In normal conditions, a spin lock should have lower acquire overhead than a mutex (excluding edge cases such as using more threads than physically available). Since that overhead is paid for every node, the overhead will scale with the number of nodes in the graph.
For what it's worth, what I was planning to do for ggerganov/llama.cpp#2239 was to keep the spin locks while evaluating a graph, while keeping the threads alive waiting on a mutex between calls to ggml_graph_compute, to avoid having to re-launch them on every graph.

JohannesGaessler · 2023-07-23T10:57:33Z

If the performance is not universally better I think this should be added as a compile option so that both implementations are still available. To get a better grasp regarding which should be the default I think we would need more data.

The pool is currently located in ggml_state. Should we manage it in ggml_context instead?

@ggerganov is the authority to ask regarding ggml architecture.

JohannesGaessler · 2023-07-23T11:07:04Z

To get back to what I said earlier: with the current design the threads need to wait and be notified for each tensor. With a design that queues all work ahead of time you could instead have the thread wait on a barrier (but I'm not sure whether that would be faster).

Maybe we should wait for slaren's backend refactor? From my perspective the biggest issue with threading is that you manually have to set the threads to 1 to get optimal performance with llama.cpp CUDA which is an inconvenience to the user and suboptimal for performance if you offload only part of the network.

slaren · 2023-07-23T11:32:34Z

Keep in mind that the backend refactor is still going to take a while (several weeks), and I wouldn't want it to cause a delay to any possible improvements that could be made in the meanwhile. I understand that it is a bit of an uncomfortable situation because the changes needed to fix that issue for CUDA may be irrelevant after the backends refactor, so ultimately it is up to you.

goerch added 7 commits July 20, 2023 12:21

Initial commit

70a6c54

Some cleanup

6198dca

Possible fix for non Windows OS

2b8dcb7

Another trial

c163f27

Rename header file

a427e10

Some more renaming required

85756a6

Uniform order for signal and unlock.

4fa00ba

Fixing gitignore oops

3e852d6

JohannesGaessler reviewed Jul 21, 2023

View reviewed changes

Reorder unlocking and signalling

ae8f051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use thread pool #400

Use thread pool #400

goerch commented Jul 20, 2023 •

edited

slaren commented Jul 20, 2023

goerch commented Jul 20, 2023 •

edited

JohannesGaessler left a comment

JohannesGaessler Jul 21, 2023

goerch Jul 21, 2023

JohannesGaessler Jul 21, 2023

AndreiLux Jul 21, 2023

goerch commented Jul 21, 2023 •

edited

goerch commented Jul 21, 2023 •

edited

JohannesGaessler commented Jul 21, 2023 •

edited

JohannesGaessler commented Jul 22, 2023 •

edited

goerch commented Jul 23, 2023

slaren commented Jul 23, 2023

JohannesGaessler commented Jul 23, 2023

JohannesGaessler commented Jul 23, 2023

slaren commented Jul 23, 2023

Use thread pool #400

Are you sure you want to change the base?

Use thread pool #400

Conversation

goerch commented Jul 20, 2023 • edited

slaren commented Jul 20, 2023

goerch commented Jul 20, 2023 • edited

JohannesGaessler left a comment

Choose a reason for hiding this comment

JohannesGaessler Jul 21, 2023

Choose a reason for hiding this comment

goerch Jul 21, 2023

Choose a reason for hiding this comment

JohannesGaessler Jul 21, 2023

Choose a reason for hiding this comment

AndreiLux Jul 21, 2023

Choose a reason for hiding this comment

goerch commented Jul 21, 2023 • edited

goerch commented Jul 21, 2023 • edited

JohannesGaessler commented Jul 21, 2023 • edited

JohannesGaessler commented Jul 22, 2023 • edited

goerch commented Jul 23, 2023

slaren commented Jul 23, 2023

JohannesGaessler commented Jul 23, 2023

JohannesGaessler commented Jul 23, 2023

slaren commented Jul 23, 2023

goerch commented Jul 20, 2023 •

edited

goerch commented Jul 20, 2023 •

edited

goerch commented Jul 21, 2023 •

edited

goerch commented Jul 21, 2023 •

edited

JohannesGaessler commented Jul 21, 2023 •

edited

JohannesGaessler commented Jul 22, 2023 •

edited