Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Open-to-community] Benchmark bloomz.cpp on different hardware #4

Open
Vaibhavs10 opened this issue Mar 15, 2023 · 6 comments
Open

Comments

@Vaibhavs10
Copy link
Collaborator

Vaibhavs10 commented Mar 15, 2023

Hey hey,

We are working hard to help you unlock the truest potential of open-source LLMs. In order for us to build better and cater to the majority of hardware we need your help to run benchmarks with bloomz.cpp 🤗

We are looking for the following information:

  1. Hardware information (CPU/ RAM/ GPU/ Threads)
  2. Inference time (time per token)
  3. Memory use

You can do so by following the quickstart steps in the project's README. 💯

Ping @NouamaneTazi and @Vaibhavs10 if you have any questions! <3

Happy benchmarking! 🚀

@eschaffn
Copy link

Is it possible to run this on windows?

@NouamaneTazi
Copy link
Owner

Good point!
It should be possible with the latest modifications done in llama.cpp. We still need to pull those to this repo.

Feel free to open a PR for that if you'd like @eschaffn 🚀

@lapo-luchini
Copy link
Contributor

I didn't expect conversion to need 22 GiB RAM (running on Win64 native python3.11).
I just barely managed. 😅

Quantization used more ore less 10 GiB RAM running on WSL Ubuntu / gcc-9.4.0.

bloom_model_quantize: model size  = 30886.16 MB
bloom_model_quantize: quant size  =  4831.16 MB
bloom_model_quantize: hist: 0.000 0.022 0.018 0.031 0.050 0.075 0.102 0.129 0.152 0.128 0.102 0.074 0.049 0.031 0.018 0.021

main: quantize time = 203633.06 ms
main:    total time = 203633.06 ms

Executed at around 5.5 token/s on a AMD Ryzen 5 3600:

% make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin -p 'Translate "Hi, how are you?" in French:'
 -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: seed = 1678990037
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour, comment vas-tu?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  8464.89 ms
main:   sample time =   180.49 ms
main:  predict time =  3074.70 ms / 180.86 ms per token
main:    total time = 12601.97 ms

@lapo-luchini
Copy link
Contributor

FreeBSD 13 on Intel i7-3770 CPU @ 3.40GHz:
(I had to remove parameters or it would just crash)

% gmake && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin -p 'Translate "Hi, how are you?" in French:
' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  FreeBSD
I UNAME_P:  amd64
I UNAME_M:  amd64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a
303)
I CXX:      FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a
303)

gmake: Nessuna operazione da eseguire per «default».
main: seed = 1678993867
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1
.300000


Translate "Hi, how are you?" in French: comment vas-tu? "</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time = 25791.55 ms
main:   sample time =   204.70 ms
main:  predict time = 50346.14 ms / 3146.63 ms per token
main:    total time = 89922.15 ms

@eschaffn
Copy link

eschaffn commented Mar 17, 2023

Intel I9-13900KS
Nvidia RTX 4090
I gave WSL 28GB of RAM and 50% Disk as swap

Running on Windows 10 with WSL 2 Ubuntu with CUDA

main: seed = 1679015450
bloom_model_load: loading model from './models/ggml-model-bloomz-7b1-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15927.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from './models/ggml-model-bloomz-7b1-f16.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size = 15446.16 MB / num tensors = 366

main: prompt: 'Je vais'
main: number of tokens in prompt = 2
  5830 -> 'Je'
 17935 -> ' vais'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Je vais maintenant discuter de quelques des propriétés digitales</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time = 237397.50 ms
main:   sample time =   112.54 ms
main:  predict time =  1832.55 ms / 203.62 ms per token
main:    total time = 240065.61 ms

After quantitization:

main: seed = 1679016169
bloom_model_load: loading model from './models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from './models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Je vais'
main: number of tokens in prompt = 2
  5830 -> 'Je'
 17935 -> ' vais'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Je vais supposer que je veux poser a few queries on the server.</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  1892.14 ms
main:   sample time =   178.52 ms
main:  predict time =  1956.12 ms / 139.72 ms per token
main:    total time =  4616.26 ms```

@itakafu
Copy link

itakafu commented Mar 18, 2023

Tried on my MacBook Pro 14inch, M2 Max, 96GB memory running macOS Ventura 13.2.1!

(ml) ~/W/bloomz.cpp ❯❯❯ make && ./main -m models/ggml-model-bloomz-7b1-f16.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1679097305
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15927.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size = 15446.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour!</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  4650.54 ms
main:   sample time =    66.74 ms
main:  predict time =   730.79 ms / 56.21 ms per token
main:    total time =  5695.71 ms

After quantitization:

(ml) ~/W/bloomz.cpp ❯❯❯ make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1679097027
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Comment vas-tu?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  1545.36 ms
main:   sample time =   117.36 ms
main:  predict time =   738.26 ms / 49.22 ms per token
main:    total time =  2709.12 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants