Skip to content

Releases: ggerganov/whisper.cpp

v1.6.0

15 May 07:13
08981d1
Compare
Choose a tag to compare

Overview

  • Can optionally enable Flash Attention for faster processing on CUDA and Metal devices (#2152)
  • Faster ppc64 performance (40aeeee) (not tested)
  • Fix main slowdown bug (#2070)

Shoutout to @JohannesGaessler for contributing efficient FA CUDA kernels

Some performance numbers for this release:

M1 Pro

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 0 39.21 1.74 0.61 0.04 22c96b4
M1 Pro METAL base 1 0 70.76 2.60 0.93 0.06 22c96b4
M1 Pro METAL small 1 0 217.28 6.42 2.14 0.17 22c96b4
M1 Pro METAL medium 1 0 596.74 14.43 4.75 0.45 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 1 30.77 1.59 0.54 0.03 22c96b4
M1 Pro METAL base 1 1 60.42 2.29 0.81 0.05 22c96b4
M1 Pro METAL small 1 1 183.82 5.12 1.81 0.14 22c96b4
M1 Pro METAL medium 1 1 517.92 11.60 4.01 0.38 22c96b4

M2 Ultra

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 0 12.32 1.35 0.49 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 0 11.65 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 0 12.08 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL base 1 0 17.58 1.90 0.76 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 0 18.89 1.86 0.79 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 0 20.69 1.88 0.79 0.02 22c96b4
M2 ULTRA METAL small 1 0 49.32 3.85 1.71 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 0 54.91 3.81 1.82 0.06 22c96b4
M2 ULTRA METAL small-q5_1 1 0 54.92 3.81 1.79 0.06 22c96b4
M2 ULTRA METAL medium 1 0 134.34 8.04 3.82 0.13 22c96b4
M2 ULTRA METAL medium-q5_0 1 0 151.68 7.59 4.07 0.14 22c96b4
M2 ULTRA METAL medium-q5_1 1 0 151.58 7.67 4.07 0.14 22c96b4
M2 ULTRA METAL medium-dis 1 0 120.82 1.07 0.41 0.02 22c96b4
M2 ULTRA METAL large-v2 1 0 235.63 12.27 5.85 0.22 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 0 273.38 11.17 6.40 0.26 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 0 272.44 11.32 6.29 0.26 22c96b4
M2 ULTRA METAL large-v2-dis 1 0 212.51 1.20 0.47 0.02 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 1 9.07 1.33 0.45 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 1 9.74 1.33 0.47 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 1 8.93 1.31 0.46 0.01 22c96b4
M2 ULTRA METAL base 1 1 15.75 1.87 0.71 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 1 17.04 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 1 17.17 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL small 1 1 42.33 3.64 1.60 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 1 47.61 3.63 1.70 0.05 22c96b4
M2 ULTRA METAL small-q5_1 1 1 47.70 3.66 1.68 0.05 22c96b4
M2 ULTRA METAL medium 1 1 114.42 7.53 3.55 0.11 22c96b4
M2 ULTRA METAL medium-q5_0 1 1 132.63 7.02 3.77 0.13 22c96b4
M2 ULTRA METAL medium-q5_1 1 1 132.28 7.10 3.76 0.13 22c96b4
M2 ULTRA METAL medium-dis 1 1 102.34 1.01 0.42 0.01 22c96b4
M2 ULTRA METAL large-v2 1 1 203.01 11.03 5.45 0.20 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 1 240.05 10.18 5.98 0.23 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 1 239.22 10.23 5.87 0.23 22c96b4
M2 ULTRA METAL large-v2-dis 1 1 181.14 1.14 0.48 0.02 22c96b4

Ryzen 9 5950X + RTX 2060

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
Ryzen 9 5950X AVX2 tiny 8 0 195.29 1.57 0.51 0.26 22c96b4
Ryzen 9 5950X AVX2 tiny-q5_0 8 0 213.33 1.10 0.50 0.30 22c96b4
Ryzen 9 5950X AVX2 tiny-q5_1 8 0 219.38 1.18 0.53 0.32 22c96b4
Ryzen 9 5950X AVX2 base 8 0 424.85 3.71 1.03 0.46 22c96b4
Ryzen 9 5950X AVX2 base-q5_0 8 0 473.61 1.81 0.82 0.52 22c96b4
Ryzen 9 5950X AVX2 base-q5_1 8 0 484.14 1.92 0.85 0.56 22c96b4
Ryzen 9 5950X AVX2 small 8 0 1458.32 12.66 3.09 1.26 22c96b4
Ryzen 9 5950X AVX2 small-q5_0 8 0 1673.22 6.42 2.18 1.45 22c96b4
Ryzen 9 5950X AVX2 small-q5_1 8 0 1724.78 6.72 2.32 1.52 22c96b4
Ryzen 9 5950X AVX2 medium 8 0 4333.87 36.80 8.56 3.37 22c96b4
Ryzen 9 5950X AVX2 medium-q5_0 8 0 5194.09 19.21 5.71 3.97 22c96b4
Ryzen 9 5950X AVX2 medium-q5_1 8 0 5450.39 20.01 5.99 4.17 22c96b4
Ryzen 9 5950X AVX2 medium-dis 8 0 3995.19 5.08 1.21 0.55 22c96b4
Ryzen 9 5950X AVX2 large-v2 8 0 8056.16 69.74 16.11 6.13 22c96b4
Ryzen 9 5950X AVX2 large-v2-q5_0 8 0 9799.58 35.16 10.49 7.28 22c96b4
Ryzen 9 5950X AVX2 large-v2-q5_1 8 0 ms 36.74 11.02 7.65 22c96b4
Ryzen 9 5950X AVX2 large-v2-dis 8 0 7490.03 7.40 1.70 0.72 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 0 12.54 0.93 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 0 12.73 0.98 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 0 12.72 0.99 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA base 8 0 24.14 1.28 0.41 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_0 8 0 24.58 1.38 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_1 8 0 24.58 1.37 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA small 8 0 74.70 2.91 0.84 0.07 22c96b4
RTX 2060 AVX2 CUDA small-q5_0 8 0 76.12 2.84 0.77 0.08 22c96b4
RTX 2060 AVX2 CUDA small-q5_1 8 0 76.14 2.84 0.76 0.08 22c96b4
RTX 2060 AVX2 CUDA medium 8 0 200.69 6.46 1.83 0.17 22c96b4
RTX 2060 AVX2 CUDA medium-q5_0 8 0 204.80 5.90 1.65 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-q5_1 8 0 205.61 5.85 1.61 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-dis 8 0 186.17 0.86 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA large-v2 8 0 347.22 10.36 2.82 0.29 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_0 8 0 357.06 8.81 2.58 0.34 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_1 8 0 356.97 8.62 2.49 0.33 22c96b4
RTX 2060 AVX2 CUDA large-v2-dis 8 0 318.05 1.03 0.34 0.04 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 1 7.21 0.76 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 1 7.42 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 1 7.38 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA ...
Read more

v1.5.5

16 Apr 11:14
7395c70
Compare
Choose a tag to compare

Overview

Many small incremental updates + Token level timestamps with DTW by @denersc in #1485
Feedback is welcome!

Full Changelog: v1.5.4...v1.5.5

What's Changed

New Contributors

Read more

v1.5.4

05 Jan 15:20
0b9af32
Compare
Choose a tag to compare

Overview

  • Faster Core ML ANE models (#1716)
  • CUDA bugfix causing random erros in the transcription
  • Fix SwiftUI example build

Full Changelog: v1.5.3...v1.5.4

v1.5.3

03 Jan 17:39
9962371
Compare
Choose a tag to compare

Overview

Minor maintenance release:

  • Fix CUDA issues where the transcription produces garbage
  • FIX quantized models to work with CUDA backend
  • Allow to use whisper.cpp and llama.cpp together in SwiftUI projects

What's Changed

New Contributors

Full Changelog: v1.5.2...v1.5.3

v1.5.2

14 Dec 16:06
88112c8
Compare
Choose a tag to compare

Overview

Minor maintenance release:

  • Re-enable CPU BLAS processing after fixing a regression (#1583)

Add new example: wchess

wchess-0.mp4

Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!

What's Changed

New Contributors

Full Changelog: v1.5.1...v1.5.2

v1.5.1

24 Nov 10:45
9d6ebd8
Compare
Choose a tag to compare

Overview

Minor update:

  • With Metal, auto-fallback to CPU if device does not support Apple7 family
  • Add server example

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.5.1

v1.5.0

15 Nov 21:06
d38af15
Compare
Choose a tag to compare

Overview

This major release includes the following changes:

  • Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
  • Efficient beam-search implementation via batched decoding and unified KV cache
  • Full quantization support of all available ggml quantization types
  • Support for grammar constrained sampling
  • Support for Distil Whisper models
  • Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

image

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra METAL tiny 1 11.14 1.40 0.49 0.01 ccc85b4
M2 Ultra METAL tiny-q5_0 1 11.51 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL tiny-q5_1 1 12.21 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL base 1 20.21 2.05 0.77 0.02 ccc85b4
M2 Ultra METAL base-q5_0 1 19.89 1.96 0.81 0.02 ccc85b4
M2 Ultra METAL base-q5_1 1 20.14 2.02 0.81 0.02 ccc85b4
M2 Ultra METAL small 1 51.01 3.97 1.74 0.05 ccc85b4
M2 Ultra METAL small-q5_0 1 56.86 4.09 1.85 0.06 ccc85b4
M2 Ultra METAL small-q5_1 1 56.81 4.14 1.85 0.06 ccc85b4
M2 Ultra METAL medium 1 141.21 8.47 3.98 0.13 ccc85b4
M2 Ultra METAL medium-q5_0 1 160.56 8.27 4.18 0.14 ccc85b4
M2 Ultra METAL medium-q5_1 1 160.52 8.40 4.15 0.14 ccc85b4
M2 Ultra METAL medium-dis 1 128.14 1.13 0.43 0.02 ccc85b4
M2 Ultra METAL large-v2 1 248.73 11.96 6.08 0.22 ccc85b4
M2 Ultra METAL large-v2-q5_0 1 286.31 11.99 6.60 0.26 ccc85b4
M2 Ultra METAL large-v2-q5_1 1 284.56 12.42 6.47 0.26 ccc85b4
M2 Ultra METAL large-v2-dis 1 224.31 1.26 0.49 0.02 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra COREML METAL tiny 1 7.60 1.41 0.50 0.01 ccc85b4
M2 Ultra COREML METAL base 1 11.90 2.07 0.78 0.02 ccc85b4
M2 Ultra COREML METAL small 1 32.19 4.10 1.78 0.05 ccc85b4
M2 Ultra COREML METAL medium 1 94.43 8.40 3.89 0.12 ccc85b4
M2 Ultra COREML METAL large-v2 1 179.78 12.12 6.07 0.22 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
NVIDIA V100 BLAS CUDA tiny 1 8.84 1.62 0.33 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_0 1 8.43 1.19 0.31 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_1 1 8.41 1.19 0.29 0.02 ccc85b4
NVIDIA V100 BLAS CUDA base 1 14.79 2.31 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_0 1 15.05 1.66 0.44 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_1 1 15.01 1.68 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA small 1 40.30 4.37 0.88 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_0 1 41.17 3.11 0.94 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_1 1 41.12 3.11 0.82 0.05 ccc85b4
NVIDIA V100 BLAS CUDA medium 1 104.93 10.06 1.77 0.11 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_0 1 107.11 6.13 2.07 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_1 1 107.91 6.21 1.77 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-dis 1 103.45 1.11 0.24 0.02 ccc85b4
NVIDIA V100 BLAS CUDA large-v2 1 171.55 15.76 2.62 0.17 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_0 1 176.27 8.61 3.17 0.19 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_1 1 176.23 8.67 2.59 0.19 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
AMD Ryzen 9 5950X AVX2 tiny 8 197.47 1.22 0.44 0.25 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_0 8 222.92 0.87 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_1 8 221.25 0.89 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 base 8 427.14 3.11 0.88 0.43 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_0 8 474.96 1.41 0.72 0.51 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_1 8 485.05 1.48 0.73 0.52 ccc85b4
AMD Ryzen 9 5950X AVX2 small 8 1470.51 11.70 2.89 1.21 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_0 8 1700.43 5.48 1.98 1.41 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_1 8 1719.03 5.79 2.02 1.42 ccc85b4
AMD Ryzen 9 5950X AVX2 medium 8 4417.70 35.13 8.14...
Read more

v1.4.3

07 Nov 14:29
6a5d195
Compare
Choose a tag to compare
v1.4.3 Pre-release
Pre-release

This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master branch that would be nice to be upstreamed. I am planning a major v1.5.0 release with some new and long-waited functionality soon:

  • Full CUDA offloading
  • Efficient Beam-Search implementation
  • Grammar support

The current version v1.4.3 should be considered in beta as I haven't worked intensively on whisper.cpp recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0 release. In the meantime, any feedback will be highly appreciated.

Detailed API changes, features and new contributor recognitions will be included in the v1.5.0 release.

v1.4.0

30 Apr 16:56
fa8dbdc
Compare
Choose a tag to compare

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

  • Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
  • Implementation details: #540
  • Usage instructions: README
  • All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

  • Implementation details: #834
  • Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together


This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0

15 Apr 14:41
c23588c
Compare
Choose a tag to compare

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

  • Added Core ML support #566
  • Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
  • Pad the audio with zeros instead of the spectrogram (5108b30)
  • Added talk-llama example
  • Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

New Contributors

Full Changelog: v1.2.1...v1.3.0