Releases · ggerganov/whisper.cpp

15 May 07:13

ggerganov

v1.6.0

08981d1

v1.6.0 Latest

Latest

Overview

Can optionally enable Flash Attention for faster processing on CUDA and Metal devices (#2152)
Faster ppc64 performance (40aeeee) (not tested)
Fix main slowdown bug (#2070)

Shoutout to @JohannesGaessler for contributing efficient FA CUDA kernels

Some performance numbers for this release:

M1 Pro

CPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M1 Pro	METAL	tiny	1	39.21	1.74	0.61	0.04	`22c96b4`
M1 Pro	METAL	base	1	70.76	2.60	0.93	0.06	`22c96b4`
M1 Pro	METAL	small	1	217.28	6.42	2.14	0.17	`22c96b4`
M1 Pro	METAL	medium	1	596.74	14.43	4.75	0.45	`22c96b4`

CPU	Config	Model	Th	FA	Enc.	Dec.	Bch5	PP	Commit
M1 Pro	METAL	tiny	1	1	30.77	1.59	0.54	0.03	`22c96b4`
M1 Pro	METAL	base	1	1	60.42	2.29	0.81	0.05	`22c96b4`
M1 Pro	METAL	small	1	1	183.82	5.12	1.81	0.14	`22c96b4`
M1 Pro	METAL	medium	1	1	517.92	11.60	4.01	0.38	`22c96b4`

M2 Ultra

CPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 ULTRA	METAL	tiny	1	12.32	1.35	0.49	0.01	`22c96b4`
M2 ULTRA	METAL	tiny-q5_0	1	11.65	1.30	0.51	0.01	`22c96b4`
M2 ULTRA	METAL	tiny-q5_1	1	12.08	1.30	0.51	0.01	`22c96b4`
M2 ULTRA	METAL	base	1	17.58	1.90	0.76	0.02	`22c96b4`
M2 ULTRA	METAL	base-q5_0	1	18.89	1.86	0.79	0.02	`22c96b4`
M2 ULTRA	METAL	base-q5_1	1	20.69	1.88	0.79	0.02	`22c96b4`
M2 ULTRA	METAL	small	1	49.32	3.85	1.71	0.05	`22c96b4`
M2 ULTRA	METAL	small-q5_0	1	54.91	3.81	1.82	0.06	`22c96b4`
M2 ULTRA	METAL	small-q5_1	1	54.92	3.81	1.79	0.06	`22c96b4`
M2 ULTRA	METAL	medium	1	134.34	8.04	3.82	0.13	`22c96b4`
M2 ULTRA	METAL	medium-q5_0	1	151.68	7.59	4.07	0.14	`22c96b4`
M2 ULTRA	METAL	medium-q5_1	1	151.58	7.67	4.07	0.14	`22c96b4`
M2 ULTRA	METAL	medium-dis	1	120.82	1.07	0.41	0.02	`22c96b4`
M2 ULTRA	METAL	large-v2	1	235.63	12.27	5.85	0.22	`22c96b4`
M2 ULTRA	METAL	large-v2-q5_0	1	273.38	11.17	6.40	0.26	`22c96b4`
M2 ULTRA	METAL	large-v2-q5_1	1	272.44	11.32	6.29	0.26	`22c96b4`
M2 ULTRA	METAL	large-v2-dis	1	212.51	1.20	0.47	0.02	`22c96b4`

CPU	Config	Model	Th	FA	Enc.	Dec.	Bch5	PP	Commit
M2 ULTRA	METAL	tiny	1	1	9.07	1.33	0.45	0.01	`22c96b4`
M2 ULTRA	METAL	tiny-q5_0	1	1	9.74	1.33	0.47	0.01	`22c96b4`
M2 ULTRA	METAL	tiny-q5_1	1	1	8.93	1.31	0.46	0.01	`22c96b4`
M2 ULTRA	METAL	base	1	1	15.75	1.87	0.71	0.02	`22c96b4`
M2 ULTRA	METAL	base-q5_0	1	1	17.04	1.83	0.74	0.02	`22c96b4`
M2 ULTRA	METAL	base-q5_1	1	1	17.17	1.83	0.74	0.02	`22c96b4`
M2 ULTRA	METAL	small	1	1	42.33	3.64	1.60	0.05	`22c96b4`
M2 ULTRA	METAL	small-q5_0	1	1	47.61	3.63	1.70	0.05	`22c96b4`
M2 ULTRA	METAL	small-q5_1	1	1	47.70	3.66	1.68	0.05	`22c96b4`
M2 ULTRA	METAL	medium	1	1	114.42	7.53	3.55	0.11	`22c96b4`
M2 ULTRA	METAL	medium-q5_0	1	1	132.63	7.02	3.77	0.13	`22c96b4`
M2 ULTRA	METAL	medium-q5_1	1	1	132.28	7.10	3.76	0.13	`22c96b4`
M2 ULTRA	METAL	medium-dis	1	1	102.34	1.01	0.42	0.01	`22c96b4`
M2 ULTRA	METAL	large-v2	1	1	203.01	11.03	5.45	0.20	`22c96b4`
M2 ULTRA	METAL	large-v2-q5_0	1	1	240.05	10.18	5.98	0.23	`22c96b4`
M2 ULTRA	METAL	large-v2-q5_1	1	1	239.22	10.23	5.87	0.23	`22c96b4`
M2 ULTRA	METAL	large-v2-dis	1	1	181.14	1.14	0.48	0.02	`22c96b4`

Ryzen 9 5950X + RTX 2060

CPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
Ryzen 9 5950X	AVX2	tiny	8	195.29	1.57	0.51	0.26	`22c96b4`
Ryzen 9 5950X	AVX2	tiny-q5_0	8	213.33	1.10	0.50	0.30	`22c96b4`
Ryzen 9 5950X	AVX2	tiny-q5_1	8	219.38	1.18	0.53	0.32	`22c96b4`
Ryzen 9 5950X	AVX2	base	8	424.85	3.71	1.03	0.46	`22c96b4`
Ryzen 9 5950X	AVX2	base-q5_0	8	473.61	1.81	0.82	0.52	`22c96b4`
Ryzen 9 5950X	AVX2	base-q5_1	8	484.14	1.92	0.85	0.56	`22c96b4`
Ryzen 9 5950X	AVX2	small	8	1458.32	12.66	3.09	1.26	`22c96b4`
Ryzen 9 5950X	AVX2	small-q5_0	8	1673.22	6.42	2.18	1.45	`22c96b4`
Ryzen 9 5950X	AVX2	small-q5_1	8	1724.78	6.72	2.32	1.52	`22c96b4`
Ryzen 9 5950X	AVX2	medium	8	4333.87	36.80	8.56	3.37	`22c96b4`
Ryzen 9 5950X	AVX2	medium-q5_0	8	5194.09	19.21	5.71	3.97	`22c96b4`
Ryzen 9 5950X	AVX2	medium-q5_1	8	5450.39	20.01	5.99	4.17	`22c96b4`
Ryzen 9 5950X	AVX2	medium-dis	8	3995.19	5.08	1.21	0.55	`22c96b4`
Ryzen 9 5950X	AVX2	large-v2	8	8056.16	69.74	16.11	6.13	`22c96b4`
Ryzen 9 5950X	AVX2	large-v2-q5_0	8	9799.58	35.16	10.49	7.28	`22c96b4`
Ryzen 9 5950X	AVX2	large-v2-q5_1	8	ms	36.74	11.02	7.65	`22c96b4`
Ryzen 9 5950X	AVX2	large-v2-dis	8	7490.03	7.40	1.70	0.72	`22c96b4`

GPU	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
RTX 2060	AVX2 CUDA	tiny	8	12.54	0.93	0.29	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	tiny-q5_0	8	12.73	0.98	0.24	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	tiny-q5_1	8	12.72	0.99	0.24	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	base	8	24.14	1.28	0.41	0.03	`22c96b4`
RTX 2060	AVX2 CUDA	base-q5_0	8	24.58	1.38	0.35	0.03	`22c96b4`
RTX 2060	AVX2 CUDA	base-q5_1	8	24.58	1.37	0.35	0.03	`22c96b4`
RTX 2060	AVX2 CUDA	small	8	74.70	2.91	0.84	0.07	`22c96b4`
RTX 2060	AVX2 CUDA	small-q5_0	8	76.12	2.84	0.77	0.08	`22c96b4`
RTX 2060	AVX2 CUDA	small-q5_1	8	76.14	2.84	0.76	0.08	`22c96b4`
RTX 2060	AVX2 CUDA	medium	8	200.69	6.46	1.83	0.17	`22c96b4`
RTX 2060	AVX2 CUDA	medium-q5_0	8	204.80	5.90	1.65	0.19	`22c96b4`
RTX 2060	AVX2 CUDA	medium-q5_1	8	205.61	5.85	1.61	0.19	`22c96b4`
RTX 2060	AVX2 CUDA	medium-dis	8	186.17	0.86	0.24	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	large-v2	8	347.22	10.36	2.82	0.29	`22c96b4`
RTX 2060	AVX2 CUDA	large-v2-q5_0	8	357.06	8.81	2.58	0.34	`22c96b4`
RTX 2060	AVX2 CUDA	large-v2-q5_1	8	356.97	8.62	2.49	0.33	`22c96b4`
RTX 2060	AVX2 CUDA	large-v2-dis	8	318.05	1.03	0.34	0.04	`22c96b4`

GPU	Config	Model	Th	FA	Enc.	Dec.	Bch5	PP	Commit
RTX 2060	AVX2 CUDA	tiny	8	1	7.21	0.76	0.29	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	tiny-q5_0	8	1	7.42	0.82	0.18	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	tiny-q5_1	8	1	7.38	0.82	0.18	0.02	`22c96b4`
RTX 2060	AVX2 CUDA	...

Contributors

iboB, przemoc, and 14 other contributors

Assets 2

3 Join discussion

16 Apr 11:14

ggerganov

v1.5.5

7395c70

v1.5.5

Overview

Many small incremental updates + Token level timestamps with DTW by @denersc in #1485
Feedback is welcome!

Full Changelog: v1.5.4...v1.5.5

What's Changed

server : fix server temperature + add temperature_inc by @ggerganov in #1729
main : add cli option to disable system prints by @ggerganov in #1740
server: add request path by @eschmidbauer in #1741
Optional Piper TTS support for talk-llama example. by @RhinoDevel in #1749
fix/1748 by @nank1ro in #1750
Don't compute timestamps when not printing them. by @ghindle in #1755
Add more parameters to server api by @ghindle in #1754
Add SetInitialPrompt method to go bindings by @blib in #1753
ggml : fix 32-bit ARM compat for IQ2_XS by @ggerganov in #1758
refactor: get all scripts to be POSIX Compliant by @sonphantrung in #1725
whisper : load the model into multiple buffers of max size 1GB by @ggerganov in #1763
rebase against your -np changes (thx) and add better python file to be used on the command line or as library by @contractorwolf in #1744
examples/talk-llama: Add optional commandline parameter to set the bot name. by @RhinoDevel in #1764
server : fix building and simplify lib deps on Windows by @przemoc in #1772
talk-llama: optional wake-up command and audio confirmation by @Rakksor in #1765
examples/server: implement "verbose_json" format with token details by @rmmh in #1781
whisper.android: Return output from benchmarks by @luciferous in #1785
libwhisper.so should be position independent by @trixirt in #1792
Docs: try to make model options / model install methods clearer by @mrienstra in #1806
common : fix input buffer check by @ggerganov in #1812
Update Makefile by @jwijffels in #1813
Add fields to verbose_json response and show examples on the home page by @JacobLinCool in #1802
common: fix wav buffer detection by @JacobLinCool in #1819
Add macOS deployment target option to Makefile by @didzis in #1839
Expose CUDA device setting in public API by @didzis in #1840
whisper.android: How to build with CLBlast by @luciferous in #1809
server: Allow CORS request with authorization headers by @valenting in #1850
Embed Metal library source into compiled binary by @didzis in #1842
added audio_ctx argument to main and server examples by @dscripka in #1857
whisper : fix external encoder by @ggerganov in #1860
swift : package no longer use ggml dependency by @ggerganov in #1861
fix openvino setup docs by @jumpers775 in #1874
clean up common code in examples by @felrock in #1871
main : check if input files exist before proceeding by @Theldus in #1872
Linking issue fix via Makefile when CUBLAS enabled in the WSL #1876 by @lbluep in #1878
main : fix file existence check in main.cpp by @Theldus in #1889
openvino : fix convert-whisper-to-openvino.py for v2023.0.0 (#1870) by @st-gr in #1890
ggml : 32-bit arm compat by @ggerganov in #1891
Add SYCL logic in whisper by @abhilash1910 in #1863
talk and talk-llama: Pass text_to_speak as a file by @tamo in #1865
Stream.wasm: Fix invalid memory access when no segments are returned by @Andrews54757 in #1902
Update README to Recommend MacOS Sonoma for Core ML to avoid hallucination by @gavin1818 in #1917
Add library versioning by @kenneth-ge in #1352
Fix SF(segment fault) issue in Android JNI by @zhouwg in #1929
Fix typo in source file whisper.cpp by @zhouwg in #1925
bench:fix typo by @zhouwg in #1933
Auto lowercase language parameter by @F1L1Pv2 in #1928
ggml : try fix 32-bit arm compat by @ggerganov in #1938
whisper : make beam candidate sort more stable by @josharian in #1943
bindings/go : add linker flags to make metal work by @josharian in #1944
whisper : improve beam search candidate diversity by @josharian in #1947
whisper : document whisper_batch.n_seq_id by @josharian in #1942
Rename --audio-context to --audio-ctx, as per help text by @joliss in #1953
[DRAFT] Token level timestamps with DTW (#375) by @denersc in #1485
Fedora dependencies needed (SDL2) by @Man2Dev in #1970
libcuda.so.1 in PATH in Docker Container by @tiagofassoni in #1966
ruby : fix build by @ggerganov in #1980
Improve support for distil-large-v3 by @sanchit-gandhi in #1982
whisper : improve handling of prompts by @ggerganov in #1981
sync : ggml by @ggerganov in #2001
Implemented command-style grammar in the main example. by @ulatekh in #1998
Use pkg-config for OpenBLAS by @przemoc in #1778
ci : add building in MSYS2 environments (Windows) by @przemoc in #1994
Support CUDA versions < 11.1 by @primenko-v in #2020
Create solution folders in the CMake build by @ulatekh in #2004
Allow a regular expression to describe tokens to suppress by @ulatekh in #1997
"main" example now allows a response-file as the sole parameter by @ulatekh in #2019
Support for CPU BLAS build via Intel MKL by @slashlib in #2024
Set stdin to binary mode on Windows. Fixes #2023 by @rotemdan in #2025
Fix file-handle leak in read_wav() by @ulatekh in #2026
Fix DTW memory access by @bradmurray-dt in #2012
whisper: update grammar-parser.cpp by @eltociear in #2058
fix missing reference to "model" variable in actual shell command run in whisper.nvim by @sixcircuit in #2049
build : detect AVX512 in Makefile, add AVX512 option in CMake by @didzis in #2043
feature/no timestamps node by @pprobst in #2048
Update embedded Metal library generation process to include dependency by @didzis in #2045
server.cpp: add dtw by @eschmidbauer in #2044

New Contributors

@eschmidbauer made their first contribution in #1741
@RhinoDevel made their first contribution in #1749
@nank1ro made their first contribution in #1750
@ghindle made their first contribution in #1755
@blib made their first contribution in #1753
@sonphantrung made their first contribution in #1725
@contractorwolf made their first contribution in #1744
@Rakksor made their first contribution in #1765
@rmmh made their f...

Contributors

luciferous, ghindle, and 43 other contributors

Assets 2

7 Join discussion

05 Jan 15:20

ggerganov

v1.5.4

0b9af32

v1.5.4

Overview

Faster Core ML ANE models (#1716)
CUDA bugfix causing random erros in the transcription
Fix SwiftUI example build

Full Changelog: v1.5.3...v1.5.4

Assets 11

03 Jan 17:39

ggerganov

v1.5.3

9962371

v1.5.3

Overview

Minor maintenance release:

Fix CUDA issues where the transcription produces garbage
FIX quantized models to work with CUDA backend
Allow to use whisper.cpp and llama.cpp together in SwiftUI projects

What's Changed

Update bench.py by @ForkedInTime in #1655
cmake : Resolve quantized model issue when CUBLAS enabled by @bobqianic in #1667
examples : Revert CMakeLists.txt for talk-llama by @bobqianic in #1669
CI : Add coverage for talk-llama when WHISPER_CUBLAS=1 by @bobqianic in #1672
ci: build and push docker image by @OpenWaygate in #1674
sync : ggml (ggml_scale, ggml_row_size, etc.) by @ggerganov in #1677
Replace WHISPER_PRINT_DEBUG with WHISPER_LOG_DEBUG by @bobqianic in #1681
download: Fix large q5 model name by @dimopep in #1695
sync : ggml (VMM, sync-ggml-am.sh, dotprod ARM fixes) by @ggerganov in #1691
whisper : replace tensor->n_dims with ggml_n_dims(tensor) by @bobqianic in #1694
Build with CLBlast by @tamo in #1576
docker : Fix the Publishing of the CUDA Docker Image by @bobqianic in #1704
emscripten: fix "Stack Overflow!" by @Huguet57 in #1713
sync : ggml by @ggerganov in #1717
Add error handling to graph_compute by @finnvoor in #1714
Updates Package.swift to use ggml as package dependency by @1-ashraful-islam in #1701

New Contributors

@ForkedInTime made their first contribution in #1655
@OpenWaygate made their first contribution in #1674
@dimopep made their first contribution in #1695
@Huguet57 made their first contribution in #1713
@1-ashraful-islam made their first contribution in #1701

Full Changelog: v1.5.2...v1.5.3

Contributors

tamo, ggerganov, and 7 other contributors

Assets 11

0 Join discussion

14 Dec 16:06

ggerganov

v1.5.2

88112c8

v1.5.2

Overview

Minor maintenance release:

Re-enable CPU BLAS processing after fixing a regression (#1583)

Add new example: wchess

wchess-0.mp4

Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!

What's Changed

automatically convert audio on the server by @sapoepsilon in #1539
CI : Rectify the Clang-Related workflow issues by @bobqianic in #1551
CI : Add CUDA 11.8.0 support by @bobqianic in #1554
Update main program help info by @bebound in #1560
Set default CORS headers to allow all by @kasumi-1 in #1567
cmake : install required ggml.h header by @gjasny in #1568
Backport .srt output format to examples/server by @osdrv in #1565
Added support for .vtt format to Whisper server by @aleksanderandrzejewski in #1578
ggml : re-enable blas for src0 != F32 by @ggerganov in #1583
Fix 32-bit compiler warning by @Digipom in #1575
Remove #if arch(arm) check in Swift Package Manager by @finnvoor in #1561
Pass max-len argument to server wparams by @osdrv in #1574
sync : ggml (new ops, new backend, etc) by @ggerganov in #1602
Fix ggml_metal_log on Intel macs by @finnvoor in #1606
Update CMakeLists.txt by @Kreijstal in #1615
target windows 8 or above for prefetchVirtualMemory in llama-talk by @Kreijstal in #1617
sync : ggml (Metal fixes, new ops, tests) by @ggerganov in #1633
wchess: whisper assisted chess by @fraxy-v in #1595

New Contributors

@sapoepsilon made their first contribution in #1539
@bebound made their first contribution in #1560
@kasumi-1 made their first contribution in #1567
@gjasny made their first contribution in #1568
@osdrv made their first contribution in #1565
@aleksanderandrzejewski made their first contribution in #1578
@Kreijstal made their first contribution in #1615
@fraxy-v made their first contribution in #1595

Full Changelog: v1.5.1...v1.5.2

Contributors

osdrv, ejones, and 11 other contributors

Assets 11

0 Join discussion

24 Nov 10:45

ggerganov

v1.5.1

9d6ebd8

v1.5.1

Overview

Minor update:

With Metal, auto-fallback to CPU if device does not support Apple7 family
Add server example

What's Changed

ISSUE-1329: replace " with ' so it doesn't try to execute code in backticks by @spullara in #1364
sync : ggml (ggml-alloc + linker + gguf fixes) by @ggerganov in #1501
Fixed with_state methods, to use the correct state by @sandrohanea in #1519
#1517 Redistribute CUDA DLLs by @tamo in #1522
whisper : reuse whisper_decode_with_state by @ggerganov in #1521
sdl : fix audio callback by @ggerganov in #1523
update deprecated example by @MightyStud in #1529
Super Simple Whisper Server by @felrock in #1380
Close file after writing in server application by @felrock in #1533
bench : multi-thread memcpy by @ggerganov in #1534
Change temp file name for server application by @felrock in #1535
Fixed Makefile for MacOS ARM 64 Go bindings by @gleicon in #1530
Fixed metal build on macos-latest by @sandrohanea in #1544
fix(server): typo in temperature parameter by @Okabintaro in #1545
Request to add a new function to get the full language name by @bradmit in #1546
server : add --print-realtime param by @ecneladis in #1541
cuda : sync some minor stuff from llama.cpp by @ggerganov in #1548
metal : add backend function to check device family support by @ggerganov in #1547

New Contributors

@spullara made their first contribution in #1364
@MightyStud made their first contribution in #1529
@felrock made their first contribution in #1380
@gleicon made their first contribution in #1530
@Okabintaro made their first contribution in #1545
@bradmit made their first contribution in #1546
@ecneladis made their first contribution in #1541

Full Changelog: v1.5.0...v1.5.1

Contributors

spullara, gleicon, and 8 other contributors

Assets 11

15 Nov 21:06

ggerganov

v1.5.0

d38af15

v1.5.0

Overview

This major release includes the following changes:

Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
Efficient beam-search implementation via batched decoding and unified KV cache
Full quantization support of all available ggml quantization types
Support for grammar constrained sampling
Support for Distil Whisper models
Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	METAL	tiny	1	11.14	1.40	0.49	0.01	`ccc85b4`
M2 Ultra	METAL	tiny-q5_0	1	11.51	1.41	0.52	0.01	`ccc85b4`
M2 Ultra	METAL	tiny-q5_1	1	12.21	1.41	0.52	0.01	`ccc85b4`
M2 Ultra	METAL	base	1	20.21	2.05	0.77	0.02	`ccc85b4`
M2 Ultra	METAL	base-q5_0	1	19.89	1.96	0.81	0.02	`ccc85b4`
M2 Ultra	METAL	base-q5_1	1	20.14	2.02	0.81	0.02	`ccc85b4`
M2 Ultra	METAL	small	1	51.01	3.97	1.74	0.05	`ccc85b4`
M2 Ultra	METAL	small-q5_0	1	56.86	4.09	1.85	0.06	`ccc85b4`
M2 Ultra	METAL	small-q5_1	1	56.81	4.14	1.85	0.06	`ccc85b4`
M2 Ultra	METAL	medium	1	141.21	8.47	3.98	0.13	`ccc85b4`
M2 Ultra	METAL	medium-q5_0	1	160.56	8.27	4.18	0.14	`ccc85b4`
M2 Ultra	METAL	medium-q5_1	1	160.52	8.40	4.15	0.14	`ccc85b4`
M2 Ultra	METAL	medium-dis	1	128.14	1.13	0.43	0.02	`ccc85b4`
M2 Ultra	METAL	large-v2	1	248.73	11.96	6.08	0.22	`ccc85b4`
M2 Ultra	METAL	large-v2-q5_0	1	286.31	11.99	6.60	0.26	`ccc85b4`
M2 Ultra	METAL	large-v2-q5_1	1	284.56	12.42	6.47	0.26	`ccc85b4`
M2 Ultra	METAL	large-v2-dis	1	224.31	1.26	0.49	0.02	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	COREML METAL	tiny	1	7.60	1.41	0.50	0.01	`ccc85b4`
M2 Ultra	COREML METAL	base	1	11.90	2.07	0.78	0.02	`ccc85b4`
M2 Ultra	COREML METAL	small	1	32.19	4.10	1.78	0.05	`ccc85b4`
M2 Ultra	COREML METAL	medium	1	94.43	8.40	3.89	0.12	`ccc85b4`
M2 Ultra	COREML METAL	large-v2	1	179.78	12.12	6.07	0.22	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NVIDIA V100	BLAS CUDA	tiny	1	8.84	1.62	0.33	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	tiny-q5_0	1	8.43	1.19	0.31	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	tiny-q5_1	1	8.41	1.19	0.29	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	base	1	14.79	2.31	0.46	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	base-q5_0	1	15.05	1.66	0.44	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	base-q5_1	1	15.01	1.68	0.46	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	small	1	40.30	4.37	0.88	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	small-q5_0	1	41.17	3.11	0.94	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	small-q5_1	1	41.12	3.11	0.82	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium	1	104.93	10.06	1.77	0.11	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-q5_0	1	107.11	6.13	2.07	0.12	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-q5_1	1	107.91	6.21	1.77	0.12	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-dis	1	103.45	1.11	0.24	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2	1	171.55	15.76	2.62	0.17	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2-q5_0	1	176.27	8.61	3.17	0.19	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2-q5_1	1	176.23	8.67	2.59	0.19	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
AMD Ryzen 9 5950X	AVX2	tiny	8	197.47	1.22	0.44	0.25	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	tiny-q5_0	8	222.92	0.87	0.45	0.30	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	tiny-q5_1	8	221.25	0.89	0.45	0.30	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base	8	427.14	3.11	0.88	0.43	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base-q5_0	8	474.96	1.41	0.72	0.51	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base-q5_1	8	485.05	1.48	0.73	0.52	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small	8	1470.51	11.70	2.89	1.21	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small-q5_0	8	1700.43	5.48	1.98	1.41	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small-q5_1	8	1719.03	5.79	2.02	1.42	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	medium	8	4417.70	35.13	8.14...

Contributors

cjheath, evmar, and 84 other contributors

Assets 9

07 Nov 14:29

ggerganov

v1.4.3

6a5d195

v1.4.3 Pre-release

Pre-release

This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master branch that would be nice to be upstreamed. I am planning a major v1.5.0 release with some new and long-waited functionality soon:

Full CUDA offloading
Efficient Beam-Search implementation
Grammar support

The current version v1.4.3 should be considered in beta as I haven't worked intensively on whisper.cpp recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0 release. In the meantime, any feedback will be highly appreciated.

Detailed API changes, features and new contributor recognitions will be included in the v1.5.0 release.

Assets 2

3 Join discussion

30 Apr 16:56

ggerganov

v1.4.0

fa8dbdc

v1.4.0

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Implementation details: #540
Usage instructions: README
All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

Implementation details: #834
Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together

This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
examples : add missing #include by @pH5 in #798
Flush upon finishing inference by @tarasglek in #811
Escape quotes in csv output by @laytan in #815
C++11style by @wuyudi in #768
Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
add some tips about in the readme of the android project folder by @Zolliner in #816
whisper: Use correct seek_end when offset is used by @ThijsRay in #833
ggml : fix 32-bit ARM NEON by @ggerganov in #836
Add CUDA support via cuBLAS by @ggerganov in #834
Integer quantisation support by @ggerganov in #540

New Contributors

@tauseefmohammed2 made their first contribution in #776
@pH5 made their first contribution in #798
@tarasglek made their first contribution in #811
@laytan made their first contribution in #815
@wuyudi made their first contribution in #768
@Canis-UK made their first contribution in #812
@Zolliner made their first contribution in #816
@ThijsRay made their first contribution in #833

Full Changelog: v1.3.0...v1.4.0

Contributors

pH5, tarasglek, and 8 other contributors

Assets 6

1 Join discussion

15 Apr 14:41

ggerganov

v1.3.0

c23588c

v1.3.0

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

Added Core ML support #566
Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
Pad the audio with zeros instead of the spectrogram (5108b30)
Added talk-llama example
Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

update csv output format to match OpenAI's Whisper dataframe output by @hykelvinlee42 in #552
Go binding: NewContext now returns a clean context by @polarmoon in #537
Added whisper state + default state on the whisper_context by @sandrohanea in #523
whisper.android: Enable fp16 instrinsics (FP16_VA) which is supported by ARMv8.2 or later. by @tinoue in #572
Add quality comparison helper by @venkr in #569
whisper.android: Support benchmark for Android example. by @tinoue in #542
Fix MUSL Linux build by @ggerganov in #576
Change default encoding to UTF-8 by @Kamilake in #605
Provide option for creating JSON output by @tuxpoldo in #615
readme : add react-native bindings by @jhen0409 in #619
Fixed language auto-detection for state provided processing. by @sandrohanea in #627
xcodeproj : add -O3 -DNDEBUG in release mode by @jhen0409 in #640
Nodejs Addon blocking main thread. Implemented Napi::AsyncWorker by @LucasZNK in #642
Include link to R wrapper in README by @jwijffels in #626
Add a cmake flag to disable F16C by @a5huynh in #628
Add talk-llama example by @ggerganov in #664
Add Alpaca support to talk-llama example by @ejones in #668
Update README.md by @razodactyl in #682
issue #470 - working 32-bit ARM by @clach04 in #486
whisper : add initial_prompt param by @jhen0409 in #645
fix typo in JSON output by @egorFiNE in #648
Fix shell script ./models/download-ggml-model.sh to handle spaces and special characters in paths by @be-next in #677
Fixed test to new async implementation by @LucasZNK in #686
Minor: fixing usage message for talk-llama by @InconsolableCellist in #687
Small typo by @ZiggerZZ in #688
feat: add progress callback by @pajowu in #600
ggml : fix q4_1 dot product types by @novag in #759
Exposed various parts to the Go Interface by @bmurray in #697
Adds shell command example for --print-colors by @bocytko in #710
Makefile: disable avx in case f16c is not available by @duthils in #706
Making the quick start instructions clearer. by @Onlyartist9 in #716
Add lrc output support by @WhichWho in #718
Corrects default speak.sh path in talk-llama by @mab122 in #720
Add msvc compiler args /utf-8 fix error C3688 by @WhichWho in #721
Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files by @ivan-gorin in #725
talk/talk-llama: add basic example script for eleven-labs tts by @DGdev91 in #728
readme : add Unity3d bindings by @Macoron in #733
Update stream.cpp by @AliAlameh in #501
Fix typos in whisper.h by @GitAritron in #737
Update LICENSE by @masguit42 in #739
fix potential memory leaks by @baderouaich in #740
readme: Add alternate swift bindings by @exPHAT in #755
Fix the bug related to word splitting errors in the "tokenize" function. by @AfryMask in #760
Do not launch threads for log_mel_spectrogram when singlethreaded by @maxilevi in #763
Core ML support by @ggerganov in #566
ggml : fix build on whisper.android (ARM_NEON) by @jhen0409 in #764