Releases · ggerganov/whisper.cpp

30 Apr 16:56

ggerganov

v1.4.0

fa8dbdc

v1.4.0

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Implementation details: #540
Usage instructions: README
All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

Implementation details: #834
Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together

This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
examples : add missing #include by @pH5 in #798
Flush upon finishing inference by @tarasglek in #811
Escape quotes in csv output by @laytan in #815
C++11style by @wuyudi in #768
Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
add some tips about in the readme of the android project folder by @Zolliner in #816
whisper: Use correct seek_end when offset is used by @ThijsRay in #833
ggml : fix 32-bit ARM NEON by @ggerganov in #836
Add CUDA support via cuBLAS by @ggerganov in #834
Integer quantisation support by @ggerganov in #540

New Contributors

@tauseefmohammed2 made their first contribution in #776
@pH5 made their first contribution in #798
@tarasglek made their first contribution in #811
@laytan made their first contribution in #815
@wuyudi made their first contribution in #768
@Canis-UK made their first contribution in #812
@Zolliner made their first contribution in #816
@ThijsRay made their first contribution in #833

Full Changelog: v1.3.0...v1.4.0

Contributors

pH5, tarasglek, and 8 other contributors

Assets 6

1 Join discussion

15 Apr 14:41

ggerganov

v1.3.0

c23588c

v1.3.0

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

Added Core ML support #566
Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
Pad the audio with zeros instead of the spectrogram (5108b30)
Added talk-llama example
Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

update csv output format to match OpenAI's Whisper dataframe output by @hykelvinlee42 in #552
Go binding: NewContext now returns a clean context by @polarmoon in #537
Added whisper state + default state on the whisper_context by @sandrohanea in #523
whisper.android: Enable fp16 instrinsics (FP16_VA) which is supported by ARMv8.2 or later. by @tinoue in #572
Add quality comparison helper by @venkr in #569
whisper.android: Support benchmark for Android example. by @tinoue in #542
Fix MUSL Linux build by @ggerganov in #576
Change default encoding to UTF-8 by @Kamilake in #605
Provide option for creating JSON output by @tuxpoldo in #615
readme : add react-native bindings by @jhen0409 in #619
Fixed language auto-detection for state provided processing. by @sandrohanea in #627
xcodeproj : add -O3 -DNDEBUG in release mode by @jhen0409 in #640
Nodejs Addon blocking main thread. Implemented Napi::AsyncWorker by @LucasZNK in #642
Include link to R wrapper in README by @jwijffels in #626
Add a cmake flag to disable F16C by @a5huynh in #628
Add talk-llama example by @ggerganov in #664
Add Alpaca support to talk-llama example by @ejones in #668
Update README.md by @razodactyl in #682
issue #470 - working 32-bit ARM by @clach04 in #486
whisper : add initial_prompt param by @jhen0409 in #645
fix typo in JSON output by @egorFiNE in #648
Fix shell script ./models/download-ggml-model.sh to handle spaces and special characters in paths by @be-next in #677
Fixed test to new async implementation by @LucasZNK in #686
Minor: fixing usage message for talk-llama by @InconsolableCellist in #687
Small typo by @ZiggerZZ in #688
feat: add progress callback by @pajowu in #600
ggml : fix q4_1 dot product types by @novag in #759
Exposed various parts to the Go Interface by @bmurray in #697
Adds shell command example for --print-colors by @bocytko in #710
Makefile: disable avx in case f16c is not available by @duthils in #706
Making the quick start instructions clearer. by @Onlyartist9 in #716
Add lrc output support by @WhichWho in #718
Corrects default speak.sh path in talk-llama by @mab122 in #720
Add msvc compiler args /utf-8 fix error C3688 by @WhichWho in #721
Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files by @ivan-gorin in #725
talk/talk-llama: add basic example script for eleven-labs tts by @DGdev91 in #728
readme : add Unity3d bindings by @Macoron in #733
Update stream.cpp by @AliAlameh in #501
Fix typos in whisper.h by @GitAritron in #737
Update LICENSE by @masguit42 in #739
fix potential memory leaks by @baderouaich in #740
readme: Add alternate swift bindings by @exPHAT in #755
Fix the bug related to word splitting errors in the "tokenize" function. by @AfryMask in #760
Do not launch threads for log_mel_spectrogram when singlethreaded by @maxilevi in #763
Core ML support by @ggerganov in #566
ggml : fix build on whisper.android (ARM_NEON) by @jhen0409 in #764

New Contributors

@hykelvinlee42 made their first contribution in #552
@tinoue made their first contribution in #572
@venkr made their first contribution in #569
@Kamilake made their first contribution in #605
@tuxpoldo made their first contribution in #615
@jhen0409 made their first contribution in #619
@LucasZNK made their first contribution in #642
@jwijffels made their first contribution in #626
@a5huynh made their first contribution in #628
@ejones made their first contribution in #668
@razodactyl made their first contribution in #682
@clach04 made their first contribution in #486
@egorFiNE made their first contribution in #648
@be-next made their first contribution in #677
@InconsolableCellist made their first contribution in #687
@ZiggerZZ made their first contribution in #688
@pajowu made their first contribution in #600
@novag made their first contribution in #759
@bmurray made their first contribution in #697
@bocytko made their first contribution in #710
@duthils made their first contribution in #706
@Onlyartist9 made their first contribution in #716
@WhichWho made their first contribution in #718
@mab122 made their first contribution in #720
@ivan-gorin made their first contribution in #725
@DGdev91 made their first contribution in #728
@Macoron made their first contribution in #733
@AliAlameh made their first contribution in #501
@GitAritron made their first contribution in #737
@masguit42 made their first contribution in #739
@baderouaich made their first contribution in #740
@exPHAT made their first contribution in #755
@AfryMask made their first contribution in #760
@maxilevi made their first contribution in #763

Full Changelog: v1.2.1...v1.3.0

Contributors

a5huynh, egorFiNE, and 35 other contributors

Assets 6

0 Join discussion

28 Feb 20:30

ggerganov

v1.2.1

ad13890

v1.2.1

Overview

This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()).

Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.

You can provide feedback in the existing v1.2.0 discussion.

What's Changed

Core `ggml` / `whisper`

whisper : whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in #476
whisper : add whisper_full_lang_id() for getting the context lang by @kamranjon in #461
whisper : fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in #474
whisper : suppress non-speech-related token outputs by @shibukazu in #473
cmake : install whisper.h header by @aviks in #485
whisper : fix signedness compiler warning by @shikokuchuo in #506
whisper : by default disable non-speech tokens suppression #473
whisper : add API for applying custom logits filters during decoding 0d22916
whisper : fix uninitialized exp_n_audio_ctx by @finnvoor in #520

Bindings

bindings : add Ruby by @taf2 in #500
readme : add .NET repos (#303)
readme : add cython bindings (#9)
readme : add pybind11 bindings by @aarnphm in #538

Examples

ci : add node addon test and optimize compilation configuration by @chenqianhe in #468
yt-wsp.sh : add unique filename generation by @genevera in #495
examples : refactor in order to reuse code and reduce duplication by @ggerganov in #482
main : fix stdin pipe stream by @conradg in #503
make : add "-mcpu=native" when building for aarch64 (#532)

C-style API

Add whisper_pcm_to_mel_phase_vocoder()
Add *(whisper_logits_filter_callback)()
Change struct whisper_full_params
Add whisper_full_lang_id()

New Contributors

@mightymatth made their first contribution in #455
@kamranjon made their first contribution in #461
@sandrohanea made their first contribution in #474
@shibukazu made their first contribution in #473
@genevera made their first contribution in #495
@shikokuchuo made their first contribution in #506
@conradg made their first contribution in #503
@taf2 made their first contribution in #500
@finnvoor made their first contribution in #520
@aarnphm made their first contribution in #538
@FlippFuzz made their first contribution in #532

Full Changelog: v1.2.0...v1.2.1

Highlights

Recently, I have been making progress on adding integer quantisation support in the ggml tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0 will officially add quantisation support. For now, you can keep track of the progress in #540

🎙️ MacWhisper by @jordibruin powered by whisper.cpp

https://goodsnooze.gumroad.com/l/macwhisper

Contributors

taf2, jordibruin, and 14 other contributors

Assets 6

04 Feb 08:55

ggerganov

v1.2.0

b2083c5

v1.2.0

Overview

In this release we significantly reduce the memory usage during inference by introducing "scratch" buffers to ggml.

The new memory requirements per model are as follows:

Model	Disk	Mem (Old)	Mem (New)
tiny	75 MB	~390 MB	~125 MB
base	142 MB	~500 MB	~210 MB
small	466 MB	~1.0 GB	~600 MB
medium	1.5 GB	~2.6 GB	~1.7 GB
large	2.9 GB	~4.7 GB	~3.3 GB

It's a simple idea that instead of creating a new memory buffer for each new tensor in the computation, we reuse the memory of old tensors that are no longer needed. The implementation is in PR #431. It's not very clean - I think there is some better way to do this, but for now it will work.

Additionally, there might be some inference speed improvements on Apple Silicon in the Decoder part of the transformer. I haven't done proper benchmarks, but seems there is about ~30% performance boost. The results are identical to v1.1.1.

What's Changed

Core `ggml` / `whisper`

whisper : PPC64 big-endian support by @fitzsim in #398
whisper : condition sampled timestamp tokens to be monotonically increasing by @ggerganov in #425
wasm : fix typo in helper.js by @bhbs in #459
ggml/whisper : reduce memory usage during inference by @ggerganov in #431

Bindings

ci : run workflows on pull requests + bindings depend on .h by @ggerganov in #446
go : added wrappers to reset and print timings by @glaslos in #436
go : add WhisperLangAutoDetect method to go binding by @RobinXL in #451
go : add wrapper for system info by @glaslos in #456
go : support "auto" as an option when set language by @polarmoon in #462

Examples

whisper.wasm : add labels for easier radio selection by @kokes in #435
livestream.sh : run main with model arg instead of default by @EricTendian in #453
main : CSV format export trimmed spaces fix by @alex-bacart in #444
addon.node : using whisper as a Node.js addon by @chenqianhe in #443

New Contributors

@kokes made their first contribution in #435
@glaslos made their first contribution in #436
@EricTendian made their first contribution in #453
@RobinXL made their first contribution in #451
@alex-bacart made their first contribution in #444
@bhbs made their first contribution in #459
@polarmoon made their first contribution in #462
@chenqianhe made their first contribution in #443

Full Changelog: v1.1.1...v1.2.0

Highlights

I'll use these release notes to write some random thoughts about the project - sort of a short blog post.

I'm really happy with how whisper.cpp turned out to be so far. There is a very positive reception in the ML community - most people seem to be excited by the simplicity of the implementation and the fact that it is quite self-contained. I receive a lot of questions about the project and about various ideas that it can be applied to. I really enjoy it and I try to respond to everyone!

I also find it very satisfying that there are so many contributions already happening by so many people. To me this illustrates the power of open-source collaboration. The contributions not only improve the functionality and the quality of the code, but also help to generate various new ideas and approaches to explore.

Another interesting thing is that the project keeps on giving. Every time I start to think that now is a good time to put it in the background for a while and focus on other stuff, some new cool idea pops up and I can't help but start working on it. Having this custom implementation allows me to interact with the model on a lower level which opens some interesting ways to explore it.

So far the development has been focused on improving the performance, expanding the platform coverage and having robust decoding strategies with a variety of examples. During this time, there have been several ideas that accumulated over-time which I find interesting to explore (diarization, token-level timestamps, improved timestamp accuracy, etc). I think I'll try to focus more on these in the future and see if I can achieve something interesting.

Windows port of whisper.cpp utilising vendor-agnostic GPGPU based on DirectCompute by @Const-me

https://github.com/Const-me/Whisper

"The New Yorker" article featuring whisper.cpp

Whispers of A.I.’s Modular Future

Contributors

EricTendian, fitzsim, and 9 other contributors

Assets 6

9 Join discussion

23 Jan 18:41

ggerganov

v1.1.1

2c3f50a

v1.1.1

Overview

Since the v1.1.0 pre-release there have been several reports of improved transcription quality.
Together with my observations, I think we can declare version v1.1.1 as "stable".

There were actually a couple of bug-fixes implemented since v1.1.0, so make sure to update to v1.1.1 for optimal results.

Another update is that the prototype for v1.2.0 is almost ready: #431
Initial results indicate that the memory usage can be reduced by a factor of 2-3 for the smaller models.

You can provide feedback in the existing v1.1.0 discussion.

What's Changed

Core `ggml` / `whisper`

whisper : perform entropy check only when we have at least 32 tokens 1a91c19
whisper : fix condition for providing past prompt (critical) 78f1661

Bindings

go : remove sample_best and sample_timestamp bindings by @Trojan295 in #409

Examples

main : re-enable temperature fallback f583e2d
main : add an option to accept optional output filenames by @garychia in #424
whisper.android : use AssetManager for Android by @Digipom in #415
whisper.wasm : add small and small.en models 206fc93
bench : add memcpy and ggml_mul_mat benchmarks (experimental) 1290fc6

New Contributors

@Trojan295 made their first contribution in #409
@garychia made their first contribution in #424

Full Changelog: v1.1.0...v1.1.1

Contributors

Digipom, Trojan295, and garychia

Assets 6

15 Jan 12:00

ggerganov

v1.1.0

8738427

v1.1.0 Pre-release

Pre-release

Overview

The major change in this pre-release is the improved decoding implementation in whisper.cpp:

Support for average logprob and entropy based criteria for fallback
Support for temperature T > 0
Improved Greedy decoder via best_of parameter for T > 0
Add beam search decoding (a.k.a beam_size)

More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C platforms.
Support for POWER9 architectures has been added.

The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release 1.1.x will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the 1.0.4 release should be considered more stable.

What's Changed

Core `ggml` / `whisper`

ggml : POWER9 support by @fitzsim in #320, #349, #369
ggml : simplify the SIMD code by @ggerganov in #324
ggml : add SSE3 and fp16 conversion lookup table by @abitofevrything in #368
ggml : utilise Accelerate's vDSP for some computations d51fc3e
ggml : speed-up softmax compute via Accelerate and loop unrolling d61d55c
ggml : do not start extra threads when using BLAS d347a59
whisper : do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in #388
whisper : various code clean-up and improvements by @asmaloney in #317 #318 #319 #322 etc
whisper : improve decoding by @ggerganov in #291
whisper : account for speed_up flag for short audio #405

C-style API

Add loader class to allow loading from buffer and others by @prsyahmi in #353
Add whisper_token_data::plog
Add whisper_init_from_file()
Add whisper_init_from_buffer()
Change whisper_init()
Remove whisper_sample_best()
Remove whisper_sample_timestamp()
Add whisper_n_audio_ctx()
Add whisper_get_logits()
Remove whisper_get_probs()
Change struct whisper_full_params

Bindings

Golang bindings by @djthorpe in #287, #379, #384

Examples

whisper.android : remove android ABI constraint by @Digipom in #301
whisper.swiftui : SwiftUI example by @Digipom in #308
main : add -ocsv, aka --output-csv for writing CSV file containing millisecond timestamps by @NielsMayer in #340
command : refactor to split command list & general transcription modes by @asmaloney in #331
command : always-prompt mode by @dnhkng in #383
stream : fix data race on bool + avoid division-by-zero a466c34
stream : fix a bug that inserted a lot of empty audio at the start a6dbd91
bench.wasm : print system info fafd789

New Contributors

@djthorpe made their first contribution in #287
@0xmohit made their first contribution in #296
@asmaloney made their first contribution in #298
@fitzsim made their first contribution in #320
@NielsMayer made their first contribution in #340
@aviks made their first contribution in #345
@eltociear made their first contribution in #346
@abitofevrything made their first contribution in #368
@Mike-Bell made their first contribution in #381
@dnhkng made their first contribution in #383
@prsyahmi made their first contribution in #353
@ianb made their first contribution in #391

Full Changelog: v1.0.4...v1.1.0

Highlights

Sample SwiftUI application example/whisper.swiftui

Contributors

ianb, aviks, and 13 other contributors

Assets 6

22 Join discussion

17 Dec 18:34

ggerganov

v1.0.4

1d716d6

v1.0.4

What's Changed

Core `ggml` / `whisper`

Make ggml compatible with c99 9955fa4 | 0f11759
Fix UB causing asserts in Debug when reading the model vocabulary 124c718
Minor improvements in the Greedy decoding strategy 6a7c825
Add Windows build without OpenBLAS by @ggerganov in #282
Add whisper_tokenize() - basic text tokenization bf69b66
Language auto-detect option by @ggerganov in #286
Add AVX,AVX2 support for ggml_vec_scale_f32 by @katsu560 in #285
Implement extra cases for ggml_compute_forward_dup_f16() a7047b2
Added Roadmap and updated F.A.Q. discussion #126

C-style API

Add whisper_tokenize()
Add whisper_lang_max_id()
Add whisper_lang_str()
Add whisper_lang_auto_detect()
Add whisper_token_lang()

Examples

Improve prompting in "talk" example a613f16
Add "sliding window" mode to "stream" example b0f8013
Add Android sample by @Digipom in #277
Guided mode for the "command" example by @ggerganov in #271
Example "main" supports --prompt option b8065d9
Example "main" supports --print-progress option 32fbc8c
Example "main" supports --lang auto option fba10a4

New Contributors

@Digipom made their first contribution in #277

Full Changelog: 1.0.3...1.0.4

Highlights

Sample Android application example/whisper.android

General-purpose, short voice command detection on Raspberry Pi 4 using example/command:

command-guided-0.mp4

Contributors

ggerganov, Digipom, and katsu560

Assets 6

Releases: ggerganov/whisper.cpp

v1.4.0

Overview

Integer quantization

LLaMA quantization (measured on M1 Pro)

RWKV quantization

GPU support via cuBLAS

What's Changed

New Contributors

Contributors

v1.3.0

Overview

What's Changed

New Contributors

Contributors

v1.2.1

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

C-style API

New Contributors

Highlights

Contributors

v1.2.0

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

New Contributors

Highlights

Whispers of A.I.’s Modular Future

Contributors

v1.1.1

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

New Contributors

Contributors

v1.1.0

Overview

What's Changed

Core ggml / whisper

C-style API

Bindings

Examples

New Contributors

Highlights

Contributors

v1.0.4

What's Changed

Core ggml / whisper

C-style API

Examples

New Contributors

Highlights

Contributors

Core `ggml` / `whisper`

Core `ggml` / `whisper`

Core `ggml` / `whisper`

Core `ggml` / `whisper`

Core `ggml` / `whisper`