Skip to content

Releases: ggerganov/whisper.cpp

v1.4.0

30 Apr 16:56
fa8dbdc
Compare
Choose a tag to compare

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

  • Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
  • Implementation details: #540
  • Usage instructions: README
  • All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

  • Implementation details: #834
  • Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together


This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0

15 Apr 14:41
c23588c
Compare
Choose a tag to compare

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

  • Added Core ML support #566
  • Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
  • Pad the audio with zeros instead of the spectrogram (5108b30)
  • Added talk-llama example
  • Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

New Contributors

Full Changelog: v1.2.1...v1.3.0

v1.2.1

28 Feb 20:30
ad13890
Compare
Choose a tag to compare

Overview

This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()).

Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.

You can provide feedback in the existing v1.2.0 discussion.

What's Changed

Core ggml / whisper

  • whisper : whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in #476
  • whisper : add whisper_full_lang_id() for getting the context lang by @kamranjon in #461
  • whisper : fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in #474
  • whisper : suppress non-speech-related token outputs by @shibukazu in #473
  • cmake : install whisper.h header by @aviks in #485
  • whisper : fix signedness compiler warning by @shikokuchuo in #506
  • whisper : by default disable non-speech tokens suppression #473
  • whisper : add API for applying custom logits filters during decoding 0d22916
  • whisper : fix uninitialized exp_n_audio_ctx by @finnvoor in #520

Bindings

  • bindings : add Ruby by @taf2 in #500
  • readme : add .NET repos (#303)
  • readme : add cython bindings (#9)
  • readme : add pybind11 bindings by @aarnphm in #538

Examples

  • ci : add node addon test and optimize compilation configuration by @chenqianhe in #468
  • yt-wsp.sh : add unique filename generation by @genevera in #495
  • examples : refactor in order to reuse code and reduce duplication by @ggerganov in #482
  • main : fix stdin pipe stream by @conradg in #503
  • make : add "-mcpu=native" when building for aarch64 (#532)

C-style API

  • Add whisper_pcm_to_mel_phase_vocoder()
  • Add *(whisper_logits_filter_callback)()
  • Change struct whisper_full_params
  • Add whisper_full_lang_id()

New Contributors

Full Changelog: v1.2.0...v1.2.1

Highlights

Recently, I have been making progress on adding integer quantisation support in the ggml tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0 will officially add quantisation support. For now, you can keep track of the progress in #540


v1.2.0

04 Feb 08:55
b2083c5
Compare
Choose a tag to compare

Overview

In this release we significantly reduce the memory usage during inference by introducing "scratch" buffers to ggml.

The new memory requirements per model are as follows:

Model Disk Mem (Old) Mem (New)
tiny 75 MB ~390 MB ~125 MB
base 142 MB ~500 MB ~210 MB
small 466 MB ~1.0 GB ~600 MB
medium 1.5 GB ~2.6 GB ~1.7 GB
large 2.9 GB ~4.7 GB ~3.3 GB

It's a simple idea that instead of creating a new memory buffer for each new tensor in the computation, we reuse the memory of old tensors that are no longer needed. The implementation is in PR #431. It's not very clean - I think there is some better way to do this, but for now it will work.

Additionally, there might be some inference speed improvements on Apple Silicon in the Decoder part of the transformer. I haven't done proper benchmarks, but seems there is about ~30% performance boost. The results are identical to v1.1.1.

What's Changed

Core ggml / whisper

  • whisper : PPC64 big-endian support by @fitzsim in #398
  • whisper : condition sampled timestamp tokens to be monotonically increasing by @ggerganov in #425
  • wasm : fix typo in helper.js by @bhbs in #459
  • ggml/whisper : reduce memory usage during inference by @ggerganov in #431

Bindings

  • ci : run workflows on pull requests + bindings depend on .h by @ggerganov in #446
  • go : added wrappers to reset and print timings by @glaslos in #436
  • go : add WhisperLangAutoDetect method to go binding by @RobinXL in #451
  • go : add wrapper for system info by @glaslos in #456
  • go : support "auto" as an option when set language by @polarmoon in #462

Examples

  • whisper.wasm : add labels for easier radio selection by @kokes in #435
  • livestream.sh : run main with model arg instead of default by @EricTendian in #453
  • main : CSV format export trimmed spaces fix by @alex-bacart in #444
  • addon.node : using whisper as a Node.js addon by @chenqianhe in #443

New Contributors

Full Changelog: v1.1.1...v1.2.0

Highlights

I'll use these release notes to write some random thoughts about the project - sort of a short blog post.

I'm really happy with how whisper.cpp turned out to be so far. There is a very positive reception in the ML community - most people seem to be excited by the simplicity of the implementation and the fact that it is quite self-contained. I receive a lot of questions about the project and about various ideas that it can be applied to. I really enjoy it and I try to respond to everyone!

I also find it very satisfying that there are so many contributions already happening by so many people. To me this illustrates the power of open-source collaboration. The contributions not only improve the functionality and the quality of the code, but also help to generate various new ideas and approaches to explore.

Another interesting thing is that the project keeps on giving. Every time I start to think that now is a good time to put it in the background for a while and focus on other stuff, some new cool idea pops up and I can't help but start working on it. Having this custom implementation allows me to interact with the model on a lower level which opens some interesting ways to explore it.

So far the development has been focused on improving the performance, expanding the platform coverage and having robust decoding strategies with a variety of examples. During this time, there have been several ideas that accumulated over-time which I find interesting to explore (diarization, token-level timestamps, improved timestamp accuracy, etc). I think I'll try to focus more on these in the future and see if I can achieve something interesting.



  • "The New Yorker" article featuring whisper.cpp

v1.1.1

23 Jan 18:41
2c3f50a
Compare
Choose a tag to compare

Overview

Since the v1.1.0 pre-release there have been several reports of improved transcription quality.
Together with my observations, I think we can declare version v1.1.1 as "stable".

There were actually a couple of bug-fixes implemented since v1.1.0, so make sure to update to v1.1.1 for optimal results.

Another update is that the prototype for v1.2.0 is almost ready: #431
Initial results indicate that the memory usage can be reduced by a factor of 2-3 for the smaller models.

You can provide feedback in the existing v1.1.0 discussion.

What's Changed

Core ggml / whisper

  • whisper : perform entropy check only when we have at least 32 tokens 1a91c19
  • whisper : fix condition for providing past prompt (critical) 78f1661

Bindings

  • go : remove sample_best and sample_timestamp bindings by @Trojan295 in #409

Examples

  • main : re-enable temperature fallback f583e2d
  • main : add an option to accept optional output filenames by @garychia in #424
  • whisper.android : use AssetManager for Android by @Digipom in #415
  • whisper.wasm : add small and small.en models 206fc93
  • bench : add memcpy and ggml_mul_mat benchmarks (experimental) 1290fc6

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

15 Jan 12:00
8738427
Compare
Choose a tag to compare
v1.1.0 Pre-release
Pre-release

Overview

The major change in this pre-release is the improved decoding implementation in whisper.cpp:

  • Support for average logprob and entropy based criteria for fallback
  • Support for temperature T > 0
  • Improved Greedy decoder via best_of parameter for T > 0
  • Add beam search decoding (a.k.a beam_size)

More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C platforms.
Support for POWER9 architectures has been added.

The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release 1.1.x will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the 1.0.4 release should be considered more stable.

What's Changed

Core ggml / whisper

  • ggml : POWER9 support by @fitzsim in #320, #349, #369
  • ggml : simplify the SIMD code by @ggerganov in #324
  • ggml : add SSE3 and fp16 conversion lookup table by @abitofevrything in #368
  • ggml : utilise Accelerate's vDSP for some computations d51fc3e
  • ggml : speed-up softmax compute via Accelerate and loop unrolling d61d55c
  • ggml : do not start extra threads when using BLAS d347a59
  • whisper : do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in #388
  • whisper : various code clean-up and improvements by @asmaloney in #317 #318 #319 #322 etc
  • whisper : improve decoding by @ggerganov in #291
  • whisper : account for speed_up flag for short audio #405

C-style API

  • Add loader class to allow loading from buffer and others by @prsyahmi in #353
  • Add whisper_token_data::plog
  • Add whisper_init_from_file()
  • Add whisper_init_from_buffer()
  • Change whisper_init()
  • Remove whisper_sample_best()
  • Remove whisper_sample_timestamp()
  • Add whisper_n_audio_ctx()
  • Add whisper_get_logits()
  • Remove whisper_get_probs()
  • Change struct whisper_full_params

Bindings

Examples

  • whisper.android : remove android ABI constraint by @Digipom in #301
  • whisper.swiftui : SwiftUI example by @Digipom in #308
  • main : add -ocsv, aka --output-csv for writing CSV file containing millisecond timestamps by @NielsMayer in #340
  • command : refactor to split command list & general transcription modes by @asmaloney in #331
  • command : always-prompt mode by @dnhkng in #383
  • stream : fix data race on bool + avoid division-by-zero a466c34
  • stream : fix a bug that inserted a lot of empty audio at the start a6dbd91
  • bench.wasm : print system info fafd789

New Contributors

Full Changelog: v1.0.4...v1.1.0

Highlights

image

v1.0.4

17 Dec 18:34
1d716d6
Compare
Choose a tag to compare

What's Changed

Core ggml / whisper

  • Make ggml compatible with c99 9955fa4 | 0f11759
  • Fix UB causing asserts in Debug when reading the model vocabulary 124c718
  • Minor improvements in the Greedy decoding strategy 6a7c825
  • Add Windows build without OpenBLAS by @ggerganov in #282
  • Add whisper_tokenize() - basic text tokenization bf69b66
  • Language auto-detect option by @ggerganov in #286
  • Add AVX,AVX2 support for ggml_vec_scale_f32 by @katsu560 in #285
  • Implement extra cases for ggml_compute_forward_dup_f16() a7047b2
  • Added Roadmap and updated F.A.Q. discussion #126

C-style API

  • Add whisper_tokenize()
  • Add whisper_lang_max_id()
  • Add whisper_lang_str()
  • Add whisper_lang_auto_detect()
  • Add whisper_token_lang()

Examples

  • Improve prompting in "talk" example a613f16
  • Add "sliding window" mode to "stream" example b0f8013
  • Add Android sample by @Digipom in #277
  • Guided mode for the "command" example by @ggerganov in #271
  • Example "main" supports --prompt option b8065d9
  • Example "main" supports --print-progress option 32fbc8c
  • Example "main" supports --lang auto option fba10a4

New Contributors

Full Changelog: 1.0.3...1.0.4

Highlights

image image

  • General-purpose, short voice command detection on Raspberry Pi 4 using example/command:

    command-guided-0.mp4