Skip to content

bamert/stm32_speech_commands

Repository files navigation

Efficient Keyword Spotting for Embedded Systems

This repository demonstrates an efficient keyword spotting system tailored for STM32L4 microcontrollers, balancing accuracy and speed for real-time audio processing in resource-constrained embedded systems.

Deployed on an STM32L475(128KB SRAM, 80MHz) it recognizes 35 different keywords and achieves a post-rejection accuracy of 96% at an inference latency of 190ms, suitable for streaming applications.

Demo For reference, the model can be tested in the browser here.

Model Specifications

  • Utilizes a modified M5 model, processing raw waveforms (no spectrogram).
  • Dataset: Recognizes 35 keywords from the speech commands dataset.
  • Audio sampling rate: 8kHz, 1 sec frames.
  • Inference Time (Cortex M4): ~ 190ms at 80Mhz (Cortex M4).
  • Inference Time (Browser): ~ 1-5ms depending on device
  • Memory Usage (Cortex M4): Consumes about 60Kb RAM.

Repository Structure

  • model_training: Contains Pytorch Lightning training code.
  • browser_inference: Includes browser-based demo inference code. Try it here.
  • stm32_inference: Features STM32-specific inference engine with firmware for B-L475-IOT01A board.

Getting Started

  • The python requirements are managed with poetry. They are installed with cd model_training && poetry install.
  • The stm32 code requires the arm gcc: arm-none-eabi-gcc. Build the code with cd stm32_inference && make.
    • A firmware binary is available at stm32_inference/build/speechmodel_code.bin.
  • Includes a no-frills browser inference engine in browser_inference/browser_demo_inference.html

Model accuracy / inference time tradeoff

Model val acc. pr val acc.(% rejected) stm32 inference time [ms] MFLOP kParams
M5-c32-k80 86.6 96.9 (23.1) 603 3.8 166
M5-c16-k80 81.7 96.3 (37.4) - - -
M5-c32-k40 87.6 97.2 (23.0) 595 2.4 99
M5-c32-k20 86.2 96.6 (23.8) 246 1.8 98
M5-c32-k10 84.5 96.5 (28.4) 180 1.6 97

The above table shows some of the model configurations that were tried. The first row shows the original configuration of the M5 model by Dai et al.

The STM32 inferences engine acquires and runs inference on overlapping audio frames of 1 second length (8kHz; 8000samples) every 250ms. This is to ensure that the longer keywords ("visual", "marvin", ..) have a higher likelihood of being fully contained in one of the frames as opposed to being cut in half. To enable 4 inferences per second, the inference time of the model has to be under 250ms.

Experiments with a smaller kernel length for the initial 1D convolution showed that reasonable performance can also be reached with a much smaller k=10. The accuracy on the validation split with this model is 84.5%. For keyword spotting applications it is more acceptable to miss an unclear keyword rather than making a false positive prediction. For this reason we use the distance between the class with the highest and second highest probabilities as a proxy for the confidence of the prediction. We only make a prediction if this distance is > 75%. Given this additional criterion to avoid false positives, all models reach a post-rejection accuracy in excess of 96% on the non-rejected validation samples (pr val acc).

The model used in the stm32 and browser inference engines above is the M5-c32-k10.

About

Efficient real-time keyword spotting on STM32L4 microcontrollers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published