Skip to content

Audio Overview

Dan Bohus edited this page Feb 8, 2022 · 3 revisions

Recording and playback of audio are common operations in situated interactive applications. For instance, audio input may be required for speech recognition, or to generate acoustic features for use with acoustic models. Applications may also need to generate and produce audio output to communicate with users. The Microsoft.Psi.Audio namespace provides components and operators for capturing, processing and rendering audio, as well as for generating a range of acoustic features.

Please note: Audio capture, and playback is supported on Windows and Linux. Audio resampling is currently only available on Windows.

Basic Components

Basic audio capabilities are provided by the following components in the Microsoft.Psi.Audio namespace:

  • AudioCapture - Captures audio from an audio recording device.
  • AudioPlayer - Plays back audio on an audio playback device.
  • AudioResampler - Resamples an audio stream (Windows only).
  • WaveFileAudioSource - Reads audio from a wave file.
  • WaveStreamSampleSource - Reads audio in WAVE format from a System.IO.Stream and emits when triggered by a boolean input signal.
  • WaveFileWriter - Writes an audio stream to a wave file.

In Platform for Situated Intelligence, audio is generally handled and passed between components via streams of type AudioBuffer. An AudioBuffer contains a single buffer of raw audio data along its associated format information in a WaveFormat or WaveFormatEx object.

Common Patterns of Usage

The following are some examples of how to use the basic audio components.

Capturing and playing back audio

The following code will capture audio from the default audio recording device on Windows and echo it to the default audio playback device.

using (var pipeline = Pipeline.Create())
{
    var source = new AudioCapture(pipeline);
    var player = new AudioPlayer(pipeline);
    source.PipeTo(player);
    pipeline.Run();
}

Individual audio devices for capture and playback may be specified in an AudioCaptureConfiguration or AudioPlayerConfiguration object, which may optionally be supplied when constructing an AudioCapture or AudioPlayer as shown in the following code:

var source = new AudioCapture(
    pipeline,
    new AudioCaptureConfiguration()
    {
        DeviceName = "Headset Microphone (USB)"
    });

var player = new AudioPlayer(
    pipeline,
    new AudioPlayerConfiguration()
    {
        DeviceName = "Remote Audio"
    });

The previous examples assume that both default capture and playback formats (sampling rate, channels, etc.) are identical. The audio format may be explicitly specified by supplying a 'WaveFormat' value in the configuration object as shown in the following example:

var format = WaveFormat.Create16kHz1Channel16BitPcm();
var source = new AudioCapture(
    pipeline,
    new AudioCaptureConfiguration()
    { 
        OutputFormat = format
    });

var player = new AudioPlayer(
    pipeline, 
    new AudioPlayerConfiguration() 
    { 
        InputFormat = format 
    });

Capturing audio from a file and resampling

The WaveFileAudioSource component enables audio from a Wave file to be consumed as a \psi stream. In the following example, a Wave file is used to generate an audio stream, which is then resampled to a different format using the AudioResampler component. Resampling is necessary in situations where the original audio format is not compatible with the format required by a downstream component that consumes the audio (for example, a speech recognizer). In the example, the resampled audio is simply sent to an AudioPlayer component for playback.

var source = new WaveFileAudioSource(pipeline, "recording.wav");
var player = new AudioPlayer(pipeline);
var resampler = new AudioResampler(
    pipeline,
    new AudioResamplerConfiguration()
    {
        OutputFormat = WaveFormat.Create16BitPcm(8000, 1)
    });

source.PipeTo(resampler);
resampler.PipeTo(player);

Acoustic Feature Operators

The following operators are provided to manipulate raw audio samples and to compute acoustic features.

  • Dither - Applies a random dither to an frame samples.
  • FFT - Computes the Fast Fourier Transform of an audio frame.
  • FFTPower - Computes the power spectral density from the FFT.
  • FrameShift - Segments a stream of audio samples into (potentially overlapping) fixed-length frames.
  • FrequencyDomainEnergy - Computes the energy within a frequency band.
  • HanningWindow - Applies a Hanning window to an audio frame.
  • LogEnergy - Computes the log energy of an audio frame.
  • SpectralEntropy - Computes the spectral entropy within a frequency band.
  • ToFloat - Converts raw audio samples to floating-point sample values.
  • ZeroCrossingRate - Computes the zero-cross frequency of an audio frame.

While any one of these operators may be used individually, they are usually produced collectively using the AcousticFeaturesExtractor component which aggregates a set of commonly used acoustic feature streams into a single component. Configuration parameters specified in the AcousticFeaturesExtractorConfiguration object determine which features to generate. Note that the audio input to this component is assumed to be 1-channel 16-bit PCM audio. Ensure that this format is specified in AudioCaptureConfiguration if using the AudioCapture to capture live audio as input, or use the AudioResampler component to convert the audio stream to the required format.

Troubleshooting Audio on Linux

The Linux AudioCapture and AudioPlayer \psi components are built on the Advanced Linux Sound Architecture (ALSA) library APIs and depend on the asound shared object. This comes installed with many Linux distributions but if you receive an error such as "Unable to load shared library 'asound'" then you may need to install:

apt install libasound2-dev

To test your audio hardware outside of \psi, you may record and playback audio with the arecord and aplay command line utilities. For example, to record and playback a 10 second test clip:

arecord -f S16_LE -d 10 -r 16000 -D hw:1,0 test.wav
aplay test.wav

You can list available capture (arecord -L) and playback (aplay -L) hardware to determine device names. You may want to experiment with sample rates and formats to ensure that your settings are correct.

Once in \psi, the AudioCapture and AudioPlayer components each take configuration details at construction time including the DeviceName (default "plughw:0,0") and Format (default 16KHz, 1 channel, 16-bit PCM).

var audioInput = new AudioCapture(
    pipeline,
    new AudioCaptureConfiguration()
    {
        DeviceName = "plughw:0,0",
        Format = WaveFormat.Create16kHz1Channel16BitPcm(),
    });

var audioOutput = new AudioPlayer(
    pipeline,
    new AudioPlayerConfiguration()
    {
        DeviceName = "plughw:0,0",
        Format = WaveFormat.Create16kHz1Channel16BitPcm(),
    });
Clone this wiki locally