Speech and Language Overview

Situated interactive applications often need to deal with speech when communicating with users using natural language. The Microsoft.Psi.Speech and Microsoft.Psi.Language namespaces provide components and data types for basic speech recognition, voice activity detection and text-to-speech synthesis (Windows only).

Speech Recognition Components

SystemSpeechRecognizer (Windows only) - Uses the System.Speech.Recognition.SpeechRecognitionEngine that is based on Desktop Speech technology that comes with the .NET Framework on Windows.
AzureSpeechRecognizer - Uses the Azure Speech Service that is part of Microsoft Cognitive Services.

The SystemSpeechRecognizer component

The SystemSpeechRecognizer component performs continuous recognition on an audio stream. Recognition results are of type SpeechRecognitionResult and implement the IStreamingSpeechRecognitionResult interface. In general, this pattern allows for results from speech recognition components based on different underlying technologies to conform to a common interface for consumption by downstream components. Final speech recognition results are posted on the Out stream while partial recognition results are posted on the PartialRecognitionResults stream. Partial results contain partial hypotheses while speech is in progress and are useful for displaying hypothesized text as feedback to the user. The final result is emitted once the recognizer has determined that speech has ended, and will contain the top hypothesis for the utterance.

The following example shows how to perform speech recognition on an audio stream. Note that by default the speech recognizer expects a 16 kHz, 1-channel, 16-bit PCM audio stream. If the format of the audio source is different, either specify the correct format in the AudioCaptureConfiguration.OutputFormat configuration parameter (as shown), apply resampling to the audio stream using the Resample audio operator, or set the SystemSpeechRecognizerConfiguration.InputFormat configuration parameter to match the audio source format. However, not all input audio formats are supported.

using (var pipeline = Pipeline.Create())
{
    // Capture audio from the default recording device in the correct format
    var audio = new AudioCapture(
        pipeline, 
        new AudioCaptureConfiguration()
        {
            OutputFormat = WaveFormat.Create16kHz1Channel16BitPcm()
        });

    // Create a new speech recognizer component
    var recognizer = new SystemSpeechRecognizer(pipeline);

    // Send the audio to the recognizer
    audio.PipeTo(recognizer);

    // Print partial recognition results - use the Do operator to print
    var partial = recognizer.PartialRecognitionResults
        .Do(partial => Console.WriteLine(partial.Text.ToString()));

    // Print final recognition results
    var final = recognizer.Out
        .Do(final => Console.WriteLine(final.Text.ToString()));

    // Run the pipeline
    pipeline.Run();
}

Speech recognition results

The speech recognizer components generate recognition results which are represented by a SpeechRecognitionResult object. Results may be either partial or final (as indicated by the IsFinal property). Each result object contains one or more Alternates, each representing a single hypothesis. The top hypothesis may be accessed directly via the Text property. In addition, the raw audio associated with the recognition result is stored in the Audio property.

NOTE: Due to the fact that the partial and final recognition result event times are estimated from the audio stream position of the underlying recognition engine, the originating times of messages from the SystemSpeechRecognizer component may not reflect the exact times of the corresponding utterances in the input audio stream. See this issue for more details.

Grammars

By default, the recognizer uses a free text dictation grammar, but can also be configured to work with custom grammar files. These are XML files that conform to the W3C SRGS Specification. An example grammar file would be:

<?xml version="1.0" encoding="UTF-8" ?>
<grammar version="1.0" xml:lang="en-US"
            xmlns="http://www.w3.org/2001/06/grammar"
            tag-format="semantics/1.0" root="Main">
    <!--
    Defines an SRGS grammar for requesting a flight. This grammar includes
    a Cities rule that lists the cities that can be used for departures
    and destinations.
    -->
    <rule id="Main">
        <item>
            I would like to fly from <ruleref uri="#Cities"/>
            to <ruleref uri="#Cities"/>
        </item>
    </rule>

    <rule id="Cities" scope="public">
        <one-of>
            <item>Seattle</item>
            <item>Los Angeles</item>
            <item>New York</item>
            <item>Miami</item>
        </one-of>
    </rule>
</grammar>

To use one or more grammar files in the SystemSpeechRecognizer, set the Grammars configuration parameter to a list of GrammarInfo objects, each of which is a name-value pair of Name and FileName. Each Name is used as a key for the supplied grammar and should be unique.

// Example of instantiating a recognizer with a set of grammar files.
var recognizer = new SystemSpeechRecognizer(pipeline,
    new SystemSpeechRecognizerConfiguration()
    {
        Grammars = new GrammarInfo[]
        {
            new GrammarInfo() { Name = "Hi", FileName = "Hi.grxml" },
            new GrammarInfo() { Name = "Bye", FileName = "Bye.grxml" },
            new GrammarInfo() { Name = "Yes", FileName = "Yes.grxml" },
            new GrammarInfo() { Name = "No", FileName = "No.grxml" },
            new GrammarInfo() { Name = "ThankYou", FileName = "ThankYou.grxml" }
        }
    });

Intents

Grammar rules may be annotated with semantic tags to augment them with semantic interpretation. When used with the SystemSpeechRecognizer component, any semantic interpretation of recognized text will be translated to intents that are output on the IntentData stream. The IntentData class contains a list of detected intents (with associated confidence scores), and a list of entities, each of which represents a key-value pair of an entity and its value, where they exist. The following is an example of how an intent and entity may be specified using semantic tags in a grammar.

<?xml version="1.0" encoding="UTF-8" ?>
<grammar version="1.0" xml:lang="en-US"
            xmlns="http://www.w3.org/2001/06/grammar"
            tag-format="semantics/1.0" root="Main">

    <rule id="Main">
        <tag>$.RepeatReminder = {}</tag>
        <ruleref special="GARBAGE"/>
        <one-of>
            <item>remind me</item>
            <item>remind me again</item>
            <item>snooze for</item>
        </one-of>
        <ruleref special="GARBAGE"/>
        <one-of>
            <item>one minute<tag>$.RepeatReminder.Snooze = 1</tag></item>
            <item>a minute<tag>$.RepeatReminder.Snooze = 1</tag></item>
            <item>five minutes<tag>$.RepeatReminder.Snooze = 5</tag></item>
        </one-of>
        <ruleref special="GARBAGE"/>
    </rule>
</grammar>

With the above grammar configured, a recognized phrase of "remind me in a minute" would result in an IntentData message containing a single intent of RepeatReminder and an entity of Snooze with a value of 1.

The AzureSpeechRecognizer component

The AzureSpeechRecognizer component uses the Cognitive Services Speech to Text API. In contrast to the SystemSpeechRecognizer, it requires as input a joint audio and voice activity signal, represented as a ValueTuple<AudioBuffer, bool>. The second item is a flag that indicates whether the AudioBuffer contains speech (or more specifically, voice activity). To construct such an input signal from a stream of raw audio, the SimpleVoiceActivityDetector or SystemVoiceActivityDetector (Windows only) components may be used in conjunction with the Join operator, as in the following example.

using (var pipeline = Pipeline.Create())
{
    // Create recognizer component. SubscriptionKey is required.
    var recognizer = new AzureSpeechRecognizer(
        pipeline,
        new AzureSpeechRecognizerConfiguration()
        {
            SubscriptionKey = "...", // replace with your own subscription key
            Region = "..." // replace with your service region (e.g. "WestUS")
        });

    var audio = new AudioCapture(
        pipeline, 
        new AudioCaptureConfiguration() 
        {
            OutputFormat = WaveFormat.Create16kHz1Channel16BitPcm()
        });

    // AzureSpeechRecognizer requires VAD signal as input
    var vad = new SystemVoiceActivityDetector(pipeline);

    // Send the audio to the VAD to detect voice activity
    audio.PipeTo(vad);
    
    // Use the Join operator to combine the audio and VAD signals into a voice-annotated audio signal
    var voice = audio.Join(vad);

    // Send the voice-annotated audio to the recognizer
    voice.PipeTo(recognizer);

    // Print the final recognized text - use Select operator to select the Text property
    var output = recognizer.Select(result => result.Text).Do(text => Console.WriteLine(text));

    // Run the pipeline
    pipeline.Run();
}

The component posts final and partial recognition results represented by the SpeechRecognitionResult object on the Out and PartialRecognitionResults streams respectively.

Note that the SubscriptionKey and Region configuration parameters are required in order to use the speech recognition service. A subscription may be obtained by registering here.

Voice activity detection

The SystemVoiceActivityDetector component

The SystemVoiceActivityDetector uses the System.Speech.Recognition.SpeechRecognitionEngine as a simple way to detect the start and end of voice activity in the audio stream.

The output of this component is a stream of boolean messages where a value of true indicates that voice activity was present at the originating time of the message, and a false indicates that silence (or no voice activity) was detected.

// Create the VAD component
var vad = new Microsoft.Psi.Speech.SystemVoiceActivityDetector(pipeline);

// Send the audio to the VAD
audio.PipeTo(vad);

// Print VAD results
var output = vad.Out.Do(speech => Console.WriteLine(speech ? "Speech" : "Silence"));

Internally, the component feeds received audio into the underlying System.Speech.Recognition.SpeechRecognitionEngine which will detect whenever the audio state changes (e.g. from silence to speech). An output message will be posted for each input audio message indicating whether voice activity was present in the audio buffer. Because detection of the audio state change may have some inherent delay, the VoiceActivityStartOffsetMs and VoiceActivityEndOffsetMs configuration parameters are provided for fine-tuning.

Note that the SystemVoiceActivityDetector is only available on Windows Desktop platforms due to its use of the Windows Desktop Speech API.

The SimpleVoiceActivityDetector component

The SimpleVoiceActivityDetector uses a simple log energy heuristic to determine the presence or absence of sound that could possibly contain voice activity within an audio signal. While not as robust as the SystemVoiceActivityDetector, it is available cross-platform.

Like the SystemVoiceActivityDetector, the output of this component is a stream of boolean messages representing the result of the detection.

Internally, the component relies primarily on a log energy threshold to detect the presence of a signal. You may tune this value in the LogEnergyThreshold property of the SimpleVoiceActivityDetectorConfiguration object that is passed the component on instantiation. The log energy is calculated based on a fixed frame size and frame rate that is applied to the input audio signal. These parameters may also be modified in the SimpleVoiceActivityDetectorConfiguration object. Finally, the detection and silence windows (i.e. the continuous length of time that the component has to detect sound or silence before it triggers a state change) are also configurable for fine-tuning the performance of the component.

Speech synthesis

The SystemSpeechSynthesizer component performs text-to-speech conversion using the .NET System.Speech.Synthesis.SpeechSynthesizer class. The following is an example of how this component might be used to do text-to-speech. Note the use of the Voice configuration parameter to select the synthesis voice. The name here refers to the installed text-to-speech voices.

var synthesizer = new Microsoft.Psi.Speech.SystemSpeechSynthesizer(
    pipeline,
    new Microsoft.Psi.Speech.SystemSpeechSynthesizerConfiguration
    {
        Voice = "Microsoft Zira Desktop"
    });

var player = new Microsoft.Psi.Audio.AudioPlayer(
    pipeline,
    new Microsoft.Psi.Audio.AudioPlayerConfiguration()
    {
        InputFormat = Microsoft.Psi.Audio.WaveFormat.Create16kHz1Channel16BitPcm()
    });

// Synthesize the recognized text and play that back through an audio player component
recognizer.Select(r => r.Text).PipeTo(synthesizer);
synthesizer.PipeTo(player);

The output of the SystemSpeechSynthesizer component is a stream of raw audio representing the synthesized speech.

Language understanding

Language understanding is the process of inferring semantic intent from the recognized text. It is typically performed as the next stage following speech recognition, although it may be applied to any stream of text. The LUISIntentDetector uses the cloud-based LUIS service in conjunction with custom or pre-built apps developed using LUIS. See http://www.luis.ai/ for more details.

The output of the LUISIntentDetector component is a stream of IntentData which wraps a list of Intents and a list of Entities that were understood in the input phrase, as determined by the LUIS service. The LUIS service returns a JSON object that is then deserialized into an IntentData object that is posted on the Out stream.