Certain voice models emit incorrect word boundary events when processing special characters #2359

GJStevenson · 2024-04-30T21:05:59Z

Describe the bug

A subset of the voice models appear to have difficulty processing the three special characters: < > and & even when using entity format (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters). After a special character is present in the script, the WordBoundary events will begin to report incorrect word boundaries.

A non-exhaustive list of voice models that appear to be exhibiting this behavior are:

en-US-AndrewNeural
en-US-BrianNeural
en-US-EmmaNeural
en-US-JennyMultilingualNeural
en-US-RyanMultilingualNeural

I've experienced this issue with the Javascript SDK, as well as the Python SDK. Sample code using the Python sample project here: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3

To Reproduce

Pull down sample code in gist: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3
Install dependencies listed in environment.yml
- If using conda, run: conda env create -f environment.yml and then activate the environment.
Set speech_key and service_region
Choose voice model to use in speech_synthesis_word_boundary_event.
Run python speech_synthesis_sample.py and enter your sample text.

NOTE: speak_text_async appears to handle converting the special characters to html entities automatically

View the results in the console, and where the logs are emitted (./out/)

For example, the sample text: Testing AT&T to see if it works will emit the word boundary events:

Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.437500, text_offset=0, word_length=7), audio offset in ms: 50.0ms. Text: Testing
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=5000000, duration=0:00:00.962500, text_offset=-1, word_length=4), audio offset in ms: 500.0ms. Text: AT&a
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=14750000, duration=0:00:00.087500, text_offset=-1, word_length=3), audio offset in ms: 1475.0ms. Text: mp;
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=15750000, duration=0:00:00.200000, text_offset=11, word_length=4), audio offset in ms: 1575.0ms. Text: T to
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17875000, duration=0:00:00.112500, text_offset=16, word_length=2), audio offset in ms: 1787.5ms. Text: se
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=19125000, duration=0:00:00.087500, text_offset=18, word_length=3), audio offset in ms: 1912.5ms. Text: e i
Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=20125000, duration=0:00:00.575000, text_offset=21, word_length=10), audio offset in ms: 2012.5ms. Text: f it works

After the & is encountered, the word boundary events start reporting incorrect word boundaries (AT&a, mp;, T to, etc.). This issue also exists with the other two special characters < and >

Attached are some logs from running the input string Testing AT&T to see if it works against the voice models en-US-AndrewNeural and en-US-AriaNeural

Expected behavior

Word boundaries are reported correctly regardless if the special characters exist.

Version of the Cognitive Services Speech SDK

Python 1.37.0
Javascript 1.31.0

Platform, Operating System, and Programming Language

OS: MacOS, Ventura 13.6.6
Hardware: M1 Macbook Pro Max, ARM
Programming Language: Python, Javascript
Browser: Chrome (Javascript SDK used in Electron version 27 on the renderer process)

Additional context

en-US-AndrewNeural Logs: speech_synthesis_en-US-AndrewNeural_20240430_163926.log

en-US-AriaNeural Logs: speech_synthesis_en-US-AriaNeural_20240430_165107.log

The text was updated successfully, but these errors were encountered:

meetakshay99 · 2024-05-02T07:28:47Z

Same happens for Java too

BeastBlood1885 · 2024-05-05T22:49:02Z

With Edge's Read Aloud, whether or not I'm using the multilingual versions of Andrew and Brian available there (which is a bit confusing as the ones that don't say "multilingual" in their names still act as such), it skips to the next sentence/passage every time it comes across those characters. Happens with Remy too.

pankopon · 2024-05-21T21:07:25Z

@yulin-li Please check - a service side issue / voice model specific?

Kerry-LinZhang · 2024-05-30T08:13:59Z

@yanchang-gyc to follow up

pankopon assigned yulin-li May 21, 2024

pankopon added in-review In review text-to-speech Text-to-Speech labels May 21, 2024

yulin-li added the service-side issue label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain voice models emit incorrect word boundary events when processing special characters #2359

Certain voice models emit incorrect word boundary events when processing special characters #2359

GJStevenson commented Apr 30, 2024 •

edited

meetakshay99 commented May 2, 2024

BeastBlood1885 commented May 5, 2024

pankopon commented May 21, 2024

Kerry-LinZhang commented May 30, 2024

Certain voice models emit incorrect word boundary events when processing special characters #2359

Certain voice models emit incorrect word boundary events when processing special characters #2359

Comments

GJStevenson commented Apr 30, 2024 • edited

meetakshay99 commented May 2, 2024

BeastBlood1885 commented May 5, 2024

pankopon commented May 21, 2024

Kerry-LinZhang commented May 30, 2024

GJStevenson commented Apr 30, 2024 •

edited