Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain voice models emit incorrect word boundary events when processing special characters #2359

Open
GJStevenson opened this issue Apr 30, 2024 · 4 comments
Assignees

Comments

@GJStevenson
Copy link

GJStevenson commented Apr 30, 2024

Describe the bug

A subset of the voice models appear to have difficulty processing the three special characters: < > and & even when using entity format (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure#special-characters). After a special character is present in the script, the WordBoundary events will begin to report incorrect word boundaries.

A non-exhaustive list of voice models that appear to be exhibiting this behavior are:

en-US-AndrewNeural
en-US-BrianNeural
en-US-EmmaNeural
en-US-JennyMultilingualNeural
en-US-RyanMultilingualNeural

I've experienced this issue with the Javascript SDK, as well as the Python SDK. Sample code using the Python sample project here: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3

To Reproduce

  1. Pull down sample code in gist: https://gist.github.com/GJStevenson/ed2b0ca00691109dfd99ad3ef177b1a3
  2. Install dependencies listed in environment.yml
    • If using conda, run: conda env create -f environment.yml and then activate the environment.
  3. Set speech_key and service_region
  4. Choose voice model to use in speech_synthesis_word_boundary_event.
  5. Run python speech_synthesis_sample.py and enter your sample text.

NOTE: speak_text_async appears to handle converting the special characters to html entities automatically

  1. View the results in the console, and where the logs are emitted (./out/)

    • For example, the sample text: Testing AT&T to see if it works will emit the word boundary events:
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=500000, duration=0:00:00.437500, text_offset=0, word_length=7), audio offset in ms: 50.0ms. Text: Testing
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=5000000, duration=0:00:00.962500, text_offset=-1, word_length=4), audio offset in ms: 500.0ms. Text: AT&a
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=14750000, duration=0:00:00.087500, text_offset=-1, word_length=3), audio offset in ms: 1475.0ms. Text: mp;
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=15750000, duration=0:00:00.200000, text_offset=11, word_length=4), audio offset in ms: 1575.0ms. Text: T to
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=17875000, duration=0:00:00.112500, text_offset=16, word_length=2), audio offset in ms: 1787.5ms. Text: se
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=19125000, duration=0:00:00.087500, text_offset=18, word_length=3), audio offset in ms: 1912.5ms. Text: e i
    Word boundary event received: SpeechSynthesisWordBoundaryEventArgs(audio_offset=20125000, duration=0:00:00.575000, text_offset=21, word_length=10), audio offset in ms: 2012.5ms. Text: f it works
    

After the &amp; is encountered, the word boundary events start reporting incorrect word boundaries (AT&a, mp;, T to, etc.). This issue also exists with the other two special characters < and >

Attached are some logs from running the input string Testing AT&T to see if it works against the voice models en-US-AndrewNeural and en-US-AriaNeural

Expected behavior

Word boundaries are reported correctly regardless if the special characters exist.

Version of the Cognitive Services Speech SDK

Python 1.37.0
Javascript 1.31.0

Platform, Operating System, and Programming Language

  • OS: MacOS, Ventura 13.6.6
  • Hardware: M1 Macbook Pro Max, ARM
  • Programming Language: Python, Javascript
  • Browser: Chrome (Javascript SDK used in Electron version 27 on the renderer process)

Additional context

en-US-AndrewNeural Logs: speech_synthesis_en-US-AndrewNeural_20240430_163926.log

en-US-AriaNeural Logs: speech_synthesis_en-US-AriaNeural_20240430_165107.log

@meetakshay99
Copy link

Same happens for Java too

@BeastBlood1885
Copy link

With Edge's Read Aloud, whether or not I'm using the multilingual versions of Andrew and Brian available there (which is a bit confusing as the ones that don't say "multilingual" in their names still act as such), it skips to the next sentence/passage every time it comes across those characters. Happens with Remy too.

@pankopon
Copy link
Contributor

@yulin-li Please check - a service side issue / voice model specific?

@Kerry-LinZhang
Copy link

@yanchang-gyc to follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants