Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling word-level timestamps for all W2L Decoders #5403

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

abarcovschi
Copy link

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #3371 and extends #3627 to include the ability to return the frame numbers of all non-blank characters of a hypothesis for all wav2letter decoder classes, not only just for W2lKenLMDecoder. A method called get_symbols() was also added to the parent class for all the decoders (W2lDecoder) so that the non-blank characters of the hypothesis can be returned as a list of natural language characters and not just token ids. This helps in finding the word-boundary tokens later when calculating the word-level timestamp information using the following formula:

timestamp = frame_num * (audio_len / (num_frames * sample_rate))

where:

  • frame_num = the timestep of the symbol, as returned in the 'timesteps' field of Wl2Decoder.decode() outputs.
  • audio_len = the number of samples in the loaded audio file corresponding to the transcript (if using batched w2v2 acoustic model inference, will be zero padded to the length of the longest loaded audio file in the batch).
  • num_frames = the number of frames in the emission matrix returned by the w2v2 acoustic model inference for that audio file (if using batched inference, the number of frames for each audio file will be the same as in this case all loaded audio files are padded to the length of the longest audio file in the batch).
  • sample_rate = sample rate of loaded audio files (usually 16000 Hz).

PR review

@alexeib

Copy link
Contributor

@alexeib alexeib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm bar comments. also i am no longer at meta so can't merge PRs into this repo

examples/speech_recognition/w2l_decoder.py Outdated Show resolved Hide resolved
examples/speech_recognition/w2l_decoder.py Outdated Show resolved Hide resolved
Copy link
Contributor

@alexeib alexeib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! hopefully someone from meta will merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] wav2vec 2.0 timestamp words
3 participants