Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps are broken for whisper large with WhisperForConditionalGeneration #30433

Closed
kamilakesbi opened this issue Apr 23, 2024 · 2 comments · Fixed by #30812
Closed

Timestamps are broken for whisper large with WhisperForConditionalGeneration #30433

kamilakesbi opened this issue Apr 23, 2024 · 2 comments · Fixed by #30812
Assignees
Labels
Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@kamilakesbi
Copy link
Contributor

kamilakesbi commented Apr 23, 2024

System Info

System Info

  • transformers version: 4.40.0.dev0
  • Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • Accelerate version: 0.29.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): 2.13.1 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@kamilakesbi @sanchit-gandhi

Reproduction

Timestamps are broken for whisper-large-v3 when used with WhisperForConditionalGeneration:

Note: This issue is related to #30224

Reproduction

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch 
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
model.to(device)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
speech_samples = ds.sort("id").select(range(4))[:4]["audio"]

input_speech = [x["array"] for x in speech_samples]
features = processor.feature_extractor(raw_speech=input_speech, return_tensors="pt")

input_features = features.input_features.to(device)
generate_kwargs = {}

generate_outputs = model.generate(
    input_features, return_timestamps=True, return_token_timestamps=True, **generate_kwargs
)
print(generate_outputs.token_timestamps)

We get:

tensor([[ 0.0000,  0.0000, 29.3000, 29.3000, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800],
        [ 0.0000,  0.0000, 29.3000, 29.3000, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800, 29.9800,
         29.9800],...

indicating that the timestamps are broken...

@sanchit-gandhi sanchit-gandhi added Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels Apr 23, 2024
@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Apr 23, 2024

This notebook should be useful: https://github.com/sanchit-gandhi/codesnippets/blob/main/whisper-word-level.ipynb

While we fix this issue, we can also consider how to make the API simpler for users, since it currently requires some post-processing outside the model + processor API

@nakranivaibhav
Copy link
Contributor

@sanchit-gandhi is this issue open for taking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants