Skip to content

davecampbell/whisper_pyannote_fusion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whisper PyAnnote Fusion

This library contains some techniques for fusing Whisper ASR output with PyAnnote output.

Whisper ASR is a model for voice to text transcription. PyAnnote is a model for speaker diarization.

It has the following features:

  • Single line command to run both whisper and pyannote and then get the ASR and diarization results
  • Contains some basic metrics for evaluating the performance of the fusion
  • Small dataset for testing the fusion

Installation

  • Please get or retrieve the hugging face API key. This is needed for the pyannote models. Additionally, you will have to go to the model cards and accept the terms and conditions.
  • Please install pyannote.audio from the github repo (https://github.com/pyannote/pyannote-audio) using the following command: pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip (Substitute pip for pipenv if you wish to use that)
  • Install whisperx from the github repo () with the command pip install git+https://github.com/m-bain/whisperx.git#egg=whisperx (Substitute pip for pipenv if you wish to use that)
  • Install the spacy model for english: python -m spacy download en_core_web_sm

Troubleshooting installation

PyTorch Lightning Cloud IO

I had to comment out line from pytorch_lightning.utilities.cloud_io import load as pl_load since cloud_io seems to have moved to lightning_fabrics. If the models haven't been downloaded yet this will probably make it not work. However, it should fixed in the future and I'm just noting it here in case it's an issue for someone.

Changing the line to from lightning_fabric.utilities.cloud_io import _load as pl_loadmight work.

Pyannote does require a lot of things to be set up right to function. Please be sure to accept the terms and conditions and get the API key on hugging face. Make sure the right key is being passed through.

Problem with MP3 files

If you try to use mp3 files, you can get errors like

Sizes of tensors must match except in dimension 0. Expected size 80000 but got size ...

You'll have to use the fix outlined here. pyannote/pyannote-audio#1324

Change audio/pipelines/speaker_diarization.py to have the following line wit the new code below the comment # waveform: (1, num_samples) torch.Tensor

# chunk: Segment(t, t + duration)
# masks: (num_frames, local_num_speakers) np.ndarray
waveform, _ = self._audio.crop(
    file,
    chunk,
    duration=duration,
    mode="pad",
)
# waveform: (1, num_samples) torch.Tensor
if waveform.shape[1] < num_samples:
    pad_num = int(num_samples - waveform.shape[1])
    waveform = torch.nn.functional.pad(waveform, (0, pad_num), "constant", 0)

Data Source

Most of the testing and training was done on one episode of TWIML. This episode is https://twimlai.com/podcast/twimlai/are-llms-overhyped-or-under-appreciated/ and the mp3 file can be downloaded from https://chrt.fm/track/4D4ED/traffic.megaphone.fm/MLN5101605789.mp3?updated=1682369086

About

Fuse whisper and pyannote results

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%