Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization #1648

Winchester37 · 2024-02-09T07:39:23Z

Tested versions

pyannote.audio 3.1.1

System information

Windows 11 - pyannote.audio 3.1.1

Issue description

I have successfully fine-tuned a Pyannote Audio model for speaker diarization using a custom dataset and now I'm facing difficulties testing the fine-tuned model. Despite following the documentation and adjusting the paths for the model checkpoint and configuration file, I encounter errors when attempting to test the model on a new audio file.

Here's the training code snippet I used for fine-tuning:

# Training code snippet

`import os
import torch
os.environ["PYANNOTE_DATABASE_CONFIG"] = "/yedek/pyannote/gsmDatasets202/datasets.yaml"

from pyannote.database import registry , FileFinder
registry.load_database("/yedek/pyannote/gsmDatasets202/datasets.yaml")
dataset = registry.get_protocol("DATATEST.SpeakerDiarization.main", {"audio": FileFinder()})


from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet

task = SpeakerDiarization(
    dataset,
    duration=5.0,
    max_speakers_per_chunk=2,
    max_speakers_per_frame=2,
    batch_size=128,
    num_workers=8,
    loss="bce"
)

model = PyanNet(task=task)

# this takes approximately 15min to run on Google Colab GPU
import torch
torch.set_float32_matmul_precision('high')
from types import MethodType
from torch.optim import Adam
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    RichProgressBar,
)

# we use Adam optimizer with 1e-4 learning rate
def configure_optimizers(self):
    return Adam(self.parameters(), lr=1e-4)

model.configure_optimizers = MethodType(configure_optimizers, model)

# we monitor diarization error rate on the validation set
# and use to keep the best checkpoint and stop early
monitor, direction = task.val_monitor
checkpoint = ModelCheckpoint(
    monitor=monitor,
    mode=direction,
    save_top_k=1,
    every_n_epochs=1,
    save_last=False,
    save_weights_only=False,
    filename="{epoch}",
    verbose=False,
)
early_stopping = EarlyStopping(
    monitor=monitor,
    mode=direction,
    min_delta=0.0,
    patience=10,
    strict=True,
    verbose=False,
)

callbacks = [RichProgressBar(), checkpoint, early_stopping]

# we train for at most 20 epochs (might be shorter in case of early stopping)
from pytorch_lightning import Trainer
trainer = Trainer(accelerator="gpu",
                  callbacks=callbacks,
                  max_epochs=200,
                  gradient_clip_val=0.5)
trainer.fit(model)

finetuned_model = checkpoint.best_model_path

print(finetuned_model)
`
And this is the testing code that leads to errors:

`from pyannote.audio import Model
import json

# Model ve yapılandırma dosyasının yolları
MODEL_PATH = "lightning_logs/version_24/checkpoints/epoch=57.ckpt"
CONFIG_PATH = "lightning_logs/version_9/hparams.yaml"
AUDIO_FILE_PATH = "wav2/20240123_112622.mp3"  # Test edilecek ses dosyası

# Konuşmacı diarizasyonu için hazır pipeline yükleniyor
pipeline = Model.from_pretrained(MODEL_PATH)

# Ses dosyası üzerinde diarizasyon gerçekleştiriliyor
diarization = pipeline(AUDIO_FILE_PATH)

# Diarizasyon sonuçlarının yazdırılması
output = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
    start = round(segment.start, 2)  # Konuşmanın başladığı zaman (saniye cinsinden)
    end = round(segment.end, 2)  # Konuşmanın bittiği zaman (saniye cinsinden)
    output.append({"speaker": speaker, "start": start, "end": end})

# Sonuçların JSON olarak yazdırılması
print(json.dumps(output, indent=4))`

`Traceback (most recent call last):
  File "C:\Users\serca\PycharmProjects\pyannote\nemoo.py", line 14, in <module>
    diarization = pipeline(AUDIO_FILE_PATH)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\pyannote\audio\models\segmentation\PyanNet.py", line 172, in forward
    outputs = self.sincnet(waveforms)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\pyannote\audio\models\blocks\sincnet.py", line 81, in forward
    outputs = self.wav_norm1d(waveforms)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\instancenorm.py", line 71, in forward
    self._check_input_dim(input)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\instancenorm.py", line 161, in _check_input_dim
    if input.dim() not in (2, 3):
AttributeError: 'str' object has no attribute 'dim'`


I'm looking for guidance on how to properly test the fine-tuned Pyannote Audio model or if there's any specific step I might be missing. Any help or pointers towards resolving this issue would be greatly appreciated.

Thank you in advance for your assistance.

### Minimal reproduction example (MRE)

https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/MRE_template.ipynb#scrollTo=gVrDtBcusDbK

FrenchKrab · 2024-02-09T08:44:27Z

I think you are confusing pyannote's "models" (pyannote.audio.models.....) and pyannote's "pipelines" (pyannote.audio.pipelines.....).
The model that you finetune/train is the 'segmentation' model, it performs the speaker diarization task on duration=5.0 seconds windows.

To obtain the final diarization output on a whole audio file, we need to aggregate multiple outputs of this local segmentation model, see paper pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe for more details about it.

There may be examples in a pyannote tutorial notebook, but I can't remember which one, so here is a pretty complete notebook about training a model and testing its pipeline (in particular the "Adapted pipeline output" section).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization #1648

Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization #1648

Winchester37 commented Feb 9, 2024

FrenchKrab commented Feb 9, 2024

Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization #1648

Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization #1648

Comments

Winchester37 commented Feb 9, 2024

Tested versions

System information

Issue description

FrenchKrab commented Feb 9, 2024