Releases: pyannote/pyannote-audio
Releases · pyannote/pyannote-audio
Version 3.1.1
TL;DR
Providing num_speakers
to pyannote/speaker-diarization-3.1
now works as expected.
Full changelog
Fixes
- fix(pipeline): fix support for setting
num_speakers
inpyannote/speaker-diarization-3.1
pipeline
Version 3.1.0
TL;DR
pyannote/speaker-diarization-3.1
no longer requires unpopular ONNX runtime
Full changelog
New features
- feat(model): add WeSpeaker embedding wrapper based on PyTorch
- feat(model): add support for multi-speaker statistics pooling
- feat(pipeline): add
TimingHook
for profiling processing time - feat(pipeline): add
ArtifactHook
for saving internal steps - feat(pipeline): add support for list of hooks with
Hooks
- feat(utils): add
"soft"
option toPowerset.to_multilabel
Fixes
- fix(pipeline): add missing "embedding" hook call in
SpeakerDiarization
- fix(pipeline): fix
AgglomerativeClustering
to honornum_clusters
when provided - fix(pipeline): fix frame-wise speaker count exceeding
max_speakers
or detectednum_speakers
inSpeakerDiarization
pipeline
Improvements
- improve(pipeline): compute
fbank
on GPU when requested
Breaking changes
- BREAKING(pipeline): rename
WeSpeakerPretrainedSpeakerEmbedding
toONNXWeSpeakerPretrainedSpeakerEmbedding
- BREAKING(setup): remove
onnxruntime
dependency.
You can still use ONNXhbredin/wespeaker-voxceleb-resnet34-LM
but you will have to installonnxruntime
yourself. - BREAKING(pipeline): remove
logging_hook
(useArtifactHook
instead) - BREAKING(pipeline): remove
onset
andoffset
parameter inSpeakerDiarizationMixin.speaker_count
You should now binarize segmentations before passing them tospeaker_count
Version 3.0.1
TL;DR
pyannote/speaker-diarization-3.0
is now much faster when sent to GPU.
import torch
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0")
pipeline.to(torch.device("cuda"))
Full changelog
Fixes and improvements
- fix: fix WeSpeakerPretrainedSpeakerEmbedding GPU support
Dependencies update
- setup: switch from
onnxruntime
toonnxruntime-gpu
Version 3.0.0
TL;DR
Better pretrained pipeline and model
- Much better overlapping speech detection with powerset pyannote/segmentation-3.0
- Much better speaker diarization performance with pyannote/speaker-diarization-3.0
Benchmark (DER %) | v2.1 | v3.0 |
---|---|---|
AISHELL-4 | 14.1 | 12.3 |
AliMeeting (channel 1) | 27.4 | 24.3 |
AMI (IHM) | 18.9 | 19.0 |
AMI (SDM) | 27.1 | 22.2 |
AVA-AVD | - | 49.1 |
DIHARD 3 (full) | 26.9 | 21.7 |
MSDWild | - | 24.6 |
REPERE (phase2) | 8.2 | 7.8 |
VoxConverse (v0.3) | 11.2 | 11.3 |
Major breaking changes
- BREAKING: pipelines now run on CPU by default
Usepipeline.to(torch.device('cuda'))
to use GPU - BREAKING: removed
SpeakerSegmentation
pipeline
UseSpeakerDiarization
pipeline instead - BREAKING: removed support for
prodi.gy
recipes
Full changelog
Features and improvements
- feat(pipeline): send pipeline to device with
pipeline.to(device)
- feat(pipeline): add
return_embeddings
option toSpeakerDiarization
pipeline - feat(pipeline): make
segmentation_batch_size
andembedding_batch_size
mutable inSpeakerDiarization
pipeline (they now default to1
) - feat(pipeline): add progress hook to pipelines
- feat(task): add powerset support to
SpeakerDiarization
task - feat(task): add support for multi-task models
- feat(task): add support for label scope in speaker diarization task
- feat(task): add support for missing classes in multi-label segmentation task
- feat(model): add segmentation model based on torchaudio self-supervised representation
- feat(pipeline): check version compatibility at load time
- improve(task): load metadata as tensors rather than pyannote.core instances
- improve(task): improve error message on missing specifications
Breaking changes
- BREAKING(task): rename
Segmentation
task toSpeakerDiarization
- BREAKING(pipeline): pipeline defaults to CPU (use
pipeline.to(device)
) - BREAKING(pipeline): remove
SpeakerSegmentation
pipeline (useSpeakerDiarization
pipeline) - BREAKING(pipeline): remove
segmentation_duration
parameter fromSpeakerDiarization
pipeline (defaults toduration
of segmentation model) - BREAKING(task): remove support for variable chunk duration for segmentation tasks
- BREAKING(pipeline): remove support for
FINCHClustering
andHiddenMarkovModelClustering
- BREAKING(setup): drop support for Python 3.7
- BREAKING(io): channels are now 0-indexed (used to be 1-indexed)
- BREAKING(io): multi-channel audio is no longer downmixed to mono by default.
You should update howpyannote.audio.core.io.Audio
is instantiated:- replace
Audio()
byAudio(mono="downmix")
; - replace
Audio(mono=True)
byAudio(mono="downmix")
; - replace
Audio(mono=False)
byAudio()
.
- replace
- BREAKING(model): get rid of (flaky)
Model.introspection
If, for some weird reason, you wrote some custom code based on that,
you should instead rely onModel.example_output
. - BREAKING(interactive): remove support for Prodigy recipes
Fixes and improvements
- fix(pipeline): fix reproducibility issue with Ampere CUDA devices
- fix(pipeline): fix support for IOBase audio
- fix(pipeline): fix corner case with no speaker
- fix(train): prevent metadata preparation to happen twice
- fix(task): fix support for "balance" option
- improve(task): shorten and improve structure of Tensorboard tags
Dependencies update
- setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+
- setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+
- setup: switch to speechbrain 0.5.14+
Version 2.1.1
Version 2.1.x
introduces a major overhaul of pyannote.audio
default speaker diarization pipeline, made of three main stages:
- neural speaker segmentation applied to a short sliding window;
- neural speaker embedding of each (local) speakers;
- (global) agglomerative clustering.
More details in the attached technical report.
Version 1.1.1
chore: do not update to pyannote.pipeline >= 2.0