You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I stumbled upon something, which I found very strange.
I used the pyannote 3.1.0 Speaker-Diarization Pipeline to
diarize a sample audio file, which is 15 min long. It took about 24 sec.
For my case I need speaker embeddings so a few months ago I implemented my own method for extracting feature embeddings per speaker from the cropped audio using audio.crop() from pyannote. Extracting the embeddings with Wespeaker Resnet293 takes about 8 seconds.
So far so good.
I noticed that you provide the option to return speaker_embeddings now, which is great and works fine with the small resnet34 model you use. However when I changed the embedding model to resnet293 by using your converting script (I just replaced the numbers and pointed to the right pytorch model from wespeaker in the path) I needed 240 seconds to diarize the file.
I wondered how this is possible given that I only need 8 seconds to manually extract speaker embeddings with the same model. My approach is the same as yours, just way more basic and without batching. I aggregate all VAD timestamps for a speaker, cut the audio accordingly, merge it back together and extract the embedding with the model. The cosine similarity between your and my vectors was ~0.99x. If there happends to be a bug in your code (which I actually do not think) this could be a good performance boost.
Maybe you have a clue if the problem lies with the model converting script or something else. I can provide you with a model file if you want to or the code snippets how I extract the embeddings, but like I said its pretty basic.
Minimal reproduction example (MRE)
Unfortunately this is hard to rewrite in a reproducer with all the files
The text was updated successfully, but these errors were encountered:
Tested versions
Appears in 3.1.0
System information
Ubuntu 22, Lenovo P1 Gen 5 Workstation A4500
Issue description
So I stumbled upon something, which I found very strange.
I used the pyannote 3.1.0 Speaker-Diarization Pipeline to
diarize a sample audio file, which is 15 min long. It took about 24 sec.
For my case I need speaker embeddings so a few months ago I implemented my own method for extracting feature embeddings per speaker from the cropped audio using
audio.crop()
from pyannote. Extracting the embeddings with Wespeaker Resnet293 takes about 8 seconds.So far so good.
I noticed that you provide the option to return
speaker_embeddings
now, which is great and works fine with the small resnet34 model you use. However when I changed the embedding model to resnet293 by using your converting script (I just replaced the numbers and pointed to the right pytorch model from wespeaker in the path) I needed 240 seconds to diarize the file.I wondered how this is possible given that I only need 8 seconds to manually extract speaker embeddings with the same model. My approach is the same as yours, just way more basic and without batching. I aggregate all VAD timestamps for a speaker, cut the audio accordingly, merge it back together and extract the embedding with the model. The cosine similarity between your and my vectors was ~0.99x. If there happends to be a bug in your code (which I actually do not think) this could be a good performance boost.
Maybe you have a clue if the problem lies with the model converting script or something else. I can provide you with a model file if you want to or the code snippets how I extract the embeddings, but like I said its pretty basic.
Minimal reproduction example (MRE)
Unfortunately this is hard to rewrite in a reproducer with all the files
The text was updated successfully, but these errors were encountered: