Recognising the same speaker across multiple audio clips? #1205
Replies: 2 comments 2 replies
-
Hi @hbredin, I now see that this might be somewhat of a duplicate of issue #1085 From you answer there: “Not out of the box, no. One straightforward solution would be to concatenate audio1.wav and audio2.wav into a longer audio1+2.wav file -- but that won't scale well with more audio files. A more involved solution would be to run diarization on each file separately It seems like I was on the right track. If it’s just a few files maybe I can concatenate otherwise the embedding + clustering route seems the best option. What would be the easiest way to get the embeddings out once I have run the separate files? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hi @PhantomSpike , I am on the same track, trying to identify the same interviewer from multiple interview files. I was also thinking at first in separate some chunks of audio with the speaker I want to recognize and get the embedding from them. Then, after running the speaker diarization pipeline, I can get embeddings from CHUNKS of every speaker detected and compare using cosine distance an decide. Regarding embeddings, I could extract them using https://huggingface.co/pyannote/embedding |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
Thank you to the developers for this wonderful package, really love it! <3
Quick Q:
Is it possible to recognize the same speaker across multiple sound clips/recordings?
If so, should I concatenate all the recordings together and feed them in as one file, or I can give them separately to pyannote.audio?
If not possible, what would you recommend? I was thinking that I can use the pipeline on the sound clips, and then once I have the embeddings for each speaker I can do some sort of clustering to find the same speakers because presumably in the embedding space they would be quite similar to one another even in different recordings.
Just to give a bit more context, my particular application will be meeting calls and telephone calls, and we would not have a database of speakers, but I would want to just be able to say that the same person was present in different recordings and when they spoke (diarization).
Any advice would be greatly appreciated! Thank you!
Beta Was this translation helpful? Give feedback.
All reactions