Recognising the same speaker across multiple audio clips? #1205

PhantomSpike · 2022-12-22T09:16:52Z

PhantomSpike
Dec 22, 2022

Hi everyone,

Thank you to the developers for this wonderful package, really love it! <3

Quick Q:

Is it possible to recognize the same speaker across multiple sound clips/recordings?

If so, should I concatenate all the recordings together and feed them in as one file, or I can give them separately to pyannote.audio?

If not possible, what would you recommend? I was thinking that I can use the pipeline on the sound clips, and then once I have the embeddings for each speaker I can do some sort of clustering to find the same speakers because presumably in the embedding space they would be quite similar to one another even in different recordings.

Just to give a bit more context, my particular application will be meeting calls and telephone calls, and we would not have a database of speakers, but I would want to just be able to say that the same person was present in different recordings and when they spoke (diarization).

Any advice would be greatly appreciated! Thank you!

PhantomSpike · 2022-12-22T17:46:33Z

PhantomSpike
Dec 22, 2022
Author

Hi @hbredin,

I now see that this might be somewhat of a duplicate of issue #1085

From you answer there:

“Not out of the box, no.

One straightforward solution would be to concatenate audio1.wav and audio2.wav into a longer audio1+2.wav file -- but that won't scale well with more audio files.

A more involved solution would be to

run diarization on each file separately
extract one speaker embedding for each speaker in each file
do some kind of clustering on top of those embeddings to match speakers across files”

It seems like I was on the right track. If it’s just a few files maybe I can concatenate otherwise the embedding + clustering route seems the best option.

What would be the easiest way to get the embeddings out once I have run the separate files?

Thanks!

0 replies

joagonzalez · 2023-01-06T13:08:42Z

joagonzalez
Jan 6, 2023

Hi @PhantomSpike , I am on the same track, trying to identify the same interviewer from multiple interview files.

I was also thinking at first in separate some chunks of audio with the speaker I want to recognize and get the embedding from them. Then, after running the speaker diarization pipeline, I can get embeddings from CHUNKS of every speaker detected and compare using cosine distance an decide.

Regarding embeddings, I could extract them using https://huggingface.co/pyannote/embedding

2 replies

PhantomSpike Jan 6, 2023
Author

Hi @joagonzalez,

That's very useful, thank you! You are right that maybe doing similarity-based metric might be easier than performing clustering, especially if you do not have many data points.

I am working on smth else right now, but will likely get back to this soon. Maybe we can connect to keep in touch and update each other on progress we made?

Thanks again!

joagonzalez Jan 6, 2023

Sure! Sounds great. I finished some tests today and got good results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognising the same speaker across multiple audio clips? #1205

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Recognising the same speaker across multiple audio clips? #1205

PhantomSpike Dec 22, 2022

Replies: 2 comments · 2 replies

PhantomSpike Dec 22, 2022 Author

joagonzalez Jan 6, 2023

PhantomSpike Jan 6, 2023 Author

joagonzalez Jan 6, 2023

PhantomSpike
Dec 22, 2022

Replies: 2 comments 2 replies

PhantomSpike
Dec 22, 2022
Author

joagonzalez
Jan 6, 2023

PhantomSpike Jan 6, 2023
Author