Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use it for speaker verification #34

Closed
wwyl2000 opened this issue Mar 21, 2024 · 6 comments
Closed

how to use it for speaker verification #34

wwyl2000 opened this issue Mar 21, 2024 · 6 comments

Comments

@wwyl2000
Copy link

Hi JungJee,

After I trained the models, I want to see use it for speaker verification. I got a test set, say N speakers, and each has (M-enroll utterances, and L-test utterances).
Should I just enroll the N speakers, using M utterances, and then for each speaker N's each utterance in L-test, calculate the scores against the N speakers, and select the speaker who score is the highest?

Thanks,
Willy

@Jungjee
Copy link
Owner

Jungjee commented Mar 21, 2024

Hi Willy,

I think the task you are mentioning would be open-set speaker identification.

Assuming you're using a neural network to do it, there are two approaches.

  1. Train a neural net with N speakers. In the test phase, you feed the input to the same network and select the highest post-probability. This would be a closed-set speaker identification because you cannot classify newly emerging speakers.
  2. Use a pre-trained speaker embedding extractor to enroll N speakers. Then for each input, compare the input utterance's speaker embedding with N speakers' embeddings to select the speaker with the highest similarity. This would be the open-set speaker identification. I believe this coincides with your description.

On the other hand, in speaker verification, you normally have a claimed (or target) speaker and an input utterance. You compare the two and make a binary decision, about whether they are the same or not.

@wwyl2000
Copy link
Author

Thanks JungJee!

You are right, my task is for open-set, speaker verification, which is the case in your description in item-2.

More question:
When I train the speaker embedding, I used X speakers.
In verification case, we have N speaker (including target speaker T), each has M utterances. Then I should have NM embeddings? The target speaker T's new input utterance (not included in speaker T's enroll utterance-set), and calculate this input utterance's embedding also. Then compare with the NM embeddings, and make binary decision.

  1. I think X >>> N. For a given N, say N=10, what value of X should be suitable for acceptable performance? Will size of X influence model size much?
  2. None of the N enroll speaker is included in X. If want to get better performance, should the N speaker's utterances be used in training the model, like fine-tuning?
  3. Can M=1? I think bigger will give better accuracy.

Could you please share your opinion on these question?
Thanks,
Willy

@Jungjee
Copy link
Owner

Jungjee commented Apr 22, 2024

Maybe I'm the one who's confused. In my understanding, if what you want is not an "open-set speaker identification" but "speaker verification" you can simply compare the speaker embedding of the enrollment and the test utterance. If there exist multiple enrollment utterances (from your explanation, I think it is "M"), you can average their speaker embeddings to derive one speaker embedding representing one speaker.

@wwyl2000
Copy link
Author

Hi Jungjee,

Thanks for your reply. In my understanding, "open-set" means the models can be used to recognize any speaker, and the speaker will sure not be included in the training set of the models.
I am doing speaker verification. In the system, there will be N speakers in total. For speaker n, there are M utterances used to enroll, and I got M embeddings for speaker n. Can i average the M embeddings into one? For example, the embeddings are K dimensions, so just get an average the M vectors and save one vector?
Can I save the M embeddings and use them all? when test a utterance, for each embeddings, I got a score with an embedding, and then i average the score, and use this score as the final one to judge if accept or reject the test utterance. The problem is, calculating M embeddings will be much slower compared with calculating only one embedding.
Please let me know your opinion.

Thanks,
Willy

@Jungjee
Copy link
Owner

Jungjee commented May 21, 2024

Hi @wwyl2000 ,

Can i average the M embeddings into one? For example, the embeddings are K dimensions, so just get an average the M vectors and save one vector?

Yes, you can and that what many of us do.

Can I save the M embeddings and use them all? when test a utterance, for each embeddings, I got a score with an embedding, and then i average the score, and use this score as the final one to judge if accept or reject the test utterance.

Also yes, for this one. The former would be averaging in the embedding-level and the latter (this one) would be averaging in the score-level. Both works and are employed in various research papers.

@wwyl2000
Copy link
Author

Thanks Jungjee!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants