You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this work. I am using the output of dvector_create.py as input to uis-rnn. Diarization is also done.
But I have a small confusion on the number of d vector embeddings created.
dvector_create.py created 24 embeddings for 9.7 sec audio and 21 embeddings for 8.9 sec audio.
In the first case, if I consider every embedding is related to 240 milliseconds (just an assumption) of audio and add up , it does not give the full audio duration.
24 * 240 = 5760 (5.7 seconds). But my audio file is 9.7 seconds long.
Just wanted to understand this as I need to split the audio after diarization is done. The idea is, if diarization result says that first 10 embeddings are related to speaker1 and if I also know each embedding is X ms long, then 10 * X= 10X ms (10X/1000 seconds) . So I will split the audio after 10X ms seconds and so on. So without knowing from what time frame(in milliseconds) to what timeframe speaker 1 spoke and what are the timeframes for speaker 2 , I cannot split the audio.
Please help me understand this. Also is there any other way that you can suggest to split the audio.
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for this work. I am using the output of dvector_create.py as input to uis-rnn. Diarization is also done.
But I have a small confusion on the number of d vector embeddings created.
dvector_create.py created 24 embeddings for 9.7 sec audio and 21 embeddings for 8.9 sec audio.
In the first case, if I consider every embedding is related to 240 milliseconds (just an assumption) of audio and add up , it does not give the full audio duration.
24 * 240 = 5760 (5.7 seconds). But my audio file is 9.7 seconds long.
Just wanted to understand this as I need to split the audio after diarization is done. The idea is, if diarization result says that first 10 embeddings are related to speaker1 and if I also know each embedding is X ms long, then 10 * X= 10X ms (10X/1000 seconds) . So I will split the audio after 10X ms seconds and so on. So without knowing from what time frame(in milliseconds) to what timeframe speaker 1 spoke and what are the timeframes for speaker 2 , I cannot split the audio.
Please help me understand this. Also is there any other way that you can suggest to split the audio.
The text was updated successfully, but these errors were encountered: