Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ Same .wav file but got different timestamps #415

Open
DarrenChengdu opened this issue Jan 18, 2024 · 1 comment
Open

❓ Same .wav file but got different timestamps #415

DarrenChengdu opened this issue Jan 18, 2024 · 1 comment
Assignees
Labels
help wanted Extra attention is needed

Comments

@DarrenChengdu
Copy link

❓ Questions and Help

I found the speech timestamps respectively obtained from pytorch and silero-vad-onnx.cpp is somewhat different. The input file is 'en_example.wav' which downloaded from torch.hub.
the speech timestamps from pytorch are as follows (set USE_ONNX = True or False):
{'end': 31200, 'start': 1568},
{'end': 73696, 'start': 42528},
{'end': 108512, 'start': 79392},
{'end': 163808, 'start': 149024},
{'end': 181728, 'start': 166944},
{'end': 211936, 'start': 183328},
{'end': 227808, 'start': 216608},
{'end': 241120, 'start': 229920},
{'end': 252896, 'start': 245280},
{'end': 285664, 'start': 260640},
{'end': 301024, 'start': 294432},
{'end': 311776, 'start': 303648},
{'end': 420320, 'start': 325664},
{'end': 455136, 'start': 422432},
{'end': 490976, 'start': 458784},
{'end': 520160, 'start': 493088},
{'end': 566752, 'start': 523808},
{'end': 601056, 'start': 572448},
{'end': 621024, 'start': 607264},
{'end': 669152, 'start': 638496},
{'end': 691680, 'start': 671776},
{'end': 712672, 'start': 697888},
{'end': 748512, 'start': 720928},
{'end': 798688, 'start': 781856},
{'end': 853984, 'start': 817696},
{'end': 865248, 'start': 856608},
{'end': 903648, 'start': 871968},
{'end': 916960, 'start': 906272},
{'end': 952288, 'start': 920096}]

the length of timestamps is 29.
the speech timestamps from silero-vad-onnx.cpp are as follows:
{start:00002048,end:00031744}
{start:00043008,end:00074752}
{start:00079872,end:00108544}
{start:00149504,end:00164864}
{start:00166912,end:00182272}
{start:00183296,end:00195584}
{start:00195584,end:00212992}
{start:00217088,end:00228352}
{start:00230400,end:00241664}
{start:00245760,end:00252928}
{start:00261120,end:00286720}
{start:00294912,end:00302080}
{start:00304128,end:00312320}
{start:00325632,end:00352256}
{start:00352256,end:00373760}
{start:00373760,end:00419840}
{start:00422912,end:00455680}
{start:00458752,end:00491520}
{start:00493568,end:00521216}
{start:00524288,end:00555008}
{start:00555008,end:00567296}
{start:00572416,end:00602112}
{start:00607232,end:00621568}
{start:00638976,end:00669696}
{start:00671744,end:00680960}
{start:00680960,end:00692224}
{start:00698368,end:00713728}
{start:00720896,end:00739328}
{start:00739328,end:00744448}
{start:00745472,end:00749568}
{start:00782336,end:00798720}
{start:00818176,end:00854016}
{start:00857088,end:00866304}
{start:00872448,end:00904192}
{start:00906240,end:00917504}
{start:00920576,end:00941056}
{start:00941056,end:00949248}
{start:00949248,end:00952320}
{start:00958464,end:00960000}

the length of timestamps is 39.
I wonder if above differences are tolerable and acceptable?
(https://github.com/snakers4/silero-models/wiki) available for our users. Please make sure you have checked it out first.

@DarrenChengdu DarrenChengdu added the help wanted Extra attention is needed label Jan 18, 2024
@Simon-chai
Copy link

The vad working principle is apply a voting mechanism to model output to decide the start and end of a segment speech. So the result depend on both the implement of the voting mechanism and model output. i.e. there is a control parameter min_silence_duration_ms ,maybe the default value is different bwtween python implement and cpp implement,and it can affect the length of final output(i.e a segment with 80ms silence will be consider as a whole by python version voting,but maybe consider as two segment by cpp version voting). And there are several others this kind of parameter that can affect the length of final output or start and end value. Of cause there is another posibility that the cpp voting implement is different from py version,I am not sure cause I am not familar with cpp. but as I see it the differences are tolerable and acceptable,most segments are valid speech.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants