-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Information Request] Best way to approach the inaccuracy of the matched result time? #209
Comments
What use case are you trying to solve that requires a time-location accuracy of more than 500 ms? |
* develop: Version bump to v8.24.0 to accommodate fix in the SoundFingerprinting.Emy described in issue #209.
I am chasing the (admittedly somewhat unreasonable) dream of removing ads / intros / outros / recaps / undesirable repetitive content completely, while simultaneously retaining all desired content, automatically/programmatically with no user input beyond a folder full of similar content and 2 files to fingerprint. It would be wonderful if this could be done, and I don't really understand why it can't. Assuming this was a "brick wall" I have already implemented a manual user interface to scan fwd/back through the video/audio frames (around the query result match suggested time) so the user can specify the precise points where the undesirable content begins and ends - but it would be much preferred if that were not necessary. |
This is for youtube right? Yea, that was pretty much what I was using it for. I ended up just using image recognition along with this library to get it "more accurate". Sure, it uses more resources but it gets the job done. As for accuracy, I think it's very complicated to get it "pixel perfect" (like near 0ms) because I'd have to assume that the sound waves are always "similar" in some sense, so the algorithm has to make sure it's a match. This library isn't the only library with this issue as well, there are a few others and I don't think it's a simple solution to get it always accurate. Like, it was able to get the audio for me at 0ms with my tricks, by giving it a slightly earlier sound and it matched it at least with a 50ms delay, instead of 500ish. But of course, 1 small change of sound (even if it's so tiny only a program can see it) the delay goes back up. If you want a real foolproof solution, then I suggest you use this library in conjunction with something else, to ensure it is accurate. |
My hope is that it might be for most everything that has repeating content to be removed.
I was attempting to do that with this library (as it does videofingerprinting as well) but I didn't get very far (it didn't recognize the common content across the two sub-clips AND it appeared to be fingerprinting the entire video instead of just the segment specified by startsAtSecond and secondsToProcess). The real trouble is that without first knowing. precisely, where the first frame of the content to remove is - image recognition doesn't really help. My hope is for all this analysis and comparison to be done programmatically and require no user input. Out of curiosity, what did you end up using for the image recognition? And was it able to compare two different video streams and find the common frames between them (ideally with frame perfect accuracy)?
I've noticed! I've only tried a few others so far, but they have the same inaccuracy issue.
I'm doing something similar, and often the delay isn't so bad - but I want it to be 0. If it isn't the hashing algorithm itself, I think the random stride may be involved in the inconsistent results - but I am hoping to understand the problem better in any case.
Thanks for the tips! I'm open to any suggestions you might have regarding the "something else"! |
FWIW I try to use this software for timing accuracy in order to sync external systems with media playback (using real-time matching). The more accurate timing the better in my case. |
@jack4455667788 the issue #207 had a broader effect, and in case you were using these parameters in conjunction with
Intuitively you can think of it as a discretization problem, the challenge of transforming a signal (audio in this case) into a set of discrete fingerprints that approximate it. There is a resolution that defines a fingerprint (i.e., 128x32) which approximates about 1.48 seconds of audio signal. These fingerprints are generated using a certain stride, a step between consecutive fingerprints. By default, the stride is 512 samples during fingerprinting (92 ms) and a random value between [256, 512] during query (46 - 92 milliseconds) (values defined in You can decrease the stride between consecutive fingerprints during fingerprinting (say to an extreme case of 1 sample) to increase the chances of having a perfectly aligned fingerprint during query time, but this will substantially increase the footprint of your model service that stores these fingerprints (generating 512x more fingerprints): FingerprintCommandBuilder.Instance
.BuildFingerprintCommand()
.From(pathToFile)
.WithFingerprintConfig(cfg =>
{
// specifying a stride of 1, meaning we will create new fingerprints with a step of one sample (~0.18 ms )
cfg.Audio.Stride = new IncrementalStaticStride(incrementBy: 1);
return cfg;
})
.UsingServices(audioService)
.Hash() This is still not a good solution because even with perfectly aligned signals, you can have distortions generated by encoding/aliasing that will prevent perfect matches. The default values have been empirically defined to maximize recall and precision while minimizing the audio signal's footprint. Now to the problem of cutting the ads to the precise frame. How it works: once you identify a match, you can run a second analysis over the video looking for edges (i.e., black frames and scene changes) around the area where you expect the content to have started/ended. This implies you need access to the matched content (for example, if you are matching over streaming content, you need to generate a file from the streaming match that covers the area where the match happened). var StartEndEdgeSearchLocationDelta = 3;
// audio object of type QueryResult
var optimalLength = audio?.BestMatch.Track.Length;
// this file has to cover the area of the audio.BestMatch
// also it is recommended to extend the area of the match by StartEndEdgeSearchLocationDelta
// as an example, if your match happened at 09:30:00 till 09:30:30 (hh:mm:ss), then extend the area of the analyzed content by 3 seconds at start/end location 09:29:57 till 09:30:33 (totally extending the match by 6 seconds)
var extendedMediaFile = "path to streaming content that matched";
var edgeSearchStrategy = new EdgeSearchStrategy(new NLogLoggerFactory());
var edgeSearchConfig = new EdgeSearchConfig(new BlackFramesFilterConfiguration { Threshold = 32, Amount = 94 }, SceneChangeThreshold: 0.4, OptimalLength: optimalLength, StartsAtHint: StartEndEdgeSearchLocationDelta, EndsAtHint: StartEndEdgeSearchLocationDelta + optimalLength);
var mediaSegment = edgeSearchStrategy.FindMediaSegmentClosestToOptimalLength(extendedMediaFile, edgeSearchConfig);
if(mediaSegment != null)
{
// better edges have been found
} Keep in mind this is an experimental API, and you need FFmpeg installed to use it https://github.com/AddictedCS/soundfingerprinting/wiki/Audio-Services. Let me know if it any of the above helped. |
Hey @jack4455667788 did anything from the above message helped in solving your issue? |
Closing due to inactivity. |
Related : #196 Sound fingerprint match always a few seconds/milliseconds too early majority of the time
In the above issue, you responded
Is the algorithm created/tuned to prioritize efficient/accurate matching at the expense of time range accuracy?
I guess I'm hoping you might be able to help me understand the non-trivial reasons so that I might be able to decrease (or ideally remove) that misalignment - and/or any other possible approaches to the problem.
I don't mind if it is hideously inefficient/slow, but I need/want better precision than this. I have tried most everything I can think of to tune/configure this problem away, but the somewhat random inaccuracy persists.
In my case the sounds may be so similar that I might be able to do raw bitstream compares on them (I'm thinking to shift them around the given match timerange until they match up more or less exactly)... Can I somehow "shift" the fingerprints/fingerprinting to do something analogous using soundfingerprinting?
Thanks in any case!
The text was updated successfully, but these errors were encountered: