-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve bandwidth estimation and adaptive switching #4825
Improve bandwidth estimation and adaptive switching #4825
Conversation
It is great that you are trying to tackle the bandwidth estimation issues! While it makes sense to use request latency as part of the effective bandwidth calculation, this averaging will fail with #3988. With preload hint fetching, the TTFB is not correlated with the latency, but with the request start time relative to the latest index update and part duration, which can both change from part to part. As such, I would prefer an alternative approach that doesn't require this averaging. FYI, it's possible to get more accurate transfer timing for a single request using the Resource Timing API, though it is a bit cumbersome to extract and requires server opt-in using the Timing-Allow-Origin header. |
f4090fa
to
bb5b3e7
Compare
src/controller/abr-controller.ts
Outdated
(duration * levelNextBitrate) / (8 * 0.8 * loadRate); | ||
fragLevelNextLoadedDelay = loadRate | ||
? (duration * levelNextBitrate) / (8 * 0.8 * loadRate) | ||
: duration / bwEstimate + ttfbEstimate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for else part, are we missing (duration * levelNextBitrate)
and by mistake have only used duration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in af0843a
const bwEstimate: number = this.bwEstimator.getEstimate(); | ||
logger.warn(`Fragment ${frag.sn}${ | ||
// if estimated load time of new segment is completely unreasonable, ignore and do not emergency switch down | ||
if (fragLevelNextLoadedDelay > duration * 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not able to imagine how would we run into this in general and moreover it is smaller than fragLoadedDelay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can end up with very very low loadRate
calculations. When this happens it's usually because of a very problematic request or a huge drop. If emergency down-switching is not going to prevent a stall then it's not worth aborting here. Wait for more data and/or let the normal ABR cycle recover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we decide to keep fragLevelNextLoadedDelay
in ms, we should convert RHS to ms, i think duration is in seconds right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duration is in seconds
const processingMs = | ||
stats.parsing.end - | ||
stats.loading.start - | ||
Math.min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this change is good improvement, does it work well for Parts as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to work well with the sample LL-HLS test stream which uses two 1s parts per segment.
93cb148
to
af0843a
Compare
); | ||
} else { | ||
// If there has been no loading progress, sample TTFB | ||
this.bwEstimator.sampleTTFB(timeLoading); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sample TTFB when no bytes have been delivered? Won't this undershoot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point timeLoading
is expected to be greater than the TTFB estimate (we should have exited above if it is not). The idea is to increase the estimate so that the penalty is not just a down-switch but also impact future estimates.
src/controller/abr-controller.ts
Outdated
// reset forced auto level value so that next level will be selected | ||
this._nextAutoLevel = -1; | ||
|
||
this.bwEstimator.sampleTTFB(stats.loading.first - stats.loading.start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move up before the this.ignoreFragment()
gating? TTFB should still be relevant for eg. an init segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done fe62282. Keeping it restricted to main variant only. We would not want alt-audio or subs to impact the behavior here which will remain main variant only for this change-set.
1aeeead
to
fe62282
Compare
hls.nextLoadLevel = nextLoadLevel; | ||
if (loadedFirstByte) { | ||
// If there has been loading progress, sample bandwidth using loading time offset by minimum TTFB time | ||
this.bwEstimator.sample( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can sample TTFB here too?
if (loadedFirstByte) { | ||
// If there has been loading progress, sample bandwidth using loading time offset by minimum TTFB time | ||
this.bwEstimator.sample( | ||
timeLoading - Math.min(ttfbEstimate, ttfb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeLoading - ttfb
This might be more accurate for bwe. agreed high TTFB is not healthy, but then that doesn't reflect user's true network condition. Also largely high TTFB is typically outliers, but those outliers can really bring fastEstimate down, and catching up again on true BW can take a while(because of small duration, weight will be smaller). Resulting user playing in low resolution unnecessarily. Moreover, how can lower bwe help with high TTFB even if it is for some time duration. Because that high TTFB will hold true for lower resolution segments as well(in theory). so track switch will not help?
stats.loading.start - | ||
Math.min( | ||
stats.loading.first - stats.loading.start, | ||
this.bwEstimator.getEstimateTTFB() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as abandonRules, too conservative bwe.
const processingMs = | ||
stats.parsing.end - | ||
stats.loading.start - | ||
Math.min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello
I encountered the Bandwidth Estimation Problem with LL-HLS streams myself, but was never able to find a reliable solution. So thank you, I appreciate your work on this PR and I'm happy if i can help you.
After playing around a little bit, I felt that this solution is working pretty well already. When I wanted to compare the most recent release of hls.js with this PR, I noticed a problem with the bandwidth graph on the demo page.
It appears to me that the demo graph is calculating the bandwidth values separately, and therefore also needs to be updated.
demo/main.js
lines 553 - 556
Math.round( (8 * data.stats.total) / (data.stats.buffering.end - data.stats.loading.start) ),
Any thoughts on that?
Thanks Thomas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robwalch would it help if I create a PR to fix the demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #4904
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@silltho can you rebase your PR now that this is merged? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a few unit conversion bugs a long the way, it might be worth to create a few sanity unit tests to make sure this works as expected.
@@ -157,13 +157,17 @@ class AbrController implements ComponentAPI { | |||
const expectedLen = | |||
stats.total || | |||
Math.max(stats.loaded, Math.round((duration * level.maxBitrate) / 8)); | |||
let timeStreaming = timeLoading - ttfb; | |||
if (timeStreaming < 1 && loadedFirstByte) { | |||
timeStreaming = Math.min(timeLoading, (stats.loaded * 8) / bwEstimate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're at the point that the network is so slow (only the first buffer was emitted by the network stack, we're in the abondonRuleCheck) then probably something happened to the network, why should we ignore this effect and rely on bwEstimate
?
suggestion:
timeStreaming = timeLoading;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
timeStreaming = timeLoading;
We're trying to remove roundtrip latency (timeLoading - ttfb
) from the equation so that is not what we want.
In this case, almost all of the time was spent waiting for roundtrip (timeStreaming < 1
). It hasn't been long enough to know if throughput is affected yet, so we can continue assuming that only roundtrip latency was impacted. As long as it's been more than one millisecond since we received bytes, then we can calculate a load rate based on actual timeStreaming
.
// (longer times have less weight with expected input under 1 second) | ||
const seconds = ttfb / 1000; | ||
const weight = Math.sqrt(2) * Math.exp(-Math.pow(seconds, 2) / 2); | ||
this.ttfb_.sample(weight, Math.max(ttfb, 5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it on purpose that the weight has a unit of f(second), while the value's unit is milisecond ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weight is not measured in units of time. The weight curve's input is in seconds because we expect that the majority of round trip time will be < 1s and these samples have greater impact on our weighted average. If we have an outlier where the server could not respond for a couple of seconds, its weight and impact on this estimate will be lower.
// (longer times have less weight with expected input under 1 second) | ||
const seconds = ttfb / 1000; | ||
const weight = Math.sqrt(2) * Math.exp(-Math.pow(seconds, 2) / 2); | ||
this.ttfb_.sample(weight, Math.max(ttfb, 5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity why min sample value of 5ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be addressed by not sampling small values in abandon incomplete request cases. I chose this magic number because based on testing with real traffic it seemed unlikely we would get lower values that were realistic or reliable.
src/controller/abr-controller.ts
Outdated
@@ -183,7 +187,7 @@ class AbrController implements ComponentAPI { | |||
const levelNextBitrate = levels[nextLoadLevel].maxBitrate; | |||
fragLevelNextLoadedDelay = loadRate | |||
? (duration * levelNextBitrate) / (8 * 0.8 * loadRate) | |||
: duration / bwEstimate + ttfbEstimate; | |||
: (duration * levelNextBitrate) / bwEstimate + ttfbEstimate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we conservative with loadRate
, while not conservative with bwEstimate
?
loadRate is falsey when we didn't load the first bytes, I'd argue bwEstimate
is over promising in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we conservative with
loadRate
, while not conservative withbwEstimate
?
Changed this to always add ttfbEstimate
rather than augment the load rate or bwe. There will always be a round-trip cost when switching down to load a new segment, so this should provide a better estimate in either case. 846f618
loadRate is falsey when we didn't load the first bytes, I'd argue
bwEstimate
is over promising in this case.
bwEstimate
is the best estimate we have. It is more likely that the ttfbEstimate
has not been met if a fragment request needs to be aborted before any data has been received.
…first byte loaded Subset of #4825
…first byte loaded Subset of #4825
…first byte loaded (video-dev#4917) Subset of video-dev#4825
fe62282
to
b4b3bfd
Compare
846f618
to
6f19423
Compare
6f19423
to
1672028
Compare
* Improve bandwidth estimation and adaptive switching with smaller segments and higher TTFB Fixes #3578 (special thanks to @Oleksandr0xB for submitting #4283) Fixes #3563 and Closes #3595 (special thanks to @kanongil) * Load rate and loaded delay calculation fixes * Convert ttfbEstimate to seconds * Include main variant init segments in TTFB sampling * Use ttfb estimate in abandon rules down-switch timing
This PR will...
Improve bandwidth estimation and adaptive switching with smaller segments and higher TTFB
Why is this Pull Request needed?
Estimating time-to-first-byte exclusive from the time used to estimate bandwidth, allows us to more accurately predict the time it takes to load segments, especially those with shorter durations closer to the average round trip time of a request.
There were also several issues with loading stats, combined vs main video buffer observation, and calculation of inflight BW performed before bytes were loaded that all contributed to bad emergency down-switch and BWE corruption in
_abandonRulesCheck
. Thanks to @Pri12890 for pointing many of these out.Are there any points in the code the reviewer needs to double check?
There are two new public methods on the player instance that I have left undocumented:
get mainForwardBufferInfo(): BufferInfo | null
Allows the ABR controller to access the stream controller's media buffer. Since it only deals with main variant fragment traffic, this allows it to compare that activity to the buffer it appends to rather than the combined buffer which could be stalled if alt-audio does not keep ahead.get ttfbEstimate(): number
Similar tohls.bandwidthEstimate
,hls.ttfbEstimate
provides the latest time-to-first-byte estimate.The TTFB sampling only uses one EWMA instance with the same slow half life as that used for bandwidth. The default estimate is 100(ms) and is not configurable. It is weighted on a curve that favors shorter values so that the occasional request RTT hiccup does not have as much impact on the estimate.
These changes will remain up for at least 1-2 minor releases and will not be released in v1.3. This is to give contributors time to review and test these changes and provide feedback.
Resolves issues:
Fixes #3578 (special thanks to @Oleksandr0xB for submitting #4283)
Fixes #3563 and Closes #3595 (special thanks to @kanongil for early testing and feedback on the Low-Latency HLS implementation)
Fixes #5094
Closes #4291 (
_abandonRulesCheck
govern whether fragment loading is completed or aborted based on timeouts - not active throughput)Checklist