Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large keyframes can result in consistent freezes on every keyframe #101

Open
haydenmc opened this issue Mar 7, 2021 · 18 comments
Open

Large keyframes can result in consistent freezes on every keyframe #101

haydenmc opened this issue Mar 7, 2021 · 18 comments
Labels
bug Something isn't working

Comments

@haydenmc
Copy link
Member

haydenmc commented Mar 7, 2021

Some users have noticed that under some circumstances, the stream will freeze briefly on every keyframe.

Based on some observations, it looks like the problem is most reproducible in scenes where keyframe packets are of substantial size. Given this, and the fact that this issue reproduced on Mixer's server implementation as well as ours which is entirely re-written, it looks like it may be due to how the keyframes are fragmented by the client-side ftl-sdk.

Relevant code is here:
https://github.com/microsoft/ftl-sdk/blob/d0c8469f66806b5ea738d607f7d2b000af8b1129/libftl/media.c#L983

Some ideas to help investigate the root cause:

  • Send raw RTP stream of source media known to reproduce the issue directly to our FTL ingest and see if the issue persists
  • Send large keyframe NALUs through ftl-sdk's fragmentation algorithm and check to see that they produce the expected RTP packets
  • Profile ftl-sdk's fragmentation algorithm and check to see if it introduces any sort of delay in RTP packet transmission
@haydenmc haydenmc added the bug Something isn't working label Mar 7, 2021
@haydenmc
Copy link
Member Author

haydenmc commented Mar 14, 2021

Was watching a Jackbox stream on Glimesh and noticed this issue reproducing. What's interesting is that beyond the initial freeze, you get some weird jumbled frames briefly in-between that are really hard to notice (kudos to @danstiner for pointing this out 😁)

Timer is of format seconds:centiseconds

Jackbox timer demonstrating the jumbled frames

@danstiner
Copy link
Collaborator

danstiner commented Mar 14, 2021

Looked deeper into this, haven't solved the issue completely but I'm convinced part of it is borked congestion handling in ftl-sdk. Building OBS with some extra logging shows the following:

11:15:20.315: ==== Streaming Start ===============================================
11:15:20.417: [3] Audio is ready and we have the first iframe, starting stream. (dropped 0 frames)
11:15:20.417: 
11:15:20.417: [3] Sender report base ntp time set to 14757395255531667552 us
11:15:20.417: 
11:15:20.417: [3] Stream (96) base dts set to 0 
11:15:20.417: 
11:15:20.417: [3] Video kbps changed to 6000, bytes_per_ms: 750, flow control: 1
11:15:20.417: 
11:15:20.417: [3] Throttled, sleeping for 2 ms because transmit_level:0 <= 0
11:15:20.420: [3] Throttled, sleeping for 2 ms because transmit_level:-550 <= 0
11:15:20.423: [3] Throttled, sleeping for 2 ms because transmit_level:-1084 <= 0
11:15:20.426: [3] Throttled, sleeping for 2 ms because transmit_level:-226 <= 0
11:15:20.429: [3] Throttled, sleeping for 2 ms because transmit_level:-760 <= 0
11:15:20.432: [3] Throttled, sleeping for 2 ms because transmit_level:-1294 <= 0
...
11:18:02.434: Last log entry repeated for 7593 more lines
11:18:02.434: Queue an average of 58.80 fps (1183.5 kbps), sent an average of 58.80 fps (1179.3 kbps), queue fullness 3, max frame size 279380

I think this is because the sdk is trying to be clever and do congestion control in a way that seems reasonable (leaky bucket based on peak bandwidth) but for several reasons their approach is broken. There are at least three issue:

  1. The leaky bucket interval is only 100ms, so at 6000kbps that means sending more than 600kb in one interval will rate limit.
  2. Peak bandwidth is set to 1.2 times the target encoder bitrate, completely ignore the speed test done
  3. The peak is actually ignored and the leaky bucket fill rate is actually set based on the exact encoder bitrate in practice

I instrumented a bit so you can see the pattern:

15:05:19.280: [3] Sleeping for 2 ms; transmit_level:0
15:05:19.283: [3] _update_xmit_level bytes_added:2250, maxed:0, transmit_level:2250
15:05:19.283: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:2234
15:05:19.283: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:842
15:05:19.283: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:-550
15:05:19.283: [3] Sleeping for 2 ms; transmit_level:-550
15:05:19.286: [3] _update_xmit_level bytes_added:2250, maxed:0, transmit_level:1700
15:05:19.286: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:308
15:05:19.286: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:-1084
15:05:19.286: [3] Sleeping for 2 ms; transmit_level:-1084
15:05:19.289: [3] _update_xmit_level bytes_added:2250, maxed:0, transmit_level:1166
15:05:19.289: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:-226
15:05:19.289: [3] Sleeping for 2 ms; transmit_level:-226
15:05:19.291: [3] _update_xmit_level bytes_added:2250, maxed:0, transmit_level:2024
15:05:19.291: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:632
15:05:19.291: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:-760
15:05:19.291: [3] Sleeping for 2 ms; transmit_level:-760
15:05:19.294: [3] _update_xmit_level bytes_added:2250, maxed:0, transmit_level:1490
15:05:19.294: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:98
15:05:19.294: [3] _update_xmit_level bytes_added:0, maxed:0, transmit_level:-1294
15:05:19.294: [3] Sleeping for 2 ms; transmit_level:-1294
15:05:19.297: [3] Stream (97) base dts set to 0 

A test fix for issue 3 is simple

diff --git a/libftl/ftl_private.h b/libftl/ftl_private.h
index 0606052..4e26bb3 100644
--- a/libftl/ftl_private.h
+++ b/libftl/ftl_private.h
@@ -69,7 +69,7 @@
 #define NACK_RTT_AVG_SECONDS 5
 #define MAX_STATUS_MESSAGE_QUEUED 10
 #define MAX_FRAME_SIZE_ELEMENTS 64 //must be a minimum of 3
-#define MAX_XMIT_LEVEL_IN_MS 100 //allows a maximum burst size of 100ms at the target bitrate  
+#define MAX_XMIT_LEVEL_IN_MS 1000 //allows a maximum burst size of 1000ms at the target bitrate
 #define VIDEO_RTP_TS_CLOCK_HZ 90000
 #define AUDIO_SAMPLE_RATE 48000
 #define AUDIO_PACKET_DURATION_MS 20
diff --git a/libftl/media.c b/libftl/media.c
index 8bfe0f1..6033d8c 100644
--- a/libftl/media.c
+++ b/libftl/media.c
@@ -1257,6 +1257,7 @@ OS_THREAD_ROUTINE video_send_thread(void *data)

         if (transmit_level <= 0) {
           ftl->video.media_component.stats.bw_throttling_count++;
+          FTL_LOG(ftl, FTL_LOG_INFO, "Sleeping for %d ms; transmit_level:%d", MAX_MTU / bytes_per_ms + 1, transmit_level);
           sleep_ms(MAX_MTU / bytes_per_ms + 1);
           _update_xmit_level(ftl, &transmit_level, &start_tv, bytes_per_ms);
         }
@@ -1308,13 +1309,17 @@ static void _update_xmit_level(ftl_stream_configuration_private_t *ftl, int *tra

   gettimeofday(&stop_tv, NULL);

-  *transmit_level += (int)timeval_subtract_to_ms(&stop_tv, start_tv) * bytes_per_ms;
+  int64_t elapsed_ms = timeval_subtract_to_ms(&stop_tv, start_tv);

-  if (*transmit_level > (MAX_XMIT_LEVEL_IN_MS * bytes_per_ms)) {
-    *transmit_level = MAX_XMIT_LEVEL_IN_MS * bytes_per_ms;
-  }
+  if (elapsed_ms > 0) {
+    *transmit_level += (int)elapsed_ms * bytes_per_ms;
+
+    if (*transmit_level > (MAX_XMIT_LEVEL_IN_MS * bytes_per_ms)) {
+      *transmit_level = MAX_XMIT_LEVEL_IN_MS * bytes_per_ms;
+    }

-  *start_tv = stop_tv;
+    *start_tv = stop_tv;
+  }
 }

 static int _update_stats(ftl_stream_configuration_private_t *ftl) {

After applying that I no longer get sleeps, but I still see picture loss. I did some additional testing and as the ratio of keyframe size to avg bitrate increases I start seeing more and more NACKs from the ingest, mostly for keyframes.

So I'm semi-hopeful the remaining issue is somehow server side, but it's been really hard to pin down what's happening.

15:36:16.908: Avg packet send per second 264.0, total nack requests 1452
15:36:17.843: [3] [8] resent sn 25239, request delay was 103 ms, was part of iframe? 1
15:36:17.843: [3] [8] resent sn 25240, request delay was 103 ms, was part of iframe? 1
15:36:17.843: [3] [8] resent sn 25241, request delay was 103 ms, was part of iframe? 1
15:36:17.843: [3] [8] resent sn 25242, request delay was 103 ms, was part of iframe? 1
15:36:17.843: [3] [8] resent sn 25243, request delay was 103 ms, was part of iframe? 1
15:36:17.843: [3] [8] resent sn 25244, request delay was 103 ms, was part of iframe? 1
15:36:21.912: Avg packet send per second 264.0, total nack requests 1458
15:36:22.854: [3] [8] resent sn 26535, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26536, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26537, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26538, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26539, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26540, request delay was 115 ms, was part of iframe? 1
15:36:22.854: [3] [8] resent sn 26541, request delay was 115 ms, was part of iframe? 1

I'd like to get a change like the above into OBS, but the above may not be enough to fix all the issues, though it is a nice small change and fixes at least one of the issues in ftl-sdk at low risk.

Alternatively we could also tear out congestion control entirely, it doesn't seem to actually feedback to OBS to lower it's bitrate so at least for me it's not really helping anything. (I also tested removing it entirely and got the same behavior as the fixed implementation)

But in any case we should probably fully solve the freezing and be very confident in the change before opening a PR to the obs folks.

@danstiner
Copy link
Collaborator

I think the original commit was microsoft/ftl-sdk@1a9e529

@danstiner
Copy link
Collaborator

Pulled logs from Chrome's webrtc stack. I see one of two patterns of errors happening with my above patch when streaming a mostly static, but complex video scene.

Pattern 1:

WARNING	24192	24340	00:28:04-354		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-355		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-355		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-355		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	8708	00:28:04-361		0	Failed to decode frame with timestamp 10290000, error code: -1
WARNING	24192	24340	00:28:04-370		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-371		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-371		0	Too many frames backed up in the decoder, dropping this one.
WARNING	24192	24340	00:28:04-371		0	Too many frames backed up in the decoder, dropping this one.
INFORMATION	24192	25948	00:28:04-465		0	VideoReceiveStream stats: 49057399, {ssrc: 1583814187, total_bps: 3955184, width: 1920, height: 1080, key: 39, delta: 6835, frames_dropped: 182, network_fps: 44, decode_fps: 34, render_fps: 34, decode_ms: 0, max_decode_ms: 11, first_frame_received_to_decoded_ms: 30, cur_delay_ms: 61, targ_delay_ms: 61, jb_delay_ms: 40, jb_cumulative_delay_seconds: 344.203, jb_emitted_count: 6698, min_playout_delay_ms: 2, sync_offset_ms: 287, cum_loss: 0, nack: 0, fir: 0, pli: 14}
INFORMATION	24192	25948	00:28:04-465		0	Call stats: 49057399, {send_bw_bps: 300000, recv_bw_bps: 4575121, max_pad_bps: 0, pacer_delay_ms: 0, rtt_ms: -1}
WARNING	24192	25948	00:28:04-561		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	26712	00:28:04-674		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:U4RMUffO:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:59458->QAZ1Gawk:1:2015363583:local:udp:192.168.86.x:51181|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=250, ms since last received data=2, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:28:04-762		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	26712	00:28:04-925		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:U4RMUffO:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:59458->QAZ1Gawk:1:2015363583:local:udp:192.168.86.x:51181|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=501, ms since last received data=3, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:28:04-963		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	25948	00:28:05-076		0	Audio delay: 111 current diff: -8 for stream 3855072737
WARNING	24192	25948	00:28:05-164		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	26712	00:28:05-175		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:U4RMUffO:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:59458->QAZ1Gawk:1:2015363583:local:udp:192.168.86.x:51181|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=751, ms since last received data=3, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:28:05-365		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	26712	00:28:05-426		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:U4RMUffO:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:59458->QAZ1Gawk:1:2015363583:local:udp:192.168.86.x:51181|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1002, ms since last received data=3, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:28:05-566		0	No decodable frame in 200 ms, requesting keyframe.
VERBOSE	24192	26712	00:28:05-677		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:U4RMUffO:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:59458->QAZ1Gawk:1:2015363583:local:udp:192.168.86.x:51181|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1253, ms since last received data=5, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:28:05-767		0	No decodable frame in 200 ms, requesting keyframe.
INFORMATION	18196	24664	00:28:05-789		0	"Remote track muted: [object Event]", source: http://glimesh-web.lan:4000/js/app.js (13643)

Pattern 2:

INFORMATION	24192	25948	00:21:33-760		0	PacketBuffer size expanded to 1024
VERBOSE	24192	25948	00:21:33-813		0	Audio delay: 88 current diff: -19 for stream 2380185164
INFORMATION	18196	24664	00:21:33-994		0	"Long poll...", source: http://glimesh-web.lan:4000/js/app.js (13643)
INFORMATION	18196	24664	00:21:33-994		0	"Got a slowlink event on session 6407741692481036", source: http://glimesh-web.lan:4000/js/app.js (13643)
INFORMATION	18196	24664	00:21:33-994		0	"[object Object]", source: http://glimesh-web.lan:4000/js/app.js (13643)
INFORMATION	18196	24664	00:21:33-994		0	"janus-ftl-player: Janus slow link: undefined", source: http://glimesh-web.lan:4000/js/app.js (13643)
VERBOSE	24192	26712	00:21:34-003		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=502, ms since last received data=6, rtt=100, pings_since_last_response=
WARNING	24192	25948	00:21:34-072		0	Sequence number 25035 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25036 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25037 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25038 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25039 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25040 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25041 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25042 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25043 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25044 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25045 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25046 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25047 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25048 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25050 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25051 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25052 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25053 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25054 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25055 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25056 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25057 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25058 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25059 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25060 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25061 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25062 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25063 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25064 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25065 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25066 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25067 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25068 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25069 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25070 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25071 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25072 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25073 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25074 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25075 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25076 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25077 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25078 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25079 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25080 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25081 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25082 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-072		0	Sequence number 25083 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25084 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25085 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25086 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25087 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25088 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25089 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25090 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25091 removed from NACK list due to max retries.
WARNING	24192	25948	00:21:34-073		0	Sequence number 25092 removed from NACK list due to max retries.
INFORMATION	18196	24664	00:21:34-161		0	"Remote track muted: [object Event]", source: http://glimesh-web.lan:4000/js/app.js (13643)
VERBOSE	24192	26712	00:21:34-254		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=753, ms since last received data=7, rtt=100, pings_since_last_response=
VERBOSE	24192	26712	00:21:34-505		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1004, ms since last received data=8, rtt=100, pings_since_last_response=
INFORMATION	24192	25948	00:21:34-716		0	PacketBuffer size expanded to 2048
VERBOSE	24192	26712	00:21:34-756		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1255, ms since last received data=10, rtt=100, pings_since_last_response=
VERBOSE	24192	25948	00:21:34-812		0	Audio delay: 87 current diff: -18 for stream 2380185164
VERBOSE	24192	26712	00:21:35-006		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1505, ms since last received data=0, rtt=100, pings_since_last_response=
VERBOSE	24192	26712	00:21:35-257		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=1756, ms since last received data=2, rtt=100, pings_since_last_response=
VERBOSE	24192	26712	00:21:35-507		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=2006, ms since last received data=6, rtt=100, pings_since_last_response=
VERBOSE	24192	26712	00:21:35-758		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=2257, ms since last received data=6, rtt=100, pings_since_last_response=
VERBOSE	24192	25948	00:21:35-811		0	Audio delay: 87 current diff: -2 for stream 2380185164
VERBOSE	24192	26712	00:21:36-010		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: UpdateState(), ms since last received response=2509, ms since last received data=8, rtt=100, pings_since_last_response=
VERBOSE	24192	26712	00:21:36-010		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: Sending STUN ping, id=334d4950462f6b346830626e, nomination=0
VERBOSE	24192	26712	00:21:36-010		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: Sent STUN BINDING request, id=334d4950462f6b346830626e, use_candidate=0, nomination=0
VERBOSE	24192	26712	00:21:36-010		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWI|S|1|0|8655920682762255870|0]: Received STUN BINDING response, id=334d4950462f6b346830626e, code=0, rtt=0, pings_since_last_response=334d4950462f6b346830626e
VERBOSE	24192	26712	00:21:36-010		0	Conn[5aa00:audio:Net[any:0.0.0.x/0:Wildcard:id=0]:88KVuqKe:1:0:local:udp:3142e6dd-1584-4001-823e-67335afe67d6.local:51570->4Zjsde6r:1:2015363583:local:udp:192.168.86.x:60308|CRWS|S|1|0|8655920682762255870|0]: set_state
WARNING	24192	25948	00:21:36-047		0	PacketBuffer is already at max size (2048), failed to increase size.
WARNING	24192	25948	00:21:36-047		0	Clear PacketBuffer and request key frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [27000, 27003] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26981, 26999] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26970, 26973] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26966, 26969] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26941, 26965] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26937, 26940] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26910, 26936] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26906, 26909] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26902, 26905] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26877, 26901] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26873, 26876] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26850, 26872] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26846, 26849] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26842, 26845] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26819, 26841] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26815, 26818] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26794, 26814] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26790, 26793] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26786, 26789] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26756, 26785] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26752, 26755] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26725, 26751] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26721, 26724] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26717, 26720] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26689, 26716] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26685, 26688] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26660, 26684] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26656, 26659] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26652, 26655] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26627, 26651] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26623, 26626] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26598, 26622] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26594, 26597] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26590, 26593] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26564, 26589] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26560, 26563] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26529, 26559] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26525, 26528] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26521, 26524] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26492, 26520] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26488, 26491] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26464, 26487] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26460, 26463] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26456, 26459] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26429, 26455] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26425, 26428] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26401, 26424] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26397, 26400] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26393, 26396] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26372, 26392] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26368, 26371] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26325, 26367] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26321, 26324] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26317, 26320] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26286, 26316] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26282, 26285] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26256, 26281] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26252, 26255] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26248, 26251] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26223, 26247] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26219, 26222] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26196, 26218] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26192, 26195] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26188, 26191] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26169, 26187] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26165, 26168] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26123, 26164] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26119, 26122] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26115, 26118] has no GoP, dropping frame.
WARNING	24192	25948	00:21:36-085		0	Generic frame with packet range [26070, 26114] has no GoP, dropping frame.

I also at one point saw the following from the ingest node:

Mar 15 00:38:48 ftl-ingest-1 janus[458847]: [2021-03-15 00:38:48.465] [debug] 152 keyframe packets recorded @ timestamp 8100000
Mar 15 00:38:54 ftl-ingest-1 janus[458847]: [2021-03-15 00:38:54.461] [debug] 121 keyframe packets recorded @ timestamp 8640000
Mar 15 00:39:00 ftl-ingest-1 janus[458847]: [2021-03-15 00:39:00.465] [debug] 109 keyframe packets recorded @ timestamp 9180000
Mar 15 00:39:05 ftl-ingest-1 janus[458847]: [h264 @ 0x7f7e34144300] Invalid level prefix
Mar 15 00:39:05 ftl-ingest-1 janus[458847]: [h264 @ 0x7f7e34144300] error while decoding MB 74 56
Mar 15 00:39:05 ftl-ingest-1 janus[458847]: [h264 @ 0x7f7e34144300] concealing 1415 DC, 1415 AC, 1415 MV errors in I frame
Mar 15 00:39:06 ftl-ingest-1 janus[458847]: [2021-03-15 00:39:06.465] [debug] 119 keyframe packets recorded @ timestamp 9720000
Mar 15 00:39:12 ftl-ingest-1 janus[458847]: [2021-03-15 00:39:12.465] [debug] 100 keyframe packets recorded @ timestamp 10260000

Pattern 2 looks like packet loss, but I'm fairly sure my home vm lab has zero packet loss. Pattern 1 looks like either packet loss or, I'm leaning towards this is somehow incorrect packetization of the video stream by ftl-sdk or OBS. The error messages at least give something to google.

I think the next step is to test ftl-sdk outside of OBS to see if the problem repros. Then if it does next is to instrument ftl-sdk and OBS to save both the raw video frames and raw packets and start looking at how they line up.

@danstiner
Copy link
Collaborator

danstiner commented Mar 16, 2021

Confirmed this happens even with zero packet loss or re-ordering on the edge/ingest nodes.
Edit: As determined by counting the incoming sequence numbers in ftl-land.

@danstiner
Copy link
Collaborator

https://webrtchacks.com/video_replay/ looks interesting for debugging this from the perspective of the browser.

@danstiner
Copy link
Collaborator

danstiner commented Mar 17, 2021

After instrumenting everything up I found at least one repro case where very large keyframes would lead to bursts of packet loss between the edge <-> browser that caused stuttering. The issue is worse with the above patch to remove rate limiting from ftl-sdk. The issue was resolved by increasing the kernel tunable receive/transmit buffers.

Looking at the remaining issues, I know of several other major causes of stuttering, but I'm not sure any of them are consistent enough to explain this issue:

  1. Packet loss between streamer <-> ingest node (re-working NACK handling in janus-ftl-plugin will help, as can lowering bitrates, but some ISPs may just be so bad you cannot stream on them)
  2. The rate limiting bug in ftl-sdk (a fix means patching OBS)
  3. Packet loss on our servers due to buffer overruns (can be fixed by increasing socket buffer sizes at the cost of a bit of RAM)
  4. Packet loss between edge node <-> browser (in theory janus handles NACKs for us sufficiently well this is not a concern)

@danstiner danstiner changed the title RTP keyframe fragmentation can result in consistent freezes on every keyframe Large keyframes can result in consistent freezes on every keyframe Apr 16, 2021
@danstiner
Copy link
Collaborator

danstiner commented Apr 16, 2021

Think I've finally found the root cause of the consistent stuttering on every keyframe, it's actually client side in the WebRTC stack. Note this specific issue only happens in low motion streams, other forms of intermittent stuttering or complete loss of video are usually due to packet loss in my testing which will be at least partially addressed by #95

tl;dr in this scenario the WebRTC stack in Chrome/Firefox/etc is not waiting for all the data for a video frame to arrive before trying to render it, leading to skipped/corrupted frames.

For context, every frame in a RTP video stream is "packetized", or broken down into one or more packets. The larger the frame the more packets and the longer time interval it takes for them to all reach the client. To account for this and other network delays, WebRTC has a jitter buffer that holds packets for a short time until it's time to render the frame. If not all packets are available when it's time to render the frame you'll see either a pause in the video as the frame is skipped or the renderer will try it's best with missing info and you might see visual artifacts / loss of color.

Normally the jitter buffer works fine and a reasonable playout delay is calculated that is enough all packets of a frame arrive before WebRTC tries to render the frame. However in my testing there's poor behavior that happens when I-frames (keyframes) are significantly larger than P-frames (intraframes). For example, one test saw WebRTC calculate a delay of 100ms but the keyframes were ~250KB for a 5mbps stream, due to the rate limiting in ftl-sdk those keyframes take about 320ms to be sent over the network. This would lead to about thirteen skipped frames at 60fps and so the video paused on every keyframe as is the hallmark of this bug.

In my previous testing I was able to work around this effectively removing the rate limiting in ftl-sdk, this let OBS spike the send rate and get the keyframe sent significantly faster so it fit within the buffer delay calculated by WebRTC. This works only if the streamer and viewer have a network bandwidth at least 3-4x more than the stream bitrate. That's more than many connections can handle, and big spikes in UDP traffic can actually lead to high packet loss that also causes stuttering/lost frames, so just increasing the send rate like I was trying in Glimesh/ftl-sdk#3 is not a slam-dunk fix.

In my testing there were many factors that make this issue more/less prevalent by affecting the ratio of I-frame to P-frame size. Encoder and network issues often seem like black magic but I'll try to document what I found:

  • baseline profile has more issues than main or high, I believe this is because main/high have a better image compression algorithm so the encoder will use less bytes on a keyframe while still maintaining good image quality=
  • streams with lots of motion do not hit this bug as much both because their P-frames are large enough WebRTC picks a large enough jitter buffer delay and because video encoders try to stay at a target bitrate, so large P-frames means there are lets bits "available" to encode a huge keyframe
  • The NVENC encoder seems to happily send tiny P-frames and then spike huge I-frames while the h264 encoder sends larger P-frames and has "smoother" bitrate that is slightly less likely to trigger this behavior.
  • The h264 encoder has a customizable buffer size that can be used to further "smooth" the bitrate at the cost of either latency or video quality
  • There are probably other encoder settings that can also "smooth" the bitrate and reduce the impact of this issue

However I think the real fix here has to be on the viewer side with WebRTC itself (changes to encoder settings would also help but are much more difficult for us to control and tune for all situations). While this could result in 100-200ms of latency being added to the playout delay, in my view that is better than stuttering video. Once we have this solved we can go back and look at optimal encoder settings or changing the rate limiting in OBS to reduce that latency by reducing the time it takes for keyframes to be delivered.

So there are a couple potential next steps:

I'd still like to come back to the other causes of stuttering I outlined above, but separately from this issue. For example #95 covers adding back NACKs to help with packet loss on the streamer side.

Here's a sample clip that can be used to reproduce the stuttering behavior. Importantly it's a mostly static image that is difficult to compress so keyframes are very large, but P-frames are small because there is almost no motion. The clock helps see when stuttering happens (these is no stuttering in this clip, it's a source clip to reproduce the issue not a capture of what the viewer sees)
https://user-images.githubusercontent.com/52513/114984590-bf574900-9e46-11eb-8fd2-d11778fa1a97.mp4

@danstiner
Copy link
Collaborator

danstiner commented Apr 16, 2021

To get into the nitty-gritty, I added some logging to WebRTC and tested in Chromium to see the behavior live. It appears what happens in the jitter estimator is that the keyframe is correctly calculated to have arrived ~200ms later than expected, but then an upper bound is applied to the frame delay. This actually makes it look like the keyframe arrived faster than expected for it's size, which I think is why the estimated jitter delay actually goes down when processing the keyframe, then it goes back up slightly because the following P-frames were delayed behind the large keyframe.

There are other parts of the calculation that also try to exclude large keyframes, but these lines seem like the main issue in my testing:

  // Cap frameDelayMS based on the current time deviation noise.
  int64_t max_time_deviation_ms =
      static_cast<int64_t>(time_deviation_upper_bound_ * sqrt(_varNoise) + 0.5);
  frameDelayMS = std::max(std::min(frameDelayMS, max_time_deviation_ms),
                          -max_time_deviation_ms);

https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/modules/video_coding/jitter_estimator.cc#159

This sample data shows the default behavior, notice how the large keyframes actually shrink the bound because of the capping behavior, the frames just after the keyframe grow the bound slightly because they were delayed, and then once things are back to normal the bound shrinks again as small P-frames are delivered with low jitter. Also notice there should be ~119 frames between keyframes for this 60fps stream, but a large number of them are getting dropped because the large keyframe delays so many subsequent frames that WebRTC gets upset and starts dropping frames until one successfully decodes, logging errors like:

Failed to decode frame with timestamp 14421000, error code: -1
Too many frames backed up in the decoder, dropping this one.
No decodable frame in 200 ms, requesting keyframe.

image

However, if I remove the frame delay cap the stuttering goes away! And I see very different behavior in the jitter estimator that explains this. In this graph you can clearly see the estimated delay has grown large enough to handle the large keyframes that take nearly 200ms to arrive. While the upper/lower bounds that would have been used to cap the frame delay do expand a bit, the large keyframes are still well outside the bounds.

image

I don't think the solution will actually be to just remove this limit, I'm sure it was added for a good reason. But this does prove to me that playout delay is the core issue here and that the WebRTC is where we should be primarily looking to solve this issue.

@haydenmc
Copy link
Member Author

Bad ass analysis!

We should get in touch with the folks responsible for this code. Is the Chromium bug tracker the best place to kick this off? Or maybe we can start a thread with the author.

@danstiner
Copy link
Collaborator

danstiner commented Apr 22, 2021

Thanks! That looks like a very relevant commit.

There is a WebRTC section on that tracker, though it may be better to start with the more casual mailing list: https://www.w3.org/groups/wg/webrtc

@haydenmc
Copy link
Member Author

Do you want to shoot a message with your findings to that DL and add me? Or do you want to meet and compose something together?

@compucat
Copy link

compucat commented May 22, 2021

I've managed to mitigate this issue on the encoder side by tuning x264 to forcibly prevent bitrate spikes from keyframes. I've had most success with CRF encoding, using vbv-minrate and vbv-maxrate to set the target bitrate range and setting vbv-bufsize = <vbv-maxrate>/<FPS>. This can sometimes be a bit unstable, though, so I ended up doubling vbv-bufsize for a bit more stability with a theoretical latency penalty*.

Test videos used were the British Arrows (as an "average video" longer test) and a continuous loop of this F-Zero GX tool-assisted speedrun (which will make any h264 encoder beg for mercy, and reliably triggers the large keyframe issue when played in a loop).

Optimally, this should be combined with intra-refresh=1 (i.e. enable periodic intra refresh, which "keyframes" column-by-column to avoid keyframe bitrate spikes) to avoid keyframe pumping and improve the consistency of the stream's quality. However, for some reason this causes playback issues if packet loss is encountered or if a viewer is connected mid-stream; I suspect something is waiting for a whole keyframe (that will never come due to intra-refresh) and stalling. As soon as a keyframe is requested, this shows up in the logs:

[h264 @ 0x560be8bbe800] no frame!
[2021-05-22 08:42:15.510] [warning] Couldn't read stream video size for channel 1111 / stream 3: Error sending a packet for decoding.
[h264 @ 0x560be8bbe800] no frame!
[2021-05-22 08:42:15.510] [warning] Couldn't generate preview for channel 1111 / stream 3: Error sending a packet for decoding.

Of course, this doesn't much help with other encoders (OpenH264, NVENC, Quick Sync, AMD AMF) unless equivalent parameters can be set. :) NVIDIA's docs seem to imply this is possible for NVENC at least, and I suspect it's likely possible for others as well. I haven't been able to test any of this at the moment, though - AMD hardware encoding on Linux is a mess in general, and my NVIDIA GPU is currently being repurposed in a Windows VM.

Come to think of it, enabling intra-refresh on its own (with more typical CBR encoding) may be good enough to mitigate the worst of the keyframe bitrate spikes, as long as we can get everything else in the chain to play nice with that. This IEEE paper seems to imply that periodic intra refresh through WebRTC is a supported combination, at least with HEVC.


*In practice, with my local machine in MN on Comcast and the server running on a NYC-based VPS, I'm achieving anywhere from 330ms to 400ms latency on average and have seen it as low as 250-270ms. Majority of testing was performed with these x264 settings: 1080p30, CRF encoding with CRF=20, keyframes every 5 seconds, faster preset, zerolatency tune, additional flags vbv-minrate 2500 vbv-maxrate=3000 vbv-bufsize=200. (Yes, I'm aware it's a painfully low bitrate for low-latency 1080p30 :)

@StrikerMan780
Copy link

Any update on this? I get stream hangs very frequently, it's quite disruptive.

@danstiner
Copy link
Collaborator

danstiner commented Jan 24, 2022

@compucat belated thank you for investigating. I had similar findings and things like limiting the x264 buffer size can certainly help. infa-refresh is interesting but as you noted comes with a quality and latency penalty. Also interesting is Scalable Video Coding. These are discussed in some other issues linked to this, but basically I think to give the best experience for our users we should be handling this better on ingest / playback and not require special encoder settings with drawbacks (though you're of course welcome to use them yourself as you understand them well)

@StrikerMan780 As an update, we've rolled out NACKs to help with packet loss and will be rolling out the playout delay change which should help with this issue (without needing encoder changes) in the next few weeks.

My test stream for this is similar to above, but simplified to the bare minimum, a changing clock with a static white noise background. This makes it easy to measure stream lag, p-frames are small because the change frame-to-frame is very small, but i-frames are huge because the background has high information content (hard to compress). For example:
image

This kind of video stuff is very complex, so it took awhile before I've had time to get back to looking at and implement features to help. I'd still like to push for a fix in Chrome itself if we prove the playout delay extension helps significantly.

@mdevaev
Copy link

mdevaev commented Jul 16, 2022

@danstiner Hello! Your research is amazing. Have you managed to find out anything else about this problem? I came across it in my own plugin for Janus, but my case is a little different. I wanted to use playout-delay extension and playoutDelayHint to make a zero latency even at the cost of smoothness. I use a Raspberry Pi hardware encoder and generate keyframes every 30 frames (on 30 fps), and I get a significant stuttering every second.

Of course, this is not an encoder problem, I have an alternative protocol and a client for receiving H264, and everything works fine there. It seems that WebRTC is too smart and its jitter algorithms spoil everything in my case.

Is fixing the WebRTC stack the only way? Or is there some other option?

Edit: a new place in the sources: https://github.com/maitrungduc1410/webrtc/blob/master/modules/video_coding/timing/jitter_estimator.cc

  // Cap frame_delay based on the current time deviation noise.
  TimeDelta max_time_deviation =
      TimeDelta::Millis(time_deviation_upper_bound_ * sqrt(var_noise_) + 0.5);
  frame_delay.Clamp(-max_time_deviation, max_time_deviation);

@danstiner
Copy link
Collaborator

danstiner commented Aug 21, 2022

@mdevaev Hi Maxium. Glad my investigations could help you. I don't work on real time video stuff anymore but I'll share a couple thoughts.

  1. If you use H.264, fundamentally there needs to be a playout delay on the receiver long enough to receive the large key frames over the network, plus time to decode and display the frame. That defines a "sweet spot" that is the lowest latency possible without stuttering. So I would recommend setting the playout delay extension/hint to that value to eliminate the stuttering. You'd have to experimentally figure out the right value, in theory you shouldn't have to set a playout hint and the WebRTC stack should infer the network and render delays, but as we've both investigated it doesn't do a great job for H.264. That could be improved, but it wouldn't actually change that "sweet spot"
  2. If you do want lower latency, I think you'll have to change your encoding. For example, WebRTC is tuned better for VP9 which has a number of real-time features like cyclic refresh, which allows it to refresh a small area each frame and have a consistent bitrate, avoiding this problem of huge keyframes taking a long time to send over the network. H.264 has some similar features like intra refresh, might be worth playing with those if you can't straight up switch to VP9 or another more appropriate encoding for your use case.

@mdevaev
Copy link

mdevaev commented Aug 21, 2022

@danstiner Thank you for sharing this. I will explore WebRTC further and maybe I can improve the frame processing chains. Unfortunately, I can't use VP9 due to the hardware limitations of my platform, so it remains only to fix the browser :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants