Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add insert of concealed packets #3308

Closed
wants to merge 2 commits into from

Conversation

spscream
Copy link

this changes were created to address #3297 for our environment.
I'm not sure if it is fully correct, but it mostly fixes our problems.

@januscla
Copy link

Thanks for your contribution, @spscream! Please make sure you sign our CLA, as it's a required step before we can merge this.


janus_mutex_lock(&participant->qmutex);

for(int i=1; i < lost_packets; i++) {
Copy link
Author

@spscream spscream Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to get rid of cycle and produce plc data which frame_size will be lost_packets * OPUS_SAMPLES and shouldn't exceed BUFFER_SAMPLES size.
In our production I've seen losses of 20+ packets in a row, so plc for 20 packet won't fit BUFFER_SAMPLES and I'm not sure how to behave in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the whole thing that perplexes me. The way it's done right now, you're basically breaking FEC, as it would never be done. You're preserving the use_fec = true bit, but it will be useless, since later on we'll try to get redundant information from a packet that doesn't have any (the PLC data you've added). I think PLC and FEC should be handled separately: if the gap is 1, then do what was done before; if it's higher than 1, do PLC (new code, without use_fec=true). Or do you have a reason for implementing it this way?

As a side not, the LOG_ERR there is probably overly verbose, and would clog the logs any time there's bursts of lost packets, which may happen often. A LOG_VERB or even LOG_HUGE would probably be a better choice (but that's something that can be discussed, it may still be useful to have info on the logs when this happens).

Copy link
Author

@spscream spscream Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code generates plc for gap sequence, which can't be recovered using fec. Last gap member is generated using fec of arrived rtp packet, if it's possible(opus_decode will return plc in case fec data isn't available), for example:

  • gap of 1 will be covered using fec, since condition i < lost_packets for cycle wont match
  • gap of 2 will generate 1 plc and 1 will be recovered from fec information using rtp payload

I'm affraid of the following problem now - frame size can be differ from packet to packet and decoder should adapt to changes of frame size accordingly since encoder may change frame_size. Looks like chrome encoder changes frame size in case of losses - opus_packet_get_nb_frames returns 10ms frame size periodically on bursts of losses for rtp packet after gap. Chromium realization of fec decode uses opus_packet_get_nb_frames to determine fec frame_size https://chromium.googlesource.com/external/webrtc/+/refs/heads/main/modules/audio_coding/codecs/opus/opus_interface.cc#681).

I left LOG_ERR for debug purpose, i'm going to change it to LOG_VERB or LOG_HUGE for production.

@spscream spscream marked this pull request as draft December 13, 2023 20:44
Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've added a few notes inline with some considerations that we can discuss. I know @atoppi had some comments of his own too, so I'll leave it to him to add more (and possibly correct me if I was wrong somewhere).

@@ -1699,6 +1699,7 @@ typedef struct janus_audiobridge_participant {
uint16_t expected_seq; /* Expected sequence number */
uint16_t probation; /* Used to determine new ssrc validity */
uint32_t last_timestamp; /* Last in seq timestamp */
uint16_t last_seq; /* Last sequence number */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: indentation of comment looks broken here.

@@ -2135,7 +2136,7 @@ static int janus_audiobridge_resample(int16_t *input, int input_num, int input_r
#define JITTER_BUFFER_MIN_PACKETS 2
#define JITTER_BUFFER_MAX_PACKETS 40
#define JITTER_BUFFER_CHECK_USECS 1*G_USEC_PER_SEC
#define QUEUE_IN_MAX_PACKETS 4
#define QUEUE_IN_MAX_PACKETS 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like this: it will cause a permanent delay increase for audio of this participant after a burst of lost packets caused this to grow because of PLC, ven way after the problem was solved. I'd argue that a burst of lost packets should indeed cause artifacts (like a speed up), and that if not 4, then a way smaller number should be more than enough to absorb shorter bursts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expiremented with it and low value caused pops/cracks for small bursts.


janus_mutex_lock(&participant->qmutex);

for(int i=1; i < lost_packets; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the whole thing that perplexes me. The way it's done right now, you're basically breaking FEC, as it would never be done. You're preserving the use_fec = true bit, but it will be useless, since later on we'll try to get redundant information from a packet that doesn't have any (the PLC data you've added). I think PLC and FEC should be handled separately: if the gap is 1, then do what was done before; if it's higher than 1, do PLC (new code, without use_fec=true). Or do you have a reason for implementing it this way?

As a side not, the LOG_ERR there is probably overly verbose, and would clog the logs any time there's bursts of lost packets, which may happen often. A LOG_VERB or even LOG_HUGE would probably be a better choice (but that's something that can be discussed, it may still be useful to have info on the logs when this happens).

pkt->data = g_malloc0(OPUS_SAMPLES * (participant->stereo ? 2 : 1) * sizeof(opus_int16));
pkt->ssrc = 0;
pkt->timestamp = participant->last_timestamp + 960 * i;
pkt->seq_number = participant->last_seq + 1 * i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure the RTP info here is correct, since last_timestamp and last_seq may come from out of order packets. But it's probably not that relevant, since it has a limited use anyway in the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought jitter buffer is reordering out of order packets and duplicates? At least I've expiremented with out of order and duplicates on output and everything still worked fine

/* This is a redundant packet, so we can't parse any extension info */
pkt->silence = FALSE;
/* Decode the lost packet using fec=1 */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you completely changed the way we handle FEC? The existing FEC management should remain the same, if not broken. See comments above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FEC management which were implemented earlier didn't work in case of gap > 1, I didn't change too much, just adapted it to work using latest arrived rtp packet info instead of expected value based on last decoded packet.

pkt->length = opus_decode(participant->decoder, payload, plen, (opus_int16 *)pkt->data, output_samples, 1);

JANUS_LOG(LOG_ERR, "[%d] packet fec decoded [%d] pkt->length, timestamp: [%d]\n",
pkt->seq_number, pkt->length, pkt->timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above on LOG_ERR usage.

@@ -8738,7 +8768,7 @@ static void *janus_audiobridge_participant_thread(void *data) {
locked = TRUE;
/* Do not let queue-in grow too much */
guint count = g_list_length(participant->inbuf);
if(count > QUEUE_IN_MAX_PACKETS) {
if((int) count > (QUEUE_IN_MAX_PACKETS + lost_packets)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you increase QUEUE_IN_MAX_PACKETS, then there's no reason to make it even less strict like you're doing now IMHO. It will make the internal mixer queue for the participant grow beyond control and cause huge and uncoverable latencies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to set it to default 4 and got pops and cracks. On screen is 20-30% of drops set on output channel.
image

With 20 QUEUE_IN_MAX_PACKETS I don't have such pops/cracks and I don't get over the limit.

@lminiero
Copy link
Member

lminiero commented Dec 18, 2023

@spscream have you tried what happens if, instead of ingesting lost_packets concealed packets in that for loop, you add only one? My guess is that the end result should be the same, since it will try to recreate stuff from the last thing it received, and should have the added benefit of not inflating the queue as your patch would do now.

@atoppi
Copy link
Member

atoppi commented Dec 18, 2023

Thanks @spscream for the effort!

My objection to the implementation is related to the timing of PLC packets generation.

FEC needs packet N+1 to decode packet N (so it will naturally introduce/leverage a playout delay), meanwhile PLC is only calculated from already decoded samples. According to this assumption, the generation of a PLC packet should be triggered in a condition where the bridge encoder "needs something from a participant but there's nothing available", and not after the arrival of a random packet not adjacent to the previous one.

Reusing FEC approach will inevitably lead to issues like unwanted delay, discarding of good packets, uncontrollable queue growth etc.

I still have to think about a better approach in detail, however one idea is to trigger a PLC whenever a participant's queue-in is empty (e.g. detected participant underflow). That means that a loss burst has exhausted both the jitter buffer and the participant's queue (decoded samples), hence the mixer has no to other chance but introduce concealment for the participant.

@spscream
Copy link
Author

@spscream have you tried what happens if, instead of ingesting lost_packets concealed packets in that for loop, you add only one? My guess is that the end result should be the same, since it will try to recreate stuff from the last thing it received, and should have the added benefit of not inflating the queue as your patch would do now.

yes, I tried it and sound cracks

@spscream
Copy link
Author

@spscream have you tried what happens if, instead of ingesting lost_packets concealed packets in that for loop, you add only one? My guess is that the end result should be the same, since it will try to recreate stuff from the last thing it received, and should have the added benefit of not inflating the queue as your patch would do now.

yes, I tried it and sound cracks

I can insert n packets of maximum size instead of number of loss packets, but plc should cover time period of lost bursts accordingly to docs. frame_size parameter of opus_decode has the following description:

Number of samples per channel of available space in pcm. If this is less than the maximum packet duration (120ms; 5760 for 48kHz), this function will not be capable of decoding some packets. In the case of PLC (data==NULL) or FEC (decode_fec=1), then frame_size needs to be exactly the duration of audio that is missing, otherwise the decoder will not be in the optimal state to decode the next incoming packet. For the PLC and FEC cases, frame_size must be a multiple of 2.5 ms.

@spscream
Copy link
Author

I still have to think about a better approach in detail, however one idea is to trigger a PLC whenever a participant's queue-in is empty (e.g. detected participant underflow). That means that a loss burst has exhausted both the jitter buffer and the participant's queue (decoded samples), hence the mixer has no to other chance but introduce concealment for the participant.

This buffer underrun can explain why we still have this cracks/pops for our production users. But I afraid I can't rewrite it to such aproach myself or at least not this year.

@lminiero
Copy link
Member

I agree with @atoppi that it's the road we should explore, even though it does come with its own challenges: for one, the mixer is the one who'd read from that queue, and it uses a different thread, meaning I'd prefer it not to be the one that should perform that PLC decode. As such, we'll have to better think of how to do that in the participant thread instead. This is something we can definitely discuss together after the vacations, no need to hurry and no need for you to actually implement it yourself either, if we manage to replicate the problem and come up with a potential fix.

@spscream
Copy link
Author

@atoppi jitter_buffer_get has JITTER_BUFFER_MISSING and JITTER_BUFFER_INSERTION ret codes, so we can use it as a signal to insert plc, you can take it to considerations.

By the way I also discovered possible incorrect usage of JITTER_BUFFER_SET_MARGIN in current implementation - its value should be the same unit as the frame_size (e.g. JITTER_BUFFER_MIN_PACKETS * frame_size), since its used as timestamp under the hood, not as count of packets.

@lminiero
Copy link
Member

its value should be the same unit as the frame_size

I don't think so. Besides, we're changing what actually ends in the buffer: not the payload anymore, but a pointer to a duplicate of the janus_plugin_rtp we got from the core.

@spscream
Copy link
Author

its value should be the same unit as the frame_size

I don't think so. Besides, we're changing what actually ends in the buffer: not the payload anymore, but a pointer to a duplicate of the janus_plugin_rtp we got from the core.

I get the following situation with buffering if JITTER_BUFFER_SET_MARGIN is 2, the jitter buffer doesn't mantain its minimal buffer size:

[WARN] [18844549281349] jitter buffer avail: 0
[WARN] [18844549281349] jitter_buffer_get result 1
[WARN] [18844549281771] jitter buffer avail: 0
[WARN] [18844549281771] jitter_buffer_get result 1
[WARN] [18844549281829] jitter buffer avail: 2
[WARN] [18844549281829] jitter_buffer_get result 0
[WARN] [18844549300041] jitter buffer avail: 0
[WARN] [18844549300041] jitter_buffer_get result 1
[WARN] [18844549300112] jitter buffer avail: 2
[WARN] [18844549300112] jitter_buffer_get result 0
[WARN] [18844549302419] jitter buffer avail: 0
[WARN] [18844549302419] jitter_buffer_get result 1
[WARN] [18844549321048] jitter buffer avail: 2
[WARN] [18844549321048] jitter_buffer_get result 0
[WARN] [18844549321144] jitter buffer avail: 0
[WARN] [18844549321144] jitter_buffer_get result 1
[WARN] [18844549323703] jitter buffer avail: 0
[WARN] [18844549323703] jitter_buffer_get result 1

if I multiply JITTER_BUFFER_MIN_PACKETS by frame_size for JITTER_BUFFER_SET_MARGIN:

[WARN] [18844964419785] jitter_buffer_get result 0
[WARN] [18844964438318] jitter buffer avail: 3
[WARN] [18844964438318] jitter_buffer_get result 0
[WARN] [18844964459607] jitter buffer avail: 3
[WARN] [18844964459607] jitter_buffer_get result 0
[WARN] [18844964478355] jitter buffer avail: 3
[WARN] [18844964478355] jitter_buffer_get result 0
[WARN] [18844964499529] jitter buffer avail: 3
[WARN] [18844964499529] jitter_buffer_get result 0
[WARN] [18844964518172] jitter buffer avail: 3
[WARN] [18844964518172] jitter_buffer_get result 0
[WARN] [18844964539348] jitter buffer avail: 3
[WARN] [18844964539348] jitter_buffer_get result 0
[WARN] [18844964557793] jitter buffer avail: 3
[WARN] [18844964557793] jitter_buffer_get result 0
[WARN] [18844964578942] jitter buffer avail: 3
[WARN] [18844964578942] jitter_buffer_get result 0
[WARN] [18844964600042] jitter buffer avail: 3
[WARN] [18844964600042] jitter_buffer_get result 0
[WARN] [18844964618636] jitter buffer avail: 3
[WARN] [18844964618636] jitter_buffer_get result 0
[WARN] [18844964639613] jitter buffer avail: 3
[WARN] [18844964639613] jitter_buffer_get result 0

@atoppi
Copy link
Member

atoppi commented Jan 8, 2024

By the way I also discovered possible incorrect usage of JITTER_BUFFER_SET_MARGIN in current implementation - its value should be the same unit as the frame_size (e.g. JITTER_BUFFER_MIN_PACKETS * frame_size), since its used as timestamp under the hood, not as count of packets.

@spscream yeah you're probably right, the comments in the speexdsp code are probably outdated/misleading.
Internally the macro sets a buffer_margin variable that is used as a RTP timestamp.

@spscream
Copy link
Author

spscream commented Jan 9, 2024

I've tried another approach here and it works in our case even with huge loses: master...spscream:janus-gateway:fix/am/audiobridge-cracks-fix-rnnoise#diff-476cfd26c102e4b3311b3b8e8f848bfeba283c1712a5f3f117254e0593a986f4R8959

@atoppi could you please look on it? If it is ok, i could refactor it and make another pr for you. Idea is the following:

  • if i got ret == JITTER_BUFFER_MISSING || ret == JITTER_BUFFER_INSERTION from jitter_buffer_get I insert concealed packets
  • if loss gap is huge I'm stopping to insert plc

I also tried to insert fec, but it seams it also leads to cracks

@atoppi
Copy link
Member

atoppi commented Jan 10, 2024

@spscream the diff is huge and includes unrelated changes, can you please provide a more readable patch?

@spscream
Copy link
Author

@spscream
Copy link
Author

spscream commented Feb 6, 2024

@atoppi hi! any thoughts?

@spscream
Copy link
Author

spscream commented Feb 6, 2024

If it is ok, I could make a new pr with my latest changes if it will be more convenient

@atoppi
Copy link
Member

atoppi commented Feb 6, 2024

@spscream If you're able to propose a cleaner version of the patch it would greatly help thanks.
I'll try to review this in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants