Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] packets are not being delivered to the application using default SRT_LIVE with a broadcast bonded connection in localhost network #2871

Open
suhrm opened this issue Feb 12, 2024 · 7 comments
Labels
[core] Area: Changes in SRT library core help wanted Indicates that a maintainer wants help on an issue or pull request Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@suhrm
Copy link

suhrm commented Feb 12, 2024

Describe the bug
I have a set of tests that covers integration of the SRT protocol, using the non-blocking API, in various scenarios with a larger program.
The test basically produces a fixed amount of packet send the through the SRT integration and check whether the expected amount is received on the other side over a localhost connection, i.e. internal on the same machine.

Here, I have observed that sometimes when using a broadcast bonding connection with the default SRT_LIVE preset, not all packets are getting through to the application on the other side. This is not a deterministic bug but rather sporadic, which means that most of the time the test passes without issue. Furthermore, if I disable the TLPKT_DROP and TSBPD, the issue goes away.

In a similar test with only a single connection using the same scaffolding around the SRT API, this issue has not been observed even once.

To Reproduce
The larger program is not publicly available, but all I have to do rerun the same test until it fails, so if further logs are needed please let me know.

Expected behavior
I would expect this kind of test to always pass, as this is over the localhost.

Desktop (please provide the following information):

  • OS: Linux
  • SRT Version / commit ID: 1.5.3

Additional context
From what I can see in the SRT related logging, it seems that when the test fail, I do not observe as many (e.g. 1) "BEGIN ASYNC MODE" as with the cases where it passes (e.g. 10). So maybe I am using the API incorrectly somehow?

Thanks in advance.

@suhrm suhrm added the Type: Bug Indicates an unexpected problem or unintended behavior label Feb 12, 2024
@ethouris
Copy link
Collaborator

A pcap would be helpful. Debug logs as well, but I'd limit them to only selected FA to avoid overloading the application, so I'll try to determine this later. Collecting statistics could be also helpful to see how the packet drops look like (some packets might be dropped already by the sender).

With TSPBD turned off, TLPKTDROP is ignored anyway as well as e.g. LATENCY.

The log with "BEGIN ASYNC MODE" is reported from the connecting function to declare that it is using the non-blocking connection mode, that is, it returns immediately.

@suhrm
Copy link
Author

suhrm commented Feb 13, 2024

A pcap would be helpful. Debug logs as well, but I'd limit them to only selected FA to avoid overloading the application, so I'll try to determine this later. Collecting statistics could be also helpful to see how the packet drops look like (some packets might be dropped already by the sender).

I have attached the pcap files one where the test passes and one for where the test fail (hangs)
I am not too familiar with the log-filters, which one should I apply to provide the information needed?
srt_working.txt
srt_hanging.txt
I have changed the file extention from pcap to txt to be able to upload them to the github issue

With TSPBD turned off, TLPKTDROP is ignored anyway as well as e.g. LATENCY.

The log with "BEGIN ASYNC MODE" is reported from the connecting function to declare that it is using the non-blocking connection mode, that is, it returns immediately.

@ethouris
Copy link
Collaborator

Note that pcap files should have "pcap" extension, this one was misinterpreted as a text file. Whatever, I got them.

I don't understand anything from this. This "working" one contains sending 10 packets and it's recorded from the handshake up to the shutdown. The transmission was so slow that ACK was received after every packet. Not sure why.

The "hanging" version looks exactly the same, except that there aren't any shutdown packets and it looks like cut after 10th packet.

If your test contains such a slow transmission, I think you can just as well turn on debug logs without filtering. I should be able to determine something from them for the "hanging" case.

@suhrm
Copy link
Author

suhrm commented Feb 14, 2024

Note that pcap files should have "pcap" extension, this one was misinterpreted as a text file. Whatever, I got them.

I don't understand anything from this. This "working" one contains sending 10 packets and it's recorded from the handshake up to the shutdown. The transmission was so slow that ACK was received after every packet. Not sure why.

The rate at which I am generating packets is quite low in this specific case, i.e. one every 10ms. Though in case of the hanging issue this remains even if I change it to e.g. 1 ms

The "hanging" version looks exactly the same, except that there aren't any shutdown packets and it looks like cut after 10th packet.

That is because the test itself times out. If needed, I can provide a longer trace.

If your test contains such a slow transmission, I think you can just as well turn on debug logs without filtering. I should be able to determine something from them for the "hanging" case.

I have attached log files for both scenarios
hanging_log.txt
working_log.txt

These are wrapped in our own logging format but should contain the full log string from SRT as well

@maxsharabayko
Copy link
Collaborator

From the description the issue looks like the one fixed in #2766. But you state you test SRT v1.5.3, that already contains the fix.

@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Apr 18, 2024
@maxsharabayko maxsharabayko added this to the Backlog milestone Apr 18, 2024
@maxsharabayko
Copy link
Collaborator

@suhrm

Here, I have observed that sometimes when using a broadcast bonding connection with the default SRT_LIVE preset, not all packets are getting through to the application on the other side. This is not a deterministic bug but rather sporadic, which means that most of the time the test passes without issue. Furthermore, if I disable the TLPKT_DROP and TSBPD, the issue goes away.

I would suggest to log epoll events set by SRT. You can find them in the code as .m_EPoll.update_usock(..) Likely the packets are in the RCV buffer, just the application does not receive a notification from the epoll.

@maxsharabayko maxsharabayko added the help wanted Indicates that a maintainer wants help on an issue or pull request label Apr 18, 2024
@suhrm
Copy link
Author

suhrm commented Apr 29, 2024

I will see if I can collect the logs this week.

@suhrm

Here, I have observed that sometimes when using a broadcast bonding connection with the default SRT_LIVE preset, not all packets are getting through to the application on the other side. This is not a deterministic bug but rather sporadic, which means that most of the time the test passes without issue. Furthermore, if I disable the TLPKT_DROP and TSBPD, the issue goes away.

I would suggest to log epoll events set by SRT. You can find them in the code as .m_EPoll.update_usock(..) Likely the packets are in the RCV buffer, just the application does not receive a notification from the epoll.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core help wanted Indicates that a maintainer wants help on an issue or pull request Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants