Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packet loss / ENOBUFs with kqueue(2) and tap(4) on OpenBSD #374

Open
mato opened this issue Jun 28, 2019 · 12 comments
Open

Packet loss / ENOBUFs with kqueue(2) and tap(4) on OpenBSD #374

mato opened this issue Jun 28, 2019 · 12 comments
Labels
bug help wanted host/openbsd Applicable to OpenBSD hosts target/hvt Applicable to hvt target

Comments

@mato
Copy link
Member

mato commented Jun 28, 2019

This issue is for the work in progress multiple device support in #373. While testing with the newly added "dual interface" tests/test_net_2if on OpenBSD, I've found what seems like a bug in the interaction between kqueue(2) and tap(4).

This is reproducible on OpenBSD 6.4 (under nested KVM) and OpenBSD 6.5 (on bare metal). This issue does not occur on FreeBSD, which uses identical calls to kqueue(2), or Linux ,which uses epoll(2).

To reproduce:

  1. Build (WIP) Implement support for multiple devices #373 (branch):
    ./configure.sh
    gmake
    
  2. Setup the TAP interfaces:
    doas tests/setup-tests.sh
    
  3. Start the service unikernel:
    doas tenders/hvt/solo5-hvt --net:service0=tap100 --net:service1=tap101 tests/test_net_2if/test_net_2if.hvt
    
  4. In another session, start a flood ping to the service0 interface:
    doas ping -f 10.0.0.2
    
  5. Observe that the flood ping is functioning correctly, with no packets dropped.
  6. In another session, start a normal ping to the service1 interface:
    doas ping 10.1.0.2
    
  7. Observe that, for each ping sent to service1, a packet is dropped by service0. This shows as a . in the first flood ping's output.
  8. Kill the normal ping to service1 and start a flood ping instead:
  9. In another session, start a flood ping to the service1 interface:
    doas ping -f 10.1.0.2
    
  10. Observe massive packet loss on both interfaces, and ping complaining about No buffer space available (ENOBUFS). Note that netstat -m shows a large (but not inordinately so) amount of mbufs used.

In an attempt to find the root cause of this issue, I've looked at the source to OpenBSD's if_tun.c and written a patch for solo5-hvt that dumps the IFQ_LEN() returned by the kqueue filter to userspace:
kqueue-diag.patch.txt.

Rebuilding the branch with this patch, and repeating up to step (7) above, note that for each ping sent to service1 (tap101, h=1), the queue length (q=) reported by the filter for service0 (tap100, h=0) increases by one and never decreases while the flood ping to service0 (tap100) is running. In fact, if we leave the normal ping from (6) running long enough for the queue length to reach 255, then we see that the flood ping (4) starts reporting ENOBUFS.

My hypothesis is that there is something wrong in the interaction between kqueue(2) and tap(4), perhaps a race condition that is causing events to be dropped? Of course there may also be an implementation error in the kqueue(2) code in #373, but given that FreeBSD does not have this problem, and neither does the functionally equivalent Linux epoll(2) code, this seems unlikely.

@adamsteen I'd appreciate it if you could confirm that you can reproduce this. Not sure what to do about it -- my idea would be to first take test_net_2if.c and rewrite it as a conventional UNIX program by cut-n-pasting from the hvt tender, to see if the problem still occurs. If it still does, then we should report that as a bug (with the non-Solo5 test case) to the OpenBSD tech@ mailing list.

In the mean time, I'll probably just disable the two-interface test for OpenBSD CI in #373.

@mato mato added help wanted host/openbsd Applicable to OpenBSD hosts target/hvt Applicable to hvt target labels Jun 28, 2019
@mato mato self-assigned this Jun 28, 2019
@adamsteen
Copy link
Contributor

@mato I have tested the scenario described and I too see the dropped packets. Issue confirmed.

I will setup a test_net_2if unix program and see what happens.

@mato
Copy link
Member Author

mato commented Jul 2, 2019 via email

@adamsteen
Copy link
Contributor

@mato i have setup something up so far (not working), test_net_2if, need to have a look at where i went wrong with yield. (error: Yield returned false, but handles in set!)

Just doing a little bit now and then as I have time.

@adamsteen
Copy link
Contributor

adamsteen commented Jul 4, 2019

@mato using my very naive translation, i was able to reproduce this error with UNIX program.

@mato
Copy link
Member Author

mato commented Jul 4, 2019

@adamsteen I can also reproduce the behaviour on 6.5 with your UNIX version. So, I suggest asking on tech@ what people think and whether or not this is a bug. Perhaps include a link to this issue, and point out that the UNIX version is a quick-n-dirty hack for the purposes of having a standalone test case.

@adamsteen
Copy link
Contributor

here is the bug report sent to @bugs https://marc.info/?l=openbsd-bugs&m=156229879107900&w=2

@mato
Copy link
Member Author

mato commented Jul 5, 2019

@adamsteen Thanks for filing that. Is there some way I can get myself added to a Cc: list on that bug?

@adamsteen
Copy link
Contributor

@mato I have cc you into the bug report

@mato
Copy link
Member Author

mato commented Aug 12, 2019

@adamsteen Any progress on this? I've not seen any traffic from bugs@. Perhaps ask on tech@ instead?

@adamsteen
Copy link
Contributor

@mato Sorry for the late reply, I have been with limited internet access for the last few weeks (Holidays), once i have caught up on things, I will ask on tech@ and see what we can do.

@adamsteen
Copy link
Contributor

@mato

I have sent an email to @tech Packet loss / ENOBUFs with kqueue(2) and tap(4)

@mato mato added the bug label Sep 19, 2019
@adamsteen
Copy link
Contributor

I have been reading a little further in tap(4), and found this tid bit

Writes never block. If the protocol queue is full, the packet is dropped, a “collision” is counted, and ENOBUFS is returned.

See 4th the paragraph of Description in tap(4) https://man.openbsd.org/tap.4

Not sure exactly what that means or where to look but wanted to note it down

@mato mato removed their assignment May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted host/openbsd Applicable to OpenBSD hosts target/hvt Applicable to hvt target
Projects
None yet
Development

No branches or pull requests

2 participants