Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packets are dropped when net buffer is full even if receiver is trying to read #1341

Open
robgjansen opened this issue May 11, 2021 · 14 comments · May be fixed by #2524
Open

Packets are dropped when net buffer is full even if receiver is trying to read #1341

robgjansen opened this issue May 11, 2021 · 14 comments · May be fixed by #2524
Labels
Component: Main Composing the core Shadow executable Type: Bug Error or flaw producing unexpected results

Comments

@robgjansen
Copy link
Member

Suppose a UDP buffer fits 10 packets. The receiver called read and is waiting for incoming packets. The sender sent 100 packets all at the same time.

Those 100 packets arrive at the receiver all at t=0. After the first one gets added to the buffer, we notice that the receiver needs to be woken up to read. We add an event to the queue to wake up the receiver. But if it gets added after those other 99 packet events that are already in the queue, then:

  • the first 10 packets will make it to the buffer
  • the buffer will be full
  • the next 90 packets will be dropped
  • we wake up the receiver so they read those 10 packets

Clearly, we need a way to wake up the receiver in between some number of packet receive events, so they get a chance to read out some of the packets, so we don't drop any of them.


We have verified that an infinite size buffer fixes the problem.

One possible way to deal with this is that each host keeps 2 separate queues, one for packets (which other hosts can insert into), and one for local host tasks (which a host can only insert into itself). Then at a specific nanosecond t and packet processing threshold p, we:

  • process all tasks with time=t
  • process p packets with time=t
  • repeat unless no more packets or tasks exist with time=t

This gives us better control over executing simulator tasks vs processing new packets.

@robgjansen robgjansen added Type: Bug Error or flaw producing unexpected results Component: Main Composing the core Shadow executable labels May 11, 2021
@robgjansen robgjansen added this to To do in Release v2.0 via automation May 11, 2021
@sporksmith
Copy link
Contributor

Another solution we talked about is to smear on the sender side; the sender shouldn't actually be able to send all 100 packets within a single nanosecond should it?

@robgjansen
Copy link
Member Author

I think the sender-side approach breaks in case you have a large number of senders sending packets to the same receiver socket all at the same instant. For UDP where you don't have a socket buffer per "connection pair", I think this breaks.

(For TCP that would only happen on listening sockets, and then those packets don't enter the read queue (the control packets to create the connection don't contain user data) so we should be OK.)

@sporksmith
Copy link
Contributor

I think the sender-side approach breaks in case you have a large number of senders sending packets to the same receiver socket all at the same instant. For UDP where you don't have a socket buffer per "connection pair", I think this breaks.

In some sense that seems realistic - if many senders each send at the receiver's full capacity simultaneously, some packets are going to be dropped, right?

OTOH in a real network there'd be some latency jitter, and maybe some "smoothing" happening where packets from multiple senders arrive at a single upstream router and are buffered before being retransmitted, making it less likely that they'd all arrive "simultaneously" in practice. Maybe your proposal is an easier solution than trying to model those things :)

@sporksmith
Copy link
Contributor

sporksmith commented May 11, 2021

and maybe some "smoothing" happening where packets from multiple senders arrive at a single upstream router and are buffered before being retransmitted

Maybe just using a bigger "receive" buffer would work, since it's actually modeling not just the local buffer, but all of the upstream router buffers? (EDIT: and currently the sender's 'send' buffer, since we're not buffering/smearing on the sender side)

@robgjansen
Copy link
Member Author

robgjansen commented May 11, 2021

We do model the upstream routers and upstream router buffering in src/main/routing/router*, separately from modeling the sender and receiver network rates in shadow/src/main/host/network_interface.*. I think the issue here is independent of the upstream router buffer or the network interface buffer.

We did verify that bigger socket buffers would help. In TCP we have TCP auto-tuning to dynamically adjust the buffer size based on the number of packets that could possibly arrive from the network at any time. Because of that auto-tuning, I doubt this problem occurs on our TCP connections. I don't think UDP sockets have this feature though:

socket buffer size: For UDP, buffer size is not related to RTT the way TCP is, but the defaults are still not large enough. Setting the socket buffer to 4M seems to help a lot in most cases

More generally, I think the problem is an artifact of how we choose to line things up in our event queue for a given nanosecond, and we could certainly line them up differently per my suggestion. The network interface modeling won't allow us to exceed our configured network rate in any case, so we don't have to worry about that. A nice side effect of my suggestion is that the Shadow network sockets may actually need to store less data fewer packet pointers overall since we'll be getting those out to the application (and freeing them) much sooner.

@robgjansen robgjansen removed this from To do in Release v2.0 Jul 1, 2021
@robgjansen robgjansen added this to To do in Release 2.0.0-pre.2 via automation Jul 1, 2021
@robgjansen robgjansen removed this from To do in Release 2.0.0-pre.2 Jul 8, 2021
@robgjansen robgjansen added this to To do in Release v2.0 via automation Jul 8, 2021
@robgjansen robgjansen removed this from To do in Release v2.0 Sep 16, 2021
@robgjansen robgjansen added this to To do in Future Release via automation Sep 16, 2021
@robgjansen robgjansen removed this from To do in Future Release Oct 15, 2021
@stevenengler
Copy link
Contributor

We add an event to the queue to wake up the receiver. But if it gets added after those other 99 packet events that are already in the queue, then:

Just documenting here that whether the new event gets added before or after the other 99 packets depends on the relative host IDs. If the receiving host has a host ID smaller than the host ID from the packets' source host, then the new event will be added before the other 99 packet events. Otherwise the new event is added after the other 99 packet events. This isn't ideal since the network behaviour depends on the host order, and changing the hostnames can significantly change the network throughput and drop rate in some cases.

@stevenengler
Copy link
Contributor

stevenengler commented Oct 29, 2022

Clearly, we need a way to wake up the receiver in between some number of packet receive events, so they get a chance to read out some of the packets, so we don't drop any of them.

I'm not sure I'm convinced anymore :) If the application is set up to send for example 100,000 bytes at a time, but only uses a receive buffer size of 10,000 bytes, that seems like a problem with the application, not Shadow. It would be nice to know how this example application would perform in the real world. Would the OS really be able to schedule the process fast enough for it to handle all the packets, or would it have the same issue where all the packets arrive before the OS has a chance to schedule the process and it would drop some? I agree that it seems like network jitter would play a role here, and would maybe be a more realistic way of solving this.

Then at a specific nanosecond t and packet processing threshold p, we:

  • process all tasks with time=t
  • process p packets with time=t
  • repeat unless no more packets or tasks exist with time=t

I don't see how we could choose a value of p other than 1. If p was chosen as 2, the receiver had a receive buffer size of 3000 bytes, and the sender sent two packets of size 2000 bytes each, we would still drop a packet here. Or are we okay with still dropping packets in this scenario, and we only want to reduce the number of packets we drop? If so, how do we choose p? We could technically peek the next packet and look forward into the network interface's socket buffers to see if there's room for the next packet, and if not start processing local tasks before trying to process packets again, but this seems a little strange.

@robgjansen
Copy link
Member Author

I agree that it seems like network jitter would play a role here, and would maybe be a more realistic way of solving this.

I agree that it would be more realistic, but it's not a viable way to solve the problem without making runtime performance significantly worse.

Conceptually, each host currently organizes their packet sending into "batches" - they send all the packets they are allowed to send for one millisecond (enforced by refilling our bandwidth token buckets every 1 millisecond). We get much better runtime performance by batching this way.

The alternative suggestion as I understand it is to basically reduce the batch time from one millisecond down to the time it takes to send a single packet (based on the host's configured bandwidth) - in other words, eliminating batched sending. I agree that this would introduce a network jitter effect and be more realistic, and in fact this is how I first implemented the sending behavior over 10 years ago. However, without batching, the number of send events required to send the same number of packets would significantly increase and very likely reduce runtime performance (which is why I moved to batch sending in the first place).

I don't think it's even worth checking, but if you wanted to you could reduce the batch time by reducing the token refill interval in this function:

static TokenBucket* _networkinterface_create_tb(uint64_t bwKiBps) {
uint64_t refill_interval_nanos = SIMTIME_ONE_MILLISECOND;
uint64_t refill_size = bwKiBps * 1024 / 1000;
// The `CONFIG_MTU` part represents a "burst allowance" which is common in
// token buckets. Only the `capacity` of the bucket is increased by
// `CONFIG_MTU`, not the `refill_size`. Therefore, the long term rate limit
// enforced by the token bucket (configured by `refill_size`) is not
// affected much.
//
// What the burst allowance ensures is that we don't lose tokens that are
// unused because we don't fragment packets. If we set the capacity of the
// bucket to exactly the refill size (i.e., without the `CONFIG_MTU` burst
// allowance) and there are only 1499 tokens left in this sending round, a
// full packet would not fit. The next time the bucket refills, it adds
// `refill_size` tokens but in doing so 1499 tokens would fall over the top
// of the bucket; these tokens would represent wasted bandwidth, and could
// potentially accumulate in every refill interval leading to a
// significantly lower achievable bandwidth.
//
// A downside of the `CONFIG_MTU` burst allowance is that the sending rate
// could possibly become "bursty" with a behavior such as:
// - interval 1: send `refill_size` + `CONFIG_MTU` bytes, sending over the
// allowance by 1500 bytes
// - refill: `refill_size` token gets added to the bucket
// - interval 2: send `refill_size` - `CONFIG_MTU` bytes, sending under the
// allowance by 1500 bytes
// - refill: `refill_size` token gets added to the bucket
// - interval 3: send `refill_size` + `CONFIG_MTU` bytes, sending over the
// allowance by 1500 bytes
// - repeat
//
// So it could become less smooth and more "bursty" even though the long
// term average is maintained. But I don't think this would happen much in
// practice, and we are batching sends for performance reasons.
uint64_t capacity = refill_size + CONFIG_MTU;
debug("creating token bucket with capacity=%" G_GUINT64_FORMAT " refill_size=%" G_GUINT64_FORMAT
" refill_interval_nanos=%" G_GUINT64_FORMAT,
capacity, refill_size, refill_interval_nanos);
return tokenbucket_new(capacity, refill_size, refill_interval_nanos);
}

@robgjansen
Copy link
Member Author

robgjansen commented Oct 30, 2022

I don't see how we could choose a value of p other than 1...

It does seem to me that the minimum buffer size required to enable smooth sending behavior in Shadow would be larger than it would be in Linux, because Shadow is operating with a batched sending algorithm and Linux is operating in real time.

So I think we could choose any p as long as we also ensure that those p packets will fit into the socket buffer by default.

@stevenengler
Copy link
Contributor

So I think we could choose any p as long as we also ensure that those p packets will fit into the socket buffer by default.

We allow for packets up to 64 KiB with UDP, so we would need the receive buffer size to be at least 64*p KiB. But what should we do if the application sets its own non-default buffer size? For example if pis 5 and an application sets a receive buffer size of 90 KiB, should we set the receive buffer size as something like 90*5 = 450 KiB?

@robgjansen
Copy link
Member Author

Changing the buffer size like that does seem odd. The way I would implement it is dynamic, i.e., don't use a fixed p value everywhere but instead figure out when to switch to the task queue based on the current state of the buffers. How about something like:

// Assumes all packets and tasks have an equal time
while !packet.queue.is_empty() // or we hit the end of the scheduler round
  let pkt = packet.queue.next()
  let space_remaining = socket.push(pkt)
  if space_remaining < PROTO_PACKET_SIZE_MAX // protocol-specific
    while !task.queue.is_empty()
      run_task(task.queue.next()) // Give the application a chance to read

Then we just need to make sure the socket buffers can always hold at least one packet (which we should probably be doing in any case). So if the application tries to set a buffer size of b bytes, we set the actual size to
size <-- MAX(b, PROTO_PACKET_SIZE_MAX).

Does this work?

@stevenengler
Copy link
Contributor

The way I would implement it is dynamic, i.e., don't use a fixed p value everywhere but instead figure out when to switch to the task queue based on the current state of the buffers. How about something like:

Okay this seems like it would work. Right now there's no way for the code to get the socket from the event, since the event has a callback and the callback passes the packet to the network interface. We'll need to change the host's event system to deal with packets and the network interface directly and only use events for local tasks.

Then we just need to make sure the socket buffers can always hold at least one packet (which we should probably be doing in any case). So if the application tries to set a buffer size of b bytes, we set the actual size to
size <-- MAX(b, PROTO_PACKET_SIZE_MAX).

I don't see why we'd need to do this? If the buffer size is smaller than the packet size I think we want to just drop the packet, and I think that's compatible with the above code.

@robgjansen
Copy link
Member Author

If the buffer size is smaller than the packet size I think we want to just drop the packet, and I think that's compatible with the above code.

OK, sounds good!

stevenengler added a commit to stevenengler/shadow that referenced this issue Nov 7, 2022
@robgjansen
Copy link
Member Author

Once we get this working, we should check if we want to increase the speed threshold in the tgen tests from 85% up to something higher (e.g., 90%).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Main Composing the core Shadow executable Type: Bug Error or flaw producing unexpected results
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants