Skip to content
This repository has been archived by the owner on Aug 11, 2023. It is now read-only.

MessageBufferPool in OutgoingMessageQueue could cause OOM #258

Open
kai-chen-ustc opened this issue Nov 3, 2017 · 6 comments
Open

MessageBufferPool in OutgoingMessageQueue could cause OOM #258

kai-chen-ustc opened this issue Nov 3, 2017 · 6 comments

Comments

@kai-chen-ustc
Copy link

The MessageBufferPool implementation used in OutgoingMessageQueue have unlimited capacity of object pool, so it could cause Out-Of-Memory if you repeatedly publish large messages like compressedImage, but network transfer speed is not fast enough.

@jubeira
Copy link

jubeira commented Nov 3, 2017

Hi @dragonjack,
Is that causing any issues to you? Do you suggest any fix in particular for this problem?

@kai-chen-ustc
Copy link
Author

Hi @jubeira ,
Thanks for your prompt reply, yes this causes my app to crash since I am continuously publishing compressed image with quality set to 100; usually it won't last more than 1 minute and OOM occurs.

@jubeira
Copy link

jubeira commented Nov 7, 2017

I see.
Do you have any idea about how to fix it in mind? I haven't experienced this issue on my side, but I don't think I ever tried to publish messages so extensively. If you send a PR I offer myself to review it and test it.

@kai-chen-ustc
Copy link
Author

Hi @jubeira ,
Sorry for the late reply. I could not think of a good policy to solve this issue, the root cause should be the code in OutgoingMessageQueue.java:

private final class Writer extends CancellableLoop {
    @Override
    public void loop() throws InterruptedException {
      T message = deque.takeFirst();
      final ChannelBuffer buffer = messageBufferPool.acquire();
      serializer.serialize(message, buffer);
      if (DEBUG) {
        log.info(String.format("Writing %d bytes to %d channels.", buffer.readableBytes(),
            channelGroup.size()));
      }
      // Note that the buffer is automatically "duplicated" by Netty to avoid
      // race conditions. However, the duplicated buffer and the original buffer
      // share the same backing array. So, we have to wait until the write
      // operation is complete before returning the buffer to the pool.
      channelGroup.write(buffer).addListener(new ChannelGroupFutureListener() {
        @Override
        public void operationComplete(ChannelGroupFuture future) throws Exception {
          messageBufferPool.release(buffer);
        }
      });
    }
  }

If someone is publishing large messages extensively, the pool will become larger and larger, until OOM is triggered.

Maybe we can limit the size of outgoing message queue, but in that case new messages are randomly discarded and the client does not know.. Do you have any idea how to solve this issue?

@jubeira
Copy link

jubeira commented Nov 21, 2017

@dragonjack I would say that if you are trying to send large quantities of information and your channel has narrow bandwidth, you should start discarding messages. You are probably more into the details than me at this point, but perhaps the available memory can be checked before trying to get a buffer from the pool, and if it's not enough just discard the message and log a warning.

I know it's not ideal, but discarding the message and logging a warning is better than throwing an unhandled exception at least in my opinion.

@nileshsuri
Copy link

Wanted to check on whether there has been any update on this problem. Transporting image data over a low bandwidth network still causes this OOM issue.
I did some digging and I think the problem is rooted at the use of a fixed queue capacity for all types of data_types as shown by the following line in OutgoingMessageQueue.java

private static final int DEQUE_CAPACITY = 16;

However as a single message of image type is much larger than other types of messages this causes the system to crash after extensive memory buildup.
Ideally the queue size should be a dynamically adjustable parameter which can be changed to suit the type of data being published. (That's how ROS does it.)
If not maybe we can just reduce the size of the queue to fix the OOM issue temporarily.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants