Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM under no load (dump added) #3282

Open
febael opened this issue Aug 2, 2018 · 7 comments
Open

OOM under no load (dump added) #3282

febael opened this issue Aug 2, 2018 · 7 comments

Comments

@febael
Copy link

febael commented Aug 2, 2018

Bonjour Mesdames et Messieurs,

We get OOM errors at server. dump file

  • We get such errors unexpectedly at random times. It tends to occur less frequent after our switch from version 7.x to 8. Still, We had it several times at periods of several days under no load.
  • You don't need to do anything special, just wait with idle 2 local nodes connected to server. We run a few jobs though, but it doesn't look to be related to that.
  • We run version 8.0.0 on ubuntu and in a docker container.
  • You can check the dump file, What is interesting is, a 1.1 Gb byteArray receive on network and OOM occurring while trying to create a new byteArray to store it. Delving into the array gives an array starting with "/ HTTP/1.0 \13\10\13\10" and more than billion byte value 0's. Call stack of the thread at time of OOM is :

"New I/O worker #2" prio=5 tid=16 RUNNABLE
at java.lang.OutOfMemoryError.(OutOfMemoryError.java:48)
at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42)
at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34)
at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.getBuffer(PNPServerFrameDecoder.java:194)
at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.decode(PNPServerFrameDecoder.java:138)
at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.callDecode(PNPServerFrameDecoder.java:104)
Local Variable: java.net.InetSocketAddress#157
at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.handleUpstream(PNPServerFrameDecoder.java:86)
Local Variable: org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder#3
Local Variable: org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext#20
Local Variable: org.jboss.netty.buffer.BigEndianHeapChannelBuffer#6
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
Local Variable: org.jboss.netty.channel.DefaultChannelPipeline#6
Local Variable: org.jboss.netty.channel.UpstreamMessageEvent#1
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
Local Variable: org.jboss.netty.channel.socket.nio.NioAcceptedSocketChannel#3
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
Local Variable: sun.nio.ch.SelectionKeyImpl#40
Local Variable: java.util.HashMap$KeyIterator#1
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
Local Variable: sun.nio.ch.EPollSelectorImpl#4
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
Local Variable: org.jboss.netty.channel.socket.nio.NioWorker#3
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
Local Variable: java.lang.String#8438
Local Variable: java.lang.String#8439
Local Variable: org.jboss.netty.util.ThreadRenamingRunnable#5
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Local Variable: org.jboss.netty.util.internal.DeadLockProofWorker$1#4
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#5
at java.lang.Thread.run(Thread.java:748)

@tobwiens
Copy link
Contributor

tobwiens commented Aug 3, 2018

@febael Thank you very much for reporting this issue.

Your finding is very interesting. That message received on the network looks faulty, containing basically 0.

How much maximum heap memory do you give the java virtual machine when you start?

@fviale @luispinedamx What do you think about this memory issue?

@tobwiens
Copy link
Contributor

tobwiens commented Aug 3, 2018

@febael Sadly I cannot analyze your dump file. I downloaded it twice but I get following error:
java.io.IOException: Unrecognized magic number: 529205248 at com.sun.tools.hat.internal.parser.Reader.readFile(Reader.java:94) at com.sun.tools.hat.Main.main(Main.java:159)

@febael
Copy link
Author

febael commented Aug 3, 2018

@tobwiens hi,

There's no setting regarding the min/max memory.

Original huge dump file has the expected magic number :
xxd java_pid6.hprof | head -1
00000000: 4a41 5641 2050 524f 4649 4c45 2031 2e30 JAVA PROFILE 1.0

Acting on the tar.gz file I get :
xxd dump.tar.gz | head -1
00000000: 1f8b 0800 37e4 625b 0003 ec7c 7974 1377 ....7.b[...|yt.w
echo $(( 16#1f8b0800 ))
529205248

So I assume you tried to directly open the tar.gz file :)

@fviale
Copy link
Member

fviale commented Aug 3, 2018

Hello @febael,

I analysed the heap dump you provided using Eclipse Memory Analyser and it does not show anything special. Like you said, the scheduler seems to be quite empty. No object occupying abnormally the memory (biggest retainer being the JVM class loader, but nothing alarming).

What you describe regarding the PNP message of 1GB is quite surprising though. A few questions below:

  • How did you manage to decode the network message ? Especially if this message huge size triggered the out-of-memory error, how did you manage to analyse it retroactively ?
  • While this OOM appeared, you were strictly doing nothing at the scheduler level ? There was no task running ? No file transfer in progress ?
  • Do you see any unusual error in the scheduler logs (scheduler_home/logs/Scheduler.log) ?
  • Could you send us the full thread dump of the JVM at the moment of the OOM ?

@febael
Copy link
Author

febael commented Aug 3, 2018

Thank you @fviale for your analysis. I used the jvisualvm tool (should be shipped with every jdk installation), and I could get all thread dumps and objects in the heap with that.

  • How did you manage to decode the network message ? Especially if this message huge size triggered the out-of-memory error, how did you manage to analyse it retroactively ?
    • The thread causing the OOM shows allocation of a byteArray in a constructor. In the biggest objects list, a byteArray is the far winner with 1.1 GB. You an easily see the contents of this byteArray with jvisualvm. It is a wrapped by many layers and resides under a DefaultChannelPipeline object, as a value in a hashMap and key is "pnpDecoder".
  • While this OOM appeared, you were strictly doing nothing at the scheduler level ? There was no task running ? No file transfer in progress ?
    • I can say with 99.9% certainty that nothing was being done at the time of error.
  • Do you see any unusual error in the scheduler logs (scheduler_home/logs/Scheduler.log) ?
    • Nothing unusual in Scheduler.log or Node-local-LocalNodes-0.log
  • Could you send us the full thread dump of the JVM at the moment of the OOM ?
    • I think jvisualvm can help for this purpose

Do you have any idea on how to monitor the case for more clues?

@fviale
Copy link
Member

fviale commented Aug 3, 2018

Thank you for your detailed explanation, and I indeed observes what you describe in JVisualVM.

Apparently, one frame decoded by PNPServerFrameDecoder (part of the low-level proactive network protocol) seems excessively big. The maximum frame size set in the code is simply MAXINT bytes (which is over 2GB).

A low-level PNP message can be many many things, and based on the content of the frame bytes only, it is nearly impossible to find out the origin of the message.

What I suggest is that you change the log level of the PNP protocol to TRACE:
scheduler_server/config/log/server.properties

log4j.logger.proactive.pnp = TRACE
log4j.logger.proactive.remoteobject = DEBUG

This will be very verbose, but whenever the problem happens again, we might be able to find the provenance of the last message.

@febael
Copy link
Author

febael commented Aug 6, 2018

Thank you @fviale, done that and will inform you with findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants