Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash, probably due to stack overflow #61

Open
bfrk opened this issue Aug 20, 2020 · 13 comments
Open

crash, probably due to stack overflow #61

bfrk opened this issue Aug 20, 2020 · 13 comments
Labels

Comments

@bfrk
Copy link

bfrk commented Aug 20, 2020

We have a situation where our IOC crashes inside libstream.so. There are lots of communication errors right from the start. The protocol file is:

Terminator = 0x03;
ReadTimeout = 100;   
ReplyTimeout = 1000;
ExtraInput = Ignore;

readSonde {
 in "%*/\|0062\|2\|\$1\|\$2\|\$3\|\$4\|/%f";
}

The core dump suggests that it runs into a stack overflow due to an endless recursion between StreamCore::readCallback and StreamCore::evalIn:

Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x00007f65a07d7a8a in StreamCore::readCallback (this=0x0, status=StreamIoSuccess, 
    input=0x0, size=0) at ../StreamCore.cc:932
#1  0x00007f65a07d78cc in StreamCore::evalIn (this=0x56230f60e890)
    at ../StreamCore.cc:903
#2  0x00007f65a07d7e81 in StreamCore::readCallback (this=0x56230f60e890, 
    status=StreamIoNoReply, input=0x0, size=0) at ../StreamCore.cc:975
#3  0x00007f65a07d78cc in StreamCore::evalIn (this=0x56230f60e890)
    at ../StreamCore.cc:903
#4  0x00007f65a07d7e81 in StreamCore::readCallback (this=0x56230f60e890, 
    status=StreamIoNoReply, input=0x0, size=0) at ../StreamCore.cc:975
#5  0x00007f65a07d78cc in StreamCore::evalIn (this=0x56230f60e890)
    at ../StreamCore.cc:903

and so on and so on (> 1300 stack entries). The StreamDevice version is 2.8.10.

@dirk-zimoch
Copy link
Member

Probably fixed in commit 8b9359 which is included from 2.8.11 on. Please update and try again.

@bfrk
Copy link
Author

bfrk commented Aug 20, 2020

Unfortunately, upgrading did not help. I have tried with 2.8.16. The core dump says:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f914d054db1 in epicsMutexOsdLock (pmutex=0x56130bec9230)
    at ../../../src/libCom/osi/os/posix/osdMutex.c:127
127     {
[Current thread is 1 (Thread 0x7f9147dfb700 (LWP 20781))]
(gdb) bt
#0  0x00007f914d054db1 in epicsMutexOsdLock (pmutex=0x56130bec9230)
    at ../../../src/libCom/osi/os/posix/osdMutex.c:127
#1  0x00007f914d04dd59 in epicsMutexLock (pmutexNode=<optimized out>)
    at ../../../src/libCom/osi/epicsMutex.cpp:143
#2  0x00007f914d04df0c in epicsMutex::lock (this=<optimized out>)
    at ../../../src/libCom/osi/epicsMutex.cpp:244
#3  0x00007f914d9e86b8 in StreamCore::MutexLock::MutexLock (_stream=0x56130bec8bb0, 
    this=<synthetic pointer>) at ../StreamCore.h:121
#4  StreamCore::readCallback (this=0x56130bec8bb0, status=StreamIoNoReply, input=0x0, 
    size=0) at ../StreamCore.cc:956
#5  0x00007f914d9e3b54 in StreamCore::evalIn (this=this@entry=0x56130bec8bb0)
    at ../StreamCore.cc:919
#6  0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0, 
    status=<optimized out>, input=<optimized out>, size=<optimized out>)
    at ../StreamCore.cc:1107
#7  0x00007f914d9e3b54 in StreamCore::evalIn (this=this@entry=0x56130bec8bb0)
    at ../StreamCore.cc:919
#8  0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0, 
    status=<optimized out>, input=<optimized out>, size=<optimized out>)
    at ../StreamCore.cc:1107
[...more of the same...]
#1268 0x00007f914d9e8804 in StreamCore::readCallback (this=0x56130bec8bb0, 
    status=<optimized out>, input=<optimized out>, size=<optimized out>)
    at ../StreamCore.cc:1107
#1269 0x00007f914d9cc57d in StreamBusInterface::readCallback (this=0x56130bec9310, 
    size=0, input=0x0, status=StreamIoNoReply) at ../StreamBusInterface.h:121
#1270 AsynDriverInterface::readHandler (this=this@entry=0x56130bec9310)
    at ../AsynDriverInterface.cc:1054
#1271 0x00007f914d9ce7d8 in AsynDriverInterface::handleRequest (this=0x56130bec9310)
    at ../AsynDriverInterface.cc:1513
#1272 0x00007f914e659909 in portThread (pport=0x56130be7b7c0)
    at ../../asyn/asynDriver/asynManager.c:902
#1273 0x00007f914d052b40 in start_routine (arg=0x56130be7bd70)
    at ../../../src/libCom/osi/os/posix/osdThread.c:403
#1274 0x00007f914b96a4a4 in start_thread (arg=0x7f9147dfb700) at pthread_create.c:456
#1275 0x00007f914bc68d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

@dirk-zimoch
Copy link
Member

I will have a look. I assume your record is an ai?
What is your device sending?
Is it TCP or serial?

@bfrk
Copy link
Author

bfrk commented Aug 20, 2020

Luckily(?) it crashed almost immediately after I enabled asyn tracing (it usually takes a few miniutes after restart). These are the IOC's last words:

SIOC66C>asynSetTraceIOMask PS3 0 0x2
SIOC66C>asynSetTraceMask PS3 0 0x9
SIOC66C>2020/08/20 13:00:18.737 192.168.21.139:4003 read 69
@\005@\0200\002RD\003\025\0201\004`\005`\0200\002|0062|2|1|0|1241|0|0.7665E-1 \003\033\0201\004\004A\005A\0200\002RD\003\025\0201\004a\005
2020/08/20 13:00:18.740 192.168.21.139:4003 read 3
a\0200
2020/08/20 13:00:18.750 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:18.789 192.168.21.139:4003 read 32
|0062|2|9|1|189|400|0.1377E-1 \003 
2020/08/20 13:00:18.802 192.168.21.139:4003 read 2
\0201
2020/08/20 13:00:18.807 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:18.817 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:18.827 192.168.21.139:4003 read 2
B\005
2020/08/20 13:00:19.929 192.168.21.139:4003 read 107
B\0200\002RD\003\025\0201\004b\005b\0200\002|0062|2|1|0|1260|0|0.8232E-1 \003\021\0201\004\004C\005C\0200\002RD\003\025\0201\004c\005c\0200\002|0062|2|9
2020/08/20 13:00:19.930 192.168.21.139:4003 read 2
D\005
2020/08/20 13:00:21.031 192.168.21.139:4003 read 164
D\0200\002RD\003\025\0201\004d\005d\0200\002|0062|2|1|0|1206|0|0.7924E-1 \003\022\0201\004\004E\005E\0200\002RD\003\025\0201\004e\005e\0200\002|0062|2|9
2020/08/20 13:00:22.133 192.168.21.139:4003 read 117
G\0200\002RD\003\025\0201\004g\005g\0200\002|0062|2|9|1|229|400|0.3521E-2 \003-\0201\004\004H\005H\0200\002RD\003\025\0201\004h\005h\0200\002|0062|2|
2020/08/20 13:00:22.145 192.168.21.139:4003 read 5
\0201\004i\005
2020/08/20 13:00:22.196 192.168.21.139:4003 read 3
i\0200
2020/08/20 13:00:22.202 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:22.247 192.168.21.139:4003 read 33
|0062|2|9|1|188|400|-0.2593E-2 \003\000
2020/08/20 13:00:22.261 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:22.273 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:24.376 192.168.21.139:4003 read 111
P\005P\0200\002RD\003\025\0201\004p\005p\0200\002|0062|2|1|0|1264|0|0.6131E-1 \003\033\0201\004\004Q\005Q\0200\002RD\003\025\0201\004q\005q\0200\002|0062|2
2020/08/20 13:00:29.487 192.168.21.139:4003 read 70
\004R\005R\0200\002RD\003\025\0201\004r\005r\0200\002|0062|2|1|0|1262|0|0.8247E-1 \003\021\0201\004\004S\005S\0200\002RD\003\025\0201\004s\005
2020/08/20 13:00:29.525 192.168.21.139:4003 read 3
s\0200
2020/08/20 13:00:29.531 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:29.574 192.168.21.139:4003 read 32
|0062|2|9|1|240|400|0.4661E-2 \003\"
2020/08/20 13:00:29.586 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:29.598 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:29.608 192.168.21.139:4003 read 2
T\005
2020/08/20 13:00:29.703 192.168.21.139:4003 read 3
T\0200
2020/08/20 13:00:29.713 192.168.21.139:4003 read 5
\002RD\003\025
2020/08/20 13:00:29.727 192.168.21.139:4003 read 5
\0201\004t\005
2020/08/20 13:00:29.775 192.168.21.139:4003 read 3
t\0200
2020/08/20 13:00:29.781 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:29.824 192.168.21.139:4003 read 32
|0062|2|10|0|1258|0|0.7930E-1 \003,
2020/08/20 13:00:29.843 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:29.853 192.168.21.139:4003 read 1
\004
2020/08/20 13:00:29.865 192.168.21.139:4003 read 2
U\005
2020/08/20 13:00:29.872 192.168.21.139:4003 read 3
U\0200
2020/08/20 13:00:29.889 192.168.21.139:4003 read 5
\002RD\003\025
2020/08/20 13:00:29.904 192.168.21.139:4003 read 5
\0201\004u\005
2020/08/20 13:00:29.953 192.168.21.139:4003 read 3
u\0200
2020/08/20 13:00:29.959 192.168.21.139:4003 read 1
\002
2020/08/20 13:00:30.002 192.168.21.139:4003 read 32
|0062|2|9|1|205|400|0.1771E-1 \003%
2020/08/20 13:00:30.018 192.168.21.139:4003 read 3
\0201\004
2020/08/20 13:00:30.029 192.168.21.139:4003 read 1
\004
@@@ Current time: [2020-08-20 13:00:31]
@@@ Child process is shutting down, a new one will be restarted shortly

The device talks serial but it is connected via terminal server, like this:

serial via drvAsynIPPortConfigure ("PS3", "192.168.21.139:4003",0,0,0)

@bfrk
Copy link
Author

bfrk commented Aug 21, 2020

I should perhaps mention that the actual protocol used by the device is something we are currently in the process of reverse engineering.The protocol file as listed above above was the first thing we tried because it has worked (more or less) a few years ago on similar devices. I guess that by accident we happened to encounter a code path that is not used under normal circumstances.

@dirk-zimoch
Copy link
Member

My analysis so far:
StreamCore::evalIn() calls StreamCore::readCallback() if there is (still/already) data in the input buffer at the beginning of executing the in command.
StreamCore::readCallback() calls StreamCore::evalIn() for SCAN="I/O Intr" records in case of a read timeout (unterminated=incomplete input line), in order to restart the (should be blocking) in command.
However, it should do so with the old input discarded (unparsedInput = false) and evalIn only calls back to readCallback if (unparsedInput). I do notunderstand yet how this can end up in a recursion loop.....

@dirk-zimoch
Copy link
Member

Do you still have the core dump? Can you print *this?
Can you try to switch off optimization? Is this a normal PC or some embedded system?

@dirk-zimoch
Copy link
Member

Please run again with var StreamDebug 1.

@bfrk
Copy link
Author

bfrk commented Aug 28, 2020

Do you still have the core dump? Can you print *this?

Yes and yes. I guess the one you are interested in is at the bottom:

(gdb) up
#1  0x00007f914d04dd59 in epicsMutexLock (pmutexNode=<optimized out>)
    at ../../../src/libCom/osi/epicsMutex.cpp:143
143             epicsMutexOsdLock(pmutexNode->id);
(gdb) print *this
No symbol "this" in current context.
(gdb) up
#2  0x00007f914d04df0c in epicsMutex::lock (this=<optimized out>)
    at ../../../src/libCom/osi/epicsMutex.cpp:244
244         epicsMutexLockStatus status = epicsMutexLock ( this->id );
(gdb) print *this
value has been optimized out
(gdb) up
#3  0x00007f914d9e86b8 in StreamCore::MutexLock::MutexLock (_stream=0x56130bec8bb0, 
    this=<synthetic pointer>) at ../StreamCore.h:121
121                 { _stream->lockMutex(); }
(gdb) print *this
$1 = {stream = 0x56130bec8bb0}
(gdb) up
#4  StreamCore::readCallback (this=0x56130bec8bb0, status=StreamIoNoReply, input=0x0, 
    size=0) at ../StreamCore.cc:956
956         MutexLock lock(this);
(gdb) print *this
$2 = {<StreamProtocolParser::Client> = {<No data fields>}, <StreamBusInterface::Client> = {<No data fields>}, next = 0x56130bec9720, static first = 0x56130beaf7f0, 
  streamname = 0x56130bfb38f0 "RADZC:B3S2", flags = 261, protocolname = {
    local = "readSonde\000\061\000\060\000\061\062\066\060\000\060", '\000' <repeats 43 times>, len = 20, cap = 64, offs = 0, buffer = 0x56130bec8be0 "readSonde"}, 
  lockTimeout = 5000, writeTimeout = 100, replyTimeout = 1000, readTimeout = 100, 
  pollPeriod = 1000, maxInput = 0, inTerminatorDefined = true, 
  outTerminatorDefined = true, inTerminator = {
    local = "\003", '\000' <repeats 62 times>, len = 1, cap = 64, offs = 0, 
    buffer = 0x56130bec8c78 "\003"}, outTerminator = {
    local = "\003", '\000' <repeats 62 times>, len = 1, cap = 64, offs = 0, 
    buffer = 0x56130bec8cd8 "\003"}, separator = {local = '\000' <repeats 63 times>, 
    len = 0, cap = 64, offs = 0, buffer = 0x56130bec8d38 ""}, commands = {
    local = "\001%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f", '\000' <repeats 30 times>, len = 33, cap = 64, offs = 0, 
    buffer = 0x56130bec8d98 "\001%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f"}, onInit = {local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0, 
    buffer = 0x56130bec8df8 ""}, onWriteTimeout = {local = '\000' <repeats 63 times>, 
    len = 0, cap = 64, offs = 0, buffer = 0x56130bec8e58 ""}, onReplyTimeout = {
    local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0, 
    buffer = 0x56130bec8eb8 ""}, onReadTimeout = {local = '\000' <repeats 63 times>, 
    len = 0, cap = 64, offs = 0, buffer = 0x56130bec8f18 ""}, onMismatch = {
    local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0, 
    buffer = 0x56130bec8f78 ""}, 
  commandIndex = 0x56130bec8d99 "%*/\033|0062\033|2\033|1\033|0\033|1260\033|0\033|/%f", activeCommand = 1 '\001', outputLine = {local = '\000' <repeats 63 times>, len = 0, 
    cap = 64, offs = 0, buffer = 0x56130bec8fe8 ""}, inputBuffer = {
    local = '\000' <repeats 63 times>, len = 5, cap = 256, offs = 103, 
---Type <return> to continue, or q <return> to quit---
    buffer = 0x7f9130000a10 "T\020\060\002RD\003\025\020\061\004t\005t\020\060\002|0062|2|10|0|1258|0|0.7997E-1 \003!\020\061\004\004U\005U\020\060\002RD\003\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 \003)\020\061\004\004"}, inputLine = {
    local = "\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 ", '\000' <repeats 23 times>, len = 40, cap = 64, offs = 0, 
    buffer = 0x56130bec90a8 "\025\020\061\004u\005u\020\060\002|0062|2|9|1|205|400|0.1094E-1 "}, consumedInput = 0, runningHandler = StreamCore::Success, fieldAddress = {
    local = '\000' <repeats 63 times>, len = 0, cap = 64, offs = 0, 
    buffer = 0x56130bec9118 ""}, lastInputStatus = StreamIoNoReply, 
  unparsedInput = true}

Can you try to switch off optimization?

I think so, but it's not quite as simple. Will try.

Is this a normal PC or some embedded system?

As far as In know it's a "blade" (whatever that means exactly, I can ask if it is important), so neither a normal PC nor really something embedded. This is the same hardware that most of our servers run on. Definitely linux-x86_64.

Please run again with var StreamDebug 1.

Will do.

@dirk-zimoch
Copy link
Member

Thanks. I see unparsedInput = true which caused the branch in evalIn be taken to call readCallback. There are indeed 40 bytes in inputBuffer. This is in frame #4. But in frame #6 we are in line 1107, just after:

            unparsedInput = false;
            inputBuffer.clear();

Thus nothing should be in inputBuffer and unparsedInput should be false. Except of course something else has re-filled the buffer meanwhile in another thread. However, the only method doing that is readCallback() itself. And we have the semaphore, thus no other thread can call readCallback() at the moment.

Hmm...

@dirk-zimoch
Copy link
Member

Frame#1269 calls StreamCore::readCallback the first time with status=StreamIoNoReply, which has been stored in this->lastInputStatus. This should end up taking the branch case StreamIoNoReply in line 985 and then as we are in AsyncMode (this->flags=261=0x105, AsyncMode=0x0004) continue to line 991 calling evalIn() in order to restart the protocol (assuming inputBuffer is empty) and then return. Instead we somehow end up in line 1107 in frame #1268 as is we had got input. I don't understand how.

Maybe the optimizer has decided to replace lines 991,992 with the equivalent of "goto line 1107" (which executes the same code). In that case the stack dump would be misleading and we are recursing from line 991, not 1107. Switching on var streamDebug 1 should show the difference.

@bfrk
Copy link
Author

bfrk commented Sep 1, 2020

I am currently having trouble to reproduce the crash, working on it. When/if I can reproduce I will make the changes you suggested and follow up.

@bfrk
Copy link
Author

bfrk commented Sep 18, 2020

I have finally come around to reproducing the crash with the requested changes: disable optimizations, and set var streamDebug 1. The log of the IOC output with var streamDebug is too large to attach here (35MiB). You can download both the backtrace and the log from https://www-csr.bessy.de/control/scratch/TRACE and https://www-csr.bessy.de/control/scratch/LOG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants