Sporadic segfaults #100

skyeagle · 2020-09-16T16:45:35Z

System Information

OS: Debian GNU/Linux 10
Ruby: 2.6.5 with compiled with jemalloc
Version: 0.7.42

Description

We are getting sporadic segfaults from time to time.

Actual behavior

There is no way we can reproduce this since it happens out of the blue on production and the entire container dies without any additional debug information other than following backtrace:

12 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/requirement.rb
11 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/version.rb
10 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/errors.rb
9 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/deprecate.rb
8 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/defaults.rb
7 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/compatibility.rb
6 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/rbconfig.rb
5 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/enc/trans/transdb.so
4 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/enc/encdb.so
3 complex.so
2 rational.so
1 thread.rb
0 enumerator.so
* Loaded features:
"* Loaded script: /var/www/homebase/vendor/bundle/ruby/2.6.0/bin/iodine] 68.170.93.138: idle
-- Other runtime information -----------------------------------------------
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f57390214cf]
/lib/x86_64-linux-gnu/libpthread.so.0(0x7fa3) [0x7f5739304fa3]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x2045a4) [0x7f57397455a4]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x204061) [0x7f5739745061]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x201b65) [0x7f5739742b65]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x796b9) [0x7f5734fa76b9]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(rb_thread_call_without_gvl+0x90) [0x7f5739746c00]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x79640) [0x7f5734fa7640]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x16404) [0x7f5734f44404]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(fio_defer_perform+0x21) [0x7f5734f44361]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x162a1) [0x7f5734f442a1]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x1805d) [0x7f5734f4605d]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x62e67) [0x7f5734f90e67]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(http1_consume_data+0x154) [0x7f5734f90c99]
/lib/x86_64-linux-gnu/libc.so.6(0x7f5739084995) [0x7f5739084995]
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7f573930f730]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x1c2e92) [0x7f5739703e92]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0xb6c0b) [0x7f57395f7c0b]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(rb_vm_bugreport+0x50d) [0x7f573979313d]
-- C level backtrace information -------------------------------------------
R14: 0x00007f5723d9cf80 R15: 0x00007f5716dfbb70 EFL: 0x0000000000010286
R11: 0x00007f6deb4d810b R12: 0x00007f5716dfc640 R13: 0x00007f5734fa7606
R8: 0xffffffffffffffeb  R9: 0x00007f5734ffb160 R10: 0x00007f6deb4d810a
RDX: 0xffffffffffff8aea RDI: 0x00007f6deb4df620 RSI: 0x00007f6deb4e0fdf
RAX: 0x00007f6deb4d810b RBX: 0x0000000000000001 RCX: 0x00007f6deb4d80ea
RIP: 0x00007f5739084995 RBP: 0x00007f5716dfb9e0 RSP: 0x00007f5716dfb968
-- Machine register context ------------------------------------------------
c:0001 p:---- s:0003 E:0019c0 (none) [FINISH]
-- Control frame information -----------------------------------------------
ruby 2.6.5p114 (2019-10-01 revision 67812)+jemalloc [x86_64-linux]
[BUG] Segmentation fault at 0x00007f6deb4e1000

It doesn't happen pretty often but still a bit concerning moving forward.

Any help would be appreciated! Thank you!

The text was updated successfully, but these errors were encountered:

boazsegev · 2020-09-17T01:55:39Z

Hi @skyeagle,

Thank you for opening this issue.

It really annoys me that there's a bug that I have no idea where to hunt for. Than again, sometimes it seems like my job is writing tests and searching for bugs more than writing code.

If you have any ideas how to reproduce, that would be great - since I can't test for a bug or a patch unless I can reproduce the issue.

Also:

Did you compile iodine with any flags (i.e., are you using iodine's internal local allocator or is it bypassed)?
Are you calling any callbacks for the idle state?
Are you running iodine in cluster mode?

The thing that really gets to me is that the whole container crashes. Assuming iodine runs in cluster mode, any child that crashes should be respawned.

If the master process crashes, you would probably get more log messages than you do (child processes will complain before shutting down). Besides, the master process is fairly lazy. Except for pub/sub and Redis connectivity all it does is monitor the worker processes for failures (crash).

Could you run iodine with debug level verbosity (it's a lot of junk output, I know, sorry about that)?

To get debugging messages from iodine, run it with -V5 (capital V). Maybe one of those messages will add illumination.

Waiting for more information.

Thank you again for opening this issue!
Kindly,
Bo.

skyeagle · 2020-09-17T23:34:32Z

@boazsegev thanks for your feedback and patience.

Answering your questions...

Did you compile iodine with any flags (i.e., are you using iodine's internal local allocator or is it bypassed)?

Yes, it's compiled with

bundle config build.iodine --with-cflags="-DHTTP_MAX_HEADER_LENGTH=32768 -DHTTP_DEFAULT_BODY_LIMIT=104857600"

Are you calling any callbacks for the idle state?

I'm not aware of any.

Are you running iodine in cluster mode?

Yes. It's been running with 1-N workers. Is that considered cluster mode?

I have not seen this issue today but made some changes per your suggestion for better debugging in production for next occurence. I'll report back as soon I get something new.

Thanks again for great work on this server!

boazsegev · 2020-09-19T11:29:33Z

It's been running with 1-N workers. Is that considered cluster mode?

Cluster mode is activated when iodine runs more than a single process, in which case it automatically monitors child processes (workers) for failure and re-spawns them.

Usually this is achieved either automatically (on systems with more than 2 CPU cores) or by using the -w CLI option. i.e.,

iodine -w -1 -t 8

I'll wait for more information.

fabio · 2021-04-24T18:02:32Z

Hi,

Please find attached a segfault log I've just got in one of my servers. I hope could be useful.
segfault.log

System Information

OS: Linux 4.19.0-14-amd64 Http sec. + Tmp files #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
Ruby: 3.0.1
Version: 0.7.44
OpenSSL: OpenSSL 1.1.1d 10 Sep 2019

Iodine start up info:

INFO: Starting up Iodine:
 * Iodine 0.7.44
 * Ruby 3.0.1
 * facil.io 0.7.4 (epoll)
 * 1 Workers X 1 Threads per worker.
 * Maximum 131056 open files / sockets per worker.
 * Master (root) process: 16009.
INFO: Server is running 1 worker X 1 thread with facil.io 0.7.4 (epoll)
* Linked to OpenSSL 1.1.1d  10 Sep 2019
* Detected capacity: 131056 open file limit

fabio · 2021-05-11T09:53:28Z

Hi,

Maybe could be helpfull the C backtrace with source line numbers:

...
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush+0x261) [0x7fe257fce221] fio.c:2993
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush) (null):0
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush_all+0x55) [0x7fe257fce325] fio.c:3053
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(iodine_io_thread+0x3e) [0x7fe25802a54e] iodine_defer.c:38
...

Full log is downloadable from the link in my previous comment.

Thanks.
-- fabio

boazsegev · 2021-05-11T10:50:08Z

I went over the logs but they didn't provide enough data, especially considering I cannot seem to reproduce the issue.

Are you using iodine's TLS/SSL layer?

I know there's some possible OpenSSL memory management issues (might be iodine related or OpenSSL bugs, not too sure).

fabio · 2021-05-11T11:59:31Z

Hi Bo,

Thanks for your reply.

Yes, my Iodine setup include OpenSSL 1.1.1d 10 Sep 2019, even if my app do not require it. I'll try disabling TLS/SSL and I'll let you know in next couple of weeks.

-- fabio

boazsegev · 2021-05-11T12:14:17Z

Hi Fabio,

If you're not actively using TLS/SSL (i.e., using iodine -tls or any other way), than simply linking to OpenSSL shouldn't normally raise any issues.

-- Bo

fabio · 2021-05-12T18:06:30Z

Hi,

I've enabled Iodine full log (-V5) on one server in production where the problem occurs on average about every two weeks. I will report to you the log.

-- fabio

boazsegev · 2021-05-13T08:51:32Z

Thanks Fabio.

Do I understand correctly that this occurs on only one of the production servers? Do you use iodine on other production servers?

If so, what's the difference? What features of iodine are you using on the server that sporadically fails that you might not be using on other servers? What other C extensions are you using on that server (it might be a memory corruption error from another extension that causes this)?

-- Bo.

fabio · 2021-05-14T08:33:21Z

Do I understand correctly that this occurs on only one of the production servers? Do you use iodine on other production servers?

I use Iodine on 6 servers with same configuration of system and application. All servers have sporadic segfaults, so we have no helpful info from server comparison.

I've enabled verbose log only on one server for now as a precaution, just to understand if the verbosity could cause other issues (loss of performance, disk space saturation, etc.). If I see no problem, I'll enable full log in all servers.

-- fabio

fabio · 2021-05-27T15:19:51Z

Hi,

Pleas find attached the log of the core dump with debug level verbosity (iodine -V5 ...):

syslog-20210527.log.gz

I intentionally left in the file a lot of lines before the crash to give you more context. The core dump is at line 7332.

I really hope this could help to narrow down the issue.

-- fabio

boazsegev · 2021-05-29T11:48:37Z

I'm looking into this... The only thing I can think of is that the fallback IO thread (the one that will make sure to send any pending data even when the Ruby threads are all busy) is somehow attempting to access an invalid pointer in an empty (new) connection.

boazsegev added bug can't reproduce labels Sep 17, 2020

boazsegev added the waiting for response label Sep 17, 2020

fabio mentioned this issue Sep 11, 2022

Assertion failure #127

Closed

boazsegev added a commit that referenced this issue Sep 12, 2022

possibly related to issue #100 + bump

4dcc78b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic segfaults #100

Sporadic segfaults #100

skyeagle commented Sep 16, 2020

boazsegev commented Sep 17, 2020

skyeagle commented Sep 17, 2020

boazsegev commented Sep 19, 2020 •

edited

fabio commented Apr 24, 2021

fabio commented May 11, 2021

boazsegev commented May 11, 2021

fabio commented May 11, 2021

boazsegev commented May 11, 2021

fabio commented May 12, 2021

boazsegev commented May 13, 2021

fabio commented May 14, 2021

fabio commented May 27, 2021

boazsegev commented May 29, 2021

Sporadic segfaults #100

Sporadic segfaults #100

Comments

skyeagle commented Sep 16, 2020

System Information

Description

Actual behavior

boazsegev commented Sep 17, 2020

skyeagle commented Sep 17, 2020

boazsegev commented Sep 19, 2020 • edited

fabio commented Apr 24, 2021

System Information

Iodine start up info:

fabio commented May 11, 2021

boazsegev commented May 11, 2021

fabio commented May 11, 2021

boazsegev commented May 11, 2021

fabio commented May 12, 2021

boazsegev commented May 13, 2021

fabio commented May 14, 2021

fabio commented May 27, 2021

boazsegev commented May 29, 2021

boazsegev commented Sep 19, 2020 •

edited