Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic segfaults #100

Open
skyeagle opened this issue Sep 16, 2020 · 13 comments
Open

Sporadic segfaults #100

skyeagle opened this issue Sep 16, 2020 · 13 comments

Comments

@skyeagle
Copy link

System Information

  • OS: Debian GNU/Linux 10
  • Ruby: 2.6.5 with compiled with jemalloc
  • Version: 0.7.42

Description

We are getting sporadic segfaults from time to time.

Actual behavior

There is no way we can reproduce this since it happens out of the blue on production and the entire container dies without any additional debug information other than following backtrace:

12 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/requirement.rb
11 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/version.rb
10 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/errors.rb
9 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/deprecate.rb
8 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/defaults.rb
7 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/rubygems/compatibility.rb
6 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/rbconfig.rb
5 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/enc/trans/transdb.so
4 /usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/ruby/2.6.0/x86_64-linux/enc/encdb.so
3 complex.so
2 rational.so
1 thread.rb
0 enumerator.so
* Loaded features:
"* Loaded script: /var/www/homebase/vendor/bundle/ruby/2.6.0/bin/iodine] 68.170.93.138: idle
-- Other runtime information -----------------------------------------------
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f57390214cf]
/lib/x86_64-linux-gnu/libpthread.so.0(0x7fa3) [0x7f5739304fa3]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x2045a4) [0x7f57397455a4]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x204061) [0x7f5739745061]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x201b65) [0x7f5739742b65]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x796b9) [0x7f5734fa76b9]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(rb_thread_call_without_gvl+0x90) [0x7f5739746c00]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x79640) [0x7f5734fa7640]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x16404) [0x7f5734f44404]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(fio_defer_perform+0x21) [0x7f5734f44361]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x162a1) [0x7f5734f442a1]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x1805d) [0x7f5734f4605d]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(0x62e67) [0x7f5734f90e67]
/var/www/homebase/vendor/bundle/ruby/2.6.0/gems/iodine-0.7.42/lib/iodine/iodine.so(http1_consume_data+0x154) [0x7f5734f90c99]
/lib/x86_64-linux-gnu/libc.so.6(0x7f5739084995) [0x7f5739084995]
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7f573930f730]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0x1c2e92) [0x7f5739703e92]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(0xb6c0b) [0x7f57395f7c0b]
/usr/lib/fullstaq-ruby/versions/2.6.5-jemalloc/lib/libruby.so.2.6(rb_vm_bugreport+0x50d) [0x7f573979313d]
-- C level backtrace information -------------------------------------------
R14: 0x00007f5723d9cf80 R15: 0x00007f5716dfbb70 EFL: 0x0000000000010286
R11: 0x00007f6deb4d810b R12: 0x00007f5716dfc640 R13: 0x00007f5734fa7606
R8: 0xffffffffffffffeb  R9: 0x00007f5734ffb160 R10: 0x00007f6deb4d810a
RDX: 0xffffffffffff8aea RDI: 0x00007f6deb4df620 RSI: 0x00007f6deb4e0fdf
RAX: 0x00007f6deb4d810b RBX: 0x0000000000000001 RCX: 0x00007f6deb4d80ea
RIP: 0x00007f5739084995 RBP: 0x00007f5716dfb9e0 RSP: 0x00007f5716dfb968
-- Machine register context ------------------------------------------------
c:0001 p:---- s:0003 E:0019c0 (none) [FINISH]
-- Control frame information -----------------------------------------------
ruby 2.6.5p114 (2019-10-01 revision 67812)+jemalloc [x86_64-linux]
[BUG] Segmentation fault at 0x00007f6deb4e1000

It doesn't happen pretty often but still a bit concerning moving forward.

Any help would be appreciated! Thank you!

@boazsegev
Copy link
Owner

Hi @skyeagle,

Thank you for opening this issue.

It really annoys me that there's a bug that I have no idea where to hunt for. Than again, sometimes it seems like my job is writing tests and searching for bugs more than writing code.

If you have any ideas how to reproduce, that would be great - since I can't test for a bug or a patch unless I can reproduce the issue.

Also:

  • Did you compile iodine with any flags (i.e., are you using iodine's internal local allocator or is it bypassed)?

  • Are you calling any callbacks for the idle state?

  • Are you running iodine in cluster mode?

The thing that really gets to me is that the whole container crashes. Assuming iodine runs in cluster mode, any child that crashes should be respawned.

If the master process crashes, you would probably get more log messages than you do (child processes will complain before shutting down). Besides, the master process is fairly lazy. Except for pub/sub and Redis connectivity all it does is monitor the worker processes for failures (crash).

Could you run iodine with debug level verbosity (it's a lot of junk output, I know, sorry about that)?

To get debugging messages from iodine, run it with -V5 (capital V). Maybe one of those messages will add illumination.

Waiting for more information.

Thank you again for opening this issue!
Kindly,
Bo.

@skyeagle
Copy link
Author

@boazsegev thanks for your feedback and patience.

Answering your questions...

Did you compile iodine with any flags (i.e., are you using iodine's internal local allocator or is it bypassed)?

Yes, it's compiled with

bundle config build.iodine --with-cflags="-DHTTP_MAX_HEADER_LENGTH=32768 -DHTTP_DEFAULT_BODY_LIMIT=104857600"

Are you calling any callbacks for the idle state?

I'm not aware of any.

Are you running iodine in cluster mode?

Yes. It's been running with 1-N workers. Is that considered cluster mode?

I have not seen this issue today but made some changes per your suggestion for better debugging in production for next occurence. I'll report back as soon I get something new.

Thanks again for great work on this server!

@boazsegev
Copy link
Owner

boazsegev commented Sep 19, 2020

It's been running with 1-N workers. Is that considered cluster mode?

Cluster mode is activated when iodine runs more than a single process, in which case it automatically monitors child processes (workers) for failure and re-spawns them.

Usually this is achieved either automatically (on systems with more than 2 CPU cores) or by using the -w CLI option. i.e.,

iodine -w -1 -t 8

I'll wait for more information.

@fabio
Copy link

fabio commented Apr 24, 2021

Hi,

Please find attached a segfault log I've just got in one of my servers. I hope could be useful.
segfault.log

System Information

  • OS: Linux 4.19.0-14-amd64 Http sec. + Tmp files #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
  • Ruby: 3.0.1
  • Version: 0.7.44
  • OpenSSL: OpenSSL 1.1.1d 10 Sep 2019

Iodine start up info:

INFO: Starting up Iodine:
 * Iodine 0.7.44
 * Ruby 3.0.1
 * facil.io 0.7.4 (epoll)
 * 1 Workers X 1 Threads per worker.
 * Maximum 131056 open files / sockets per worker.
 * Master (root) process: 16009.
INFO: Server is running 1 worker X 1 thread with facil.io 0.7.4 (epoll)
* Linked to OpenSSL 1.1.1d  10 Sep 2019
* Detected capacity: 131056 open file limit

@fabio
Copy link

fabio commented May 11, 2021

Hi,

Maybe could be helpfull the C backtrace with source line numbers:

...
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush+0x261) [0x7fe257fce221] fio.c:2993
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush) (null):0
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(fio_flush_all+0x55) [0x7fe257fce325] fio.c:3053
.rbenv/versions/3.0.1/lib/ruby/gems/3.0.0/gems/iodine-0.7.44/lib/iodine/iodine.so(iodine_io_thread+0x3e) [0x7fe25802a54e] iodine_defer.c:38
...

Full log is downloadable from the link in my previous comment.

Thanks.
-- fabio

@boazsegev
Copy link
Owner

I went over the logs but they didn't provide enough data, especially considering I cannot seem to reproduce the issue.

Are you using iodine's TLS/SSL layer?

I know there's some possible OpenSSL memory management issues (might be iodine related or OpenSSL bugs, not too sure).

@fabio
Copy link

fabio commented May 11, 2021

Hi Bo,

Thanks for your reply.

Yes, my Iodine setup include OpenSSL 1.1.1d 10 Sep 2019, even if my app do not require it. I'll try disabling TLS/SSL and I'll let you know in next couple of weeks.

-- fabio

@boazsegev
Copy link
Owner

Hi Fabio,

If you're not actively using TLS/SSL (i.e., using iodine -tls or any other way), than simply linking to OpenSSL shouldn't normally raise any issues.

-- Bo

@fabio
Copy link

fabio commented May 12, 2021

Hi,

I've enabled Iodine full log (-V5) on one server in production where the problem occurs on average about every two weeks. I will report to you the log.

-- fabio

@boazsegev
Copy link
Owner

Thanks Fabio.

Do I understand correctly that this occurs on only one of the production servers? Do you use iodine on other production servers?

If so, what's the difference? What features of iodine are you using on the server that sporadically fails that you might not be using on other servers? What other C extensions are you using on that server (it might be a memory corruption error from another extension that causes this)?

-- Bo.

@fabio
Copy link

fabio commented May 14, 2021

Do I understand correctly that this occurs on only one of the production servers? Do you use iodine on other production servers?

I use Iodine on 6 servers with same configuration of system and application. All servers have sporadic segfaults, so we have no helpful info from server comparison.

I've enabled verbose log only on one server for now as a precaution, just to understand if the verbosity could cause other issues (loss of performance, disk space saturation, etc.). If I see no problem, I'll enable full log in all servers.

-- fabio

@fabio
Copy link

fabio commented May 27, 2021

Hi,

Pleas find attached the log of the core dump with debug level verbosity (iodine -V5 ...):

syslog-20210527.log.gz

I intentionally left in the file a lot of lines before the crash to give you more context. The core dump is at line 7332.

I really hope this could help to narrow down the issue.

-- fabio

@boazsegev
Copy link
Owner

I'm looking into this... The only thing I can think of is that the fallback IO thread (the one that will make sure to send any pending data even when the Ruby threads are all busy) is somehow attempting to access an invalid pointer in an empty (new) connection.

boazsegev added a commit that referenced this issue Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants