[bug]: FATAL: unhandled exception PJLIB/No memory! #727

mikepultz · 2024-04-30T21:22:29Z

Severity

Major

Versions

21.2.0

Components/Modules

pjproject

Operating Environment

Amazon Linux 2 (RHEL)

Frequency of Occurrence

One Time

Issue Description

One of our Asterisk instances crashed today when trying to allocate memory to a pj_pool:

[2024-04-30 11:11:04] ERROR[28315]: pjproject:<?>: except.c ..!!!FATAL: unhandled exception PJLIB/No memory!

This is the first time it's happened, and we have 6 identical instances total (load balanced) that have been running for about a month, and so far it's only happened the once.

The system is not experiencing memory issues, and it doesn't appear to be a memory leak (no visible downward trend over time in on our Available memory graphs):

The servers are not under a high load - there was only around 120 active calls when this happened. I've included a backtrace from the core dump.

It's a production device, so I'm not able to run it under Valgrind, but I can probably provide some redacted config files if that's helpful..

Relevant log output

#0  0x00001479505e6cd0 in __longjmp_chk () from /lib64/libc.so.6
#1  0x0000147952e2952e in pj_throw_exception_ (exception_id=1) at ../src/pj/except.c:53
#2  0x0000147952d96f90 in pool_callback (pool=<optimized out>, size=<optimized out>) at ../src/pjsip/sip_endpoint.c:143
#3  0x0000147952e2d9f5 in pj_pool_create_block (size=7598263421698388000, pool=0x14792c65baa0) at ../src/pj/pool.c:62
#4  pj_pool_allocate_find (pool=0x14792c65baa0, size=7598263421698387232) at ../src/pj/pool.c:148
#5  0x0000147952e2da52 in pj_pool_alloc (pool=<optimized out>, size=<optimized out>) at ../include/pj/pool_i.h:61
#6  0x0000147952e39b59 in pj_strdup (pool=pool@entry=0x14792c65baa0, dst=dst@entry=0x147902c0f7f8, src=0x1e58c48) at ../include/pj/string_i.h:42
#7  0x0000147952d9fab9 in pjsip_tpmgr_find_local_addr2 (tpmgr=0x1e3f5c8, pool=pool@entry=0x14792c65baa0, prm=prm@entry=0x147902c0f7d0) at ../src/pjsip/sip_transport.c:1770
#8  0x000014790b63ada5 in filter_on_tx_message (tdata=0x14792c65bb48) at res_pjsip/pjsip_message_filter.c:253
#9  0x0000147952d96f3c in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x14792c65bb48) at ../src/pjsip/sip_endpoint.c:1125
#10 0x0000147952d9e9a4 in pjsip_transport_send (tr=0x14793d6f9838, tdata=tdata@entry=0x14792c65bb48, addr=addr@entry=0x1479441374f4, addr_len=28, token=token@entry=0x147944137408, 
    cb=cb@entry=0x147952db02b0 <transport_callback>) at ../src/pjsip/sip_transport.c:944
#11 0x0000147952db0715 in tsx_send_msg (tsx=0x147944137408, tdata=0x14792c65bb48) at ../src/pjsip/sip_transaction.c:2351
#12 0x0000147952db296e in tsx_on_state_proceeding_uas (event=0x147902c0fa50, tsx=0x147944137408) at ../src/pjsip/sip_transaction.c:2948
#13 tsx_on_state_trying (tsx=0x147944137408, event=0x147902c0fa50) at ../src/pjsip/sip_transaction.c:2878
#14 0x0000147952db4707 in pjsip_tsx_send_msg (tsx=0x147944137408, tdata=tdata@entry=0x14792c65bb48) at ../src/pjsip/sip_transaction.c:1969
#15 0x000014790b62bf61 in ast_sip_send_stateful_response (rdata=rdata@entry=0x14793ed75598, tdata=0x14792c65bb48, sip_endpoint=sip_endpoint@entry=0x14792cf94368) at res_pjsip.c:2460
#16 0x000014790b63178f in send_options_response (rdata=0x14793ed75598, code=code@entry=200) at res_pjsip/pjsip_options.c:256
#17 0x000014790b63190f in options_on_rx_request (rdata=<optimized out>) at res_pjsip/pjsip_options.c:305
#18 0x0000147952d97ca1 in pjsip_endpt_process_rx_data (endpt=0x1bdb5a8, rdata=rdata@entry=0x14793ed75598, p=p@entry=0x14790b87a400 <param>, p_handled=p_handled@entry=0x147902c0fc1c)
    at ../src/pjsip/sip_endpoint.c:930
#19 0x000014790b6466ef in distribute (data=0x14793ed75598) at res_pjsip/pjsip_distributor.c:952
#20 0x00000000005a0bee in ast_taskprocessor_execute ()
#21 0x00000000005a732b in execute_tasks ()
#22 0x00000000005a0bee in ast_taskprocessor_execute ()
#23 0x00000000005a7c81 in worker_start ()
#24 0x00000000005af939 in dummy_start ()
#25 0x00001479524e944b in start_thread () from /lib64/libpthread.so.0
#26 0x00001479505d652f in clone () from /lib64/libc.so.6

Asterisk Issue Guidelines

Yes, I have read the Asterisk Issue Guidelines

The text was updated successfully, but these errors were encountered:

jcolp · 2024-04-30T22:11:38Z

It's not a memory issue. It's trying to allocate an absurdly large amount, due to likely reading a value from freed memory. Unless it can be reproduced it would likely be hard to isolate and identify, but the backtrace shows it having to do with a transport so information about transports in use would be needed, along with attaching a full backtrace[1].

[1] https://docs.asterisk.org/Development/Debugging/Getting-a-Backtrace/?h=backtrace

mikepultz · 2024-04-30T22:39:07Z

Thanks @jcolp - yes - I probably should have mentioned the insanely large allocation size ;)

Here is my current transport configuration - used by all endpoints on the system; I'll run ast_coredumper in a few minutes to get those details.

[transport](!)
type=transport
local_net=x.x.x.x/22
external_media_address=y.y.y.y
external_signaling_address=y.y.y.y
allow_reload=yes
tos=cs5            
cos=5

[transport-udp](transport)
type=transport
protocol=udp
bind=0.0.0.0:5060

[transport-udp6](transport)
type=transport
protocol=udp
bind=[::]:5060

[transport-tcp](transport)
type=transport
protocol=tcp
bind=0.0.0.0:5060

[transport-tcp6](transport)
type=transport
protocol=tcp
bind=[::]:5060

[transport-tls](transport)
type=transport
protocol=tls
bind=0.0.0.0:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes

[transport-tls6](transport)
type=transport
protocol=tls
bind=[::]:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes

mikepultz · 2024-04-30T22:46:36Z

Results from ast_coredumper:

core-asterisk-2024-04-30T15-11-04Z-brief.txt
core-asterisk-2024-04-30T15-11-04Z-full.txt
core-asterisk-2024-04-30T15-11-04Z-info.txt
core-asterisk-2024-04-30T15-11-04Z-locks.txt
core-asterisk-2024-04-30T15-11-04Z-thread1.txt

gtjoseph · 2024-05-02T13:51:41Z

@mikepultz If you still have that coredump around, I'd be interested in the results of the following...

# gdb /usr/sbin/asterisk <coredump>
(gdb) frame 6
(gdb) p *dst
(gdb) p *src
(gdb) q

mikepultz · 2024-05-02T17:52:37Z

Yup

(gdb) frame 6
#6  0x0000147952e39b59 in pj_strdup (pool=pool@entry=0x14792c65baa0, dst=dst@entry=0x147902c0f7f8, src=0x1e58c48) at ../include/pj/string_i.h:42
42              dst->ptr = (char*)pj_pool_alloc(pool, src->slen);
(gdb) p *dst
$1 = {ptr = 0x0, slen = 0}
(gdb) p *src
$2 = {ptr = 0x3066323936316332 <error: Cannot access memory at address 0x3066323936316332>, slen = 7598263421698387232}
(gdb)

gtjoseph · 2024-05-06T13:30:56Z

@mikepultz Thanks. That just confirms that the "src" pointer parameter is either corrupted or the contents of that location are. It looks like the build is optimized but the full coredump may help anyway.

Run sudo /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps <coredump>
If your asterisk binary files have been stripped of debug info, please download the get_binaries.sh script, make it executable, then run it as sudo ./get_binaries.sh /tmp. It will attempt to get the debug symbols from /usr/lib/debug if they're installed.
Upload the resulting tarballs to the file hosting service of your choice (Google Drive, Dropbox, etc).
Send the link to asteriskteam@sangoma.com mentioning this issue in the Subject.

mikepultz · 2024-05-06T17:41:33Z

hey @gtjoseph - that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.

I can include a tar of the asterisk binaries, all libraries on the system, and the core dump if that gives you what you need?

I definitely don't have the debug symbols on the system- it's a custom build that we package for our environment; I can include the build string as well if that's helpful?

Mike

gtjoseph · 2024-05-06T19:30:45Z

that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.

Eh what??? --tarball-coredumps should only grab the coredump itself, the *.txt files, the asterisk binary, the modules, and /etc/os-release. It should never try root or home directories or anything else. Not even /etc/asterisk. I know it works fine on RHEL but I wonder if Amazon Linux does something goofy with the directory layout.

In any case, what we'd really need is the asterisk binary and modules, the accompanying debug symbols if the binaries are stripped, the coredump itself, and /etc/os-release. From that we can usually spin up a matching docker container, copy in the binaries and symbols, and run gdb. The symbols are really important though. Are you certain they're not available? You wouldn't have been able to run that gdb command snippet I gave you without them. What does the get_binaries.sh script produce?

mikepultz · 2024-05-06T20:07:08Z

Ok- so it's doing something really weird then; it was taking a while to run, so I checked the process list, which showed:

and when I looked in that /tmp directory, I saw:

[mike@sip4.us1 ~]$ cd /tmp/core-asterisk-2024-04-30T15-11-04Z.output/
[mike@sip4.us1 core-asterisk-2024-04-30T15-11-04Z.output]$ ls -la
total 32
drwxr-xr-x 13 root root  169 May  6 13:17 .
drwxrwxrwt 11 root root 4096 May  6 13:22 ..
lrwxrwxrwx  1 root root    7 Mar  3  2021 bin -> usr/bin
dr-xr-xr-x  4 root root 4096 Apr  1 15:28 boot
drwxr-xr-x 14 root root 4096 Apr  8 22:04 dev
drwxr-xr-x 97 root root 8192 Apr 30 14:15 etc
drwxr-xr-x  7 root root   78 Jan  8 13:25 home
lrwxrwxrwx  1 root root    7 Mar  3  2021 lib -> usr/lib
lrwxrwxrwx  1 root root    9 Mar  3  2021 lib64 -> usr/lib64
drwxr-xr-x  2 root root    6 Mar  3  2021 local
drwxr-xr-x  2 root root    6 Apr  9  2019 media
drwxr-xr-x  2 root root    6 Apr  9  2019 mnt
drwxr-xr-x  4 root root   27 Mar  3  2021 opt
drwx------  8 root root  118 May  6 13:17 proc
drwxr-xr-x  2 root root 4096 May  6 13:17 tmp
drwxr-xr-x  3 root root   19 May  6 13:17 usr

get_binaries.sh only finds the os-release file

The main issue is that we use our own RPM package for our systems- I'll see if I can build a debuginfo RPM for our package and then tar everything up to send over.

Mike

gtjoseph · 2024-05-07T12:29:36Z

That's weird. I'm sure it doesn't have anything to do with this issue but I'd like to figure out why ast_coredumper isn't working in your environment. What AMI id are you using for Amazon Linux 2? Are you doing anything special with the filesystem layout or asterisk installation directories? Also, the core-asterisk-2024-04-30T15-11-04Z-info.txt file has no good info in it so can you tell me the exact version of asterisk you're running? If you're building from git, the commit-id would work. Are you applying your own patches to the source?

mikepultz · 2024-05-07T20:53:31Z

hey @gtjoseph

I just emailed in a dropbox link with the files; hopefully it has everything you need - if there's something missing or would make it easier, just let me know and I'll see if I can provide it.

RE: ast_coredumper

Amazon Linux 2 - the original AMI was amzn2-ami-hvm-2.0.20210303.0-x86_64-gp2 - but it's regularly updated, and the AMI name doesn't change - so that's not accurate; but it's an up-to-date Amazon Linux 2 instance, with kernel 5.15.152-100.162.amzn2.x86_64. There's nothing really "strange" about the system set up or layout.
I'm using asterisk 21.2.0 - from the website- I assume that's probably the 21.2.0 tag from github
I'm installing asterisk under /usr/local/asterisk, with config under /etc/asterisk
I do use --disable-rpath when building to make packaging on our system easier- that might affect it?
I do apply some patches, but nothing really substantial - mostly disabling some logs.

I've included the full asterisk install dir (including ast_coredumper) from my system, as well as a .patch file with all the changes I make, and all the config files, with the layout from my system in that dropbox file I sent- hopefully that gives you what you need.

Mike

mikepultz added bug triage labels Apr 30, 2024

jcolp added feedback-required support-level-core Functionality with core support level labels Apr 30, 2024

jcolp assigned mikepultz Apr 30, 2024

jcolp unassigned mikepultz May 1, 2024

jcolp removed the feedback-required label May 1, 2024

gtjoseph added the feedback-required label May 2, 2024

gtjoseph assigned mikepultz May 2, 2024

gtjoseph removed the feedback-required label May 6, 2024

gtjoseph unassigned mikepultz May 6, 2024

gtjoseph assigned mikepultz May 6, 2024

gtjoseph added feedback-required and removed triage labels May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: FATAL: unhandled exception PJLIB/No memory! #727

[bug]: FATAL: unhandled exception PJLIB/No memory! #727

mikepultz commented Apr 30, 2024

jcolp commented Apr 30, 2024

mikepultz commented Apr 30, 2024

mikepultz commented Apr 30, 2024

gtjoseph commented May 2, 2024

mikepultz commented May 2, 2024

gtjoseph commented May 6, 2024

mikepultz commented May 6, 2024

gtjoseph commented May 6, 2024

mikepultz commented May 6, 2024

gtjoseph commented May 7, 2024

mikepultz commented May 7, 2024

[bug]: FATAL: unhandled exception PJLIB/No memory! #727

[bug]: FATAL: unhandled exception PJLIB/No memory! #727

Comments

mikepultz commented Apr 30, 2024

Severity

Versions

Components/Modules

Operating Environment

Frequency of Occurrence

Issue Description

Relevant log output

Asterisk Issue Guidelines

jcolp commented Apr 30, 2024

mikepultz commented Apr 30, 2024

mikepultz commented Apr 30, 2024

gtjoseph commented May 2, 2024

mikepultz commented May 2, 2024

gtjoseph commented May 6, 2024

mikepultz commented May 6, 2024

gtjoseph commented May 6, 2024

mikepultz commented May 6, 2024

gtjoseph commented May 7, 2024

mikepultz commented May 7, 2024