Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion intermittently exits with code 139 #3452

Closed
minimalisti opened this issue Mar 7, 2019 · 5 comments
Closed

Orion intermittently exits with code 139 #3452

minimalisti opened this issue Mar 7, 2019 · 5 comments
Milestone

Comments

@minimalisti
Copy link

In our FIWARE system consisting of Orion, STH-Comet, and Mongo, Orion worked well for some months, but then started dying every now and then with exit code 139. This case has been difficult to for me to debug, as I can't see no cause for this in the logs.

End of Orion's log up to untimely exit:

    orion        | time=Friday 07 Dec 10:51:27 2018.571Z | lvl=WARN | corr=0c2f26ba-fa0e-11e8-8b6c-0242ac120002 | trans=1544121223-508-00000139202 | from=192.168.142.93 | srv=outside_temperature | subsrv=/ | comp=Orion | op=AlarmManager.cpp[432]:badInputReset | msg=Releasing alarm BadInput 192.168.142.93
    orion        | time=Friday 07 Dec 10:52:51 2018.614Z | lvl=WARN | corr=3e472d00-fa0e-11e8-ba7b-0242ac120002 | trans=1544121223-508-00000139410 | from=192.168.142.93 | srv=service list | subsrv=<default> | comp=Orion | op=AlarmManager.cpp[405]:badInput | msg=Raising alarm BadInput 192.168.142.93: bad character in tenant name - only underscore and alphanumeric characters are allowed. Offending character:
    orion        | time=Friday 07 Dec 10:52:57 2018.572Z | lvl=WARN | corr=41d4402a-fa0e-11e8-bd3c-0242ac120002 | trans=1544121223-508-00000139411 | from=192.168.142.93 | srv=outside_temperature | subsrv=/ | comp=Orion | op=AlarmManager.cpp[432]:badInputReset | msg=Releasing alarm BadInput 192.168.142.93
    orion        | time=Friday 07 Dec 10:55:19 2018.627Z | lvl=WARN | corr=96802652-fa0e-11e8-9ee0-0242ac120002 | trans=1544121223-508-00000139799 | from=192.168.142.93 | srv=service list | subsrv=<default> | comp=Orion | op=AlarmManager.cpp[405]:badInput | msg=Raising alarm BadInput 192.168.142.93: bad character in tenant name - only underscore and alphanumeric characters are allowed. Offending character:
    orion        | time=Friday 07 Dec 10:55:24 2018.261Z | lvl=WARN | corr=99432998-fa0e-11e8-b495-0242ac120002 | trans=1544121223-508-00000139816 | from=192.168.142.93 | srv=solar_panels | subsrv=/ | comp=Orion | op=AlarmManager.cpp[432]:badInputReset | msg=Releasing alarm BadInput 192.168.142.93
    orion        | time=Friday 07 Dec 11:04:30 2018.798Z | lvl=WARN | corr=N/A | trans=1544121223-508-00000141471 | from=pending | srv=pending | subsrv=pending | comp=Orion | op=AlarmManager.cpp[328]:notificationError | msg=Raising alarm NotificationError http://sth-comet:8666/notify: (curl_easy_perform failed: Timeout was reached)
    orion        | time=Friday 07 Dec 13:10:34 2018.172Z | lvl=WARN | corr=7b25874a-fa21-11e8-bc92-0242ac120002 | trans=1544121223-508-00000165344 | from=192.168.142.93 | srv=<default> | subsrv=<default> | comp=Orion | op=AlarmManager.cpp[405]:badInput | msg=Raising alarm BadInput 192.168.142.93: service '
    orion        | time=Friday 07 Dec 13:10:36 2018.281Z | lvl=WARN | corr=7c673fea-fa21-11e8-b126-0242ac120002 | trans=1544121223-508-00000165345 | from=192.168.142.93 | srv=outside_temperature | subsrv=/ | comp=Orion | op=AlarmManager.cpp[432]:badInputReset | msg=Releasing alarm BadInput 192.168.142.93
    orion exited with code 139

As I wrote, at this time a have no further information to give about this problem. I would appreciate any advice on how to proceed with getting this solved.

Earlier issue has some similarities, but while the exit code 139 is common with these issues, the reason for it seem to be different.

@fgalan
Copy link
Member

fgalan commented Mar 8, 2019

(I have cleaned up comments from #3326 to keep each issue focused on its topic)

@fgalan
Copy link
Member

fgalan commented Mar 8, 2019

Exit code 139 means that the process ended due to signal 11 (SIGSEGV). It would be useful to have the backtrace in order to have more information. The typical procedure to get the backtrace is:

unlimited -c ulimit    # to enable core generation
contextBroker -fg ...  # to run Orion
# once the core is generated run
gdb contextBroker core.xxxxx
(gdb) bt

However, not sure how this can be achieved if Orion runs inside docker container... Some investigation needs to be conducted on this.

Has you been able to reproduce the problem running Orion outside a docker container? This issue and the aforementioned #3326 make me wonder if the cause of the problem could be in Docker instead of Orion. Or, more precisely, in the way that Orion is running inside Docker. Maybe the resources available to Orion inside docker are limited (i.e. small amount of RAM) and that's causing the problem.

@minimalisti
Copy link
Author

minimalisti commented Mar 12, 2019

I have always ran Orion as a service in Docker Compose inside a virtual machine, so I haven't reproduced the error. Maybe I should try it, but I presumed running using Docker was natural with FIWARE, much of the documentation is dedicated to using FIWARE with Docker and Docker Compose, and I would like to continue using them.

I haven't tried enabling core generation within a Docker container, from quick Googling it could be possible: https://stackoverflow.com/questions/28335614/how-to-generate-core-file-in-docker-container#47694315. I'll likely have to time to look into this in the coming weeks.

Today I changed the image used for Orion in Docker Compose from fiware/orion:latest to fiware/orion:2.2.0, keeping the versions inline with https://github.com/Fiware/catalogue/releases.

My current docker-compose.yml file:

 version: "3"
    services:
        mongo:
            image: mongo:3.4
            command: --nojournal
            container_name: mongo
            hostname: mongo
            volumes:
              - ~/mongo_data/db:/data/db
        orion:
            image: fiware/orion:2.2.0
            hostname: orion
            container_name: orion
            links:
                - "mongo:iot-mongo"
            expose:
                - "1026"
            ports:
                - "1026:1026"
            command: -dbhost mongo
            deploy:
              restart_policy:
                condition: any
                delay: 30s
                window: 120s
        sth-comet:
           image: fiware/sth-comet:2.5.0
           container_name: sth-comet
           hostname: sth-comet
           links:
             - mongo:iot-mongo
           ports:
             - "8666:8666"
           environment:
             - STH_HOST=0.0.0.0
             - DB_URI=mongo:27017
             - NAME_ENCODING=true
             - DATA_MODEL=collection-per-attribute
             - LOGOPS_LEVEL=DEBUG

The resources of the virtual machine should be enough, there is about 30 gigs of memory, and four 2,1gig cores. The OS is Ubuntu 18.04. But looking at the virtual machines logs I do see some troubling lines concerning segfaulting libmicrohttpd:

[Thu Jan 24 17:55:32 2019] libmicrohttpd[49015]: segfault at 306 ip 00007fa8d92d0f19 sp 00007fa8be7e9c40 error 4 in libc-2.17.so[7fa8d9284000+1c3000]
[Fri Feb 22 08:17:54 2019] libmicrohttpd[121197]: segfault at 306 ip 00007f5f07612f19 sp 00007f5ede7f1c40 error 4 in libc-2.17.so[7f5f075c6000+1c3000]
[Fri Mar  1 13:15:11 2019] libmicrohttpd[25364]: segfault at 306 ip 00007fc9f1091f19 sp 00007fc9e37f3c40 error 4 in libc-2.17.so[7fc9f1045000+1c3000]

These segfaults might be a reason or a consequence of the problem our system is facing. This could move the cause of error away from Orion.

There is little new information in this comment, but the Orion hasn't crashed for awhile now.

@fgalan
Copy link
Member

fgalan commented Mar 12, 2019

Thank you for the detailed report. Please keep posting if some new finding arrives.

@fgalan
Copy link
Member

fgalan commented Aug 24, 2023

After a long quarantine of more than 4 years :) I think this issue can be closed.

If problem persist, please open a new issue with fresh updated information.

@fgalan fgalan closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants