Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catch on the exception mongo::UserException [it was "container crashes with exit code 139"] #3326

Closed
cdupont opened this issue Oct 1, 2018 · 38 comments

Comments

@cdupont
Copy link

cdupont commented Oct 1, 2018

I recently upgraded Orion to version 1.15.
However it crashes regularly with exit code 139.
For instance:

$ docker-compose up
orion_1       | time=Sunday 30 Sep 19:40:34 2018.747Z | lvl=ERROR | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=mongoConnectionPool.cpp[152]:mongoConnect | msg=Database Startup Error (cannot connect to mongo - doing 100 retries with a 1000 microsecond interval)
orion_1       | terminate called after throwing an instance of 'mongo::UserException'
orion_1       |   what():  DBClientBase::findN: transport error: mongo:27017 ns: admin.$cmd query: { ismaster: 1 }
platform_orion_1 exited with code 139

It seem related to a memory problem.
I run the docker version (with a bunch of other docker containers).
This bug happens when my PC is quite loaded (running many tasks together). It happens systematically on Travis CI.

@fgalan
Copy link
Member

fgalan commented Oct 2, 2018

So it happens all the time? Or only when your PC is loaded (I mean, when you PC is not loaded, the problem doesn't happen)?

@cdupont
Copy link
Author

cdupont commented Oct 2, 2018

It happens quite intermittently...
It seem to happen more often when my PC is load (for example when I use Skype or play some video)...
Here is a log where it happens on Travis: https://api.travis-ci.org/v3/job/435775229/log.txt
See line 2043. It seem to be related to the order in which the containers are started, or the load of docker at startup, or the connectivity with Mongo?
I don't have a reproducible example. Here is my application: https://github.com/Waziup/Platform
If you follow the install instructions of the README, it should reproduce the problem (sometimes).
It happens quite often, I would say one every 4-6 launches.

@fgalan
Copy link
Member

fgalan commented Oct 2, 2018

How much memory has the CB container allocated?

@cdupont
Copy link
Author

cdupont commented Oct 2, 2018

I use the default from docker-compose. I'm not sure how mush is this, probably 2Go?

@cdupont
Copy link
Author

cdupont commented Oct 3, 2018

Here is the result of docker logs and inspect:

$ docker logs b09deea0a855
terminate called after throwing an instance of 'mongo::UserException'
  what():  DBClientBase::findN: transport error: mongo:27017 ns: admin.$cmd query: { ismaster: 1 }
$ docker inspect b09deea0a855
[
    {
        "Id": "b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76",
        "Created": "2018-10-03T08:40:04.484018714Z",
        "Path": "/usr/bin/contextBroker",
        "Args": [
            "-fg",
            "-multiservice",
            "-ngsiv1Autocast",
            "-dbhost",
            "mongo"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 139,
            "Error": "",
            "StartedAt": "2018-10-03T08:40:04.952913231Z",
            "FinishedAt": "2018-10-03T08:40:05.287269743Z"
        },
        "Image": "sha256:dc67096e0cafbd0688270d2bae48fba95f73d8eb56cb7e017d68222ad2ff8bc3",
        "ResolvConfPath": "/var/lib/docker/containers/b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76/hostname",
        "HostsPath": "/var/lib/docker/containers/b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76/hosts",
        "LogPath": "/var/lib/docker/containers/b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76/b09deea0a85562037f33865e4ea84c07359e2af245ad5c791a1f6ffc2cc78b76-json.log",
        "Name": "/platform_orion_1",
        "RestartCount": 0,
        "Driver": "overlay2",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "host",
            "PortBindings": {
                "1026/tcp": [
                    {
                        "HostIp": "",
                        "HostPort": "1026"
                    }
                ]
            },
            "RestartPolicy": {
                "Name": "",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": [],
            "CapAdd": null,
            "CapDrop": null,
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": [
                "localhost:127.0.0.1",
                "mongo:127.0.0.1"
            ],
            "GroupAdd": null,
            "IpcMode": "",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DiskQuota": 0,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": -1,
            "OomKillDisable": false,
            "PidsLimit": 0,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0
        },
        "GraphDriver": {
            "Name": "overlay2",
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/e79bad88ad0b4d65939daf647c65c77cefc0d0c9dd9c6f9919b5dbb1c5838598-init/diff:/var/lib/docker/overlay2/45221d1ed96eee6a6938d5dab3b551b6c172635ff02f31d49481c65be4736893/diff:/var/lib/docker/overlay2/dbec426be400f08c9a87ad8a4efafc576a3a6f9e30acf45a207f1dee4fed92fe/diff",
                "MergedDir": "/var/lib/docker/overlay2/e79bad88ad0b4d65939daf647c65c77cefc0d0c9dd9c6f9919b5dbb1c5838598/merged",
                "UpperDir": "/var/lib/docker/overlay2/e79bad88ad0b4d65939daf647c65c77cefc0d0c9dd9c6f9919b5dbb1c5838598/diff",
                "WorkDir": "/var/lib/docker/overlay2/e79bad88ad0b4d65939daf647c65c77cefc0d0c9dd9c6f9919b5dbb1c5838598/work"
            }
        },
        "Mounts": [],
        "Config": {
            "Hostname": "orion",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "1026/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "ORION_USER=orion",
                "GIT_REV_ORION=1.15.0",
                "CLEAN_DEV_TOOLS=1"
            ],
            "Cmd": [
                "-dbhost",
                "mongo"
            ],
            "Image": "fiware/orion:1.15.0",
            "Volumes": null,
            "WorkingDir": "/",
            "Entrypoint": [
                "/usr/bin/contextBroker",
                "-fg",
                "-multiservice",
                "-ngsiv1Autocast"
            ],
            "OnBuild": null,
            "Labels": {
                "build-date": "20170911",
                "com.docker.compose.config-hash": "215fd55a19c24f52466f9718a00cbaf91df72dd2be3d6cf750d5a4e1c16c709a",
                "com.docker.compose.container-number": "1",
                "com.docker.compose.oneoff": "False",
                "com.docker.compose.project": "platform",
                "com.docker.compose.service": "orion",
                "com.docker.compose.version": "1.9.0",
                "license": "GPLv2",
                "name": "CentOS Base Image",
                "vendor": "CentOS"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "672b8d5353b0d66050bfcc2922d1d2604934068913dad764bd50759dcb08b046",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": null,
            "SandboxKey": "/var/run/docker/netns/default",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "host": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "434c876a37bda002e9653cf3543f416f9dfa132c6696607df943686615021470",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": ""
                }
            }
        }
    }
]

Notice that it happens only half a second after startup.

@fgalan
Copy link
Member

fgalan commented Oct 3, 2018

See line 2043. It seem to be related to the order in which the containers are started, or the load of docker at startup, or the connectivity with Mongo?

A problem in the starting order (i.e. CB container is started before MongoDB is started) could make sense. In order to confirm/discard you could do the following test:

  1. Run first MongoDB docker with docker run.
  2. Esnure MongoDB is up and ready (for instance, accesing to it with mongo shell)
  3. Run now Orion docker with docker run

More detail on how to run individual container instead of using docker-compose.yml can be found at https://github.com/telefonicaid/fiware-orion/tree/master/docker#2b-mongodb-runs-on-another-docker-container

@cdupont
Copy link
Author

cdupont commented Oct 3, 2018

I did as suggested, it works in both cases (mongo before orion, orion before mongo) with appropriate messages.
It seem to crash when there is load at startup. That's probably why it doesn't crash when starting individual containers, but crash when I start my full platform with docker-compose up.
For example today it crashed twice when typing docker-compose up. Other times it was fine. Quite the mystery :)

@fgalan
Copy link
Member

fgalan commented Oct 3, 2018

In any case it seems to be an issue more related with docker or with the available resources in your hosting system than with Orion itself. What do you think?

@cdupont
Copy link
Author

cdupont commented Oct 3, 2018

I agree. I'm wondering how to debug it, though. Do you know how I can trace/graph the memory consumption of dockers?

@cdupont
Copy link
Author

cdupont commented Oct 5, 2018

It happens really consistently, on my PC, on travis and on colleagues PC.
Here is a docker-compose that reproduces the problem:

version: '2'

services:

# Broker
  mongo:
    image: mongo:3.2.8
    ports:
      - "27017:27017"
    volumes:
      - ./data/mongo:/data/db

  orion:
    image: fiware/orion:1.15.0
    ports:
      - "1026:1026"
    network_mode: host
    depends_on:
      - mongo

Run with:

$ docker-compose up --force-recreate orion
Recreating platform_mongo_1
Recreating platform_orion_1
Attaching to platform_orion_1
orion_1  | terminate called after throwing an instance of 'mongo::UserException'
orion_1  |   what():  DBClientBase::findN: transport error: localhost:27017 ns: admin.$cmd query: { ismaster: 1 }
platform_orion_1 exited with code 139

$ docker-compose down
Stopping platform_mongo_1 ... done
Removing platform_orion_1 ... done
Removing platform_mongo_1 ... done
Removing network platform_default

My system is not loaded at the moment. With this setting, it happens systematically.
It seems to be related to the startup of mongo: Orion might make some requests when mongo is in an uncertain state.
I noticed that if the database is big, it happens quite systematically (90% of the time) while when it is empty it happens less (10% of the time). Big database may mean that mongo takes longer to load.
Can it be that this version of Mongo is not compatible with Orion (i.e. some startup states makes Orion crash)?

@fgalan
Copy link
Member

fgalan commented Oct 5, 2018

Can it be that this version of Mongo is not compatible with Orion (i.e. some startup states makes Orion crash)?

It shouldn't but, in order to discard it, I'd suggest to use MongoDB 3.4 and check if the problem stills or not.

@cdupont
Copy link
Author

cdupont commented Oct 5, 2018

I updated to 3.4, same problem.

@cdupont
Copy link
Author

cdupont commented Oct 5, 2018

It does seem to be related to memory, after all. After using Skype, my system presented little memory available:

$ cat /proc/meminfo
MemTotal:       16321148 kB
MemFree:          260928 kB

Cleaning it made the problem go away:

$ sudo sysctl -w vm.drop_caches=3`
$ cat /proc/meminfo 
MemTotal:       16321148 kB
MemFree:         9252460 kB

Now Orion starts correctly.
Probably having less memory makes mongo take more time to startup, flustering Orion?

@fgalan
Copy link
Member

fgalan commented Oct 5, 2018

It could be.

A way of testing it would be to have a tail -f in the mongo docker log and in the context broker docker log. Thus, you can check if the error occurs only when Orion has ended startup but MongoDB is still on it.

@cdupont
Copy link
Author

cdupont commented Oct 5, 2018

I seem to reproduce the problem more often when loading memory:

stress --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.99;}' < /proc/meminfo)k --vm-keep -m 1

@cdupont
Copy link
Author

cdupont commented Oct 5, 2018

Anyway, a solution could be to add a catch on the exception mongo::UserException in Orion?

@fgalan
Copy link
Member

fgalan commented Oct 8, 2018

Orion already implements a process to retry connections to MongoDB, giving up after a number of retries if the process fails.

However, the idea of capturing the exception makes sense (at least to print the proper error message in the logs :). Let's have this issue opened for that (low priority, I guess)

@fgalan fgalan changed the title container crashes with exit code 139 catch on the exception mongo::UserException [it was "container crashes with exit code 139"] Oct 8, 2018
@cdupont
Copy link
Author

cdupont commented Oct 16, 2018

As a workaround, I added an healthcheck to mongo; and orion need waits that mongo is healthy.
This avoids to get the startup crash. Here is my docker-compose.yml:

version: '2.1'

services:
  mongo:
    image: mongo:3.2.8
    hostname: mongo
    ports:
      - "27017:27017"
    healthcheck:
      test: echo 'db.stats().ok' | mongo localhost:27017/local --quiet
      interval: 5s
      timeout: 5s
      retries: 12
    extra_hosts:
      - "localhost:127.0.0.1"

  orion:
    image: fiware/orion:1.15.0
    hostname: orion
    ports:
      - "1026:1026"
    command: -dbhost mongo -logLevel INFO 
    network_mode: host
    extra_hosts:
      - "mongo:127.0.0.1"
      - "localhost:127.0.0.1"
    depends_on:
      mongo:
        condition: service_healthy 

@fgalan
Copy link
Member

fgalan commented Oct 17, 2018

Great!

Maybe it would be a good idea to include this in the documentation to help other users in the same situation. In particular in docker/README.md. I mean, in a new section 4 Troubleshooting (old section 4 will be now section 5) with a 4.1 subsection for this specific case.

What do you think? Would you like to propose a PR with that piece of documentation?

@cdupont
Copy link
Author

cdupont commented Oct 17, 2018

But this is still considered a bug, no? Orion should catch the Exception.
In any case I can propose the PR.

@fgalan
Copy link
Member

fgalan commented Oct 18, 2018

Yes, we can keep the issue open in order to continou discusion (and eventually fixing) about Orion Exception.

Regarding this, two things to consider:

  1. What should Orion do when the exception is capture? My proposal y to log and exit, al if we reach that point is because the retry loop has failed and doesn't make sense to retry again.
  2. Can the behaviour that raises the error be reproduced in a deterministic way?

@cdupont
Copy link
Author

cdupont commented Oct 22, 2018

Apparently condition: service_healthy is not supported in docker-compose V3:
peter-evans/docker-compose-healthcheck#3

@fgalan
Copy link
Member

fgalan commented Oct 23, 2018

Is there any alternative or workaround for docker-compose v3?

@cdupont
Copy link
Author

cdupont commented Oct 23, 2018

Alternative is to use a tool like wait-for-it or dockerize.

@fgalan
Copy link
Member

fgalan commented Oct 23, 2018

One possibility would be to describe both solutions (for docker-compose v2 and v3). However, if you prefer to keep things simple, I'd suggest to describe the one you have and know it works (based on docker-compose v2) with a final note stating that service_healthy is not going to work in v3 and that wait-for-it should be used, but without going into too much detail.

Does it make sense to you?

@cdupont
Copy link
Author

cdupont commented Oct 28, 2018

Well, I fixed one of my other containers that was consuming a lot of memory on start-up.
Now the problem doesn't appear anymore... So I don't think any documentation action is needed.

@cdupont
Copy link
Author

cdupont commented Dec 19, 2018

The problem still bites me very often:

orion_1       | time=Wednesday 19 Dec 09:54:32 2018.406Z | lvl=INFO | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=contextBroker.cpp[904]:main | msg=Orion Context Broker is runni
ng
orion_1       | terminate called after throwing an instance of 'mongo::UserException'
orion_1       |   what():  DBClientBase::findN: transport error: mongo:27017 ns: admin.$cmd query: { ismaster: 1 }

In which case I have to do docker-compose down then docker rm $(docker ps -a -q) to clean everything.

@fgalan
Copy link
Member

fgalan commented Jan 14, 2019

Is there any deterministic way to reproduce the problem? It is hard to debug a problem that sometimes happens sometimes doesn't ;)

@cdupont
Copy link
Author

cdupont commented Jan 14, 2019

Unfortunately no deterministic way, but it happens very often.
The docker-compose.yml I gave on the 5th Oct reproduces the problem.

@cdupont
Copy link
Author

cdupont commented May 21, 2019

I still have this issue often with the latest version (2.2.0). If you are in FIWARE summit in Genova I could show you :)

@fgalan
Copy link
Member

fgalan commented May 21, 2019

Unfortunatelly, I'm not at FIWARE summit :(

What I would need is a deterministic way of reproducing the problem (i.e. "deploy this given docker-compose.yml, then run request X, the run request Y, etc.") to be able of debugging it.

@anselmobattisti
Copy link

anselmobattisti commented Apr 26, 2020

It's an old thread, but, can help someone stucked in this same point.

If you got the docker-compose file from the samples code of FIWARE, you will need to define a depends_on directive.

like

version: "3"
services:
    mongo:
        image: mongo:3.6
        command: --nojournal
    orion:
        image: fiware/orion
        depends_on: ["mongo"]
        links:
            - mongo
        ports:
            - "1026:1026"
        command: -dbhost mongo`

In this way the broker only will start AFTER mongo!

@fgalan
Copy link
Member

fgalan commented Apr 27, 2020

Thanks you for the hint @anselmobattisti! The thread is somehow old, but still open, so your feedback is valuable :)

@cdupont could you have a look and see if it helps, please? Thanks!

@cdupont
Copy link
Author

cdupont commented Apr 27, 2020

@anselmobattisti good hint, thanks.
Unfortunately, depends_on is not really useful because it can start Orion just after Mongo. It does not wait that Mongo is stable.
The only way is to define restart: on-failure and let Orion restart.

@fgalan
Copy link
Member

fgalan commented Apr 27, 2020

Is there a way to configure in docker-compose a script to be run from Orion container to check if MongoDB container is ready?

Maybe it could be a way to explore. It seems that scripts like that are available out there (for instance, check https://stackoverflow.com/questions/15443106/how-to-check-if-mongodb-is-up-and-ready-to-accept-connections-from-bash-script)

@cdupont
Copy link
Author

cdupont commented Apr 27, 2020

Yes, it's definitely possible.

@fgalan fgalan added this to the 3.0.0 milestone Apr 12, 2021
@fgalan
Copy link
Member

fgalan commented Apr 12, 2021

MongoDB driver has been completely replaced in PR #3622. If this problem is still happening it will appear in a completely different form, given that the new driver doesn't have the capability of raising any exception (including mongo::UserException).

@cdupont I think is better to close this issue and, if it appears again, opening a new issue about it, please.

@fgalan fgalan closed this as completed Apr 12, 2021
@cdupont
Copy link
Author

cdupont commented Apr 12, 2021

Fermin, that's great. I'll open another issue if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants