Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServerSelectionTimeoutError when services connect to mongo. #240

Open
brianwilliams-123 opened this issue Mar 22, 2022 · 10 comments
Open

ServerSelectionTimeoutError when services connect to mongo. #240

brianwilliams-123 opened this issue Mar 22, 2022 · 10 comments
Labels

Comments

@brianwilliams-123
Copy link

Problem

I tried to install StackStorm on Docker using this repository's guide without any changes done to the files.

After executing docker-compose, every service that depends on the mongo service, and tries to connect to its database throws the exception ServerSelectionTimeoutError:

st2timersengine_1      | 2022-03-22 18:19:08,934 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescrip
tion id: 623a13191966d2953888ab75, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>                                                                
st2timersengine_1      | 2022-03-22 18:19:08,934 ERROR [-] (PID=1) TimerEngine quit due to exception.                                                                                                     
st2timersengine_1      | Traceback (most recent call last):                                                                                                                                               
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2reactor/cmd/timersengine.py", line 93, in main                                                                        
st2timersengine_1      |     _setup()                                                                                                                                                                     
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2reactor/cmd/timersengine.py", line 50, in _setup                                                                      
st2timersengine_1      |     capabilities=capabilities,                                                                                                                                                   
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/service_setup.py", line 249, in setup                                                                          
st2timersengine_1      |     db_setup()                                                                                                                                                                   
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/database_setup.py", line 55, in db_setup                                                                       
st2timersengine_1      |     connection = db_init.db_setup_with_retry(**db_cfg)                                                                                                                           
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/persistence/db_init.py", line 93, in db_setup_with_retry                                                       
st2timersengine_1      |     ssl_match_hostname=ssl_match_hostname,                                                                                                                                       
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/persistence/db_init.py", line 58, in db_func_with_retry                                                        
st2timersengine_1      |     return retrying_obj.call(db_func, *args, **kwargs)                                                                                                                           
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 206, in call                                                                                          
st2timersengine_1      |     return attempt.get(self._wrap_exception)                                                                                                                                     
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 247, in get                                                                                           
st2timersengine_1      |     six.reraise(self.value[0], self.value[1], self.value[2])                                                                                                                     
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/six.py", line 696, in reraise                                                                                            
st2timersengine_1      |     raise value                                                                                                                                                                  
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/retrying.py", line 200, in call                                                                                          
st2timersengine_1      |     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)                                                                                                                
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 251, in db_setup                                                                  
st2timersengine_1      |     ssl_match_hostname=ssl_match_hostname,                                                                                                                                       
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 213, in _db_connect                                                               
st2timersengine_1      |     raise e                                                                                                                                                                      
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/models/db/__init__.py", line 204, in _db_connect                                                               
st2timersengine_1      |     connection.admin.command("ping")                                                                                                                                             
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/database.py", line 737, in command                                                                               
st2timersengine_1      |     read_preference, session) as (sock_info, slave_ok):                                                                                                                          
st2timersengine_1      |   File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__                                                                                                                 
st2timersengine_1      |     return next(self.gen)                                                                                                                                                        
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1325, in _socket_for_reads                                                                
st2timersengine_1      |     server = self._select_server(read_preference, session)                                                                                                                       
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/mongo_client.py", line 1278, in _select_server                                                                   
st2timersengine_1      |     server = topology.select_server(server_selector)                                                                                                                             
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 243, in select_server                                                                         
st2timersengine_1      |     address))                                                                                                                                                                    
st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 200, in select_servers                                                                        
st2timersengine_1      |     selector, server_timeout, address)                                                                                                                                           st2timersengine_1      |   File "/opt/stackstorm/st2/lib/python3.6/site-packages/pymongo/topology.py", line 217, in _select_servers_loop                                                                  
st2timersengine_1      |     (self._error_message(selector), timeout, self.description))                                                                                                                  
st2timersengine_1      | pymongo.errors.ServerSelectionTimeoutError: No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 623a13191966d2953888ab75, topology_type: Single, 
servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>

After doing some digging, the source of this issue the exception ServerSelectionTimeoutError is caught in line 205 of this file https://github.com/StackStorm/st2/blob/master/st2common/st2common/models/db/init.py.

Some services keep restarting indefinitely due to the restart: on-failure option and only st2-docker_st2web_1 and st2-docker_st2client_1 stay up, but they are unusable.

CONTAINER ID   IMAGE                                   COMMAND                  CREATED          STATUS                         PORTS                                                 NAMES
5cab2898f2c9   stackstorm/st2actionrunner:latest       "/st2client-startup.…"   15 minutes ago   Up 2 minutes                                                                         st2-docker_st2client_1
a265df0d2466   stackstorm/st2chatops:latest            "/st2chatops-startup…"   15 minutes ago   Exited (0) 2 minutes ago                                                             st2-docker_st2chatops_1
299d5ab21193   stackstorm/st2web:latest                "/bin/bash -c 'if [ …"   15 minutes ago   Up 2 minutes (unhealthy)       127.0.0.1:80->80/tcp, 443/tcp                         st2-docker_st2web_1
b47e4c381ba9   stackstorm/st2timersengine:latest       "/opt/stackstorm/st2…"   15 minutes ago   Up 5 seconds                                                                         st2-docker_st2timersengine_1
f424783795a7   stackstorm/st2actionrunner:latest       "/opt/stackstorm/st2…"   15 minutes ago   Restarting (1) 5 seconds ago                                                         st2-docker_st2actionrunner_1
e584a6a1dee6   stackstorm/st2stream:latest             "/opt/stackstorm/st2…"   15 minutes ago   Up 1 second                    9102/tcp                                              st2-docker_st2stream_1
75ae17a1eabf   stackstorm/st2auth:latest               "/opt/stackstorm/st2…"   15 minutes ago   Up 9 seconds                   9100/tcp                                              st2-docker_st2auth_1
fe556f0a4bd6   stackstorm/st2workflowengine:latest     "/opt/stackstorm/st2…"   15 minutes ago   Restarting (1) 4 seconds ago                                                         st2-docker_st2workflowengine_1
e0e6f4ad0aa9   stackstorm/st2sensorcontainer:latest    "/opt/stackstorm/st2…"   15 minutes ago   Up 7 seconds                                                                         st2-docker_st2sensorcontainer_1
0676f3a3b7c1   stackstorm/st2garbagecollector:latest   "/opt/stackstorm/st2…"   15 minutes ago   Restarting (1) 4 seconds ago                                                         st2-docker_st2garbagecollector_1
fec487dd323b   stackstorm/st2scheduler:latest          "/opt/stackstorm/st2…"   15 minutes ago   Up 12 seconds                                                                        st2-docker_st2scheduler_1
ece3af683ef5   stackstorm/st2notifier:latest           "/opt/stackstorm/st2…"   15 minutes ago   Up 20 seconds                                                                        st2-docker_st2notifier_1
2e523c091513   stackstorm/st2rulesengine:latest        "/opt/stackstorm/st2…"   15 minutes ago   Up 22 seconds                                                                        st2-docker_st2rulesengine_1
ad9b149797e6   stackstorm/st2api:latest                "/opt/stackstorm/st2…"   15 minutes ago   Up 5 seconds                   9101/tcp                                              st2-docker_st2api_1
97329d112d76   stackstorm/st2actionrunner:latest       "/makesecrets.sh"        15 minutes ago   Exited (0) 2 minutes ago                                                             st2-docker_st2makesecrets_1
01453a00d51b   mongo:4.0                               "docker-entrypoint.s…"   15 minutes ago   Up 2 minutes                   27017/tcp                                             st2-docker_mongo_1
a971e05ba16d   rabbitmq:3.8                            "docker-entrypoint.s…"   15 minutes ago   Up 2 minutes                   4369/tcp, 5671-5672/tcp, 15691-15692/tcp, 25672/tcp   st2-docker_rabbitmq_1
a8a72c873785   redis:6.2                               "docker-entrypoint.s…"   15 minutes ago   Up 2 minutes                   6379/tcp                                              st2-docker_redis_1

Anyone else encountered this error?

Versions

  • Host OS: Debian GNU/Linux 11 (bullseye)
  • docker: 20.10.5+dfsg1
  • docker-compose: 1.25.0
  • stackstorm version: v3.6.0

How to reproduce

Follow the README of this repository.

@fengqing2021
Copy link

I got this problem too. I haven't solved it

@rams3sh
Copy link

rams3sh commented Jul 19, 2022

Same issue being faced here !! Unable to find a solution !!

@fengqing2021
Copy link

Back to previous version(st2-docker-3.6) can solve it.

@dmulyalin
Copy link

dmulyalin commented Nov 21, 2022

Same issue while trying to start st2 3.7.0 docker images, running on:

[root@lp-bejodeve-ubu st2-docker]# hostnamectl
   Static hostname: localhost.localdomain
Transient hostname: URL
         Icon name: computer-vm
           Chassis: vm
    Virtualization: vmware
  Operating System: Rocky Linux 8.6 (Green Obsidian)
       CPE OS Name: cpe:/o:rocky:rocky:8:GA
            Kernel: Linux 4.18.0-372.9.1.el8.x86_64
      Architecture: x86-64
[root@lp-bejodeve-ubu st2-docker]# docker --version
Docker version 20.10.17, build 100c701
[root@lp-bejodeve-ubu st2-docker]# docker-compose --version
Docker Compose version v2.6.1
[root@lp-bejodeve-ubu st2-docker]#

However, in my case going back to 3.6.0 did not help either, running these containers produces same issue:

[root@lp-bejodeve-ubu st2-docker]#  docker container ls 
CONTAINER ID   IMAGE                                  COMMAND                  CREATED         STATUS                         PORTS                                                 NAMES
58070676a063   stackstorm/st2web:3.6.0                "/bin/bash -c 'if [ …"   3 minutes ago   Up 2 minutes (unhealthy)       443/tcp, 0.0.0.0:8899->80/tcp                         st2-docker-st2web-1
f82ddfd03b68   stackstorm/st2actionrunner:3.6.0       "/st2client-startup.…"   3 minutes ago   Up 2 minutes                                                                         st2-docker-st2client-1
5c0724094489   stackstorm/st2rulesengine:3.6.0        "/opt/stackstorm/st2…"   3 minutes ago   Up 10 seconds                                                                        st2-docker-st2rulesengine-1
598c8288b678   stackstorm/st2notifier:3.6.0           "/opt/stackstorm/st2…"   3 minutes ago   Restarting (1) 5 seconds ago                                                         st2-docker-st2notifier-1
1d1495532504   stackstorm/st2sensorcontainer:3.6.0    "/opt/stackstorm/st2…"   3 minutes ago   Up 11 seconds                                                                        st2-docker-st2sensorcontainer-1
b016fbfcd6b1   stackstorm/st2workflowengine:3.6.0     "/opt/stackstorm/st2…"   3 minutes ago   Up 10 seconds                                                                        st2-docker-st2workflowengine-1
2e0f0bb87a14   stackstorm/st2actionrunner:3.6.0       "/opt/stackstorm/st2…"   3 minutes ago   Restarting (1) 7 seconds ago                                                         st2-docker-st2actionrunner-1
f892a9365e94   stackstorm/st2scheduler:3.6.0          "/opt/stackstorm/st2…"   3 minutes ago   Up 1 second                                                                          st2-docker-st2scheduler-1
43985cf22e90   stackstorm/st2timersengine:3.6.0       "/opt/stackstorm/st2…"   3 minutes ago   Up 11 seconds                                                                        st2-docker-st2timersengine-1
a220342568ff   stackstorm/st2auth:3.6.0               "/opt/stackstorm/st2…"   3 minutes ago   Up Less than a second          9100/tcp                                              st2-docker-st2auth-1
122eafc2e421   stackstorm/st2garbagecollector:3.6.0   "/opt/stackstorm/st2…"   3 minutes ago   Up 9 seconds                                                                         st2-docker-st2garbagecollector-1
d7c48abdd377   stackstorm/st2stream:3.6.0             "/opt/stackstorm/st2…"   3 minutes ago   Restarting (1) 6 seconds ago                                                         st2-docker-st2stream-1
a2fb891fede2   stackstorm/st2api:3.6.0                "/opt/stackstorm/st2…"   3 minutes ago   Up 1 second                    9101/tcp                                              st2-docker-st2api-1
a8433cc07ddb   mongo:4.4                              "docker-entrypoint.s…"   3 minutes ago   Up 3 minutes                   27017/tcp                                             st2-docker-mongo-1
aae9d96b8915   rabbitmq:3.8                           "docker-entrypoint.s…"   3 minutes ago   Up 3 minutes                   4369/tcp, 5671-5672/tcp, 15691-15692/tcp, 25672/tcp   st2-docker-rabbitmq-1
5b57bd0e2367   redis:6.2                              "docker-entrypoint.s…"   3 minutes ago   Up 3 minutes                   6379/tcp                                              st2-docker-redis-1

Attached logs from containers.
st2 docker logs.txt

@arm4b
Copy link
Member

arm4b commented Nov 21, 2022

I would check if mongo host (container) is reachable from the st2 containers.
If yes, I'd try to connect to it from the mongo client to understand if the DB is working.

If mongo client could connect to the DB, while st2 cannot - it's a stackstorm problem.
If mongo client could not connect to the DB, - it's a mongodb or docker problem.

@ytjohn
Copy link
Contributor

ytjohn commented Nov 21, 2022

I've actually spent a lot of time back in September trying to solve this issue without any definitive answers. On Ubuntu VMs I encounter the issue, but in WSL2 running Rancher desktop no issues. Between me and my team, we were able to replicate on Ubuntu 20.04 and 22.04. From my notes, the issue presents sometime between dockerd 20.10.13 and 20.10.17. Generally, I have to reboot after downgrading to resolve, and if I upgrade after it works once, I'm good. But on a fresh install, problems.

Ultimately, the issue presents as a network race condition. It's not a DNS timeout, that has a different error. I also tried adding a script to sleep before starting things like st2actionrunner to try and give time for the network to establish between st2actionrunner container and mongo.

Here's what I know:

When the issue occurs, the error message is relatively vague:

st2api_1               | 2022-09-13 13:15:48,108 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1               | 2022-09-13 13:15:51,121 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 63208284d0b0393f8dd98335, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>

If you manage to shell into the container before it exits, you'll find that the mongo instance is reachable. I had a mongocheck.py script that I would mount in the container and run in before the container was restarted.

Here is the script. In it, I have the errors you see if you have bad dns or the port is unreachable.

import pymongo

# >>> client = pymongo.MongoClient("afakehost", 27017)  # invalid host
# >>> db = client.st2
# >>> db.my_collection.insert_one({"x": 10}).inserted_id
# pymongo.errors.ServerSelectionTimeoutError: afakehost:27017: [Errno -3] Temporary failure in name resolution, Timeout: 30s, Topology Description: <TopologyDescription id: 6308e86ed939509aa52e4eeb, topology_type: Single, servers: [<ServerDescription ('afakehost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('afakehost:27017: [Errno -3] Temporary failure in name resolution')>]>
# >>> client = pymongo.MongoClient("mongo", 27019)  # invalid port
# >>> db = client.st2
# >>> db.my_collection.insert_one({"x": 10}).inserted_id
mongohost = "mongo"
mongoport = 27017

print(f"Connecting to {mongohost}:{mongoport}")
client = pymongo.MongoClient("mongo", 27017)
db = client.st2
inserted = db.my_collection.insert_one({"x": 10}).inserted_id
print(f"inserted id of {inserted}")

Here 50.10.13 of the docker-ce package had it working for me. I had downgraded from 20.10.17 and my docker client is still on that version.

hogenj@devg:~/projects/st2-docker$ apt-cache madison docker-ce
 docker-ce | 5:20.10.17~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
 docker-ce | 5:20.10.16~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
 docker-ce | 5:20.10.15~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
 docker-ce | 5:20.10.14~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
 docker-ce | 5:20.10.13~3-0~ubuntu-jammy | https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages

sudo apt install docker-ce=5:20.10.13~3-0~ubuntu-jammy
hogenj@devg:~/projects/st2-docker$ docker --version
Docker version 20.10.17, build 100c701
hogenj@devg:~/projects/st2-docker$ docker-compose --version
docker-compose version 1.29.2, build unknown
hogenj@devg:~/projects/st2-docker$ dockerd --version
Docker version 20.10.13, build 906f57f

Since it's been a while and there are newer versions of docker-ce out, I'm going to try upgrading my dev vm and see where things stand.

@ytjohn
Copy link
Contributor

ytjohn commented Nov 21, 2022

After updates

Docker version 20.10.21, build baeda1f
Docker version 20.10.21, build 3056208
docker-compose version 1.29.2, build unknown

I made some changes to st2api to try and work around any race conditions

  st2api:
    image: ${ST2_IMAGE_REPO:-stackstorm/}st2api:${ST2_VERSION:-latest}
...
    command: /startst2api.sh
    # command: /mongocheck2.py
    volumes:
      - ./scripts/st2client-startup.sh:/st2client-startup.sh:ro
      - ./files/mongocheck.py:/mongocheck.py:rw
      - ./files/mongocheck2.py:/mongocheck2.py:rw
      - ./files/startst2api.sh:/startst2api.sh:rw
...

startst2api.sh is:

#!/bin/bash
echo "sleeping 30 seconds"
sleep 30
echo "starting st2api"
/opt/stackstorm/st2/bin/st2api --config-file=/etc/st2/st2.conf --config-file=/etc/st2/st2.docker.conf --config-file=/etc/st2/st2.user.conf

If I start that, I get the vague server selection timeout:

hogenj@devg:~/projects/st2-docker$ docker-compose up st2api
st2-docker_mongo_1 is up-to-date
st2-docker_redis_1 is up-to-date
st2-docker_rabbitmq_1 is up-to-date
Starting st2-docker_st2makesecrets_1 ... done
Recreating st2-docker_st2api_1       ... done
Attaching to st2-docker_st2api_1
st2api_1               | sleeping 30 seconds
st2api_1               | starting st2api
st2api_1               | 2022-11-21 15:08:05,436 INFO [-] Using Python: 3.8.10 (/opt/stackstorm/st2/bin/python)
st2api_1               | 2022-11-21 15:08:05,436 INFO [-] Using fs encoding: utf-8, default encoding: utf-8, locale: en_US.UTF-8, LANG env variable: en_US.UTF-8, PYTHONIOENCODING env variable: notset
st2api_1               | 2022-11-21 15:08:05,436 INFO [-] Using config files: /etc/st2/st2.conf,/etc/st2/st2.docker.conf,/etc/st2/st2.user.conf
st2api_1               | 2022-11-21 15:08:05,437 INFO [-] Using logging config: /etc/st2/logging.api.gunicorn.conf
st2api_1               | 2022-11-21 15:08:05,437 INFO [-] Using coordination driver: redis
st2api_1               | 2022-11-21 15:08:05,437 INFO [-] Using metrics driver: noop
st2api_1               | 2022-11-21 15:08:05,461 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1               | 2022-11-21 15:08:08,484 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 3.0s, Topology Description: <TopologyDescription id: 637b945500772fa92caefd9a, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
(snip)

however...

If I more or less create the code, even withtout a sleep, that works.

files/mongocheck2.py

#!/opt/stackstorm/st2/bin/python

import pymongo
import mongoengine 
from pymongo.errors import OperationFailure
from pymongo.errors import ConnectionFailure
from pymongo.errors import ServerSelectionTimeoutError
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

from oslo_config import cfg

# import st2common.config as common_config

from st2api import config

config.register_opts(ignore_errors=True)
import st2common.models.db as db
db.LOG.setLevel(logging.DEBUG)


db_host = "mongo"
db_port = 27017
db_name = "st2"
username = None
password = None
connection_timeout = 3000  # ms
ssl_kwargs = {}
compressor_kwargs = {}

print('running _db_connect')


connection = db._db_connect(db_name, db_host, db_port)

# NOTE: We intentionally set "serverSelectionTimeoutMS" to 3 seconds. By default it's set to
# 30 seconds, which means it will block up to 30 seconds and fail if there are any SSL related
# or other errors
connection_timeout = cfg.CONF.database.connection_timeout
connection = mongoengine.connection.connect(
    db_name,
    host=db_host,
    port=db_port,
    tz_aware=True,
    # alias='foo',
    username=username,
    password=password,
    connectTimeoutMS=connection_timeout,
    serverSelectionTimeoutMS=connection_timeout,
    **ssl_kwargs,
    **compressor_kwargs,
)


# NOTE: Since pymongo 3.0, connect() method is lazy and not blocking (always returns success)
# so we need to issue a command / query to check if connection has been
# successfully established.
# See http://api.mongodb.com/python/current/api/pymongo/mongo_client.html for details
try:
    # The ping command is cheap and does not require auth
    # https://www.mongodb.com/community/forums/t/how-to-use-the-new-hello-interface-for-availability/116748/
    connection.admin.command("ping")
except (ConnectionFailure, ServerSelectionTimeoutError) as e:
    # NOTE: ServerSelectionTimeoutError can also be thrown if SSLHandShake fails in the server
    # Sadly the client doesn't include more information about the error so in such scenarios
    # user needs to check MongoDB server log
    print(f'Failed to connect to database connected to database "{db_name}" @ "{db_host}:{db_port}" as user "{username}".')
    raise e


print(f'Successfully connected to database "{db_name}" @ "{db_host}:{db_port}" as user "{username}".')

# connection.close()

and I set that as the command in st2api

command: /opt/stackstorm/st2/bin/python /mongocheck2.py
volumes:
      - ./files/mongocheck2.py:/mongocheck2.py:rw
ogenj@devg:~/projects/st2-docker$ docker-compose up st2api
Starting st2-docker_st2makesecrets_1 ... 
st2-docker_mongo_1 is up-to-date
st2-docker_rabbitmq_1 is up-to-date
Starting st2-docker_st2makesecrets_1 ... done
Starting st2-docker_st2api_1         ... done
Attaching to st2-docker_st2api_1
st2api_1               | running _db_connect
st2api_1               | Successfully connected to database "st2" @ "mongo:27017" as user "None".
st2api_1               | Inserting one using pymongo
st2api_1               | inserted id of 637ba0ecbdf365d52a98264e
st2-docker_st2api_1 exited with code 0

And because this is already a wall of text, I'm providing mongocheck2.py as a gist. It's basically a copy of the st2common db connection.

One interesting note which might not be related. After running mongocheck2.py, I started getting an error about not running disconnect first. That's why in mongocheck2.py, I added a line where alias='foo'.

st2api_1               | running _db_connect
st2api_1               | Traceback (most recent call last):
st2api_1               |   File "/mongocheck2.py", line 40, in <module>
st2api_1               |     connection = mongoengine.connection.connect(
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/connection.py", line 395, in connect
st2api_1               |     raise ConnectionFailure(err_msg)
st2api_1               | mongoengine.connection.ConnectionFailure: A different connection with alias `default` was already registered. Use disconnect() first

@arm4b
Copy link
Member

arm4b commented Nov 21, 2022

First, it's OK when stackstorm tries to connect to MongoDB which is not yet available and fail. It'll always retry until establishing a successful connection.

The StackStorm services could work in 2 different ways when it comes to DB backend connection:

  • retry multiple times until MongoDB connection is established (default)
  • exit process on first failure and let orchestrator handle restarts (default in Docker/K8s environment)

In Docker/K8s we override these settings because restarting on failure and letting the orchestrators handle the container restarts is the better way for these systems to control the recovery, liveness, and availability:
https://github.com/StackStorm/st2-dockerfiles/blob/618f6dcff4d84aa5c26f5dd719a33bf309dbbcbf/base/files/st2.tmp.conf#L17-L21

If you'd like to retry the connection until it's successful in an existing running process, you can revert the default st2 settings here:
https://github.com/StackStorm/st2/blob/dfab25aab0468f330a635db55f01286523217135/conf/st2.conf.sample#L129-L136
and also fine-tune timeouts, number of retries, delays, and so on.


What are the benefits of a middle-man script that waits for MongoDB to be up & running if st2 can handle reconnection itself?

@ytjohn
Copy link
Contributor

ytjohn commented Nov 22, 2022

I do apologize. I do need to remember when uploading a lot of troubleshooting that I should summarize it some.

  1. the easiest way to reproduce this is to spin up a clean VM of Ubuntu 20.04 or 22.04 and install the latest docker, docker-compose. It does seem that those who have had this setup working, it continues to work through upgrades.
  2. mongo (and redis) is up before st2api - I bring those up first, and also they're a dependency of st2api in the docker-compose file.
  3. Alternate mongo clients, including one I wrote based on the code in st2common/models/db/init.py connect up to mongo with no issues.
  4. st2api (and all the others) fail to connect to mongo using stackstorm code.
  5. It's not a DNS or port closed error.
  6. I did the middle-man script to eliminate some potential race conditions, like docker networking not being established. It waits for 30 seconds and then starts st2api. Even with that wait, st2api still fails to connect to mongo.

Per your suggestion, I went ahead and updated my retry configurations.

# Don't try to reconnect to database and exit early and allow k8s to handle reconnection's for it to keep retrying.
[database]
# Connection retry backoff max (seconds).
connection_retry_backoff_max_s = 12
# Backoff multiplier (seconds).
connection_retry_backoff_mul = 2
# Connection retry total time (minutes).
connection_retry_max_delay_m = 5
# Connection and server selection timeout (in ms).
connection_timeout = 6000

Running this has it retry multiple times, and the container does exit after a minute.

tail end

st2api_1               | 2022-11-22 01:37:07,051 INFO [-] Using metrics driver: noop
st2api_1               | 2022-11-22 01:37:07,061 INFO [-] Connecting to database "st2" @ "mongo:27017" as user "None".
st2api_1               | 2022-11-22 01:37:13,080 ERROR [-] Failed to connect to database "st2" @ "mongo:27017" as user "None": No servers found yet, Timeout: 6.0s, Topology Description: <TopologyDescription id: 637c27c32cb168d993d30aa9, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
st2api_1               | 2022-11-22 01:37:13,081 ERROR [-] (PID=1) ST2 API quit due to exception.
st2api_1               | Traceback (most recent call last):
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/cmd/api.py", line 98, in main
st2api_1               |     _setup()
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/cmd/api.py", line 60, in _setup
st2api_1               |     common_setup(
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/service_setup.py", line 252, in setup
st2api_1               |     db_setup()
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/database_setup.py", line 55, in db_setup
st2api_1               |     connection = db_init.db_setup_with_retry(**db_cfg)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/persistence/db_init.py", line 79, in db_setup_with_retry
st2api_1               |     return db_func_with_retry(
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/persistence/db_init.py", line 58, in db_func_with_retry
st2api_1               |     return retrying_obj.call(db_func, *args, **kwargs)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 206, in call
st2api_1               |     return attempt.get(self._wrap_exception)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 247, in get
st2api_1               |     six.reraise(self.value[0], self.value[1], self.value[2])
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/six.py", line 696, in reraise
st2api_1               |     raise value
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/retrying.py", line 200, in call
st2api_1               |     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 239, in db_setup
st2api_1               |     connection = _db_connect(
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 213, in _db_connect
st2api_1               |     raise e
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/models/db/__init__.py", line 204, in _db_connect
st2api_1               |     connection.admin.command("ping")
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/database.py", line 736, in command
st2api_1               |     with self.__client._socket_for_reads(
st2api_1               |   File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
st2api_1               |     return next(self.gen)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1325, in _socket_for_reads
st2api_1               |     server = self._select_server(read_preference, session)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
st2api_1               |     server = topology.select_server(server_selector)
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 241, in select_server
st2api_1               |     return random.choice(self.select_servers(selector,
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 199, in select_servers
st2api_1               |     server_descriptions = self._select_servers_loop(
st2api_1               |   File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/topology.py", line 215, in _select_servers_loop
st2api_1               |     raise ServerSelectionTimeoutError(
st2api_1               | pymongo.errors.ServerSelectionTimeoutError: No servers found yet, Timeout: 6.0s, Topology Description: <TopologyDescription id: 637c27c32cb168d993d30aa9, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None>]>
st2-docker_st2api_1 exited with code 1

real    1m5.811s
user    0m0.514s
sys     0m0.099s

@rite2hhh
Copy link

rite2hhh commented Nov 22, 2022

I ran into the same issue. Somewhere, buried ONLY in the first attempt to start st2 (and incidentally mongodb) was the following error

mongo_1                | {"t":{"$date":"2022-11-22T03:56:33.766+00:00"},"s":"F",  "c":"STORAGE",  "id":4671205, "ctx":"initandlisten","msg":"This version of MongoDB is too recent to start up on the existing data files. Try MongoDB 4.2 or earlier."}

Clearing up my volumes to ensure I didn't have compatibility issues was a way to resolve this for me.

OR migrate your existing data from your current version to mongo:4.4

you could try to docker-compose up mongo inspect around. De-couple containers and try to launch them individually.

I noticed mongo was dying with exit code 14, each time it would respawn it would print some log (but not all of it)

UPDATE:

In my case, other services had a problem connecting mongo and failed with ServerSelectionTimeoutError, but mongo itself was crashing, and the version compat mismatch message was only emitted in the beginning.

@arm4b arm4b added the bug label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants