Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Master failback fails to work if none of the master's can be resolved #66534

Open
2 of 9 tasks
Jepson2k opened this issue May 16, 2024 · 0 comments
Open
2 of 9 tasks
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@Jepson2k
Copy link

Description
If a minion is setup in Multi-Master mode and each master is a domain name and none of the domain names can be resolved then the minion only continues to try the last master and never attempts to try the first one again, even if the master_failback parameter is set.

Setup
minion config:

master:
    - examplehostname
    - examplehostanme.local
master_type: failover
master_failback: True
retry_dns: 0

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior

  1. Don't setup master on network to simulate master being down or disconnected.
  2. Setup minion with the configuration file above and run salt-minion.

Expected behavior
Minion fails back to trying to resolve first master if it cannot resolve the last master (because the first master might now be up).

Versions Report

salt --versions-report No difference in salt versions between master and minion.
Salt Version:
          Salt: 3007.0
 
Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
 
Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.8.0
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4
 
Salt Package Information:
  Package Type: onedir
 
System Versions:
          dist: ubuntu 22.04.4 jammy
        locale: utf-8
       machine: x86_64
       release: 6.5.0-28-generic
        system: Linux
       version: Ubuntu 22.04.4 jammy

Additional context
I help to manage 50 laptops we use for various events. The setup has to be flexible and work on different networks so we try to use multicast DNS names for resolution. Some networks don't support mDNS but do resolve the hostname. Therefore, we've found decent success by including both the salt-master's hostname and its hostname.local. Unfortunately neither of these name resolution techniques are very reliable so it would be useful for the salt minions to continue to try both rather than just the last one.

Potentially why this is occurring
Without diving too deep into the code base here is what I've observing:

  1. At line 687 in Minion.py, opts["master"] which originally was a list is set to just one of the masters: opts["master"] = master
  2. Since none of the master names get resolved the error on line 702 is raised: raise SaltClientError(msg)
  3. The coroutine waits according to acceptance_wait_time parameter in minion config
  4. The routine loop repeats and the eval_master is called and since opts["master"] is now a string the conditional on line 600: elif isinstance(opts["master"], str) and ("master_list" not in opts): is taken instead of the failed conditional on line 611: elif failed: which would set opts["master"] back to the list.

Potential Solution
I don't plan on opening a pull request since I am not familiar enough with Salt to know if this break anything else but changing line 600 in minion.py to elif isinstance(opts["master"], str) and ("master_list" not in opts) and not failed: seemed to fix the issue.

Temporary Workaround
Adding an IP address such as 127.0.0.1 to the list of masters fixes this issue.

@Jepson2k Jepson2k added Bug broken, incorrect, or confusing behavior needs-triage labels May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant