Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd service failure inhibits system configuration activation #1535

Open
PAI5REECHO opened this issue Aug 14, 2022 · 2 comments
Open

systemd service failure inhibits system configuration activation #1535

PAI5REECHO opened this issue Aug 14, 2022 · 2 comments

Comments

@PAI5REECHO
Copy link

Whenever a nixops deployment is made on a system with a systemd service in a activating (auto-restart) or failed state the deployment fails. I don't understand why nixops is designed in this way though.

test.........> setting up tmpfiles
test.........> the following new units were started: systemd-coredump@194-238204-0.service
test.........> warning: the following units failed: restic-backups-external.service
test.........> 
test.........> ● test.service - test
test.........>      Loaded: loaded (/etc/systemd/system/test.service; linked; preset: enabled)
test.........>      Active: activating (auto-restart) since Sun 2022-08-14 12:00:08 UTC; 2h 9min ago
test.........> TriggeredBy: ● test.timer
test.........>    Main PID: 8780 (code=exited, status=1/FAILURE)
test.........>         CPU: 512ms
test.........> error: Traceback (most recent call last):
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 906, in worker
    raise Exception(
Exception: unable to activate new configuration (exit code 4)

Traceback (most recent call last):
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/bin/.nixops-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/__main__.py", line 56, in main
    args.op(args)
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/script_defs.py", line 715, in op_deploy
    depl.deploy(
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1365, in deploy
    self.run_with_notify("deploy", lambda: self._deploy(**kwargs))
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1354, in run_with_notify
    f()
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1365, in <lambda>
    self.run_with_notify("deploy", lambda: self._deploy(**kwargs))
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 1300, in _deploy
    self.activate_configs(
  File "/nix/store/8myxcs76bsyg37n21x2xwnj6srfwfxxm-python3.10-nixops-2.0.0/lib/python3.10/site-packages/nixops/deployment.py", line 947, in activate_configs
    raise Exception(
Exception: activation of 1 of 1 machines failed (namely on ‘test’)
@roberth
Copy link
Member

roberth commented Aug 15, 2022

Me neither, if what you're saying is that something was skipped because of the error.

Stopping a deployment half way is incompatible with declarative deployments that do not specify dependencies (we don't) and it is also incompatible with the idea of letting the distributed system converge towards an acceptable (or fully) operational state.
That said, using the deployment process for feedback about the system seems useful. Did your deployment skip anything because of the error? If so, that would be an issue that needs correcting.

Also we shouldn't be emitting a stack trace for this type of error and the log should be clear about what did and did not happen.

TODO

  • check that errors are collected but do not interrupt parallel changes
  • report such errors with clarity as to what happened. Specifically answer the question whether a re-deployment is necessary.
  • do not report a stack trace for expected errors that are handled properly

@PAI5REECHO
Copy link
Author

PAI5REECHO commented Aug 16, 2022

Did your deployment skip anything because of the error?

Yes, the system activation fails due to a failing or pending systemd service, so no changes to the system are applied which is unexpected. Activation shouldn't depend on the health of systemd services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants