Skip to content

Commit

Permalink
Install Telegraf over SSH with pty=False
Browse files Browse the repository at this point in the history
There were cases where Telegraf failed to start when deployed over
SSH even though the installation script completed with exit code 0.
Running the script by hand in a terminal, after having SSH'ed to a
random VM, returns no errors and Telegraf starts successfully.

It turns out this weird behavior has to do with jobs and sessions
when daemonizing processes.

When we run a command over SSH with pty=True and the session leader
(shell) exits, a SIGHUP will be sent to the parent process' process
group, which includes the daemonized Telegraf bin. On the other hand,
when pty=False, the new shell connects to stdin/stdout/stderr, which
are then inherited by its child processes, thus allowing them to run
in the background after the parent process has exited.

For more details, see: fabric/fabric#395
  • Loading branch information
pchristos committed Feb 27, 2018
1 parent 1d0ea86 commit 94e9221
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions src/mist/api/monitoring/tasks.py
Expand Up @@ -39,15 +39,17 @@ def install_telegraf(machine_id, job=None, job_id=None, plugins=None):
shell = mist.api.shell.Shell(machine.ctl.get_host())
key, user = shell.autoconfigure(machine.owner, machine.cloud.id,
machine.machine_id)
exit_code, stdout = shell.command(unix_install(machine))
exit_code, stdout, stderr = shell.command(unix_install(machine), False)
stdout = stdout.encode('utf-8', 'ignore')
stdout = stdout.replace('\r\n', '\n').replace('\r', '\n')
stderr = stderr.encode('utf-8', 'ignore')
stderr = stderr.replace('\r\n', '\n').replace('\r', '\n')
except Exception as err:
log.error('Error during Telegraf installation: %s', repr(err))
stdout = ''
else:
err = exit_code or None
_log.update({'key_id': key, 'ssh_user': user, 'exit_code': exit_code})
_log.update({'key_id': key, 'ssh_user': user, 'exit_code': exit_code,
'stdout': stdout, 'stderr': stderr})
finally:
# Close the SSH connection.
shell.disconnect()
Expand All @@ -58,7 +60,7 @@ def install_telegraf(machine_id, job=None, job_id=None, plugins=None):
else:
machine.monitoring.installation_status.state = 'succeeded'
machine.monitoring.installation_status.finished_at = time.time()
machine.monitoring.installation_status.stdout = stdout
machine.monitoring.installation_status.stdout = stderr
machine.monitoring.installation_status.error_msg = str(err)
machine.save()

Expand All @@ -83,8 +85,7 @@ def install_telegraf(machine_id, job=None, job_id=None, plugins=None):
err = 'Deployment of scripts with IDs %s failed' % ','.join(failed)

# Log deployment's outcome.
log_event(action='telegraf_deployment_finished',
stdout=stdout, error=err, **_log)
log_event(action='telegraf_deployment_finished', error=err, **_log)

# Trigger UI update.
trigger_session_update(machine.owner, ['monitoring'])
Expand Down

0 comments on commit 94e9221

Please sign in to comment.