Skip to content
dswd edited this page Mar 11, 2014 · 8 revisions

Host Management

Downtime guidelines

Sometimes downtimes are inevitable. Host administrators should take the following steps to minimize the impact on the testbed:

  • Two weeks before the downtime: Send a notification about the date and duration of the downtime to users@tomato-lab.org.
  • One week before the downtime: Manually disable the host in ToMaTo to avoid new elements being assigned to it.

Common host errors

Most of the host errors have pretty descriptive messages but some are not so clear.

Host unreachable

This error could mean that the host actually is unreachable which can be verified by pinging it. If the host is in fact reachable by ping, this error means that the hostmanager has crashed or is not responding any longer.

In this case the hostmanager could be restarted with the following command: /etc/init.d/tomato-hostmanager restart. Afterwards, the log file /var/log/tomato/server.log should be checked if the hostmanager started sucessfully. If not, this file will contain more information on the problems at its end.

Lots of error dumps

Whenever something on a host goes wrong, an error dump is written to /var/log/tomato/dumps/. If more than 100 error dumps exist on a host, the backend takes this as a sign that something on that host is wrong.

Lots of error dumps could mean that there is a problem with the host that ToMaTo does not detect otherwise or that there is a problem in the ToMaTo code which just happens to show itself on that particular host.

The contents of the error dumps can help to identify whether the problem is on the host or in the code. Please send these dumps to the developers (devs@tomato-lab.org, in a zip-file or so). When the problem is fixed, the dump files can be removed from the folder.

More uncommon host errors

Last query took very long

This error is a sign that something goes wrong on the host. If this error appears alone without other errors it could be that there are problems that are not covered by other tests.

If the query time is constantly about 5 seconds, this is a sign that the primary DNS server is not responding.