Skip to content

Commit

Permalink
MRG: Merge pull request #541 from octue/fix/improve-crash-diagnostics
Browse files Browse the repository at this point in the history
Fix and improve crash diagnostics
  • Loading branch information
cortadocodes committed Nov 1, 2022
2 parents b026de9 + 2ee1dff commit d1e3d78
Show file tree
Hide file tree
Showing 39 changed files with 1,917 additions and 1,031 deletions.
236 changes: 119 additions & 117 deletions docs/source/inter_service_compatibility.rst

Large diffs are not rendered by default.

23 changes: 8 additions & 15 deletions docs/source/testing_services.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ To emulate your children in tests, patch the :mod:`Child <octue.resources.child.
app_src=app_directory_path,
twine=os.path.join(app_directory_path, "twine.json"),
children=children,
service_id="you/your-service:latest",
service_id="your-org/your-service:latest",
)
emulated_children = [
Expand Down Expand Up @@ -212,7 +212,8 @@ change correspondingly (or at all). So, it's up to you to define a set of realis
the child's twine - this is because the twine is only available to the real child. This is ok - you're testing your
service, not the child.

You can create test fixtures manually or by recording messages from a real child to a JSON file. To record messages:
You can create test fixtures manually or by using the ``Child.received_messages`` property after questioning a real
child.

.. code-block:: python
Expand All @@ -225,20 +226,13 @@ You can create test fixtures manually or by recording messages from a real child
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
)
result = child.ask(
input_values=[1, 2, 3, 4],
record_messages_to="child_messages.json",
)
with open("child_messages.json") as f:
child_messages = json.load(f)
result = child.ask(input_values=[1, 2, 3, 4])
child_messages
child.received_messages
>>> [
{
'type': 'delivery_acknowledgement',
'delivery_time': '2022-08-16 11:49:57.244263',
'message_number': 0
},
{
'type': 'log_record',
Expand All @@ -248,14 +242,11 @@ You can create test fixtures manually or by recording messages from a real child
'levelname': 'INFO',
...
},
'analysis_id': '0ce8386d-564d-47fa-9d11-3b728f557bfe',
'message_number': 1
},
{
'type': 'result',
'output_values': {"some": "results"},
'output_manifest': None,
'message_number': 2
}
]
Expand All @@ -266,7 +257,9 @@ You can then feed these into a child emulator to emulate one possible response o
from octue.cloud.emulators import ChildEmulator
child_emulator = ChildEmulator(messages=child_messages)
child_emulator = ChildEmulator(messages=child.received_messages)
child_emulator.ask(input_values=[1, 2, 3, 4])
>>> {"some": "results"}
You can also create test fixtures from :ref:`downloaded service crash diagnostics <test_fixtures_from_crash_diagnostics>`.
111 changes: 88 additions & 23 deletions docs/source/troubleshooting_services.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,27 @@
Troubleshooting services
========================

Allowing crash diagnostics
==========================
A parent can give a child permission to save the following data to the cloud in the event the child fails while
processing a question:
Crash diagnostics
=================
Services save the following data to the cloud if they crash while processing a question:

- Input values
- Input manifest and datasets
- Child configuration values
- Child configuration manifest and datasets
- Messages sent from the child to the parent
- Inputs to and messages received in answer to each question the service asked its children (if it has any). These are
stored in the order the questions were asked.

The parent can give permission on a question-by-question basis by setting ``allow_save_diagnostics_data_on_crash=True``
in :mod:`Child.ask <octue.resources.child.Child.ask>`. For example:
.. important::

.. code-block:: python
child = Child(
id="my-organisation/my-service:latest",
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
)
answer = child.ask(
input_values={"height": 32, "width": 3},
allow_save_diagnostics_data_on_crash=True,
)
For this feature to be enabled, the child must have the ``crash_diagnostics_cloud_path`` field in its service
configuration (:ref:`octue.yaml <octue_yaml>` file) set to a Google Cloud Storage path.

For crash diagnostics to be saved, the child must have the ``crash_diagnostics_cloud_path`` field in its service
configuration (:ref:`octue.yaml <octue_yaml>` file) set to a Google Cloud Storage path.

Accessing crash diagnostics
===========================
In the event of a child crash, the child will upload the crash diagnostics and send the cloud path to them to the
parent as a log message. A user with credentials to access this path can use the ``octue`` CLI to retrieve the crash
diagnostics data:
In the event of a crash, the service will upload the crash diagnostics and send the upload path to the parent as a log
message. A user with credentials to access this path can use the ``octue`` CLI to retrieve the crash diagnostics data:

.. code-block:: shell
Expand All @@ -59,4 +46,82 @@ More information on the command:
--local-path DIRECTORY The path to a directory to store the directory of
diagnostics data in. Defaults to the current working
directory.
--download-datasets If provided, download any datasets from the crash
diagnostics and update their paths in their
manifests to the new local paths.
-h, --help Show this message and exit.
.. _test_fixtures_from_crash_diagnostics:

Creating test fixtures from crash diagnostics
=============================================
You can create test fixtures directly from crash diagnostics, allowing you to recreate the exact conditions that caused
your service to fail.

.. code-block:: python
from unittest.mock import patch
from octue import Runner
from octue.utils.testing import load_test_fixture_from_crash_diagnostics
(
configuration_values,
configuration_manifest,
input_values,
input_manifest,
child_emulators,
) = load_test_fixture_from_crash_diagnostics(path="path/to/downloaded/crash/diagnostics")
# You can explicitly specify your children here as shown or
# read the same information in from your app configuration file.
children = [
{
"key": "my_child",
"id": "octue/my-child-service:latest",
"backend": {
"name": "GCPPubSubBackend",
"project_name": "my-project",
}
},
{
"key": "another_child",
"id": "octue/another-child-service:latest",
"backend": {
"name": "GCPPubSubBackend",
"project_name": "my-project",
}
}
]
runner = Runner(
app_src="path/to/directory_containing_app",
twine=os.path.join(app_directory_path, "twine.json"),
children=children,
configuration_values=configuration_values,
configuration_manifest=configuration_manifest,
service_id="your-org/your-service:latest",
)
with patch("octue.runner.Child", side_effect=child_emulators):
analysis = runner.run(input_values=input_values, input_manifest=input_manifest)
Disabling crash diagnostics
===========================
When asking a question to a child, parents can disable crash diagnostics upload in the child on a question-by-question
basis by setting ``allow_save_diagnostics_data_on_crash`` to ``False`` in :mod:`Child.ask <octue.resources.child.Child.ask>`.
For example:

.. code-block:: python
child = Child(
id="my-organisation/my-service:latest",
backend={"name": "GCPPubSubBackend", "project_name": "my-project"},
)
answer = child.ask(
input_values={"height": 32, "width": 3},
allow_save_diagnostics_data_on_crash=False,
)
63 changes: 63 additions & 0 deletions octue/app_loading.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import importlib
import logging
import os
import sys


logger = logging.getLogger(__name__)


class AppFrom:
"""A context manager that imports the module "app" from a file named "app.py" in the given directory on entry (by
making a temporary addition to the system path) and unloads it (by deleting it from `sys.modules`) on exit. It will
issue a warning if an existing module called "app" is already loaded. Usage example:
```python3
with AppFrom('/path/to/dir') as app:
Runner().run(app)
```
:param str app_path: path to directory containing module named "app.py".
:return None:
"""

def __init__(self, app_path="."):
self.app_path = os.path.abspath(os.path.normpath(app_path))
logger.debug("Initialising AppFrom context at app_path %s", self.app_path)
self.app_module = None

def __enter__(self):
# Warn on an app present on the system path
if "app" in sys.modules.keys():
logger.warning(
"Module 'app' already on system path. Using 'AppFrom' context will yield unexpected results. Avoid "
"using 'app' as a python module, except for your main entrypoint."
)

# Insert the present directory first on the system path.
sys.path.insert(0, self.app_path)

# Import the app from the present directory.
self.app_module = importlib.import_module("app")

# Immediately clean up the entry to the system path (don't use "remove" because if the user has it in their
# path, this'll be an unexpected side effect, and don't do it in cleanup in case the called code inserts a path)
sys.path.pop(0)
logger.debug("Imported app at app_path and cleaned up temporary modification to sys.path %s", self.app_path)
return self

def __exit__(self, exc_type, exc_value, traceback):
"""Unload the imported module.
:return None:
"""
try:
del sys.modules["app"]
logger.debug("Deleted app from sys.modules")

except KeyError:
context_manager_name = type(self).__name__
logger.warning(
f"Module 'app' was already removed from the system path prior to exiting the {context_manager_name} "
f"context manager. Using the {context_manager_name} context may yield unexpected results."
)
44 changes: 40 additions & 4 deletions octue/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from octue.definitions import MANIFEST_FILENAME, VALUES_FILENAME
from octue.exceptions import ServiceAlreadyExists
from octue.log_handlers import apply_log_handler, get_remote_handler
from octue.resources import service_backends
from octue.resources import Manifest, service_backends
from octue.runner import Runner
from octue.utils.encoders import OctueJSONEncoder
from twined import Twine
Expand Down Expand Up @@ -289,25 +289,61 @@ def start(service_config, revision_tag, timeout, no_rm):
@click.option(
"--local-path",
type=click.Path(file_okay=False),
default=None,
default=".",
help="The path to a directory to store the directory of diagnostics data in. Defaults to the current working "
"directory.",
)
def get_crash_diagnostics(cloud_path, local_path):
@click.option(
"--download-datasets",
is_flag=True,
help="If provided, download any datasets from the crash diagnostics and update their paths in the configuration and "
"input manifests to the new local paths.",
)
def get_crash_diagnostics(cloud_path, local_path, download_datasets):
"""Download crash diagnostics for an analysis from the given directory in Google Cloud Storage. The cloud path
should end in the analysis ID.
CLOUD_PATH: The path to the directory in Google Cloud Storage containing the diagnostics data.
"""
analysis_id = storage.path.split(cloud_path)[-1]
local_path = os.path.join((local_path or "."), analysis_id)
local_path = os.path.join(local_path, analysis_id)

if download_datasets:
filter = None
else:
filter = lambda blob: any(
(
blob.name.endswith(f"configuration_{VALUES_FILENAME}"),
blob.name.endswith(f"configuration_{MANIFEST_FILENAME}"),
blob.name.endswith(f"input_{VALUES_FILENAME}"),
blob.name.endswith(f"input_{MANIFEST_FILENAME}"),
blob.name.endswith("questions.json"),
)
)

GoogleCloudStorageClient().download_all_files(
local_path=local_path,
cloud_path=cloud_path,
filter=filter,
recursive=True,
)

# Update the manifests with the local paths of the datasets.
if download_datasets:
for manifest_type in ("configuration_manifest", "input_manifest"):
manifest_path = os.path.join(local_path, manifest_type + ".json")

if not os.path.exists(manifest_path):
continue

manifest = Manifest.from_file(manifest_path)

manifest.update_dataset_paths(
path_generator=lambda dataset: os.path.join(local_path, f"{manifest_type}_datasets", dataset.name)
)

manifest.to_file(manifest_path)

logger.info("Downloaded crash diagnostics from %r to %r.", cloud_path, local_path)


Expand Down

0 comments on commit d1e3d78

Please sign in to comment.