Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cloud profiler to training_utils #828

Merged
merged 69 commits into from
Nov 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
99b3519
using dispatcher to relay requests for additional event senders
mkovalski Aug 23, 2021
535a0b3
minor change for return value
mkovalski Aug 23, 2021
9c84c23
Merge branch 'master' into profiler
mkovalski Aug 23, 2021
1640a96
reformatted
mkovalski Aug 24, 2021
8c51461
Merge remote-tracking branch 'upstream/master' into profiler
mkovalski Aug 24, 2021
b0594a4
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Aug 24, 2021
fadcf9a
Merge branch 'master' into profiler
mkovalski Aug 25, 2021
83a803e
type hints and changing mutable default
mkovalski Aug 30, 2021
f71c41f
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Aug 30, 2021
49b2727
creating separate RunResourceManager for handling runs
mkovalski Aug 31, 2021
17c3f8b
Merge branch 'main' of https://github.com/googleapis/python-aiplatfor…
mkovalski Sep 1, 2021
833bbbb
moving additional functionality to uploader_utils to be shared by pro…
mkovalski Sep 2, 2021
75048d3
initial profile uploader commit, still needs testing
mkovalski Sep 2, 2021
b7ba730
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Sep 2, 2021
400de49
abstract methods for call, move items to uploader_utils
mkovalski Sep 2, 2021
4bc3912
adding test objects for profiler
mkovalski Sep 8, 2021
097fa34
merging from main
mkovalski Sep 8, 2021
17be71a
have mock client create one platform names for testing
mkovalski Sep 11, 2021
9cd47b6
additional tets for profile events
mkovalski Sep 13, 2021
f3c7032
more tests
mkovalski Sep 14, 2021
9907578
100% coverage on profiler
mkovalski Sep 14, 2021
89fcce0
docstrings and typing
mkovalski Sep 14, 2021
0107c54
Merge remote-tracking branch 'upstream/main' into profiler_impl
mkovalski Sep 14, 2021
5a4cd9a
moving training utils over from DEV branch
mkovalski Sep 14, 2021
7dfd269
rename tensorboard to tensorboard_resource to avoid import conflicts,…
mkovalski Sep 21, 2021
6049df1
Merge remote-tracking branch 'origin/profiler_impl' into profiling_sdk
mkovalski Sep 21, 2021
e469013
adding of tensorboard_resource file
mkovalski Sep 21, 2021
05aaec1
moving profile plugin creation to TensorBoardUploader and added docst…
mkovalski Sep 22, 2021
35ecd1d
merging from main
mkovalski Sep 22, 2021
78f6c5b
Merge branch 'profiler_impl' into profiling_sdk
mkovalski Sep 23, 2021
e68cc5c
adding additional environment variables for profiling SDK
mkovalski Sep 23, 2021
4d2ca8b
adding profile plugin and base plugin
mkovalski Sep 23, 2021
134713d
adding profiler plugin and initializers
mkovalski Sep 27, 2021
919c957
adding tests
mkovalski Sep 28, 2021
415d528
README.rst
mkovalski Sep 29, 2021
5c62724
Merge branch 'main' into profiler_impl
sasha-gitg Oct 4, 2021
6534781
proper docstrings and type hints on files
mkovalski Oct 6, 2021
df86310
Merge branch 'profiler_impl' of github.com:mkovalski/python-aiplatfor…
mkovalski Oct 6, 2021
5459c4d
Merge branch 'profiler_impl' into profiling_sdk
mkovalski Oct 7, 2021
8490990
updates to typing + docstrings
mkovalski Oct 7, 2021
df6c7ba
merges from main, add plugin to setup.py
mkovalski Oct 12, 2021
5930cb4
merging from main
mkovalski Oct 25, 2021
0d5b89d
moving around directory structure
mkovalski Nov 1, 2021
ad926c6
Merge branch 'main' into profiling_sdk
mkovalski Nov 1, 2021
76a4acb
adding HTTP handler port as environment variable
mkovalski Nov 2, 2021
1e92260
add tests to support environment variables module
mkovalski Nov 2, 2021
6ea3718
update tests for environment variables
mkovalski Nov 2, 2021
281c9c4
flake8 + black
mkovalski Nov 2, 2021
38453cd
flake8 + black for tests
mkovalski Nov 2, 2021
cba4bdd
docstring update for initializer
mkovalski Nov 2, 2021
ed42226
update requirements
mkovalski Nov 8, 2021
940e3e1
Merge branch 'main' into profiling_sdk
mkovalski Nov 8, 2021
0bc0251
importlib returns the module, check if this is None instead
mkovalski Nov 8, 2021
17a2698
directly check for None
mkovalski Nov 8, 2021
50b2593
Merge branch 'main' into profiling_sdk
nicain Nov 8, 2021
c957a1b
Merge branch 'main' into profiling_sdk
nicain Nov 8, 2021
6ea028f
must rely on tensorflow 2.4.0
mkovalski Nov 9, 2021
20f1268
Add correct typing to _build_plugin
mkovalski Nov 9, 2021
348c0de
resolving a number of comments
mkovalski Nov 17, 2021
b7469ce
Proper typing for WSGI related variables
mkovalski Nov 18, 2021
983dbd3
Add test for catching ImportError while loading tensoflow
mkovalski Nov 18, 2021
d2e04ac
throw error if importing cloud_profiler fails
mkovalski Nov 19, 2021
7ff3bad
module level import for TBContext
mkovalski Nov 19, 2021
58dd84f
apply black formatting
mkovalski Nov 19, 2021
e6d7ded
Merge branch 'main' into profiling_sdk
mkovalski Nov 19, 2021
2fc8eeb
removed profiler from full_extra_requires, add requirements for testing
mkovalski Nov 19, 2021
e3808a4
Merge branch 'main' into profiling_sdk
sasha-gitg Nov 23, 2021
1c211a6
Merge branch 'main' into profiling_sdk
mkovalski Nov 23, 2021
dba42ff
Merge branch 'main' into profiling_sdk
sasha-gitg Nov 29, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,24 @@ To use Explanation Metadata in endpoint deployment and model upload:
aiplatform.Model.upload(..., explanation_metadata=explanation_metadata)


Cloud Profiler
----------------------------

Cloud Profiler allows you to profile your remote Vertex AI Training jobs on demand and visualize the results in Vertex Tensorboard.

To start using the profiler with TensorFlow, update your training script to include the following:

.. code-block:: Python

from google.cloud.aiplatform.training_utils import cloud_profiler
...
cloud_profiler.init()

Next, run the job with with a Vertex TensorBoard instance. For full details on how to do this, visit https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview

Finally, visit your TensorBoard in your Google Cloud Console, navigate to the "Profile" tab, and click the `Capture Profile` button. This will allow users to capture profiling statistics for the running jobs.


Next Steps
~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion google/cloud/aiplatform/tensorboard/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# limitations under the License.
#

from google.cloud.aiplatform.tensorboard.tensorboard import Tensorboard
from google.cloud.aiplatform.tensorboard.tensorboard_resource import Tensorboard


__all__ = ("Tensorboard",)
1 change: 0 additions & 1 deletion google/cloud/aiplatform/tensorboard/uploader_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,6 @@ def get_or_create(
filter="display_name = {}".format(json.dumps(str(tag_name))),
)
)

num = 0
time_series = None

Expand Down
20 changes: 20 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Cloud Profiler
mkovalski marked this conversation as resolved.
Show resolved Hide resolved
=================================

Cloud Profiler allows you to profile your remote Vertex AI Training jobs on demand and visualize the results in Vertex Tensorboard.

Quick Start
------------

To start using the profiler with TensorFlow, update your training script to include the following:

.. code-block:: Python

from google.cloud.aiplatform.training_utils import cloud_profiler
...
cloud_profiler.init()


Next, run the job with with a Vertex TensorBoard instance. For full details on how to do this, visit https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview

Finally, visit your TensorBoard in your Google Cloud Console, navigate to the "Profile" tab, and click the `Capture Profile` button. This will allow users to capture profiling statistics for the running jobs.
35 changes: 35 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

try:
import google.cloud.aiplatform.training_utils.cloud_profiler.initializer as initializer
except ImportError as err:
raise ImportError(
"Could not load the cloud profiler. To use the profiler, "
'install the SDK using "pip install google-cloud-aiplatform[cloud-profiler]"'
) from err

"""
Initialize the cloud profiler for tensorflow.

Usage:
from google.cloud.aiplatform.training_utils import cloud_profiler

cloud_profiler.init(profiler='tensorflow')
"""

init = initializer.initialize
118 changes: 118 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/initializer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import logging
import threading
from typing import Optional, Type
from werkzeug import serving
sasha-gitg marked this conversation as resolved.
Show resolved Hide resolved

from google.cloud.aiplatform.training_utils import environment_variables
from google.cloud.aiplatform.training_utils.cloud_profiler import webserver
from google.cloud.aiplatform.training_utils.cloud_profiler.plugins import base_plugin
from google.cloud.aiplatform.training_utils.cloud_profiler.plugins.tensorflow import (
tf_profiler,
)

# Mapping of available plugins to use
_AVAILABLE_PLUGINS = {"tensorflow": tf_profiler.TFProfiler}


class MissingEnvironmentVariableException(Exception):
pass


def _build_plugin(
plugin: Type[base_plugin.BasePlugin],
) -> Optional[base_plugin.BasePlugin]:
"""Builds the plugin given the object.

Args:
plugin (Type[base_plugin]):
Required. An uninitialized plugin class.

Returns:
An initialized plugin, or None if plugin cannot be
initialized.
"""
if not plugin.can_initialize():
mkovalski marked this conversation as resolved.
Show resolved Hide resolved
logging.warning("Cannot initialize the plugin")
return

plugin.setup()

if not plugin.post_setup_check():
return

return plugin()


def _run_app_thread(server: webserver.WebServer, port: int):
"""Run the webserver in a separate thread.

Args:
server (webserver.WebServer):
Required. A webserver to accept requests.
port (int):
Required. The port to run the webserver on.
"""
daemon = threading.Thread(
name="profile_server",
target=serving.run_simple,
args=("0.0.0.0", port, server,),
)
daemon.setDaemon(True)
daemon.start()


def initialize(plugin: str = "tensorflow"):
"""Initializes the profiling SDK.

Args:
plugin (str):
Required. Name of the plugin to initialize.
Current options are ["tensorflow"]

Raises:
ValueError:
The plugin does not exist.
MissingEnvironmentVariableException:
An environment variable that is needed is not set.
"""
plugin_obj = _AVAILABLE_PLUGINS.get(plugin)

if not plugin_obj:
raise ValueError(
"Plugin {} not available, must choose from {}".format(
plugin, _AVAILABLE_PLUGINS.keys()
)
)

prof_plugin = _build_plugin(plugin_obj)

if prof_plugin is None:
return

server = webserver.WebServer([prof_plugin])

if not environment_variables.http_handler_port:
raise MissingEnvironmentVariableException(
"'AIP_HTTP_HANDLER_PORT' must be set."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user set this using env or is this set by the service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set by the service.

)

port = int(environment_variables.http_handler_port)

_run_app_thread(server, port)
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import abc
from typing import Callable, Dict
from werkzeug import Response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some request to wrap with informative exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



class BasePlugin(abc.ABC):
"""Base plugin for cloud training tools endpoints.

The plugins support registering http handlers to be used for
AI Platform training jobs.
"""

@staticmethod
@abc.abstractmethod
def setup() -> None:
"""Run any setup code for the plugin before webserver is launched."""
raise NotImplementedError

@staticmethod
@abc.abstractmethod
def can_initialize() -> bool:
"""Check whether a plugin is able to be initialized.

Used for checking if correct dependencies are installed, system requirements, etc.

Returns:
Bool indicating whether the plugin can be initialized.
"""
raise NotImplementedError

@staticmethod
@abc.abstractmethod
def post_setup_check() -> bool:
"""Check if after initialization, we need to use the plugin.

Example: Web server only needs to run for main node for training, others
just need to have 'setup()' run to start the rpc server.

Returns:
A boolean indicating whether post setup checks pass.
"""
raise NotImplementedError

@abc.abstractmethod
def get_routes(self) -> Dict[str, Callable[..., Response]]:
"""Get the mapping from path to handler.

This is the method in which plugins can assign different routes to
different handlers.

Returns:
A mapping from a route to a handler.
"""
raise NotImplementedError