Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add profiling initialization code to training_utils #732

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
99b3519
using dispatcher to relay requests for additional event senders
mkovalski Aug 23, 2021
535a0b3
minor change for return value
mkovalski Aug 23, 2021
9c84c23
Merge branch 'master' into profiler
mkovalski Aug 23, 2021
1640a96
reformatted
mkovalski Aug 24, 2021
8c51461
Merge remote-tracking branch 'upstream/master' into profiler
mkovalski Aug 24, 2021
b0594a4
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Aug 24, 2021
fadcf9a
Merge branch 'master' into profiler
mkovalski Aug 25, 2021
83a803e
type hints and changing mutable default
mkovalski Aug 30, 2021
f71c41f
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Aug 30, 2021
49b2727
creating separate RunResourceManager for handling runs
mkovalski Aug 31, 2021
17c3f8b
Merge branch 'main' of https://github.com/googleapis/python-aiplatfor…
mkovalski Sep 1, 2021
833bbbb
moving additional functionality to uploader_utils to be shared by pro…
mkovalski Sep 2, 2021
75048d3
initial profile uploader commit, still needs testing
mkovalski Sep 2, 2021
b7ba730
Merge branch 'profiler' of github.com:mkovalski/python-aiplatform int…
mkovalski Sep 2, 2021
400de49
abstract methods for call, move items to uploader_utils
mkovalski Sep 2, 2021
4bc3912
adding test objects for profiler
mkovalski Sep 8, 2021
097fa34
merging from main
mkovalski Sep 8, 2021
17be71a
have mock client create one platform names for testing
mkovalski Sep 11, 2021
9cd47b6
additional tets for profile events
mkovalski Sep 13, 2021
f3c7032
more tests
mkovalski Sep 14, 2021
9907578
100% coverage on profiler
mkovalski Sep 14, 2021
89fcce0
docstrings and typing
mkovalski Sep 14, 2021
0107c54
Merge remote-tracking branch 'upstream/main' into profiler_impl
mkovalski Sep 14, 2021
5a4cd9a
moving training utils over from DEV branch
mkovalski Sep 14, 2021
7dfd269
rename tensorboard to tensorboard_resource to avoid import conflicts,…
mkovalski Sep 21, 2021
6049df1
Merge remote-tracking branch 'origin/profiler_impl' into profiling_sdk
mkovalski Sep 21, 2021
e469013
adding of tensorboard_resource file
mkovalski Sep 21, 2021
05aaec1
moving profile plugin creation to TensorBoardUploader and added docst…
mkovalski Sep 22, 2021
35ecd1d
merging from main
mkovalski Sep 22, 2021
78f6c5b
Merge branch 'profiler_impl' into profiling_sdk
mkovalski Sep 23, 2021
e68cc5c
adding additional environment variables for profiling SDK
mkovalski Sep 23, 2021
4d2ca8b
adding profile plugin and base plugin
mkovalski Sep 23, 2021
134713d
adding profiler plugin and initializers
mkovalski Sep 27, 2021
919c957
adding tests
mkovalski Sep 28, 2021
415d528
README.rst
mkovalski Sep 29, 2021
5c62724
Merge branch 'main' into profiler_impl
sasha-gitg Oct 4, 2021
6534781
proper docstrings and type hints on files
mkovalski Oct 6, 2021
df86310
Merge branch 'profiler_impl' of github.com:mkovalski/python-aiplatfor…
mkovalski Oct 6, 2021
5459c4d
Merge branch 'profiler_impl' into profiling_sdk
mkovalski Oct 7, 2021
8490990
updates to typing + docstrings
mkovalski Oct 7, 2021
df6c7ba
merges from main, add plugin to setup.py
mkovalski Oct 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion google/cloud/aiplatform/tensorboard/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# limitations under the License.
#

from google.cloud.aiplatform.tensorboard.tensorboard import Tensorboard
from google.cloud.aiplatform.tensorboard.tensorboard_resource import Tensorboard


__all__ = ("Tensorboard",)
1 change: 0 additions & 1 deletion google/cloud/aiplatform/tensorboard/uploader_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,6 @@ def get_or_create(
filter="display_name = {}".format(json.dumps(str(tag_name))),
)
)

num = 0
time_series = None

Expand Down
22 changes: 22 additions & 0 deletions google/cloud/aiplatform/training_utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.cloud.aiplatform.training_utils.environment_variables import (
EnvironmentVariables,
)

__all__ = ("EnvironmentVariables",)
20 changes: 20 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Cloud Profiler
=================================

Cloud Profiler allows you to profile your remote Vertex AI Training jobs on demand. To understand the performance of your training code, a user can update their scripts by initializing the profiler and capturing the profile session through Vertex TensorBoard.

Quick Start
------------

To start using the profiler, update the training script to include the following:

.. code-block:: Python

from google.cloud.aiplatform.training_utils import cloud_profiler
...
cloud_profiler.init()


Next, run the job with with a Vertex TensorBoard instance. For full details on how to do this, visit https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview

Finally, visit your TensorBoard instance and click the `Capture Profile` button. This will allow users to capture profiling statistics for the running jobs.
29 changes: 29 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.cloud.aiplatform.training_utils.cloud_profiler import initializer

"""
Initialize the cloud profiler for tensorflow.

Usage:
from google.cloud.aiplatform.training_utils import cloud_profiler

cloud_profiler.init(profiler='tensorflow')
"""

init = initializer.initialize
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import abc
from typing import Dict


class BasePlugin(abc.ABC):
"""Base plugin for cloud training tools endpoints.

The plugins support registering http handlers to be used for
AI Platform training jobs.
"""

@staticmethod
@abc.abstractmethod
def setup() -> None:
"""Run any setup code for the plugin before webserver is launched."""
raise NotImplementedError

@staticmethod
@abc.abstractmethod
def can_initialize() -> bool:
"""Check whether a plugin is able to be initialized.

Used for checking if correct dependencies are installed, system requirements, etc.

Returns:
Bool indicating whether the plugin can be initialized.
"""
raise NotImplementedError

@staticmethod
@abc.abstractmethod
def post_setup_check() -> bool:
"""Check if after initialization, we need to use the plugin.

Example: Web server only needs to run for main node for training, others
just need to have 'setup()' run to start the rpc server.

Returns:
A boolean indicating whether post setup checks pass.
"""
raise NotImplementedError

@abc.abstractmethod
def get_routes(self) -> Dict[str, str]:
"""Get the mapping from path to handler.

This is the method in which plugins can assign different routes to
different handlers.

Returns:
A Dict[str, str] mapping a route to a handler.
"""
raise NotImplementedError
100 changes: 100 additions & 0 deletions google/cloud/aiplatform/training_utils/cloud_profiler/initializer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# -*- coding: utf-8 -*-

# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import logging
import threading
from typing import Callable
from werkzeug import serving

from google.cloud.aiplatform.training_utils.cloud_profiler import base_plugin
from google.cloud.aiplatform.training_utils.cloud_profiler import webserver
from google.cloud.aiplatform.training_utils.cloud_profiler.plugins import tf_profiler


_AVAILABLE_PLUGINS = {"tensorflow": tf_profiler.TFProfiler}
_HOST_PORT = 6010


def _build_plugin(
plugin: Callable[[], base_plugin.BasePlugin]
) -> base_plugin.BasePlugin:
"""Builds the plugin given the object.

Args:
plugin (Callable[[], base_plugin):
Required. An uninitialized plugin.

Returns:
An initialized plugin.
"""
if not plugin.can_initialize():
logging.warning("Cannot initialize the plugin")
return

plugin.setup()

if not plugin.post_setup_check():
return

return plugin()


def _run_app_thread(server: webserver.WebServer):
"""Run the webserver in a separate thread.

Args:
server (webserver.WebServer):
Required. A webserver to accept requests.
"""
daemon = threading.Thread(
name="profile_server",
target=serving.run_simple,
args=("0.0.0.0", _HOST_PORT, server,),
)
daemon.setDaemon(True)
daemon.start()


def initialize(plugin: str = "tensorflow"):
"""Initializes the profiling SDK.

Args:
plugin (str):
Required. Name of the plugin to initialize.
Current options are ["tensorflow"]

Raises:
ValueError:
The plugin does not exist.
"""

plugin_obj = _AVAILABLE_PLUGINS.get(plugin)

if not plugin_obj:
raise ValueError(
"Plugin {} not available, must choose from {}".format(
plugin, _AVAILABLE_PLUGINS.keys()
)
)

prof_plugin = _build_plugin(plugin_obj)

if not prof_plugin:
return

server = webserver.WebServer([prof_plugin])
_run_app_thread(server)