Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cloud profiler to training_utils #828

Merged
merged 69 commits into from Nov 29, 2021

Conversation

mkovalski
Copy link
Contributor

@mkovalski mkovalski commented Nov 8, 2021

Adds ability to profile vertex training jobs using tensorboard profiler.

  • Add a base plugin and tf profiler plugin to cloud training tools.
  • Create helpers for uploading profiled items to tensorboard backend
  • Add additional environment variables for setting webserver port.

Fixes #519

mkovalski and others added 30 commits August 23, 2021 15:10
@mkovalski mkovalski requested a review from a team as a code owner November 8, 2021 16:04
@product-auto-label product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Nov 8, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 8, 2021
@nicain
Copy link
Contributor

nicain commented Nov 8, 2021

@mkovalski: I am assigning as owner of this PR; feel free to ping reviewers as needed to make sure the review process progresses in a timely fashion, or provide guidance on a who might better own the process of getting the PR reviewed, passing continuous testing, and merged. Reach out if you have questions.

tests/unit/aiplatform/test_cloud_profiler.py Outdated Show resolved Hide resolved

if not environment_variables.http_handler_port:
raise MissingEnvironmentVariableException(
"'AIP_HTTP_HANDLER_PORT' must be set."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user set this using env or is this set by the service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set by the service.


from google.cloud.aiplatform.training_utils.cloud_profiler.plugins import base_plugin
from typing import List
from werkzeug import wrappers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap with informative importerror exception.

setup.py Outdated
full_extra_require = list(
set(tensorboard_extra_require + metadata_extra_require + xai_extra_require)
set(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TF version should be handled explicitly since TB, XAI, and Profiler have different version bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sasha-gitg sasha-gitg merged commit 6d5c7c4 into googleapis:main Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add remote tensorflow profiling to training jobs.
3 participants