Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Framework splitting #9318

Merged
merged 20 commits into from May 15, 2024
Merged

feat: Framework splitting #9318

merged 20 commits into from May 15, 2024

Conversation

MikhailKardash
Copy link
Contributor

@MikhailKardash MikhailKardash commented May 6, 2024

Ticket

MD-272

Description

Makes pytorch-ngc our default image. Sets tensorflow-ngc in tests when required.
Moves legacy PyTorchTrial launcher from horovod to torch.distributed.launch.
Lots of docs updates.
Update master restart test to include a configurable timeout. In this case, we have to pull the pytorch image, so we timeout. This will be fixed in a later environments PR, but this has to be merged first.
Also splits our tests a bit better at the directory level to make a clearer distinction between PyTorch and TensorFlow.

Test Plan

Unit tests pass, experiments default to pytorch-ngc images.
Verify that Tensorboard launches from ngc-based image experiments.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@MikhailKardash MikhailKardash requested review from a team as code owners May 6, 2024 21:13
@cla-bot cla-bot bot added the cla-signed label May 6, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 6, 2024
@determined-ci determined-ci requested a review from a team May 6, 2024 21:13
@MikhailKardash MikhailKardash requested review from tara-det-ai and removed request for a team May 6, 2024 21:13
Copy link

netlify bot commented May 6, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit e448f0a
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6643ac52380ae40008c93a35

Copy link

codecov bot commented May 6, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 45.29%. Comparing base (4f180db) to head (e448f0a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9318   +/-   ##
=======================================
  Coverage   45.28%   45.29%           
=======================================
  Files        1227     1227           
  Lines      154048   154051    +3     
  Branches     2404     2404           
=======================================
+ Hits        69766    69776   +10     
+ Misses      84090    84083    -7     
  Partials      192      192           
Flag Coverage Δ
backend 41.75% <ø> (ø)
harness 64.09% <93.75%> (+0.02%) ⬆️
web 36.33% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
harness/determined/exec/launch.py 78.00% <ø> (ø)
...ness/tests/experiment/keras/test_tf_keras_trial.py 98.83% <100.00%> (+0.08%) ⬆️
harness/tests/launch/test_launch.py 100.00% <100.00%> (ø)
harness/determined/core/_profiler.py 57.56% <0.00%> (ø)
harness/tests/experiment/pytorch/test_local.py 91.66% <91.66%> (ø)

... and 4 files with indirect coverage changes

Copy link
Contributor

@djanicekpach djanicekpach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI config changes seemed reasonable to me.

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested edits

Copy link

@highvelcty highvelcty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to determined and new to this code set, but the changes look good to me! I made one minor comment about an import within a test method instead of at the module level. Feel free to take it or leave it :). Thanks for the changes!

Copy link
Contributor

@gt2345 gt2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that experiment image gets updated

"cpu": "determinedai/pytorch-tensorflow-cpu-dev:8b3bea3",
"cuda": "determinedai/pytorch-tensorflow-cuda-dev:8b3bea3",
"cpu": "determinedai/tensorflow-ngc-dev:8b3bea3",
"cuda": "determinedai/tensorflow-ngc-dev:8b3bea3\"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

@azhou-determined
Copy link
Contributor

do we install tensorboard in the pytorch-only images? IIRC, launching a tensorboard from an experiment will default to the image defined in an experiment.

another thing to test would be if you launch a tensorboard with multiple experiments, where one has pytorch-only and another is tf-only.

after this change, what image do we expect to be used to launch NTSC?

@MikhailKardash
Copy link
Contributor Author

MikhailKardash commented May 13, 2024

do we install tensorboard in the pytorch-only images? IIRC, launching a tensorboard from an experiment will default to the image defined in an experiment.

Yes we do, that's done in environments: https://github.com/determined-ai/environments/blob/8b3bea3b81bd3934ad885bdf159320f11f6ea1ba/Dockerfile-pytorch-ngc#L21

another thing to test would be if you launch a tensorboard with multiple experiments, where one has pytorch-only and another is tf-only.

I'll verify this and then add it to testing instructions.

after this change, what image do we expect to be used to launch NTSC?

After this change, we expect to launch in determinedai/pytorch-ngc-dev:<HASH> images. These will be retagged by release-party to be determinedai/pytorch-ngc:<VERSION>

@MikhailKardash
Copy link
Contributor Author

general question: is there a doc somewhere that catalogues all the different images we offer and their intended use?

We have this page in our docs: https://docs.determined.ai/latest/model-dev-guide/prepare-container/set-environment-images.html#set-environment-images
Which is updated in this PR: docs/model-dev-guide/prepare-container/set-environment-images.rst

@azhou-determined
Copy link
Contributor

azhou-determined commented May 13, 2024

do we install tensorboard in the pytorch-only images? IIRC, launching a tensorboard from an experiment will default to the image defined in an experiment.

Yes we do, that's done in environments: https://github.com/determined-ai/environments/blob/8b3bea3b81bd3934ad885bdf159320f11f6ea1ba/Dockerfile-pytorch-ngc#L21

seems like that environments image only installs the profilers? where does actual tensorboard get installed, is it in the base image maybe?

also, if we only install torch-tb-profiler in torch images, and tensorflow-plugin-profile in tf images, then the case when someone compares multiple experiments where one profiles with torch and the other with TF profiler, seems this might cause issues?

@MikhailKardash
Copy link
Contributor Author

seems like that environments image only installs the profilers? where does actual tensorboard get installed, is it in the base image maybe?

torch-tb-profiler installs tensorboard as a dependency. See: https://github.com/pytorch/kineto/blob/84d95b8232167674eee17c11c2198276f4a6482c/tb_plugin/setup.py#L29
They version pin tensorboard on their end anyways.

also, if we only install torch-tb-profiler in torch images, and tensorflow-plugin-profile in tf images, then the case when someone compares multiple experiments where one profiles with torch and the other with TF profiler, seems this might cause issues?

They do all load in tensorboard right now. See the following image, where trial 3 is a pytorch mnist trial while trials 4-6 are iris tfkeras trials. Note that the tensorboard inherits the image used in the trial it is associated with. I tried with ngc-tensorflow-dev, ngc-pytorch-dev, and even pytorch-tensorflow and they all work.
image

Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@MikhailKardash MikhailKardash merged commit a96cafd into main May 15, 2024
85 of 98 checks passed
@MikhailKardash MikhailKardash deleted the framework_splitting branch May 15, 2024 16:00
kkunapuli pushed a commit that referenced this pull request May 16, 2024
* make ngc-pytorch images default
* fix a profiler bug
* update docs for ngc image changes
billboggs added a commit that referenced this pull request May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants