Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convert k8s submissions from pods to jobs #9296

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented May 2, 2024

Ticket

[RM-203,RM-204,RM-205,RM-206,RM-208,RM-213]

Description

Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:

  • Jobs play nice with resource quotas and other Kubernetes features out of the box.
  • Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs.
  • They allow us to integrate with kueue immediately.
  • If we want to support VolcanoJobs we are much closer (and it is easier to maintain Job+VolcanoJob than Pods+VolcanoJob).

Test Plan

Covered by automated tests.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.
  • File ticket for rp.namespace weirdness, graceful shutdown, using indexed completions

@cla-bot cla-bot bot added the cla-signed label May 2, 2024
Copy link

netlify bot commented May 2, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 96f1ce4
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/664e6d94bdb6a6000850cfea
😎 Deploy Preview https://deploy-preview-9296--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented May 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 40.60%. Comparing base (ca45198) to head (96f1ce4).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9296      +/-   ##
==========================================
- Coverage   48.57%   40.60%   -7.98%     
==========================================
  Files        1234      662     -572     
  Lines      158841    77260   -81581     
  Branches     2778        0    -2778     
==========================================
- Hits        77155    31368   -45787     
+ Misses      81511    45892   -35619     
+ Partials      175        0     -175     
Flag Coverage Δ
harness 37.91% <ø> (-26.11%) ⬇️
web ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/db/postgres_test_utils.go 82.37% <ø> (+0.21%) ⬆️
master/internal/rm/agentrm/resource_pool.go 34.20% <ø> (+0.07%) ⬆️
master/internal/rm/kubernetesrm/informer.go 81.92% <ø> (+0.97%) ⬆️
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 29.15% <ø> (+0.76%) ⬆️
master/internal/rm/kubernetesrm/request_queue.go 86.58% <ø> (+1.96%) ⬆️
master/internal/rm/kubernetesrm/request_workers.go 89.58% <ø> (+1.05%) ⬆️
master/internal/rm/kubernetesrm/resource_pool.go 44.47% <ø> (+1.09%) ⬆️
master/internal/rm/kubernetesrm/spec.go 73.16% <ø> (+0.98%) ⬆️
master/internal/sproto/resources.go 23.94% <ø> (-1.06%) ⬇️
master/internal/sproto/task.go 41.37% <ø> (+0.47%) ⬆️
... and 7 more

... and 671 files with indirect coverage changes

@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 2 times, most recently from 07bfc25 to e99300c Compare May 6, 2024 21:36
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes from my first skim through

master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
helm/charts/determined/values.yaml Outdated Show resolved Hide resolved
helm/charts/determined/templates/master-deployment.yaml Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/spec.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/spec.go Show resolved Hide resolved
@stoksc stoksc changed the title Stoksc/feat/pods2jobs feat: convert k8s submissions from pods to jobs May 9, 2024
@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 5 times, most recently from ae8e8e0 to 101af3d Compare May 16, 2024 18:27
@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 4 times, most recently from e8a4aab to aad1242 Compare May 17, 2024 21:17
@stoksc stoksc marked this pull request as ready for review May 17, 2024 23:01
@stoksc stoksc requested review from a team as code owners May 17, 2024 23:01
@stoksc
Copy link
Contributor Author

stoksc commented May 17, 2024

All the failing tests are also failing on main, but I'm going to make an attempt to fix the relevant ones before landing at least.

@stoksc stoksc changed the base branch from main to stoksc/feat/kubernetesjobs May 17, 2024 23:25
@stoksc stoksc requested review from a team as code owners May 17, 2024 23:25
@stoksc stoksc requested a review from ashtonG May 17, 2024 23:25
@stoksc stoksc removed request for a team and ashtonG May 17, 2024 23:27
@@ -0,0 +1,751 @@
package kubernetesrm
Copy link
Contributor Author

@stoksc stoksc May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewer: this code started as a copy of the old pod.go (but modified so heavily github's diff algorithm switch from calling it a rename+changes to a delete+rewrite). jobs.go is the same, copied from pods.go. I'm calling this out to say: I refactored the code as I thought it was necessary and as it made me more confident in its correctness but I didn't go a lot further. I'll probably do a style-oriented refactor once this PR is in (it's all going to a feature branch for now).

@@ -2158,6 +2158,8 @@ jobs:
- setup-python-venv:
executor: <<pipeline.parameters.machine-image>>
- setup-go-intg-deps
- run: curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && sudo install minikube-linux-amd64 /usr/local/bin/minikube
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a thing for this PR, but can we move this (and maybe the start too) out to a command so we don't have a proliferation of these huge command lines everywhere minikube is used? Should I open that as an issue somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a job for infra

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can do. but fyi this is the only such line, I cherry picked from carolina's PR you reviewed. it'll go away once she lands and i rebase. mb, should've waited to mark it ready for review. would've saved you a bit of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Cool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, I merged the PR, so you can rebase against main to pick it up!

Copy link
Member

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra-relevant parts look fine.

Is "Eventually we can deleted restarts..." in the PR description supposed to be "delegate" though?

@stoksc
Copy link
Contributor Author

stoksc commented May 18, 2024

Yep typo, thanks.

Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got through everything but jobs.go/jobs_test.go -- everything looks like it checks out, excited for the bug bash tomorrow!

master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/jobs.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/jobs.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/jobs.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/jobs.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/resource_pool.go Outdated Show resolved Hide resolved
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but before you merge I pointed out 2 changes that I think should either be called out explicitly in the PR description or given their own PR because I think they're unrelated to this feat
Also, I think you should reword the log messages into something a little more clear by putting the action verb first.
Besides that, just style comments, which I assume will make it into their own PR

Comment on lines 335 to 339
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Pulling)
j.container.State = cproto.Pulling
j.informTaskResourcesState()

j.syslog.Infof("saw pod %s in state %s", podName, cproto.Starting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't love the wording of these "saw pod X in state Y", maybe try "pulling/starting/ pod %s" --> strings.ToLower(cproto.Pulling) + " pod " + stringName

@determined-ci determined-ci requested a review from a team May 21, 2024 23:29
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 21, 2024
@stoksc stoksc changed the base branch from stoksc/feat/kubernetesjobs to main May 22, 2024 00:40
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label May 22, 2024
Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

stoksc added 12 commits May 22, 2024 12:11
various ci fixes

add consts

fix import style

revert unneeded helm changes

last bit of review feedback

lint fixes

bring in carolina's config changes

tmp

stuff that i'm definitely keeping

amends

lints

debug logging for weird failure only in CI

debug logging

test fixes

test fixes

fixes for reattach tests

self review

more self review

fix annoyance

pass numPods to recreateJobHandler

final self review

fix: job queue state not recovered on reattach

various fixes
@@ -99,11 +99,13 @@ build/mock_gen.stamp: $(MOCK_INPUTS)
mockery --quiet --name=PodInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename pod_iface.go
mockery --quiet --name=EventInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename event_iface.go
mockery --quiet --name=NodeInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename node_iface.go
mockery --quiet --name=JobInterface --srcpkg=k8s.io/client-go/kubernetes/typed/batch/v1 --output internal/mocks --filename job_iface.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to switch to a config file ASAP

https://hpe-aiatscale.atlassian.net/browse/RM-277

req.State = msg.State
if sproto.ScheduledStates[req.State] {
k.allocationIDToRunningPods[id]++
k.allocationIDToRunningPods[msg.AllocationID] += msg.NumPods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the job know how many pods it has running, it is a little tragic that we need to keep track of this map

feel free to just make a follow up ticket or ignore any of these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, tragic but no it doesn't know how many are "running" where our definition of "running" is post scheduling and bound to a node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants