-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert k8s submissions from pods to jobs #9296
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9296 +/- ##
==========================================
- Coverage 48.57% 40.60% -7.98%
==========================================
Files 1234 662 -572
Lines 158841 77260 -81581
Branches 2778 0 -2778
==========================================
- Hits 77155 31368 -45787
+ Misses 81511 45892 -35619
+ Partials 175 0 -175
Flags with carried forward coverage won't be shown. Click here to find out more.
|
07bfc25
to
e99300c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notes from my first skim through
a49c29b
to
dd9f56d
Compare
ae8e8e0
to
101af3d
Compare
e8a4aab
to
aad1242
Compare
All the failing tests are also failing on main, but I'm going to make an attempt to fix the relevant ones before landing at least. |
aad1242
to
a94c3c0
Compare
@@ -0,0 +1,751 @@ | |||
package kubernetesrm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to reviewer: this code started as a copy of the old pod.go
(but modified so heavily github's diff algorithm switch from calling it a rename+changes to a delete+rewrite). jobs.go
is the same, copied from pods.go
. I'm calling this out to say: I refactored the code as I thought it was necessary and as it made me more confident in its correctness but I didn't go a lot further. I'll probably do a style-oriented refactor once this PR is in (it's all going to a feature branch for now).
.circleci/real_config.yml
Outdated
@@ -2158,6 +2158,8 @@ jobs: | |||
- setup-python-venv: | |||
executor: <<pipeline.parameters.machine-image>> | |||
- setup-go-intg-deps | |||
- run: curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && sudo install minikube-linux-amd64 /usr/local/bin/minikube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not a thing for this PR, but can we move this (and maybe the start too) out to a command so we don't have a proliferation of these huge command lines everywhere minikube is used? Should I open that as an issue somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a job for infra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah can do. but fyi this is the only such line, I cherry picked from carolina's PR you reviewed. it'll go away once she lands and i rebase. mb, should've waited to mark it ready for review. would've saved you a bit of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, I merged the PR, so you can rebase against main to pick it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra-relevant parts look fine.
Is "Eventually we can deleted restarts..." in the PR description supposed to be "delegate" though?
Yep typo, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got through everything but jobs.go/jobs_test.go -- everything looks like it checks out, excited for the bug bash tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but before you merge I pointed out 2 changes that I think should either be called out explicitly in the PR description or given their own PR because I think they're unrelated to this feat
Also, I think you should reword the log messages into something a little more clear by putting the action verb first.
Besides that, just style comments, which I assume will make it into their own PR
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Pulling) | ||
j.container.State = cproto.Pulling | ||
j.informTaskResourcesState() | ||
|
||
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Starting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't love the wording of these "saw pod X in state Y", maybe try "pulling/starting/ pod %s" --> strings.ToLower(cproto.Pulling) + " pod " + stringName
6fa894e
to
6d8e486
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
various ci fixes add consts fix import style revert unneeded helm changes last bit of review feedback lint fixes bring in carolina's config changes tmp stuff that i'm definitely keeping amends lints debug logging for weird failure only in CI debug logging test fixes test fixes fixes for reattach tests self review more self review fix annoyance pass numPods to recreateJobHandler final self review fix: job queue state not recovered on reattach various fixes
ed5620b
to
96f1ce4
Compare
@@ -99,11 +99,13 @@ build/mock_gen.stamp: $(MOCK_INPUTS) | |||
mockery --quiet --name=PodInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename pod_iface.go | |||
mockery --quiet --name=EventInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename event_iface.go | |||
mockery --quiet --name=NodeInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename node_iface.go | |||
mockery --quiet --name=JobInterface --srcpkg=k8s.io/client-go/kubernetes/typed/batch/v1 --output internal/mocks --filename job_iface.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to switch to a config file ASAP
req.State = msg.State | ||
if sproto.ScheduledStates[req.State] { | ||
k.allocationIDToRunningPods[id]++ | ||
k.allocationIDToRunningPods[msg.AllocationID] += msg.NumPods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the job know how many pods it has running, it is a little tragic that we need to keep track of this map
feel free to just make a follow up ticket or ignore any of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, tragic but no it doesn't know how many are "running" where our definition of "running" is post scheduling and bound to a node.
Ticket
[RM-203,RM-204,RM-205,RM-206,RM-208,RM-213]
Description
Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:
Test Plan
Covered by automated tests.
Checklist
docs/release-notes/
.See Release Note for details.