Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnibus actual concurrency and major refactor #1530

Merged
merged 96 commits into from
May 16, 2024
Merged

Conversation

technillogue
Copy link
Contributor

@technillogue technillogue commented Feb 12, 2024

this should be reviewable at last. I'm mostly interested in whether the changes are comprehensible/legible and what comments I can add. there's a grab-bag of random changes like a /ready route, predictor.log, cancellation fixes, etc and the core change of moving worker into runner. I'm very open to reconsidering this (e.g. call it worker instead and maybe move some of it into http), but not right now. once this is merged, the plan is to gradually cut small changes from the async branch to merge into main, and review those changes more thoroughly.

original description I originally tried to split up my work in #1499 and #1508 as "refactor runner + add concurrency" and "fix uploads/downloads" but ended up interleaving these changes. this PR will just be the overall changeset for now, and hopefully as this coalesces more it'll be clear how to carve it up into separate changesets

major points:

  • add concurrency to cog.yaml
  • use httpx for everything except URLFile, pull out all the client code from everywhere else
  • completely rework URLPath
  • very dirty hack to unblock us on large file uploads
  • merge worker into runner (maybe the other way around would be better?)

Copy link
Contributor

@yorickvP yorickvP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look at the tests yet

pkg/config/config.go Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pkg/config/validator.go Outdated Show resolved Hide resolved
python/cog/predictor.py Outdated Show resolved Hide resolved
python/cog/predictor.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
python/cog/server/http.py Outdated Show resolved Hide resolved
python/cog/server/clients.py Outdated Show resolved Hide resolved
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 8f7a594 to 3563178 Compare February 19, 2024 18:02
@technillogue technillogue marked this pull request as ready for review February 19, 2024 23:53
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 9bc6ece to f390777 Compare February 21, 2024 19:21
@technillogue technillogue force-pushed the syl/more-refactor branch 2 times, most recently from d965185 to 644d1cd Compare February 29, 2024 21:45
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…busy state to runner

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…point return the same result and fix tests somewhat

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Image string `json:"image,omitempty" yaml:"image"`
Predict string `json:"predict,omitempty" yaml:"predict"`
Train string `json:"train,omitempty" yaml:"train"`
Concurrency int `json:"concurrency,omitempty" yaml:"concurrency"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if this should be an object with a single key (max) so that it can be extended in future if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like a pretty reasonable idea, what other keys do you think it might have?

# del response
# except ValidationError as e:
# _log_invalid_output(e)
# raise HTTPException(status_code=500, detail=str(e)) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of commenting this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are not validated, no error is raised if you return a string instead of an int. this turned out to be a bottleneck with sufficient throughput in profiling, and also led to a bug mentioned in the comments where Path is converted to str but then validation converts it back to Path and this round-trip no longer works the way it used to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned when we spoke, it seems quite confusing to me to take 90% of the code from Worker and put it in PredictionRunner rather than making Worker do what we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, I'm not sure how to find the line for all the enter_predict/exit_predict stuff if it's bound by being def () -> AsyncIterator[Event]. I would love more detailed input on this.

@@ -355,26 +178,34 @@ def _loop_sync(self) -> None:
break
if isinstance(ev, PredictionInput):
self._predict_sync(ev)
elif isinstance(ev, Cancel):
pass # we should have gotten a signal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this. Are cancel events sent down the pipe or are they communicated via signal? It should be one or the other, not both, and if we're sending them down the pipe but then ignoring them... why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the parent doesn't necessarily know if the predictor is async or not, we send both the event and the signal. sync predictors ignore the event, async predictors ignore the signal.

def test_return_wrong_type(client):
resp = client.post("/predictions")
assert resp.status_code == 500
# it's not the worst idea to validate outputs but it's slow and not required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change fits in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, this is also related to File(upload(File(x))) != x

# We call cancel a WHOLE BUNCH to make sure that we don't propagate
# any of those cancelations to subsequent predictions, regardless
# of the internal implementation of exceptions raised inside signal
# handlers.
for _ in range(100):
w.cancel()
w.cancel(input1.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, does this mean we're relying on Worker.predict to modify its arguments in order to communicate prediction ID back to the caller? That seems very hacky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no? the prediction ID must be fixed before it makes it to Worker.predict. it is autogenerated if not set on PredictionRequest and then you're supposed to PredictionInput.from_request(request...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something? You deleted the Worker class so how come all these tests aren't failing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's due to pytest.skip(allow_module_level=True)

Copy link
Member

@nickstenning nickstenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is (as advertised, to be fair) a bit of a mishmash of changes with different intents and scopes, and as such it's a bit hard to review.

If I understand it correctly the gist of the effort here is to update the Worker API to support the management of multiple predictions, indexed by their IDs. That seems sensible, but I don't understand why we moved 80% of Worker to PredictionRunner to do that.

I'm also more than a little bit suspicious at the fact that the CI checks on this branch are all green, especially given that by far the most substantive part of the cog test suite (test_worker.py) is apparently not running at all.

@technillogue technillogue reopened this May 10, 2024
@technillogue technillogue force-pushed the syl/more-refactor branch 7 times, most recently from bd69678 to 7098fde Compare May 16, 2024 20:09
Signed-off-by: technillogue <technillogue@gmail.com>
…ancelation and validation

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
@technillogue technillogue merged commit 0ebfc54 into async May 16, 2024
10 checks passed
@technillogue technillogue deleted the syl/more-refactor branch May 16, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants