-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
omnibus actual concurrency and major refactor #1530
Conversation
55c0468
to
cc246ac
Compare
2fd12fe
to
f68a968
Compare
85d2814
to
f57474d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't look at the tests yet
8f7a594
to
3563178
Compare
9bc6ece
to
f390777
Compare
1e8c300
to
335f67b
Compare
d965185
to
644d1cd
Compare
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…busy state to runner Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…point return the same result and fix tests somewhat Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
b5b29ce
to
4cf7566
Compare
Image string `json:"image,omitempty" yaml:"image"` | ||
Predict string `json:"predict,omitempty" yaml:"predict"` | ||
Train string `json:"train,omitempty" yaml:"train"` | ||
Concurrency int `json:"concurrency,omitempty" yaml:"concurrency"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me wonders if this should be an object with a single key (max
) so that it can be extended in future if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds like a pretty reasonable idea, what other keys do you think it might have?
# del response | ||
# except ValidationError as e: | ||
# _log_invalid_output(e) | ||
# raise HTTPException(status_code=500, detail=str(e)) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the implications of commenting this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs are not validated, no error is raised if you return a string instead of an int. this turned out to be a bottleneck with sufficient throughput in profiling, and also led to a bug mentioned in the comments where Path is converted to str but then validation converts it back to Path and this round-trip no longer works the way it used to
python/cog/server/runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned when we spoke, it seems quite confusing to me to take 90% of the code from Worker
and put it in PredictionRunner
rather than making Worker
do what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, I'm not sure how to find the line for all the enter_predict/exit_predict stuff if it's bound by being def () -> AsyncIterator[Event]. I would love more detailed input on this.
python/cog/server/worker.py
Outdated
@@ -355,26 +178,34 @@ def _loop_sync(self) -> None: | |||
break | |||
if isinstance(ev, PredictionInput): | |||
self._predict_sync(ev) | |||
elif isinstance(ev, Cancel): | |||
pass # we should have gotten a signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this. Are cancel events sent down the pipe or are they communicated via signal? It should be one or the other, not both, and if we're sending them down the pipe but then ignoring them... why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the parent doesn't necessarily know if the predictor is async or not, we send both the event and the signal. sync predictors ignore the event, async predictors ignore the signal.
def test_return_wrong_type(client): | ||
resp = client.post("/predictions") | ||
assert resp.status_code == 500 | ||
# it's not the worst idea to validate outputs but it's slow and not required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change fits in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment, this is also related to File(upload(File(x))) != x
# We call cancel a WHOLE BUNCH to make sure that we don't propagate | ||
# any of those cancelations to subsequent predictions, regardless | ||
# of the internal implementation of exceptions raised inside signal | ||
# handlers. | ||
for _ in range(100): | ||
w.cancel() | ||
w.cancel(input1.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, does this mean we're relying on Worker.predict
to modify its arguments in order to communicate prediction ID back to the caller? That seems very hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no? the prediction ID must be fixed before it makes it to Worker.predict. it is autogenerated if not set on PredictionRequest and then you're supposed to PredictionInput.from_request(request...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something? You deleted the Worker
class so how come all these tests aren't failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's due to pytest.skip(allow_module_level=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is (as advertised, to be fair) a bit of a mishmash of changes with different intents and scopes, and as such it's a bit hard to review.
If I understand it correctly the gist of the effort here is to update the Worker
API to support the management of multiple predictions, indexed by their IDs. That seems sensible, but I don't understand why we moved 80% of Worker
to PredictionRunner
to do that.
I'm also more than a little bit suspicious at the fact that the CI checks on this branch are all green, especially given that by far the most substantive part of the cog test suite (test_worker.py
) is apparently not running at all.
bd69678
to
7098fde
Compare
…ancelation and validation Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
7098fde
to
40bdb54
Compare
this should be reviewable at last. I'm mostly interested in whether the changes are comprehensible/legible and what comments I can add. there's a grab-bag of random changes like a /ready route, predictor.log, cancellation fixes, etc and the core change of moving worker into runner. I'm very open to reconsidering this (e.g. call it worker instead and maybe move some of it into http), but not right now. once this is merged, the plan is to gradually cut small changes from the async branch to merge into main, and review those changes more thoroughly.
original description
I originally tried to split up my work in #1499 and #1508 as "refactor runner + add concurrency" and "fix uploads/downloads" but ended up interleaving these changes. this PR will just be the overall changeset for now, and hopefully as this coalesces more it'll be clear how to carve it up into separate changesetsmajor points:
concurrency
to cog.yaml