omnibus actual concurrency and major refactor #1530

technillogue · 2024-02-12T21:16:38Z

this should be reviewable at last. I'm mostly interested in whether the changes are comprehensible/legible and what comments I can add. there's a grab-bag of random changes like a /ready route, predictor.log, cancellation fixes, etc and the core change of moving worker into runner. I'm very open to reconsidering this (e.g. call it worker instead and maybe move some of it into http), but not right now. once this is merged, the plan is to gradually cut small changes from the async branch to merge into main, and review those changes more thoroughly.

original description

I originally tried to split up my work in #1499 and #1508 as "refactor runner + add concurrency" and "fix uploads/downloads" but ended up interleaving these changes. this PR will just be the overall changeset for now, and hopefully as this coalesces more it'll be clear how to carve it up into separate changesets

major points:

add concurrency to cog.yaml
use httpx for everything except URLFile, pull out all the client code from everywhere else
completely rework URLPath
very dirty hack to unblock us on large file uploads
merge worker into runner (maybe the other way around would be better?)

yorickvP

didn't look at the tests yet

pkg/config/config.go

pyproject.toml

pkg/config/validator.go

python/cog/predictor.py

python/cog/server/clients.py

python/cog/server/http.py

python/cog/server/clients.py

Signed-off-by: technillogue <technillogue@gmail.com>

…busy state to runner Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

…point return the same result and fix tests somewhat Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

nickstenning · 2024-05-10T12:37:35Z

pkg/config/config.go

+	Image       string `json:"image,omitempty" yaml:"image"`
+	Predict     string `json:"predict,omitempty" yaml:"predict"`
+	Train       string `json:"train,omitempty" yaml:"train"`
+	Concurrency int    `json:"concurrency,omitempty" yaml:"concurrency"`


Part of me wonders if this should be an object with a single key (max) so that it can be extended in future if needed.

that sounds like a pretty reasonable idea, what other keys do you think it might have?

python/cog/server/http.py

nickstenning · 2024-05-10T12:39:42Z

python/cog/server/http.py

+        #     del response
+        # except ValidationError as e:
+        #     _log_invalid_output(e)
+        #     raise HTTPException(status_code=500, detail=str(e)) from e


What are the implications of commenting this out?

outputs are not validated, no error is raised if you return a string instead of an int. this turned out to be a bottleneck with sufficient throughput in profiling, and also led to a bug mentioned in the comments where Path is converted to str but then validation converts it back to Path and this round-trip no longer works the way it used to

nickstenning · 2024-05-10T12:41:32Z

python/cog/server/runner.py

As mentioned when we spoke, it seems quite confusing to me to take 90% of the code from Worker and put it in PredictionRunner rather than making Worker do what we need.

unfortunately, I'm not sure how to find the line for all the enter_predict/exit_predict stuff if it's bound by being def () -> AsyncIterator[Event]. I would love more detailed input on this.

nickstenning · 2024-05-10T12:43:56Z

python/cog/server/worker.py

@@ -355,26 +178,34 @@ def _loop_sync(self) -> None:
                break
            if isinstance(ev, PredictionInput):
                self._predict_sync(ev)
+            elif isinstance(ev, Cancel):
+                pass  # we should have gotten a signal


I don't think I understand this. Are cancel events sent down the pipe or are they communicated via signal? It should be one or the other, not both, and if we're sending them down the pipe but then ignoring them... why?

because the parent doesn't necessarily know if the predictor is async or not, we send both the event and the signal. sync predictors ignore the event, async predictors ignore the signal.

nickstenning · 2024-05-10T12:44:38Z

python/tests/server/test_http_output.py

-def test_return_wrong_type(client):
-    resp = client.post("/predictions")
-    assert resp.status_code == 500
+# it's not the worst idea to validate outputs but it's slow and not required


I don't think this change fits in this PR.

see previous comment, this is also related to File(upload(File(x))) != x

nickstenning · 2024-05-10T12:45:46Z

python/tests/server/test_worker.py

            # We call cancel a WHOLE BUNCH to make sure that we don't propagate
            # any of those cancelations to subsequent predictions, regardless
            # of the internal implementation of exceptions raised inside signal
            # handlers.
            for _ in range(100):
-                w.cancel()
+                w.cancel(input1.id)


Yikes, does this mean we're relying on Worker.predict to modify its arguments in order to communicate prediction ID back to the caller? That seems very hacky.

no? the prediction ID must be fixed before it makes it to Worker.predict. it is autogenerated if not set on PredictionRequest and then you're supposed to PredictionInput.from_request(request...)

nickstenning · 2024-05-10T12:50:49Z

python/tests/server/test_worker.py

Am I missing something? You deleted the Worker class so how come all these tests aren't failing?

it's due to pytest.skip(allow_module_level=True)

nickstenning

This PR is (as advertised, to be fair) a bit of a mishmash of changes with different intents and scopes, and as such it's a bit hard to review.

If I understand it correctly the gist of the effort here is to update the Worker API to support the management of multiple predictions, indexed by their IDs. That seems sensible, but I don't understand why we moved 80% of Worker to PredictionRunner to do that.

I'm also more than a little bit suspicious at the fact that the CI checks on this branch are all green, especially given that by far the most substantive part of the cog test suite (test_worker.py) is apparently not running at all.

Signed-off-by: technillogue <technillogue@gmail.com>

…ancelation and validation Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

technillogue force-pushed the async branch 2 times, most recently from 55c0468 to cc246ac Compare February 13, 2024 07:37

technillogue force-pushed the syl/more-refactor branch from 2fd12fe to f68a968 Compare February 13, 2024 07:41

technillogue force-pushed the async branch 3 times, most recently from 85d2814 to f57474d Compare February 13, 2024 07:45

yorickvP added the async label Feb 13, 2024

yorickvP reviewed Feb 13, 2024

View reviewed changes

technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 8f7a594 to 3563178 Compare February 19, 2024 18:02

technillogue force-pushed the async branch from 03659b2 to f57474d Compare February 19, 2024 23:53

technillogue marked this pull request as ready for review February 19, 2024 23:53

technillogue force-pushed the syl/more-refactor branch 2 times, most recently from 9bc6ece to f390777 Compare February 21, 2024 19:21

technillogue force-pushed the async branch 3 times, most recently from 1e8c300 to 335f67b Compare February 21, 2024 21:16

technillogue force-pushed the syl/more-refactor branch 2 times, most recently from d965185 to 644d1cd Compare February 29, 2024 21:45

technillogue added 11 commits March 12, 2024 15:44

add concurrency to config

d1158d3

Signed-off-by: technillogue <technillogue@gmail.com>

this basically works!

c75606e

Signed-off-by: technillogue <technillogue@gmail.com>

more descriptive names for predict functions

a73fbce

Signed-off-by: technillogue <technillogue@gmail.com>

maybe pass through prediction id and try to make cancelation do both?

c7a775d

Signed-off-by: technillogue <technillogue@gmail.com>

don't cancel from signal handler if a loop is running. expose worker …

c6b03aa

…busy state to runner Signed-off-by: technillogue <technillogue@gmail.com>

move handle_event_stream to PredictionEventHandler

d596273

Signed-off-by: technillogue <technillogue@gmail.com>

make setup and canceling work

cd2d115

Signed-off-by: technillogue <technillogue@gmail.com>

drop some checks around cancelation

52ccf7b

Signed-off-by: technillogue <technillogue@gmail.com>

try out eager_predict_state_change

faed8c1

Signed-off-by: technillogue <technillogue@gmail.com>

keep track of multiple runner prediction tasks to make idempotent end…

9f0e8d0

…point return the same result and fix tests somewhat Signed-off-by: technillogue <technillogue@gmail.com>

fix idempotent tests

5ab395f

Signed-off-by: technillogue <technillogue@gmail.com>

technillogue force-pushed the syl/more-refactor branch from b5b29ce to 4cf7566 Compare May 8, 2024 20:47

technillogue force-pushed the async branch from 8ec2a17 to c62cf67 Compare May 8, 2024 21:00

clean up

6b8ab71

Signed-off-by: technillogue <technillogue@gmail.com>

nickstenning reviewed May 10, 2024

View reviewed changes

python/cog/server/http.py Show resolved Hide resolved

nickstenning reviewed May 10, 2024

View reviewed changes

technillogue closed this May 10, 2024

technillogue reopened this May 10, 2024

technillogue force-pushed the syl/more-refactor branch 7 times, most recently from bd69678 to 7098fde Compare May 16, 2024 20:09

technillogue added 3 commits May 16, 2024 16:18

codecov

f69667f

Signed-off-by: technillogue <technillogue@gmail.com>

describe the remaining problems with this PR and add comments about c…

e780834

…ancelation and validation Signed-off-by: technillogue <technillogue@gmail.com>

add a test

40bdb54

Signed-off-by: technillogue <technillogue@gmail.com>

technillogue force-pushed the syl/more-refactor branch from 7098fde to 40bdb54 Compare May 16, 2024 20:18

technillogue merged commit 0ebfc54 into async May 16, 2024
10 checks passed

technillogue deleted the syl/more-refactor branch May 16, 2024 21:08

This was referenced May 17, 2024

fix flaky runner test #1669

Merged

[async] Include prediction id upload request #1680

Open

technillogue mentioned this pull request Jun 4, 2024

fix upload redirect handling #1714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omnibus actual concurrency and major refactor #1530

omnibus actual concurrency and major refactor #1530

technillogue commented Feb 12, 2024 •

edited

yorickvP left a comment

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning May 10, 2024

technillogue May 10, 2024

nickstenning left a comment

omnibus actual concurrency and major refactor #1530

omnibus actual concurrency and major refactor #1530

Conversation

technillogue commented Feb 12, 2024 • edited

yorickvP left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickstenning left a comment

Choose a reason for hiding this comment

technillogue commented Feb 12, 2024 •

edited