-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session Cleanup+Simplifications #7315
base: main
Are you sure you want to change the base?
Conversation
Ended up spinning out the support for serving nested execs from the executor here (ended up just becoming removal of the shim entirely). Coming back here now to finish up the rest of this refactor on top of that. |
0c8f915
to
1878470
Compare
dd3c007
to
a615f02
Compare
6f2b173
to
55a8e04
Compare
9542792
to
77406e4
Compare
77406e4
to
a5932ad
Compare
927d700
to
2780ed7
Compare
2780ed7
to
3c2985e
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
The --experimental-privileged-nesting flag was broken when used with terminal due to a panic around registering clients. This was fixed by commits before this one which completely removed the need to register clients, but backfilling the coverage now. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
632daea
to
22deba6
Compare
// TODO: is it safe to update the json name or do we need cloud coordination? | ||
SessionID string `json:"server_id,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a question for @vito @aluzzardi, not sure if we can just change the name of this json field and cloud will be fine or if it's more involved.
It doesn't actually matter that much, mostly aesthetics, but I guess discrepancies like this will add up in confusion in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the server-side analytics code, I think you're good to rename this. There's a struct with this field defined, but nothing references it after unmarshaling. I can put up a quick PR to change it.
Hit a test failure flake I haven't seen previously:
Will presume it's this PR's fault until proven otherwise, may be missing some synchronization of services somewhere? Or could be the service exiting prematurely for unrelated reasons? |
@vito FYI I think I may have hit the theoretical flake you described in the OTEL PR:
(tangential feature request - easier to copy logs from the cloud traces output 😄) Does that look like it may be the flake you were imagining? This PR obviously changes all kinds of timing of almost everything, so not sure if it's jsu tthat or a legit issue. |
Yep, that's the one. Sorry about that! I have an idea of how to fix it but it's a bit tricky. The problem is that trace and log data arrives independently, and we can't know whether a span has logs until we see logs for it for the first time, after which point we wait until EOF. But the test can still flake if we don't see the start of the logs before calling In terms of an actual fix, I'm thinking we would need to set an attribute on the span to indicate that logs or at least an EOF should be consumed for it, and update the draining logic accordingly. But the problem is we don't control the span creation. We can man-in-the-middle it, and look for spans starting with |
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
These values are prohibited in HTTP 2+, but since we are proxying requests in dagger session/listen that may be coming from HTTP 1 clients, we need to be sure to clear them out before forwarding to the engine server, which is now HTTP/2. Before this, I was seeing failures in just the typescript SDK tests that used node 18, which apparently set `Connection: close` and caused all sorts of strange behavior in the go http client. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
I realized that since FunctionCall includes arbitrary arg literal values and arbitrary parent object field literal vals, it can theoretically get almost arbitrarily large. This is a problem with HTTP headers since they have a max size (Go default is 1 MB, though it can be raised). Fortunately, there's no actual need to include this in HTTP headers; it was mildly convenient but since function call clients are served direct from the executor in the engine process, we can just directly provide the metadata to the session server *alongside* the http requests rather than stuff it into the http requests itself. Another possibility would be to move it to the body one way or another, but this approach was simpler overall than that. Included an integ test for coverage of this (fails before this commit, passes now). Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Alex Suraci <alex@dagger.io>
I was still having to wait 5 minutes for OTEL logs to drain on some test runs that involved service tunneling. I believe there was a race condition where the tunnel code could end up writing logs after they were closed, which seems to create a leak and cause us to have to wait 5 minutes before a timeout is hit. I also saw a few other place otel logs were created and not closed, so added a close there too. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
ce8cf09
to
4b96eda
Compare
Can repro this one locally, think I know what's happening:
cc @vito to double check this adds up to you too If that's correct, I think this is technically independent of this PR and more likely just getting triggered now due to timing differences introduced here. But need to double check and will attempt a fix in the interest of not increasing probability of flakes here. |
On a positive note, with the extra telemetry draining fix commits appended to this PR, I'm seeing full engine tests take as low as 9 minutes, which is the fastest I've seen in a very long time 🎉 |
Found and fixed that flake here #7610 |
Another part of the effort to support #6916 while also doing general cruft cleanup and setup for various upcoming efforts.
This changeset focuses on making sessions + associated state management simpler:
daggerSession
object and all of the state associated with a given client in a session is adaggerClient
objectSession
method from buildkit's upstream controller interfaceengine/server/conn.go
) and no longer need to be paranoid about gRPC max message limits in as many placesDetails
Objects + state + naming
Server
- formerly known asBuildkitController
Server
, which was really more like the session state (and was thus confusing)clientMetadata
(which is now sent in an http header). Code.BuildkitController
API since we do have some reliance onListWorker
at least, though we are free to change any+all of those to core API (i.e. gql) calls at any point (just got a bit too out of scope here)daggerSession
anddaggerClient
uninitialized
,initialized
,deleted
(not complicated enough to go full-on state machine, but this still makes it all more obvious and easy to follow I think, especially when it comes to locking the state for mutations)core.Query
andbuildkit.Client
to be in these structs too so there's less places to look+think aboutClientCallContext
- instead of trying to register all of that we're "stateless" in that the module+function-call metadata for a client is just plumbed throughExecutionMetadata
+ClientMetadata
, following the request path rather than both that and a pre-registration side-channelClientMetadata
it was plumbed in the requests made by the nested clientHTTP/2 Usage
BuildkitController.Session
API and it's associated complications, etc.)docker-container
,kube-pod
, etc. (spawns a subprocess)Session Attachables
/sessionAttachables
http endpoint, which the server hijacks and uses as a raw conn for establishing the gRPC streamsSessionManager.HandleHTTP
method, which is somewhat similar, but it didn't handle the switch from http->grpc synchronously and was resulting in data getting mixed sometimes