New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merge yet] Fix/fetch negotiation improv (#6532) #6540
base: main
Are you sure you want to change the base?
Conversation
A hide callback being set on a revwalk, by itself, doesn’t need to be a condition for pre-walking. The callback can be applied as the revwalk walks. This allows the revwalk to be used without incurring the performance penalty of pre-walking.
Negotiation code updated to pretty much match core git’s behaviour (v0/1).
Thanks for the PR and the copious details. I'll review 🔜 . Coincidentally, I'm doing some refactoring of the networking code. In particular, moving us from using blocking sockets under the hood to nonblocking sockets so that we can timeout reads. But independently we should also start sending keepalives when we connect and do work. #5201 started this but it's polling on the underlying socket not the tls stack. (The encrypted data could have been all read, but is sitting in the tls stack's buffer, since we haven't read it yet. So we really need to poll the actual tls stack not the underlying socket.) |
Note there are two different keep-alives concepts here. The one addressed by #5201 is the git's The one I'm talking about is the HTTP The problem is we are kinda lying: we do reuse the socket if it's still up, but we are unable to re-open a socket if it closes between requests, which is likely to happen since the default web server timeouts are very short. Specifically, in How this applies to this PR: the negotiation is performed in multiple requests. When using HTTP transport, these are normal, independent HTTP requests. In the current release, negotiation is performed by simply traversing the graph and sending 20 "have" commands each time until we stop. Although it's still possible the HTTP socket is closed between each request, it's unlikely since generating each request is a matter or traversing 20 commits. It will likely only happen if the connection has a huge latency and the machine generating the requests has a slow storage that is being trashed by something else. The new version on the patch I submitted handles negotiation like git, which a window of "have" commands that increase size each iteration (among many other differences that optimize the size of the downloaded pack). This means each subsequent request will take longer to generate than the one before, unless we hit a stop condition. In practice this is still very quick (for most fetch operations in the real world, negotiation is done in either a single step, or 2 or 3), but in some edge cases, it may be not. As an example, the only way I could cause this to happen, while working on ideal conditions (server running on a VM on the same Mac as the client), was to host the Linux repository inside the VM, but deleting all commits except the root one, and creating an additional commit, causing Linux's "main" to diverge right after the root. On the client (still with a "real" clone of linux), I ran git fetch. In that situation, the negotiation algorithm had to traverse the entire repository (which is now entirely "uncommon" except the root), and in each step, generate a request with more "have" statements than before. Both core git and libgit2 will start to slow down, and eventually both take more than 5 seconds to generate the next request, and in both cases, apache running on my VM closed the connection. However, core git, per the HTTP spec, creates a new connection and keeps going, while we stop and return an "Unexpected EOF" error. So, we should have a facility somewhere in the transport/HTTP stack to abstract all this stuff. Ideally, this should be something that tries to send an HTTP request, and if it either fails doing that, or fails on the subsequent "read", it reopens the connection and replays the request. The part I still didn't fully understand in the code is the entire sequence of writing requests and reading replies through the whole transport stack, although it shouldn't be too hard. I may try to get something working, although it's probably going to conflict with your changes. Sorry for the long reply, hope it makes sense. 🙂 |
Yep, your analysis is spot on. We should check if we have sent more than one request on the socket, and if our first |
#Conflicts: # src/libgit2/transports/smart_protocol.c
Description
This PR adds fetch negotiation similar to core git. It's divided into 3 commits:
4edf6c1 (In PT Partial fix for #6532: insert-by-date order. #6539 that this one includes): fixes revwalker bugs related to insertion sorted by date. This is low risk and can be merged safely, hence a separate PR.
068f3ae Changes revwalker so that a hide callback is not by itself a reason for pre-walking. Instead, the callback is called as walking is performed, saving the caller from incurring the performance hit of traversing the entire graph. This is especially useful for algorithms that very likely will stop traversing very early, like the one introduced in this PR.
525ee5c Implements a negotiation algorithm similar to core git. This includes specific behaviour depending on the ACK type ("ready", "common", etc), sending "have" statements only for commits not known to be common, multiple steps with increasing number of statements, stop conditions, etc.
I've seen situations where libgit2 was downloading a packfile >10x the size core git did. For all the situations I could test, with these patches, libgit2 now downloads the exact same number of objects core git does.
This is not v2 protocol yet, but most of this can be extracted to shared code (like core git does with negotiators) and reused.
There is a problem though, and this is why I didn't request merging yet: although traversing is usually very fast since the hide callback avoids traversing commits we already know wont be used for "have" statements (unlike core git that traverses them), there are edge cases where traversing can take so long that causes a timeout. This happens with HTTP connections, since libgit2 is not implementing reconnection if the connection drops when keep-alive is 1.
Although this could also happen before, the probability increases with this patch. Note this is not a problem with this patch per se, but a deviation of the HTTP spec in libgit2 that must be fixed anyway (as an example, Apache comes configured by default with a 5 second timeout for keep-alive connections, which is really short even if traversing is mostly instantaneous for users with high latency connections).
I tried to look into this but still couldn't fully grasp all the layers (transport, subtransport, httpclient, etc). Any help on what is the best layer to do this is welcome. This is kinda hard to do since a "write" can actually succeed on a connection that was already dropped if the FIN packet is still in transit, in which case the subsequent "read" fails. Or it can fail on the "write" itself. Point being, there has to be a way to replay the initial request. There is already support for replaying inside httpclient, I believe, due to proxies, etc, so maybe it's a good place to look into. I'm not too worried with SSH since generally the timeouts are much higher and I believe libssh2 manages the pings and keep-alives of the session. Thoughts? Could really use some guidance here. 🙏