Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do not merge yet] Fix/fetch negotiation improv (#6532) #6540

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

arroz
Copy link
Contributor

@arroz arroz commented Mar 27, 2023

Description

This PR adds fetch negotiation similar to core git. It's divided into 3 commits:

  1. 4edf6c1 (In PT Partial fix for #6532: insert-by-date order. #6539 that this one includes): fixes revwalker bugs related to insertion sorted by date. This is low risk and can be merged safely, hence a separate PR.

  2. 068f3ae Changes revwalker so that a hide callback is not by itself a reason for pre-walking. Instead, the callback is called as walking is performed, saving the caller from incurring the performance hit of traversing the entire graph. This is especially useful for algorithms that very likely will stop traversing very early, like the one introduced in this PR.

  3. 525ee5c Implements a negotiation algorithm similar to core git. This includes specific behaviour depending on the ACK type ("ready", "common", etc), sending "have" statements only for commits not known to be common, multiple steps with increasing number of statements, stop conditions, etc.

I've seen situations where libgit2 was downloading a packfile >10x the size core git did. For all the situations I could test, with these patches, libgit2 now downloads the exact same number of objects core git does.

This is not v2 protocol yet, but most of this can be extracted to shared code (like core git does with negotiators) and reused.

⚠️ Problem

There is a problem though, and this is why I didn't request merging yet: although traversing is usually very fast since the hide callback avoids traversing commits we already know wont be used for "have" statements (unlike core git that traverses them), there are edge cases where traversing can take so long that causes a timeout. This happens with HTTP connections, since libgit2 is not implementing reconnection if the connection drops when keep-alive is 1.

Although this could also happen before, the probability increases with this patch. Note this is not a problem with this patch per se, but a deviation of the HTTP spec in libgit2 that must be fixed anyway (as an example, Apache comes configured by default with a 5 second timeout for keep-alive connections, which is really short even if traversing is mostly instantaneous for users with high latency connections).

I tried to look into this but still couldn't fully grasp all the layers (transport, subtransport, httpclient, etc). Any help on what is the best layer to do this is welcome. This is kinda hard to do since a "write" can actually succeed on a connection that was already dropped if the FIN packet is still in transit, in which case the subsequent "read" fails. Or it can fail on the "write" itself. Point being, there has to be a way to replay the initial request. There is already support for replaying inside httpclient, I believe, due to proxies, etc, so maybe it's a good place to look into. I'm not too worried with SSH since generally the timeouts are much higher and I believe libssh2 manages the pings and keep-alives of the session. Thoughts? Could really use some guidance here. 🙏

A hide callback being set on a revwalk, by itself, doesn’t need to be a condition for pre-walking. The callback can be applied as the revwalk walks. This allows the revwalk to be used without incurring the performance penalty of pre-walking.
Negotiation code updated to pretty much match core git’s behaviour (v0/1).
@arroz arroz changed the title Fix/fetch negotiation improv (#6532) [Do not merge yet] Fix/fetch negotiation improv (#6532) Mar 27, 2023
@ethomson
Copy link
Member

Thanks for the PR and the copious details. I'll review 🔜 .

Coincidentally, I'm doing some refactoring of the networking code. In particular, moving us from using blocking sockets under the hood to nonblocking sockets so that we can timeout reads. But independently we should also start sending keepalives when we connect and do work.

#5201 started this but it's polling on the underlying socket not the tls stack.

(The encrypted data could have been all read, but is sitting in the tls stack's buffer, since we haven't read it yet. So we really need to poll the actual tls stack not the underlying socket.)

@arroz
Copy link
Contributor Author

arroz commented Mar 29, 2023

Note there are two different keep-alives concepts here.

The one addressed by #5201 is the git's keepalive functionality, which avoids a socket being closed by something (an impatient server, a NAT-box or stateful firewall, etc) if, mid-request there is no traffic for a while due to the server (on fetch) or the client (on push) taking too long to generate the pack file after negotiation (and especially if the "quiet" flag is on).

The one I'm talking about is the HTTP keep-alive feature, which works at a different level. Since we declare HTTP/1.1 on the requests, and don't explicitly use the Keep-alive header to announce we do not support keep-alive, per the HTTP 1.1 specification, we are implicitly announcing we do support HTTP keep-alive. This keep-alive, unlike git's, is meant to keep the socket open (and reuse it) between different requests. And per the HTTP specification, the socket can be closed at any time between requests and the client is supposed to open a new one for the next request, this is normal and expected behaviour.

The problem is we are kinda lying: we do reuse the socket if it's still up, but we are unable to re-open a socket if it closes between requests, which is likely to happen since the default web server timeouts are very short. Specifically, in http_client_connect if client->connected && client->keepalive is true and the states are acceptable, we assume a socket is open, which may not be the case. And, as I mentioned on the previous comment, it's not possible to determine if the socket is already closed with 100% reliability when we write, because the FIN packet sent by the server may still be in transit, so both write and read need to test for this condition (although currently I believe we don't do either; if either write or read fails, we give up immediately). So, technically, we don't really support HTTP keep-alive, and we probably should explicitly announce we don't for now using the appropriate header.

How this applies to this PR: the negotiation is performed in multiple requests. When using HTTP transport, these are normal, independent HTTP requests. In the current release, negotiation is performed by simply traversing the graph and sending 20 "have" commands each time until we stop. Although it's still possible the HTTP socket is closed between each request, it's unlikely since generating each request is a matter or traversing 20 commits. It will likely only happen if the connection has a huge latency and the machine generating the requests has a slow storage that is being trashed by something else.

The new version on the patch I submitted handles negotiation like git, which a window of "have" commands that increase size each iteration (among many other differences that optimize the size of the downloaded pack). This means each subsequent request will take longer to generate than the one before, unless we hit a stop condition. In practice this is still very quick (for most fetch operations in the real world, negotiation is done in either a single step, or 2 or 3), but in some edge cases, it may be not.

As an example, the only way I could cause this to happen, while working on ideal conditions (server running on a VM on the same Mac as the client), was to host the Linux repository inside the VM, but deleting all commits except the root one, and creating an additional commit, causing Linux's "main" to diverge right after the root. On the client (still with a "real" clone of linux), I ran git fetch. In that situation, the negotiation algorithm had to traverse the entire repository (which is now entirely "uncommon" except the root), and in each step, generate a request with more "have" statements than before. Both core git and libgit2 will start to slow down, and eventually both take more than 5 seconds to generate the next request, and in both cases, apache running on my VM closed the connection. However, core git, per the HTTP spec, creates a new connection and keeps going, while we stop and return an "Unexpected EOF" error.

So, we should have a facility somewhere in the transport/HTTP stack to abstract all this stuff. Ideally, this should be something that tries to send an HTTP request, and if it either fails doing that, or fails on the subsequent "read", it reopens the connection and replays the request. The part I still didn't fully understand in the code is the entire sequence of writing requests and reading replies through the whole transport stack, although it shouldn't be too hard. I may try to get something working, although it's probably going to conflict with your changes.

Sorry for the long reply, hope it makes sense. 🙂

@ethomson
Copy link
Member

So, technically, we don't really support HTTP keep-alive, and we probably should explicitly announce we don't for now using the appropriate header.

Yep, your analysis is spot on. We should check if we have sent more than one request on the socket, and if our first read after sending the request fails and retry. This is a little gross since the whole smart logic tries to abstract some things away. Patch forthcoming.

#Conflicts:
#	src/libgit2/transports/smart_protocol.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants