New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle closed keep-alive connections #6541
base: main
Are you sure you want to change the base?
Conversation
cc @arroz |
Test against our testing HTTP server that we can support the case when the server (legally) breaks the connection after the first successful request/response on a keep-alive connection.
The server may break a keep-alive connection after a successful request, in which case we should try to re-send it. Indicate this state in the httpclient with `GIT_RETRY` and re-connect and re-send the request.
8ab39e3
to
5b61b77
Compare
(error = handle_response(&complete, stream, &response, true)) < 0) | ||
goto done; | ||
(error = handle_response(&complete, stream, &response, true)) < 0) { | ||
if (error != GIT_RETRY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work. We'll hit the steps
limit. We need another flag for "retry" or something.
goto done; | ||
(error = handle_response(&complete, stream, &response, true)) < 0) { | ||
if (error != GIT_RETRY) | ||
goto done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's replay_count
and do we need to increase it by one for this case?
@ethomson When I was reading your patch, I realized the two functions that do the replay ( I think we can contain the retrying mechanism to these two functions. This would probably by another loop around the sequence of generating the request, sending it and reading the response. This is enough since the keep-alive problem only happens between requests, so we don't have to worry about it on the subsequent reads, when The hard part is replacing the socket/TLS transport with a fresh one from there, but I think it's doable (maybe there's already an API for that, didn't check yet). This would be nice, since we wouldn't need to leak this detail to the upper layers that deal with a generic transport and not http specifically. It would also avoid the need of dealing with retrying all over the place (at least in two different places of the fetch negotiator, and one on the push negotiator). I can give this a try tomorrow and over the weekend, I'll send you the patch assuming I don't hit any unexpected wall. |
Hum, looks like |
Yes, this is disappointing. But I think that a refactoring here would be a deep investment. 😢 |
I was trying to store the initial buffer/len for a write request in the I guess we really need to handle the retrying loop at the upper layers… I'll try to write a function that takes a closure with the logic that sends a request and processes the response and see if we can at least wrap the retrying loop nicely. Was there any debate in the past about replacing all the http code with any library that may exist out there (libcurl, being highly portable, being a good candidate)? Seems like this should be a solved problem that we could just leverage. |
We used to use libcurl but it adds another dependency that we have to manage, which is particularly annoying on windows. Plus it gives us less control over crypto settings. We wanted to move to a single http provider - having too many different variants (it used to be three) is awful for many reasons. After #6533 the end is in sight for bringing this number to one. Making that one libcurl would have probably been possible but vendoring it vs linking against the existing one and dealing with its crypto stack settings is very annoying. (Nor, I think, did it solve this problem for us, but I may be misremembering.) |
I think inverting the structure of the "smart" transports so that we have an http transport and an ssh transport that get the high level commands (like upload pack) and then use a shared smart handling library would resolve this reasonably well. I've never really liked the structure. It seems like in the status quo, adding a new "smart sub transport" would be easy, since you're just given the git protocol steam directly. Unfortunately that's not really true since you're trying to juggle connection management with an imperfect view of what you're actually doing. I wonder if it's actually not all that hard to actually do this refactoring. 🤔 |
So, a draft on top of your PR: #6542 I still didn't address push, but for fetch, this kinda works! This is incredibly hard and annoying to test since it never really fails the way I expect it to when pausing the debugger on multiple places. But, adding Some thoughts so far:
Anyway, I'll try to address |
Update: push is now working. However, it required bubbling the error up explicitly from the transport. Some transports are already more detailed than others ( I'll do the same for the read functions tomorrow, to improve the error detection. I wont be able to do this for the windows transport stuff, though. I have no way to test or even compile it. I'll try to do it for |
Ok, this should be ready to merge into yours. I standardized on I updated secure transport, socket and OpenSSL transports. Didn't touch the others. They need to return This seems to be working for both fetch and push. Connection is re-opened if it takes too long to send the next request. |
We naively assumed that we could re-use an existing HTTP connection if we were using keep-alive, which is often true but definitely not always true and the server is permitted to close the connection. Handle this gracefully - httpclient will return
GIT_RETRY
if the first read of a response on a socket that has been kept-alive returns0
(EOF). Retry exactly once in this case.Note that we're not checking for failures to write the request. We should probably do that in this PR, but usually it's the
read
that will actually fail on a closed socket. But I suspect there are plenty of failure modes in thewrite
case, too.