New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arroz/keep alive #6542
base: ethomson/keepalive
Are you sure you want to change the base?
Arroz/keep alive #6542
Conversation
src/libgit2/transports/httpclient.c
Outdated
@@ -1211,7 +1211,7 @@ GIT_INLINE(int) client_read_and_parse(git_http_client *client) | |||
/* recv returned 0, the server hung up on us */ | |||
else if (!parsed_len) { | |||
git_error_set(GIT_ERROR_HTTP, "unexpected EOF"); | |||
return -1; | |||
return GIT_EEOF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand yet. If I'm reading correctly, this takes any socket failure and retries. For example, this will return GIT_EEOF
but we have already read some data from the server (or we would have returned from line 1151).
This is not handling the case when the server disconnects a kept-alive socket, this socket was disconnected mid-stream.
I'm not sure that we should retry that at all. Same with changing the behavior of socket_read
and socket_write
to return GIT_EEOF
. The retry should be context-aware -- it really needs to be a failure when we're reading the first byte of the response when that response is not the response to the first request on a socket, or a failure when we're writing the first byte of the request when that request is not the first request on a socket.
In other words, we should be detecting only the case when the server hangs up on us while we were keeping a connection alive.
And actually, this should really only be the failsafe. We don't want to be in a position where we retry. Resending POST data is expensive and we should avoid it wherever possible. If we're getting into a position where we're getting hung up on, we should change our behavior so that we send keep-alive packets every second or so while we're doing work to avoid getting timed out at all.
But maybe I am not understanding the problem that you're trying to solve. Are you seeing other sorts of failures? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: I'm only trying to solve the problem of an HTTP connection being dropped between two different requests. I'm not dealing with connections dropping during the request-response loop (that can be solved with git's keepalive feature, as you pointed out). Using your words, this: "we should be detecting only the case when the server hangs up on us while we were keeping a connection alive".
You're right regarding the code above, I pushed a fix reverting it to -1.
Regarding on how to detect and, more importantly, telling apart between the connection being dropped during the keep-alive period, or the server hanging up during an active RR loop, it's not as easy as it seems. Two examples (assume HTTP (not S) protocol, using socket transport, and the connection being kept alive was closed before the beginning of the relevant negotiation step):
-
wait_while_ack
(called by the fetch negotiator) eventually callssocket_read
. Since it's the first read after writing the POST request,socket_read
usually returns 0. But not necessarily. If I pause the debugger right before thep_recv
for a few seconds, and then resume, it will actually return <0, anderrno
set to 32, orEPIPE
. I believe this is caused by a timeout at the OS level that keeps the file descriptor open for a few seconds so the program has a chance to call read and get 0 bytes back, but eventually the OS closes the file descriptor and the program will get the 32 error. Is this timeout the same on all operating systems? Who knows. For our purposes this works since it's all converted inGIT_EEOF
that will eventually bubble up and be retried if appropriate, but the point is assumingp_recv
will always return 0 in this situation may not be true. -
push
is even weirder. Assumegit_smart__push
will take time before entering theRUN_WITH_KEEP_ALIVE
loop on my patch, orgit_smart__get_push_stream
on the original code. I addedprintf("Going to write initial payload\n");
right before that call, and, insidestream_thunk
, another lineprintf("Going to write subsequent payload\n");
before thewrite
call. Some print lines added tosocket_write
as well. This is the result:
Sleeping for keep-alive drop
Resuming
Going to write initial payload
Socket p_send returned 286
Socket p_send returned 4
Socket p_send returned 136
Socket p_send returned 2
Going to write subsequent payload
Socket p_send returned <0, errno: 32
It only fails on the second high-level write (or the 5th actual write to the socket, making it even more arbitrary!), and fails by returning a negative value and errno
32 again. And I suspect this may be dependent on many factors like the size of the TCP window, how long (or if) FIN and RST packets arrive, TCP stack implementation (Linux VS BSDs VS Windows VS AmigaOS and everything else libgit2 runs on) etc.
In this situation I think it's not really possible to tell apart if the server dropped the connection on us already mid-request, or if the connection was dropped before during the keep-alive period. EDIT: In fact, technically at the socket level, this is exactly the same situation, it's a peer dropping the connection for whatever reason, hence why it's not really possible to tell both situations apart.
Since negotiation can take a while (I still didn't look into the push negotiator code that decides what to upload, but it does take a few seconds, especially in large repos), if we are too conservative with this, it may be literally impossible to push from a large repo when using HTTP with a server with a short timeout (again, apache by default uses 5 secs). This is why I'm preferring to, when in doubt, retry. I don't see this as a problem, since if the intent is to push, and we only try twice at most, it's better to waste some bandwidth and make sure the upload happens than failing.
We never pass here on the first read, due to the test added above, so it’s ok.
No description provided.