Arroz/keep alive #6542

arroz · 2023-04-02T22:39:03Z

No description provided.

ethomson · 2023-04-04T12:00:43Z

src/libgit2/transports/httpclient.c

@@ -1211,7 +1211,7 @@ GIT_INLINE(int) client_read_and_parse(git_http_client *client)
 	/* recv returned 0, the server hung up on us */
 	else if (!parsed_len) {
 		git_error_set(GIT_ERROR_HTTP, "unexpected EOF");
-		return -1;
+		return GIT_EEOF;


I'm not sure that I understand yet. If I'm reading correctly, this takes any socket failure and retries. For example, this will return GIT_EEOF but we have already read some data from the server (or we would have returned from line 1151).

This is not handling the case when the server disconnects a kept-alive socket, this socket was disconnected mid-stream.

I'm not sure that we should retry that at all. Same with changing the behavior of socket_read and socket_write to return GIT_EEOF. The retry should be context-aware -- it really needs to be a failure when we're reading the first byte of the response when that response is not the response to the first request on a socket, or a failure when we're writing the first byte of the request when that request is not the first request on a socket.

In other words, we should be detecting only the case when the server hangs up on us while we were keeping a connection alive.

And actually, this should really only be the failsafe. We don't want to be in a position where we retry. Resending POST data is expensive and we should avoid it wherever possible. If we're getting into a position where we're getting hung up on, we should change our behavior so that we send keep-alive packets every second or so while we're doing work to avoid getting timed out at all.

But maybe I am not understanding the problem that you're trying to solve. Are you seeing other sorts of failures? 🤔

To clarify: I'm only trying to solve the problem of an HTTP connection being dropped between two different requests. I'm not dealing with connections dropping during the request-response loop (that can be solved with git's keepalive feature, as you pointed out). Using your words, this: "we should be detecting only the case when the server hangs up on us while we were keeping a connection alive".

You're right regarding the code above, I pushed a fix reverting it to -1.

Regarding on how to detect and, more importantly, telling apart between the connection being dropped during the keep-alive period, or the server hanging up during an active RR loop, it's not as easy as it seems. Two examples (assume HTTP (not S) protocol, using socket transport, and the connection being kept alive was closed before the beginning of the relevant negotiation step):

wait_while_ack (called by the fetch negotiator) eventually calls socket_read. Since it's the first read after writing the POST request, socket_read usually returns 0. But not necessarily. If I pause the debugger right before the p_recv for a few seconds, and then resume, it will actually return <0, and errno set to 32, or EPIPE. I believe this is caused by a timeout at the OS level that keeps the file descriptor open for a few seconds so the program has a chance to call read and get 0 bytes back, but eventually the OS closes the file descriptor and the program will get the 32 error. Is this timeout the same on all operating systems? Who knows. For our purposes this works since it's all converted in GIT_EEOF that will eventually bubble up and be retried if appropriate, but the point is assuming p_recv will always return 0 in this situation may not be true.

push is even weirder. Assume git_smart__push will take time before entering the RUN_WITH_KEEP_ALIVE loop on my patch, or git_smart__get_push_stream on the original code. I added printf("Going to write initial payload\n"); right before that call, and, inside stream_thunk, another line printf("Going to write subsequent payload\n"); before the write call. Some print lines added to socket_write as well. This is the result:

Sleeping for keep-alive drop Resuming Going to write initial payload Socket p_send returned 286 Socket p_send returned 4 Socket p_send returned 136 Socket p_send returned 2 Going to write subsequent payload Socket p_send returned <0, errno: 32

It only fails on the second high-level write (or the 5th actual write to the socket, making it even more arbitrary!), and fails by returning a negative value and errno 32 again. And I suspect this may be dependent on many factors like the size of the TCP window, how long (or if) FIN and RST packets arrive, TCP stack implementation (Linux VS BSDs VS Windows VS AmigaOS and everything else libgit2 runs on) etc.

In this situation I think it's not really possible to tell apart if the server dropped the connection on us already mid-request, or if the connection was dropped before during the keep-alive period. EDIT: In fact, technically at the socket level, this is exactly the same situation, it's a peer dropping the connection for whatever reason, hence why it's not really possible to tell both situations apart.

Since negotiation can take a while (I still didn't look into the push negotiator code that decides what to upload, but it does take a few seconds, especially in large repos), if we are too conservative with this, it may be literally impossible to push from a large repo when using HTTP with a server with a short timeout (again, apache by default uses 5 secs). This is why I'm preferring to, when in doubt, retry. I don't see this as a problem, since if the intent is to push, and we only try twice at most, it's better to waste some bandwidth and make sure the upload happens than failing.

We never pass here on the first read, due to the test added above, so it’s ok.

arroz added 2 commits April 2, 2023 11:59

Keep-alive on fetch

aad6fcd

Fixed for fetch

7443116

arroz mentioned this pull request Apr 2, 2023

Gracefully handle closed keep-alive connections #6541

Open

Push keep-alive working (stransport and socket).

4fb0b96

arroz force-pushed the arroz/keep-alive branch from 8f24605 to 4fb0b96 Compare April 3, 2023 17:48

arroz added 5 commits April 3, 2023 11:12

Fix non-rpc issue, buffer cleared on wrong place.

55a3ecb

Socket transport reporting EOF on reads.

92810d3

Base keep-alive retry mechanism on EOF error.

4d2cfd3

Include push response parsing in keep-alive wrapper.

5580af5

Make openssl transport return EEOF on zero read error.

1c2c80e

arroz marked this pull request as ready for review April 3, 2023 22:06

ethomson reviewed Apr 4, 2023

View reviewed changes

Revert GIT_EEOF to -1.

af38b92

We never pass here on the first read, due to the test added above, so it’s ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arroz/keep alive #6542

Arroz/keep alive #6542

arroz commented Apr 2, 2023

ethomson Apr 4, 2023

arroz Apr 4, 2023 •

edited

Arroz/keep alive #6542

Are you sure you want to change the base?

Arroz/keep alive #6542

Conversation

arroz commented Apr 2, 2023

ethomson Apr 4, 2023

Choose a reason for hiding this comment

arroz Apr 4, 2023 • edited

Choose a reason for hiding this comment

arroz Apr 4, 2023 •

edited