Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arroz/keep alive #6542

Open
wants to merge 9 commits into
base: ethomson/keepalive
Choose a base branch
from
Open

Conversation

arroz
Copy link
Contributor

@arroz arroz commented Apr 2, 2023

No description provided.

@arroz arroz marked this pull request as ready for review April 3, 2023 22:06
@@ -1211,7 +1211,7 @@ GIT_INLINE(int) client_read_and_parse(git_http_client *client)
/* recv returned 0, the server hung up on us */
else if (!parsed_len) {
git_error_set(GIT_ERROR_HTTP, "unexpected EOF");
return -1;
return GIT_EEOF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understand yet. If I'm reading correctly, this takes any socket failure and retries. For example, this will return GIT_EEOF but we have already read some data from the server (or we would have returned from line 1151).

This is not handling the case when the server disconnects a kept-alive socket, this socket was disconnected mid-stream.

I'm not sure that we should retry that at all. Same with changing the behavior of socket_read and socket_write to return GIT_EEOF. The retry should be context-aware -- it really needs to be a failure when we're reading the first byte of the response when that response is not the response to the first request on a socket, or a failure when we're writing the first byte of the request when that request is not the first request on a socket.

In other words, we should be detecting only the case when the server hangs up on us while we were keeping a connection alive.

And actually, this should really only be the failsafe. We don't want to be in a position where we retry. Resending POST data is expensive and we should avoid it wherever possible. If we're getting into a position where we're getting hung up on, we should change our behavior so that we send keep-alive packets every second or so while we're doing work to avoid getting timed out at all.

But maybe I am not understanding the problem that you're trying to solve. Are you seeing other sorts of failures? 🤔

Copy link
Contributor Author

@arroz arroz Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: I'm only trying to solve the problem of an HTTP connection being dropped between two different requests. I'm not dealing with connections dropping during the request-response loop (that can be solved with git's keepalive feature, as you pointed out). Using your words, this: "we should be detecting only the case when the server hangs up on us while we were keeping a connection alive".

You're right regarding the code above, I pushed a fix reverting it to -1.

Regarding on how to detect and, more importantly, telling apart between the connection being dropped during the keep-alive period, or the server hanging up during an active RR loop, it's not as easy as it seems. Two examples (assume HTTP (not S) protocol, using socket transport, and the connection being kept alive was closed before the beginning of the relevant negotiation step):

  1. wait_while_ack (called by the fetch negotiator) eventually calls socket_read. Since it's the first read after writing the POST request, socket_read usually returns 0. But not necessarily. If I pause the debugger right before the p_recv for a few seconds, and then resume, it will actually return <0, and errno set to 32, or EPIPE. I believe this is caused by a timeout at the OS level that keeps the file descriptor open for a few seconds so the program has a chance to call read and get 0 bytes back, but eventually the OS closes the file descriptor and the program will get the 32 error. Is this timeout the same on all operating systems? Who knows. For our purposes this works since it's all converted in GIT_EEOF that will eventually bubble up and be retried if appropriate, but the point is assuming p_recv will always return 0 in this situation may not be true.

  2. push is even weirder. Assume git_smart__push will take time before entering the RUN_WITH_KEEP_ALIVE loop on my patch, or git_smart__get_push_stream on the original code. I added printf("Going to write initial payload\n"); right before that call, and, inside stream_thunk, another line printf("Going to write subsequent payload\n"); before the write call. Some print lines added to socket_write as well. This is the result:

Sleeping for keep-alive drop
Resuming
Going to write initial payload
Socket p_send returned 286
Socket p_send returned 4
Socket p_send returned 136
Socket p_send returned 2
Going to write subsequent payload
Socket p_send returned <0, errno: 32

It only fails on the second high-level write (or the 5th actual write to the socket, making it even more arbitrary!), and fails by returning a negative value and errno 32 again. And I suspect this may be dependent on many factors like the size of the TCP window, how long (or if) FIN and RST packets arrive, TCP stack implementation (Linux VS BSDs VS Windows VS AmigaOS and everything else libgit2 runs on) etc.

In this situation I think it's not really possible to tell apart if the server dropped the connection on us already mid-request, or if the connection was dropped before during the keep-alive period. EDIT: In fact, technically at the socket level, this is exactly the same situation, it's a peer dropping the connection for whatever reason, hence why it's not really possible to tell both situations apart.

Since negotiation can take a while (I still didn't look into the push negotiator code that decides what to upload, but it does take a few seconds, especially in large repos), if we are too conservative with this, it may be literally impossible to push from a large repo when using HTTP with a server with a short timeout (again, apache by default uses 5 secs). This is why I'm preferring to, when in doubt, retry. I don't see this as a problem, since if the intent is to push, and we only try twice at most, it's better to waste some bandwidth and make sure the upload happens than failing.

We never pass here on the first read, due to the test added above, so it’s ok.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants