Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows builds constantly failing with IOException: connection was forcibly closed by the remote host #4467

Open
avdv opened this issue Aug 3, 2023 · 14 comments

Comments

@avdv
Copy link

avdv commented Aug 3, 2023

We are using Bazel 6.2.0 currently, but the issue existed before (I tried to follow suggestions in #992 but no dice).

The issue only happens for the Windows builds, Linux and Darwin are fine.

ERROR: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host
	at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
}' UNAVAILABLE: An existing connection was forcibly closed by the remote host UNAVAILABLE: An existing connection was forcibly closed by the remote host
INFO: Build completed successfully, 511 total actions
Error: Process completed with exit code 38.

CI runs: 1 (bb), 2 (bb), 3 (bb)

Note, we have 4 jobs in the workflow running on Windows, and one of them succeeded: https://github.com/tweag/rules_haskell/actions/runs/5738410548/job/15564769980?pr=1925

The successful job took a bit over one hour, the failing jobs took >5 hours. Could this be just some sort of timeout?

We are using:

--bes_upload_mode=wait_for_upload_complete
--bes_timeout=60s

Should we try something else? Or tune different parameters? Thank you for any suggestions!

@siggisim
Copy link
Member

siggisim commented Aug 3, 2023

@sluongng any chance you're able to reproduce this on Windows?

This kind of feels like an error that Bazel would handle gracefully / retry on Linux (this suggests that this is the windows error message for a RST stream).

I believe our servers send an RST stream when they're restarting (and the 3 errors linked are around this time of our release yesterday when all of the apps are started, which happened around noon PST).

The question for me is why isn't Bazel able to successfully retry the BES stream on Windows like it seems to be able to on Linux / Mac (otherwise we could never restart our servers / perform releases).

@sluongng
Copy link
Contributor

sluongng commented Aug 4, 2023

I have not seen this before while I was trying to build BuildBuddy on Windows 🤔

Gona reach out to Tweag folks on Slack to see if I could reproduce with rules_haskell or not... though I doubt that this is specific to a repo, there might be some flags/options specific to rules_haskell CI (or their GitHub Action Runner) that could affect this.

@siggisim
Copy link
Member

siggisim commented Aug 4, 2023

Seems like it happened during a release, you could try running a Windows build against a local server and restarting it mid build.

@sluongng
Copy link
Contributor

sluongng commented Aug 4, 2023

Slightly hard since our server does not build on Windows 😅
But I took that as a challenge: I set up a personal VPN on my Windows VM and ran //server on my MacBook.

Tried to stop and restart server on my MacBook 2 times mid-build but could not reproduce the issue 🤔

@sluongng
Copy link
Contributor

sluongng commented Aug 7, 2023

I tried a bunch of approaches and could not replicate the exact error.

@avdv It would be helpful if you could help capture the java log in $(bazel info output-base)/java.log*, specifically the logs from BuildEventServiceUploader would be pretty beneficial. I want to check the timing of the Retrying stream log and its reasons.

It's likely that the 5 hours build duration overlapping with our weekly release time frame was the root cause of this: our old server received the stream -> was interrupted for upgrade -> Bazel retried to another server that would also get interrupted later. If this is indeed the case, bumping --experimental_build_event_upload_max_retries=4 to a higher number would be a good solution.

@avdv
Copy link
Author

avdv commented Aug 7, 2023

Thank you!

It would be helpful if you could help capture the java log in $(bazel info output-base)/java.log*, specifically the logs from BuildEventServiceUploader would be pretty beneficial. I want to check the timing of the Retrying stream log and its reasons.

I'll try to reproduce and attach the files.

@sluongng
Copy link
Contributor

sluongng commented Aug 7, 2023

discussed with @avdv on Slack and tested it myself in https://github.com/sluongng/rules_haskell/actions/runs/5785750975/job/15678998276 https://app.buildbuddy.io/invocation/c912e3eb-c949-4872-9246-47a743b7425d

It seems that the issue is unrelated to our weekly release. I could not reproduce it on AWS EC2 Windows VM, but it seems like the Github Action's managed runner (on Azure) could be easily reproduced.

@siggisim
Copy link
Member

siggisim commented Aug 7, 2023

One thing I've noticed is that the Github action runners on Linux seem to have pretty flaky networking, and that setting this kernel param seems to help:

sudo sysctl -w net.ipv4.tcp_keepalive_time=60

I wonder if there is an equivalent for Windows. Based on this it seems the default is 2 hours and it can be tweaked: https://serverfault.com/a/735768

@avdv
Copy link
Author

avdv commented Aug 8, 2023

I have made a PR here where I uploaded log files as artifacts. See here.

Looking at the "windows-latest rules_haskell bzlmod" job's logs:

230808 05:23:49.822:I 1 [com.google.devtools.build.lib.runtime.BlazeRuntime.main] Starting Bazel server with pid 1876 and , args [--max_idle_secs=10800, --noshutdown_on_low_│
sys_mem, --connect_timeout_secs=30, --output_user_root=c:\_bzl, --install_base=c:\_bzl\install\b81619668f3c61a40ed47421c0adf7f4, --install_md5=b81619668f3c61a40ed47421c0adf7│
f4, --output_base=c:\_bzl\minshlu6, --workspace_directory=d:\a\rules_haskell\rules_haskell, --default_system_javabase=C:/hostedtoolcache/windows/Java_Temurin-Hotspot_jdk/8.0│
.382-5/x64, --failure_detail_out=c:\_bzl\minshlu6\failure_detail.rawproto, --expand_configs_in_place, --idle_server_tasks, --write_command_log, --nowatchfs, --nofatal_event_│
bus_exceptions, --nowindows_enable_symlinks, --client_debug=false, --host_jvm_args=-Dbazel.windows_unix_root=C:\Program Files\Git\usr\bin\bash.exe, --host_jvm_args=-Djdk.tls│
.client.protocols=TLSv1.2, --product_name=Bazel, --option_sources=host_Ujvm_Uargs:d_C\a\rules_Uhaskell\rules_Uhaskell\.bazelrc.common:output_Uuser_Uroot:d_C\a\rules_Uhaskell│
\rules_Uhaskell\.bazelrc.local]
...
230808 05:23:53.701:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.logIds] Streaming Build Event Protocol to 'grpcs://remote.buildbuddy.io' wi│
th build_request_id: '0ae44fad-579a-44e4-a296-fd383a6c5571' and invocation_id: '514cd9ca-ecd6-45e6-89f5-f3b0bea0defa'                                                        │
230808 05:24:00.465:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=2           │
230808 05:36:46.267:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:36:47.273:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:41:22.864:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:41:24.467:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:46:00.879:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:46:03.721:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:50:40.630:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:50:44.732:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:57:58.894:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Not retrying publishBuildEvents, no more attempts le│
ft: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                      │
230808 05:57:58.924:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] About to cancel all local file uploads              │
230808 05:57:58.925:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Spent 0 milliseconds doing local file upload cancell│
ation                                                                                                                                                                        │
230808 05:57:59.062:X 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.logAndSetException] The Build Event Protocol upload failed: Not retrying│
 publishBuildEvents, no more attempts left: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed │
by the remote host                                                                                                                                                           │
230808 06:35:42.655:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 0 milliseconds doing waiting for │
BES close for invocation 514cd9ca-ecd6-45e6-89f5-f3b0bea0defa                                                                                                                │
230808 06:35:42.669:WT 20 [com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand] While running afterCommand() on BazelBuildEventServiceModule                     │
com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=U│
NAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                               │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.logAndSetException(BuildEventServiceUploader.java:298)                                  │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.run(BuildEventServiceUploader.java:334)                                                 │
Caused by: com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader$DetailedStatusException: UNAVAILABLE: io exception                                      │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.withFailureDetail(BuildEventServiceUploader.java:731)                                   │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents(BuildEventServiceUploader.java:547)                                  │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.run(BuildEventServiceUploader.java:311)                                                 │
230808 06:35:42.669:I 20 [com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec] Exit status was DetailedExitCode{exitCode=PUBLISH_ERROR, failureDetail=message: │
"The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status=\'Status{code=UNAVAILABLE, description=io exception, cause=java.io.IO│
Exception: An existing connection was forcibly closed by the remote host\r\n\tat java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)\r\n\tat java.base/sun.nio.ch.Sock│
etDispatcher.read(Unknown Source)\r\n\tat java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)\r\n\tat java.base/sun.nio.ch.IOUtil.read(Unknown Source)\r\n\tat j│
ava.base/sun.nio.ch.IOUtil.read(Unknown Source)\r\n\tat java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)\r\n\tat io.netty.buffer.PooledByteBuf.setBytes(PooledByte│
Buf.java:258)\r\n\tat io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)\r\n\tat io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel│
.java:350)\r\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)\r\n\tat io.netty.channel.nio.NioEventLoop.processSelectedK│
ey(NioEventLoop.java:722)\r\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\r\n\tat io.netty.channel.nio.NioEventLoop.processSele│
ctedKeys(NioEventLoop.java:584)\r\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\r\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleT│
hreadEventExecutor.java:986)\r\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\r\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(F│
astThreadLocalRunnable.java:30)\r\n\tat java.base/java.lang.Thread.run(Unknown Source)\r\n}\' UNAVAILABLE: An existing connection was forcibly closed by the remote host UNAV│
AILABLE: An existing connection was forcibly closed by the remote host"

So the build started at 05:23:49 and the first disconnection happened at 05:36:46. Then it happens after 7, 5, 4 and 7 minutes again. I'll just try to bump the retries to a high number.

Thank you @siggisim, I'll look into the TCP keepalive setting too.

@avdv
Copy link
Author

avdv commented Aug 9, 2023

I did set the KeepAliveTime registry entry to 300000 milliseconds = 5 mins, but that did not help. Yet it almost seems to happen at steady rate once every 5mins now...

230809 06:29:01.478:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.logIds] Streaming Build Event Protocol to 'grpcs://remote.buildbuddy.io' wi│
230809 06:29:21.604:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=2           │
230809 06:34:05.272:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:34:06.272:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=134         │
230809 06:45:19.297:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:45:20.902:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:49:50.790:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:49:53.353:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:54:22.370:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:54:26.471:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:58:55.971:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:59:02.538:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 07:03:31.893:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:03:42.384:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 07:10:15.104:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:10:31.887:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=386         │
230809 07:19:28.817:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:19:55.674:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:24:26.380:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:25:09.346:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:29:39.393:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:30:48.115:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:35:17.988:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:37:07.943:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=390         │
230809 07:41:37.300:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:44:33.241:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=390         │
230809 07:50:15.981:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:54:57.464:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=445         │
230809 07:54:57.620:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] BES uploader is half-closed                         │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] publishBuildEvents was successful                   │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] About to cancel all local file uploads              │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Spent 0 milliseconds doing local file upload cancell│
230809 07:54:58.621:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 21048 milliseconds doing waiting │

Setting --experimental_build_event_upload_max_retries=256 did remedy the issue although I saw "The Build Event Protocol upload timed out." in one job instead.

avdv added a commit to avdv/rules_haskell that referenced this issue Aug 10, 2023
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the
build would fail eventually because the connections to the remote get closed intermittently.
@avdv
Copy link
Author

avdv commented Aug 10, 2023

Just to note (seems I only mentioned this in Slack), we are mostly seeing this error with PRs from forks (where the BuildBuddy API secret is not set, so we only have a read-only cache).

Also, I made a PR adding the max retries setting and it failed with an upload timeout error again.

230809 10:35:02.715:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 60000 milliseconds doing waiting for BES close for invocation 0aca44ca-32b2-41f6-9760-2a7ca8956818
230809 10:35:02.720:WT 20 [com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand] While running afterCommand() on BazelBuildEventServiceModule
com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@4cd9776e[status=PENDING, info=[delegate=[SettableFuture@9dbe091[status=PENDING]]]]
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.createAbruptExitException(BuildEventServiceModule.java:851)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:512)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:608)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:623)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:625)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:649)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:240)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
	at io.grpc.Context$1.run(Context.java:566)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@4cd9776e[status=PENDING, info=[delegate=[SettableFuture@9dbe091[status=PENDING]]]]
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:567)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:113)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:240)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:506)
	... 11 more

I'll try to increase the timeout and see if it makes a difference (no high hopes, I guess 60s should be plenty already).

avdv added a commit to avdv/rules_haskell that referenced this issue Aug 10, 2023
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the
build would fail eventually because the connections to the remote get closed intermittently.
@sluongng
Copy link
Contributor

https://www.buildbuddy.io/docs/troubleshooting-slow-upload#the-build-event-protocol-upload-timed-out

We do recommend setting the BES timeout value to 600s here. I think it should help with cases like yours, but it would be better to know where in the network stack did the failure occur 🤔

@avdv
Copy link
Author

avdv commented Aug 14, 2023

Thank you @sluongng, but seems like it did not help: https://github.com/tweag/rules_haskell/actions/runs/5834070440/job/15838104954?pr=1935

I'll retry and try to get a log file. \edit: the job ran successfully on the third try.

avdv added a commit to avdv/rules_haskell that referenced this issue Aug 16, 2023
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the
build would fail eventually because the connections to the remote get closed intermittently.
@sluongng
Copy link
Contributor

sluongng commented Nov 6, 2023

We have recently helped a few customers troubleshoot their AWS network connectivity. Folks who connect to BuildBuddy through a NAT Gateway would often be subjected to the gateway's TCP timeout, especially when it comes to a long-running grpc stream.

It's unknown how Github is setting up the managed Windows runners, but I guess that it's also subjected to similar limitations.
https://learn.microsoft.com/en-us/azure/nat-gateway/troubleshoot-nat-connectivity#tcp-idle-timeout-timers-set-higher-than-the-default-value

On Bazel side, there are --grpc_keepalive_time and --grpc_keepalive_timeout flags that might help with these behavior.

There is also --experimental_remote_execution_keepalive but I don't think RBE is relevant here.

Also here is how you can tune the setting on Linux's sysctl https://www.buildbuddy.io/docs/troubleshooting-rbe#warning-remote-cache-unavailable-io-exception. It seems like you can achieve a similar effect on Windows https://serverfault.com/questions/735515/tcp-timeout-for-established-connections-in-windows. However, since we are discussing Github-managed runners, I think these are less relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants