Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yarn fails with ESOCKETTIMEDOUT while installing a large package on a slow disk #8242

Closed
darkk opened this issue Jul 17, 2020 · 16 comments · May be fixed by #8363
Closed

Yarn fails with ESOCKETTIMEDOUT while installing a large package on a slow disk #8242

darkk opened this issue Jul 17, 2020 · 16 comments · May be fixed by #8363
Labels
fixed-in-modern This issue has been fixed / implemented in Yarn 2+.

Comments

@darkk
Copy link

darkk commented Jul 17, 2020

Bug description

Windows build of our electron app is consistently failing. yarn install --frozen-lockfile failed to download https://registry.yarnpkg.com/date-fns/-/date-fns-2.12.0.tgz and https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz. Failure to download @material-ui/icons was reported with ESOCKETTIMEDOUT. However, I expected the buildhost to have a well-provisioned network as it was a Github Actions runner. Linux build was working fine.

I assumed that high-latency disk IO may be a reason and managed to get a test-case that reproduces the issue reliably: ESOCKETTIMEDOUT is reliably triggered on Linux when small and realistic delay (8ms) is injected to disk IO system calls.

ESOCKETTIMEDOUT being reported because of slow disk IO is very confusing behavior, as it sounds like temporary network error while the root cause is different. It does not match my understanding of "Developing Javascript projects shouldn't leave the door open to surprises" motto, so I'm reporting this test-case as a separate issue despite possible duplicates in the issue tracker. 🙂

Command

ubuntu $ docker run --privileged --rm -ti --net=host node:12-buster /bin/bash
docker # apt update && apt install -y strace
docker # mkdir ~/prj && cd ~/prj
docker # strace -f -o /dev/null -e trace=stat -e inject=stat:delay_exit=8000 yarn add @material-ui/icons@^4.5.1

What is the current behavior?

yarn add v1.22.4
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
error An unexpected error occurred: "https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz: ESOCKETTIMEDOUT".
info If you think this is a bug, please open a bug report with the information provided in "/root/prj/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/add for documentation about this command.

What is the expected behavior?
If I run exactly same command with delay_exit=1 (0.001ms) instead of delay_exit=8000 (8ms), I get the expected behavior:

yarn add v1.22.4
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "@material-ui/core@^4.0.0".
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "react@^16.8.0".
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "react-dom@^16.8.0".
[4/4] Building fresh packages...
success Saved lockfile.
success Saved 3 new dependencies.
info Direct dependencies
└─ @material-ui/icons@4.9.1
info All dependencies
├─ @babel/runtime@7.10.5
├─ @material-ui/icons@4.9.1
└─ regenerator-runtime@0.13.5
Done in 53.01s.

Steps to Reproduce

  1. See "Command" paragraph. Also, the test-case may need some small changes depending on environment.

First, strace adds some overhead on it's own and it may affect reproducibility. E.g. yarn add @material-ui/icons@^4.5.1 is Done in 5.76s. in the very same environment without strace wrapper. That's why I compare strace-with-delay to strace-without-delay and don't compare it to "clean" run.

Second, I've taken stat() call from the following:

  • I've launched strace -f -o ~/yarn-trace yarn add @material-ui/icons@^4.5.1
  • I've looked at the output of grep -F AccessAlarmsRounded.d.ts ~/yarn-trace. It had 5 openat() calls, 4 lstat() calls, 1 stat() call, 1 chmod() call. So I've taken stat(/usr/local/share/.cache/yarn/v6/.../AccessAlarmsRounded.d.ts) as a place to inject delay to.

Third, I've taken 8ms delay assuming that there is single stat() system call per unpacked file and I was emulating HDD-based system having 125 IOPS performance. It's all a ballpark estimate: 1ms delay works on my system, 2ms fails with ESOCKETTIMEDOUT once but manages to install a package after retry, 4ms and 8ms fail reliably.

Fourth, as soon as TCP buffering is involved (see comment on TCP ZeroWindow later), available network bandwidth and size of socket buffer may be also a factor playing a role in bug reproducibility. I've reproduced the bug with these exact variable values with Ubuntu 16.04 laptop connected by 100 Mbit/s link in Russian St. Petersburg and on Linode VM in Newark (see below).

Fifth, your node build may interact with OS kernel a bit differently, e.g. it may use open() instead of openat(). So, if the test-case fails for you, try to increase the injected latency for the disk-related system call or change a disk-related system call. I reproduced the issue on Ubuntu 18.04 VM in Linode Newark availability zone, but I had to use openat as a latency-injection point instead of stat. 4 statx() and 3 openat() syscalls were made for the aforementioned filename at that VM.

Comments and assumptions

SQLite has "faster than FS" benchmarks showing that Windows had pretty bad performance (compared to Linux) while operating with lots of small files. Both date-fns and @material-ui/icons have thousands of files as well as packages mentioned in "Possible duplicates" section. That explains that Windows users are suffering way more from ESOCKETTIMEDOUT happening while installing packages with thousands of files.

@FredyC came to the same idea that high-latency HDD being used instead of low-latency SSD triggers the ESOCKETTIMEDOUT in #6221 (comment)

@Hinaser made an excellent comment describing packet capture #5259 (comment) yarn probably stops reading from a socket (so client OS sends TCP ZeroWindow) and eventually closes the socket from the client side.

I assume that node or yarn is busy unpacking well-compressed tarball full of small files and does not restart reading from socket for long enough time, so ESOCKETTIMEDOUT is triggered. I assume that the code also does not disable socket timeout while putting stream in paused state.

I assume, the possible fix is to download .tgz to a temporary file with some timeouts for network interactions and to unpack it without any timeouts as disk can't write faster anyway. Unfortunately, I'm not familiar with yarn codebase to provide a good PR.

Environment

  • Node Version: 12.18.2
  • Yarn v1 Version: 1.22.4
  • OS and version: Docker container node:12-buster running on top of Ubuntu 16.04 or 18.04

yarn-error.log is the following:

Arguments: 
  /usr/local/bin/node /opt/yarn-v1.22.4/bin/yarn.js add @material-ui/icons@^4.5.1

PATH: 
  /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Yarn version: 
  1.22.4

Node version: 
  12.18.2

Platform: 
  linux x64

Trace: 
  Error: https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz: ESOCKETTIMEDOUT
      at ClientRequest.<anonymous> (/opt/yarn-v1.22.4/lib/cli.js:141375:19)
      at Object.onceWrapper (events.js:421:28)
      at ClientRequest.emit (events.js:315:20)
      at TLSSocket.emitRequestTimeout (_http_client.js:709:9)
      at Object.onceWrapper (events.js:421:28)
      at TLSSocket.emit (events.js:327:22)
      at TLSSocket.Socket._onTimeout (net.js:481:8)
      at listOnTimeout (internal/timers.js:549:17)
      at processTimers (internal/timers.js:492:7)

npm manifest: 
  No manifest

yarn manifest: 
  No manifest

Lockfile: 
  No lockfile

Possible duplicates:

@ievgennaida
Copy link

ievgennaida commented Jul 20, 2020

Windows 10.
yarn v1.22.4
node v10.19.0
WSL

Executed:
npm init react-app my-ap

Failed:
[2/4] Fetching packages...
error An unexpected error occurred: "https://registry.yarnpkg.com/rxjs/-/rxjs-6.5.4.tgz: ESOCKETTIMEDOUT".

@AmirTugi
Copy link

I'm experiencing the same exact issue.
It's a massive issue for us, since it's blocking deploys to production.
Anyone has any idea how to solve this?
What changed since last week?

@darkk
Copy link
Author

darkk commented Jul 21, 2020

@AmirTugi the workaround that works for me is to do yarn config set network-timeout 300000 to raise timeout to 5m from 30s.

However, that's just a workaround, not a fix.

@AmirTugi
Copy link

Right, I tried to raise it to 1000000 and it didn't work.
So that's an arbitrary work-around :)

@m0hamm3d-ac
Copy link

We are facing this issue on our pipeline servers too.. Has anything changed on yarn?

@AlanKnightly
Copy link

me too, I had tries more than 30 times these days and always got timeout, which is so annoying

@ankasani
Copy link

ankasani commented Sep 7, 2020

Even, I see a the same problem in the app center build services. Can anyone please look into this issue?

Is there any yarn status page available?

@m0hamm3d-ac
Copy link

Even, I see a the same problem in the app center build services. Can anyone please look into this issue?

Is there any yarn status page available?

The previous failures I observed corresponded to npm outages shown on this page - https://status.npmjs.org/

@anhvut
Copy link

anhvut commented Sep 30, 2020

@darkk did a wonderful description of the bug. I proposed a PR with his proposal fix in mind:

  • download first the tgz file in cache - no uncompress at the same time. This part is still constrained by network timeout
  • read tgz file in cache folder to uncompress and do other stuff (compute checkums, build metadata ...)

@toby-griffiths
Copy link

I'm also seeing this on my builds on a Digital Ocean (SSD) build server the last couple of days (since setting the build server up).

@marikaner
Copy link

This is happening on GH actions for us as well. Every day a few of our checks fail, because of that. Current solution: rerun...

@mikehardy
Copy link

For github actions failures you might like https://github.com/nick-invision/retry/

@merceyz
Copy link
Member

merceyz commented Jan 2, 2021

Closing as fixed in v2 where the timeout logic is less susceptible to this sort of issue

https://yarnpkg.com/getting-started/migration

@merceyz merceyz closed this as completed Jan 2, 2021
@merceyz merceyz added the fixed-in-modern This issue has been fixed / implemented in Yarn 2+. label Jan 2, 2021
@jjangga0214
Copy link

@merceyz

I have a few questions.

  1. Is this not supposed to be fixed in v1?
  2. Does this happen only on a specific OS (.e.g. macOS)?
  3. Why does this happen and how is this fixed in v2?

Would appreciate if you share some information.

Thanks!

@darkk
Copy link
Author

darkk commented Feb 10, 2021

@jjangga0214 WRT Q#2. It might happen on any OS. It's just more probable to trigger the bug on macOS and Windows due to performance characteristics of the filesystems. HDD (or any other high-latency medium) instead of SSD also increases the probability.

@ben1one
Copy link

ben1one commented Jun 19, 2021

thx so much!!!!

@yarnpkg yarnpkg locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
fixed-in-modern This issue has been fixed / implemented in Yarn 2+.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

14 participants