Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observations regarding instability of native fetch in AWS Lambda x86_64 nodejs20.x #3133

Open
jandppw opened this issue Apr 17, 2024 · 24 comments

Comments

@jandppw
Copy link

jandppw commented Apr 17, 2024

We observe consistent intermittent errors when using native fetch in an AWS Lambda x86_64 nodejs20.x. We were
unable to find the particular place in undici code that is the culprit of this behavior.

The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable
people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (fetch is still
experimental in Node 20).

Stable precursor

Versions of this service have been in production since 2022-05-19 on AWS Lambda x86_64 nodejs16.x using
node-fetch@2.6.7 without many issues. Notably, in the last 6 months of
2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.

Where are the issues observed?

The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a
Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use http2and3.

We observe this issue, or these issues, only in AWS Lambda x86_64 nodejs20.x, and have not observed the issue during
development on macOS (This does not guarantee that the issues are limited to AWS Lambda x86_64 nodejs20.x, but that it
is possible that the issues are limited to AWS Lambda x86_64 nodejs20.x).

A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems AWS
Lambda x86_64 nodejs20.x
changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus
undici version) because sadly we do not log that. Given that the latest LTS version of node uses undici@5.28.4, this
reports on versions undici@<=5.28.4. The issues might have been fixed already in later releases, although we don’t see
any relevant mentioning of it in the release notes since v5.25.1. Between
2024-02-06 and 2024-04-03 the errors are reported on node:internal/deps/undici/undici:12344:11. Between 2024-04-03 and
2024-04-15 the errors are reported on node:internal/deps/undici/undici:12345:11.

~1% of invocations fail with 4 ≠ types of failures

The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on AWS Lambda
x86_64 nodejs20.x
, and replaced node-fetch with native fetch. Between 2024-02-06 and 2024-04-15 about 49000 Lambda
invocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it
calls fetch ~25 times. This means ~0,03% of fetch calls fail, but, as you will see, the impact might be greater,
because most failures appear to happen on the first fetch call of an invocation of a fresh Lambda instance.

All failures are logged as 1 of 4 “fetch failed” cause variants:

  • { cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } } (163)
  • UND_ERR_SOCKET (152)
  • { cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } } (16)
  • { cause: { errno: -32, code: 'EPIPE', syscall: 'write' } } (7)

The UND_ERR_SOCKET causes are of the general form

{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: 'NNN.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 580,
      bytesRead: 1915
    }
  }
}

Sometimes the remote address and remote port is filled out, sometimes not.

Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of cause. Finding
out through code inspection did not turn up anything — the usage of processResponse is a bit opaque.

ETIMEDOUT and EPIPE

All samples of the ETIMEDOUT and EPIPE cause we examined show < 15ms between the FETCHING… log and the ERROR
log, and these calls are not registered in the remote API access log. We see the FETCHING… log, but never the
RESPONSE_RECEIVED log in these cases (see below).

In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never
left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.

UND_ERR_CONNECT_TIMEOUT and UND_ERR_SOCKET

The UND_ERR_CONNECT_TIMEOUT and UND_ERR_SOCKET do not follow this pattern. Multiple successful calls of the remote
API with fetch are successful before the error occurs. We assume these calls do go out to the remote API.

Yet, such errors rarely occured when we used this code on AWS Lambda x86_64 nodejs16.x with
node-fetch@2.6.7 (1 such error in the last 6 months of 2023).

Code

fetch is used twice in this code base, but all errors spawn from the same use:

  
  log.info(module, 'get', flowId, 'FETCHING …', { uri })
  const response = await fetch(uri, {
    headers: { accept: 'application/json', authorization }
  })
  log.info(module, 'get', flowId, 'RESPONSE_RECEIVED', { uri, statusCode: response.status })
  

The log.info code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {string} phase
 * @param {object} [details]
 */
function info(callingModule, functionName, flowId, phase, details) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'INFO',
      module: callingModule.filename,
      function: functionName,
      phase,
      details: masked(details, [])
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, phase)
  }
}

Note that now in the logged JSON is determined immediately before the call to fetch synchronously.

The try - catch around the call to fetch is several levels higher in the call stack:


} catch (err) {
  
  log.error(module, 'handler', responseFlowId, err)
  return {
    statusCode: 500,}
}

The log.error code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {*} err
 */
function error(callingModule, functionName, flowId, err) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'ERROR',
      module: callingModule.filename,
      function: functionName,
      message: err && err.message,
      stack: err && err.stack,
      details: util.inspect(masked(err, []), { depth: 9, maxArrayLength: 5 })
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, 'ERROR')
  }
}

Note that now in the logged JSON is determined here.

From the logged now in the FETCHING… log and the logged now in the ERROR log we can determine an upper bound of
the duration between our fetch call and the occurrence of the error.

Possibly related issues

@jandppw jandppw added the bug Something isn't working label Apr 17, 2024
@ronag
Copy link
Member

ronag commented Apr 17, 2024

Can you try to use the undici package instead of the bundled version? To see if the problem still occurs in newer versions of undici.

@jandppw
Copy link
Author

jandppw commented Apr 17, 2024

Because the issues are intermittent, and do not occur locally, we cannot really test for them. We can only get log information from failed invocations from production realistically, over a number of hours or days. We just deployed a new version that should log the stack trace of the cause of the errors, to get a bit more information.

I understand that rolling out a version that uses the latest version of undici directly would help in isolating the issues. Yet, we believe it would be unwise to expose our users to more experiments than necessary for several days, without addressing the issue they have.

Because this is an annoying production issue (~3,5% of interactions failing), we must focus on stability first. Our plan is to revert to using node-fetch (on Node 20) asap. fetch is still experimental in Node 20, after all, and maybe we jumped the gun.

We will report further information when the errors appear with more log information in the mean time.

We understand the report is not very concrete, and regret we cannot be of anymore help. Yet we believed the observations we made are important enough to share here, if only for reference if other people encounter similar issues (e.g., #2964).

@ronag
Copy link
Member

ronag commented Apr 17, 2024

@jandppw Have you considered simply using undici.request which is our recommended API? fetch should only be used if you require compatibility with other runtimes than NodeJS.

@jandppw
Copy link
Author

jandppw commented Apr 17, 2024

@ronag No, we haven’t. We weren’t aware of the fact that undici.request is the recommended API, nor that fetch should only be used if you require compatibility with other runtimes than NodeJS. Is there any documentation on that?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

@ronag
Copy link
Member

ronag commented Apr 17, 2024

Is there any documentation on that?

Not sure. Where would you expect that to be documented?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

fetch was added due to popular demand not due to technical merits. It's never going to be as fast as undici.request. I would stay away from fetch unless writing a isomorphic code or if your team is mostly comfortable with fetch and performance is not critical.

@jandppw
Copy link
Author

jandppw commented Apr 17, 2024

But that would require adding undici as a direct dependency in the project, no? In that situation, undici is a contender next to node-fetch, axios, got, superagent, …, no?

If that is the case, in the current project, at this time, we’d rather revert to proven safe ground (i.e., node-fetch).

Before the above occurred, we were not familiar with undici. Until your comment, to us undici was the thing used inside Node for fetch. Doing some web searches, blogs anno 2024 still refer to the usual suspects, and do not mention undici.

For future work, no dependency trumps any dependency, unless there are very good reasons (such was working with promises instead of consuming events ourselves). In that space, now that we have learned about undici, we will certainly take it into account in future work.

But, to answer the question “Where would you expect that to be documented?”: well, first of all in blogs that I hit with a Google search about doing API calls inside a Node application. This would point me to undici @ npm. None of this is mentioned there. Next, we would have to choose between the contenders. For us, this is probably a bit late. We've been using axios and node-fetch for years now, and would need good reasons to change away from things that are proven in usage.

Now, you mentioning that we should stay away from using fetch in a Node-only application, such as a Lambda, is of a different order. That should be clear first, because, no dependency trumps any dependency, unless there are very good reasons. I have not encountered that point before. So documenting that, and getting the message out, should then be the first step. If this is the official Node point-of-view, maybe the Node fetch documentation should mention it. But this is probably not the forum for that :-).

@mcollina
Copy link
Member

mcollina commented Apr 17, 2024

I would really love that the AWS Lambda team helped out on this one. We lack visibility on the Node.js distribution they are running. Essentially they are likely doing certain things to the Node.js runtime that are causing this.

This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.

Given you are using Node.js for free but paying for AWS Lambda, I recommend you to contact AWS support.


In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never
left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

I suspect it's due to keep alive connections. Lambda suspends the process and all its state after a response is sent. Undici keep alive the connections, but when does Lambda resume the process we do not know that if we need to reset all the connections. There is no event for us to hook into on resume or suspend.

Try adding a connection: 'close' inside your headers. Alternatives: create a new undici.Client for each request and pass it as a dispatcher property to undici.fetch(). Then .close() the client.

@jandppw
Copy link
Author

jandppw commented Apr 17, 2024

@mcollina You are absolutely right that this might very well be an AWS issue, or an interop issue (as noted in the report). We just thought that sharing our observations publicly might help other people that might encounter the same issue (as in “no, I’m not crazy”).

AWS support will be made aware of the issue.

Your hint about a Lambda resuming seems plausible.

@jandppw
Copy link
Author

jandppw commented Apr 17, 2024

@mcollina Actually, it turns out: no. There seems to be no easy and 0€ way to make AWS aware of the issue for this client.

Our main focus is getting stability back to our users.

I share your frustration about AWS not being involved, but there seems to be no easy way for us to get AWS on the case without paying extra €. The only way to get AWS involved seems to be be public side channels, such as social media. This public report might be a first effort.

Any response from AWS now (if I were in their place anyway) would be that native fetch on Node 20 is still experimental, so go away.

Apart from all that, there are no ill feelings here regarding undici or node. In any case, by the time AWS will offer Node 22, there will have to be some resolution, since native fetch is no longer experimental since Node 21.

In the mean time, this report might help other people in realizing that it probably is not a good idea to use native fetch in AWS Lambda’s for the time being.

@mweberxyz
Copy link
Contributor

Out of curiosity, I attempted to replicate this in us-east-1 and couldn't.

I set up a CloudFront distribution with a Lambda named PrivateLambda (invoked via HTTPS via a function URL) as the origin. Then set up a second lambda named PublicLambda (with a function URL) that calls the CloudFront distribution 25 times in a loop.

I called PublicLambda repeatedly using k6, ramping concurrency up and down to suss out out cold/warm start issues. Every request succeeded:

% k6 run test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: test.js
        output: -

     scenarios: (100.00%) 1 scenario, 50 max VUs, 35m30s max duration (incl. graceful stop):
              * default: Up to 50 looping VUs for 35m0s over 7 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ is status 200
     ✓ is body 25

     checks.........................: 100.00% ✓ 99290     ✗ 0
     data_received..................: 17 MB   8.1 kB/s
     data_sent......................: 7.4 MB  3.5 kB/s
     http_req_blocked...............: avg=149.91µs min=1µs      med=7µs      max=170.28ms p(90)=9µs      p(95)=9µs
     http_req_connecting............: avg=59.64µs  min=0s       med=0s       max=60.26ms  p(90)=0s       p(95)=0s
     http_req_duration..............: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
       { expected_response:true }...: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
     http_req_failed................: 0.00%   ✓ 0         ✗ 49645
     http_req_receiving.............: avg=136.27µs min=16µs     med=117µs    max=204.25ms p(90)=167µs    p(95)=199µs
     http_req_sending...............: avg=36.73µs  min=6µs      med=34µs     max=10.91ms  p(90)=48µs     p(95)=53µs
     http_req_tls_handshaking.......: avg=81.24µs  min=0s       med=0s       max=130.21ms p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=667.89ms min=465.38ms med=625.3ms  max=5.7s     p(90)=824.94ms p(95)=900.73ms
     http_reqs......................: 49645   23.638233/s
     iteration_duration.............: avg=668.46ms min=465.76ms med=625.79ms max=5.76s    p(90)=825.95ms p(95)=901.25ms
     iterations.....................: 49645   23.638233/s
     vus............................: 1       min=0       max=50
     vus_max........................: 50      min=50      max=50

Over 35 minutes this made 49,645 requests to to PublicLambda, which means we made successfully made 1,241,125 fetch calls.

Hopefully I did my math right and this little experiment only cost me a few dollars 😅

Attached cloudformation template for set-up and k6 test file:
replicate-undici-3133.zip

@metcoder95
Copy link
Member

The issues shared seem to follow a pattern of constant connection and abrupt disconnections, I believe I'm seeing a similar issue as @mcollina states. I'd suggest to give it a try to his suggestion and start from there.

The lambda runtime might be doing constant cleanups while resuming causing intermittent connect issues.

@jandppw
Copy link
Author

jandppw commented Apr 18, 2024

We yesterday deployed a version that logs the stack trace of the cause, while working on a revert to node-fetch. Up until now 2 errors are logged of the ETIMEDOUT variant, both with the same stack trace:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)
{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
Error: write ETIMEDOUT
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
    at handleWriteReq (node:internal/stream_base_commons:58:21)
    at writeGeneric (node:internal/stream_base_commons:149:15)
    at Socket._writeGeneric (node:net:952:11)
    at Socket._write (node:net:964:8)
    at writeOrBuffer (node:internal/streams/writable:564:12)
    at _write (node:internal/streams/writable:493:10)
    at Writable.write (node:internal/streams/writable:502:10)
    at write (node:internal/deps/undici/undici:9269:18)
    at _resume (node:internal/deps/undici/undici:9172:33)
    at resume (node:internal/deps/undici/undici:9082:7)
    at [dispatch] (node:internal/deps/undici/undici:8327:11)
    at Client.Intercept (node:internal/deps/undici/undici:8000:20)
    at Client.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:6837:32)
    at Pool.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:9957:27)
    at Agent.Intercept (node:internal/deps/undici/undici:8000:20)
    at Agent.dispatch (node:internal/deps/undici/undici:6606:44)
    at node:internal/deps/undici/undici:10929:55
    at new Promise (<anonymous>)
    at dispatch (node:internal/deps/undici/undici:10929:16)
    at httpNetworkFetch (node:internal/deps/undici/undici:10824:73)
    at httpNetworkOrCacheFetch (node:internal/deps/undici/undici:10712:39)
    at httpFetch (node:internal/deps/undici/undici:10545:43)
    at schemeFetch (node:internal/deps/undici/undici:10469:18)
    at node:internal/deps/undici/undici:10342:26
    at mainFetch (node:internal/deps/undici/undici:10361:11)
    at fetching (node:internal/deps/undici/undici:10309:7)
    at fetch (node:internal/deps/undici/undici:10173:20)
    at fetch (node:internal/deps/undici/undici:12344:10)
    at fetch (node:internal/bootstrap/web/exposed-window-or-worker:79:16)
    at getImplementation (/var/task/lib/dal/get.js:75:26)
    …

These again were the first fetch calls during the invocation. The time between the fetch call and the logging of the error was <4ms.

@ajinkyashukla29dj
Copy link

We are using amazon-cognito-identity-js package inside of the AWS Lambda function to talk to AWS Cognito. The AWS lambda function makes multiple AWS cognito calls. We see this error randomly for some of the calls.

{
    "cause": {
        "name": "ConnectTimeoutError",
        "code": "UND_ERR_CONNECT_TIMEOUT",
        "message": "Connect Timeout Error"
    }
}
Error: Network error
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:117:15
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NetworkError'
}

The AWS lambda function is configured to use Node.js 18.x runtime. The earliest occurrence of this error seems to be around six months ago but we need to investigate the AWS Cloudwatch logs if it occurred earlier too.

@mweberxyz
Copy link
Contributor

@ajinkyashukla29dj amazon-cognito-identity-js uses isomorphic-unfetch which uses the globally defined fetch if available. Can you disable experimental fetch by setting NODE_OPTIONS [1] environment variable in your Lambdas to --no-experimental-fetch [2] and see if your issue persists?

[1] https://docs.aws.amazon.com/lambda/latest/dg/runtimes-modify.html
[2] https://nodejs.org/api/cli.html#--no-experimental-fetch

@jandppw
Copy link
Author

jandppw commented Apr 18, 2024

One more error came in, this time of the of the UND_ERR_SOCKET variant:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)
{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: '169.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 1185,
      bytesRead: 69979
    }
  }
}

cause stack:

SocketError: other side closed
    at TLSSocket.onSocketEnd (node:internal/deps/undici/undici:8903:26)
    at TLSSocket.emit (node:events:530:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

ThIs again was the first fetch call during the invocation. The time between the fetch call and the logging of the error was <3ms.

@mweberxyz
Copy link
Contributor

mweberxyz commented Apr 18, 2024

@jandppw I think your best course of action is to swap out native fetch for node-fetch, keeping everything else the same. If the errors go away, it might be an undici issue. If the issues persist, it is either a remote server issue, or at a lower level than undici.

btw: not an undici code contributor, just curious at this point.

@uNetworkingAB

This comment has been minimized.

@jandppw
Copy link
Author

jandppw commented Apr 19, 2024

3 more errors of the ETIMEDOUT variant came in this morning, with the same stack trace.

A new version is now rolled out on production that no longer uses native fetch, so this is the last observation we can share.

@ajinkyashukla29dj
Copy link

@mweberxyz Thank you for your inputs. I don't see the ConnectTimeoutError error after setting the NODE_OPTIONS env variable to --no-experimental-fetch. I am seeing the following response even though the username and password is correct

{
    "size": 0,
    "timeout": 0
}
ERROR	NotAuthorizedException: Incorrect username or password.
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:136:19
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NotAuthorizedException'
}

@mcollina mcollina removed the bug Something isn't working label Apr 19, 2024
@mweberxyz
Copy link
Contributor

@ajinkyashukla29dj sorry can't help with that error it's coming from the amazon-cognito-identity-js library, looks like the HTTP request succeeded though!

Can you provide some more details about your lambda so I can try to replicate UND_ERR_CONNECT_TIMEOUT you encountered when using native fetch? Such as:

  • How often did the UND_ERR_CONNECT_TIMEOUT occur? Every request or just intermittently?
  • What AWS region?
  • Is the lambda running in a VPC? If so, public or private subnet? and is there any privatelink to AWS services defined in the vpc?
  • How is the lambda invoked (api gateway, manually from console, api gateway, manual test from console, etc)?

@uNetworkingAB
Copy link

For me this is not AWS related, it happens on macOS "all the time" (very often) for a very simple scraper project that does about 500 fetch calls or something. To confirm I ported it to another impl. of fetch and it worked without stability issues there.

The issue I have is that ETIMEDOUT and similar is not being propagated to my try/catch but instead just crashes Node.js somewhere inside undici.

It can be reproduced with await fetch inside try/catch leading to crashed Node.js process.

@mweberxyz
Copy link
Contributor

@uNetworkingAB Can you give a code sample? Knowing which remote servers you are having trouble with would help with replication.

@mcollina
Copy link
Member

Please verify if undici v6.14.1 fixes this.

@lucamezzalira
Copy link

@jandppw my name is Luca, I work for AWS as Severless Specialist, I'd love to connect with you to review your challenge with the service team. Would you mind ping me in private so we can arrange a chat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants