Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #7961

styxlab · 2021-05-27T15:01:19Z

styxlab
May 27, 2021

Bug description

My Next.js app is deployed to Vercel and uses a lambda route for a GraphQL server (apollo-server-micro) that is configured with Prisma + Nexus. Lambda cold starts on Vercel lead to slow queries that take approximately 7 seconds. I see the 7 seconds on the private deploy with project name "blogody". A typical cold start signature looks as follows:

x-vercel-id | cdg1::iad1::28qc9-1622127003386-1cc4995040e3

As I cannot share this repo publicly, I made a smaller example that still shows a smaller but significant cold start times of approximately 2,5 seconds. I have not managed to find the influencing factors and I hope Vercel can shed some light on it. Here is the deploy output for the serverless functions:


00:04:49.070 | Serverless function size info
-- | --
00:04:49.071 | Serverless Function's page: api/graphql.js
00:04:49.074 | Large Dependencies                          Uncompressed size  Compressed size
00:04:49.074 | node_modules/.prisma/client                             47 MB          16.4 MB
00:04:49.074 | node_modules/prettier/parser-typescript.js            3.18 MB           817 kB
00:04:49.074 | node_modules/prettier/index.js                        1.72 MB           396 kB
00:04:49.074 | node_modules/prettier/parser-flow.js                  3.11 MB           310 kB
00:04:49.074 | node_modules/@prisma/client                           1.17 MB           243 kB
00:04:49.074 | node_modules/busboy/deps                               618 kB           196 kB
00:04:49.075 | node_modules/micro/node_modules                        360 kB           190 kB
00:04:49.075 | node_modules/encoding/node_modules                     329 kB           179 kB
00:04:49.075 | node_modules/lodash/lodash.js                          544 kB          96.4 kB
00:04:49.075 | All dependencies                                      63.5 MB          20.2 MB
00:04:49.163 | Created all serverless functions in: 43.168s

I see the cold starts after approx. 10 minutes inactivity, but that could vary.

I put some simple timestamps into the app, both on the client and the server. From those timestamps, you can see that in the case of a cold start the total query time is governed by the waiting time between query initiation and endpoint function invocation (start request - before fetching). I hope vercel can debug what happens within this timespan and give some guidance on how to reduce it.

Some screenshots from the example:

How to reproduce

Clone project lambda-cold-start and deploy to Vercel
You can inspect it right away under https://lambda-cold-start.vercel.app (but better deploy yourself, so you can control the inactivity).

Expected behavior

I know that cold starts cannot be fully eliminated, but cold start times of 2 - 7 seconds are a problem for me. I can accept cold start time of roughly 1 second. Thus, I expect the following help from this issue:

What exactly happens on Vercel between request imitation and lambda function invocation
Understand the influencing factors (hopefully something can improved)
Based on the analysis, maybe some ideas for viable remedies (warm-up strategies).

I will also opened an issue @prisma to see if the issue is amplified by that stack.

Additional information

You can find package.json and prisma schema on the linked repo. Note that the example takes out any calls to the database (all prisma queries are taken out the the GraphQL resolvers). This is to show that we are indeed dealing with a cold start issue and not database latency.

I'd be happy to provide more information if needed.

Answered by leerob

Jun 14, 2022

I wrote up ways to debug and detect your root issue with Serverless Function performance decreases.

Edit: Since this has been posted, Prisma has done significant work to improve cold starts (commonly mentioned here as being used in Vercel Functions). Ensure you are on both the latest versions of Prisma and Next.js. Next.js 14 also had cold start improvements, including up to 80% small functions in some instances.

View full answer

williamli · 2021-05-28T06:06:57Z

williamli
May 28, 2021

@styxlab The issue you experienced is caused by the database (in this case Nexus) not being optimised for serverless connections. Database connections cannot be shared between serverless invocations between cold boots. Therefore, each time your serverless function is called (while cold), a new database connection will need to be established. You can get around this by adding poolers between your database and the serverless function or by switching to a serverless friendly database.

You can find more information about this over at https://vercel.com/docs/solutions/databases#connecting-to-your-database.

0 replies

styxlab · 2021-05-28T08:06:27Z

styxlab
May 28, 2021
Author

Thanks @williamli for looking into this issue. Unfortunately, connection pooling cannot explain the issue, I ruled that out already. Why? because I took the database out of the example, there is not a single call to a database! In the example I simply return mocked data in the GraphQL resolver, that's where a real world example would make a request to the database.

Nexus is not a database it is a GraphQL schema generator, and Prisma is a ORM or model mapper. I include them in the example, because they have an influence on the problem (maybe through lambda function bundle size, I don't know).

0 replies

styxlab · 2021-05-28T11:07:15Z

styxlab
May 28, 2021
Author

Just for the record, I enhanced the reproduction example with

/api/hello endpoint showing that a simple function does not exhibit the reported problem (always <500ms)
expose some AWS process env variables for easier debugging
I am testing from Frankfurt Germany (should not have an influenc on cold start, but here you go).

0 replies

styxlab · 2021-06-02T16:26:46Z

styxlab
Jun 2, 2021
Author

Here are some additional findings:

The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).
Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.
Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.
The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).
I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

0 replies

OG84 · 2021-08-07T05:37:34Z

OG84
Aug 7, 2021

If you develop with plain AWS, you can significantly decrease cold start time by increasing function memory size (will also give you more virtual cpu cores). I think you can also change the memory setting in vercel.
Second thing which slows down lambda startup are dynamic require statements of js code. But we dont have any influence on this. Thats up to vercel. Periodically running a ping on your function with an early return should also be helpful.

0 replies

nhuesmann · 2021-08-26T00:16:46Z

nhuesmann
Aug 26, 2021

Here are some additional findings:

The initially reported 7 secs cold-startup were caused by functions that were calling other endpoints that also experienced cold-start delays so the times accumulated. (I updated the title accordingly).

Initially I was measuring total round-trip times which included the network delays from my location. As the warm function times are around 500ms, this can be substracted to get the time that should be attributed to the cold starts.

Finally, I am left with 2 - 3 secs of "cold start" delays on Vercel. When testing only 1 endpoint, cold-start times are around 2 secs, but if I fire many requests at the same time, cold start times increase to approx 3 secs per endpoint.

The cold start time are related to the function size (which is mostly determined by the included modules/packages). So the reported times refer to a function size of ~ 20 MB (60 MB uncompressed).

I played with "warming" the endpoint by issuing a dummy request every 3 minutes. This helps in 80% of all cases, but interestingly, not always. Please let me know what the best strategy is for "warming".

As Vercel infra is basically a black box, I would very much appreciate some more insight on what determines the cold starts and what can be done to reduce it (both on user and Vercel land). It would be also interesting to know why warming does not help in all cases.

@styxlab I'm having this issue as well, my backend uses Prisma + Nexus + Vercel. I'm using connection pooling, so I know it's not what @williamli mentioned in his comment.

Have you made any more progress on this issue?

0 replies

styxlab · 2021-08-26T06:51:39Z

styxlab
Aug 26, 2021
Author

@nhuesmann This is still an unsolved problem for me and that's why I still run my api endpoints on a digitalocean droplet (everything else on Vercel).

The best I could do with Vercel lambda was to call the endpoints every 3-4 minutes (warming), but that didn't reliably help all the time. It's also difficult to test that from different regions.

I am planning to write an in-depth blog article about my findings but do not yet know when I have the time for this.

4 replies

ekeric13 May 3, 2023

@styxlab Can you write up how you moved your api part of nextjs onto digital ocean but kept your frontend with vercel?
I am running into this issue as well almost 2 years later and I would like to investigate moving the BE part of nextjs to another cloud provider that does not have these issues. It has been extremely frustrating dealing with this.

noahhai May 3, 2023

@ekeric13
I moved everything to koyeb where you can also easily deploy to a multi-region, provisioned web server. I will probably try to use Vercel for frontend still and koyeb for backend, just haven't needed to yet. Here is what you would do though:

In vercel DNS, create a cname for api.your-domain.com and point it to your koyeb deployment.

Update CORS for your nextj.js routes to allow requests from your-domain.com
https://github.com/vercel/next.js/blob/canary/examples/api-routes-cors/pages/api/cors.ts

Update your baseUrl in your web services to use ${process.env.baseUrl}/api/${route} instead of just /api/${route}.

For local development in your .env file set baseUrl='' and in production (on vercel deployments) set baseUrl='https://api.your-domain.com'

Something like that.

ricsands2801 May 3, 2023

It really is a shame Vercel are not resolving this, it has got to the stage it is easy to tell when a site is hosted on Vercel when running NextJs and you get a cold start. @leerob it would be great to see Vercel focus on fixing this issue 🙂

ekeric13 May 3, 2023

@noahhai Is there a way to bundle the pages/api portion of one's code separately in order to deploy just the backend?

does it involve creating the FE statically?? (https://nextjs.org/docs/advanced-features/static-html-export)

neoromantic · 2021-09-01T15:53:29Z

neoromantic
Sep 1, 2021

We're (https://github.com/gooditcollective) quite interested in this as well, since we build all of our clients projects on Vercel.

Specifically, we use Graphql function on plain nodejs Vercel environment, made with Apollo Server.

Even completely minimal solution (with no external connections to databases and similar things, with no extra code dependencies, just plain Apollo Server initialisation and single http handler made with micro) boots up in about 1.5-2 seconds.

We'd love to find a way to make it reasonable (I guess, 200-500ms would be already satisfactory).

Is there any suggestions or ideas from Vercel's team or community, I wonder.

We'll try limiting function memory size, but I reckon that will have small effect if any. Rewarming is something that we will do also, but this feels like a broken solution and is unreliable.

Is there anything we can try?

0 replies

styxlab · 2021-09-01T16:11:56Z

styxlab
Sep 1, 2021
Author

@neoromantic I don't want to get in the way of a reply from @vercel, but it's good to see you are reporting figures that correspond very well with my own observations (am also using apollo-micro and tested with empty resolvers - no db connection).

I am also very interested in moving my temporary solution (graphql API endpoints on DO) back to Vercel, but the performance difference is really huge as I am getting <~ 100ms consistently without worrying about cold startups. I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later. In any case, a real solution would probably have to come from AWS, so maybe it's better addressed there?

0 replies

neoromantic · 2021-09-01T17:05:11Z

neoromantic
Sep 1, 2021

I am a bit puzzled as to why this topic does not get more attention - it seems to me that all apps using serverless functions would run into that issue sooner or later.

If I remember correctly, Vercel's position and strategy is that their platform is very much cache-oriented. So it's not the main case for vercel to be a hosting for real-time api functions, but to generate a response and cache it so it can be delivered statically.

Personally, I want to consider solutions like fly.io, which allows to have multi-zone setup for gql server and redis cache backend, for example.

But since I've adopted vercel (called zeit then) since very first versions and I adore their ideology and wonderful support. So I'm very hopeful that at least we would get an understanding on how to manage cold boot times.

0 replies

timuric · 2021-09-09T11:12:36Z

timuric
Sep 9, 2021

I am experiencing 10s+ cold starts with 255kb function, it quite a deal breaker

0 replies

styxlab · 2021-09-09T12:02:28Z

styxlab
Sep 9, 2021
Author

@timuric This sounds a bit high, with a 255kb function I would expect cold start times of ~ 1 second. Did you make the following checks?

exclude the network round-trip time
make no calls to database or other async tasks (just for testing)
make sure you are not calling other serverless functions (which could also exhibit could starts)

I missed the latter check initially, that's why I ended up with ~ 7 secs, because individual cold starts accumulated. Once you understand your access pattern, you can optimize. However, the barrier of ~ 1 second remains, which is still a big issue for me.

0 replies

fubhy · 2021-09-20T17:40:30Z

fubhy
Sep 20, 2021

Just wanted to chime in here to confirm that we are also running into this exact same issue with, in fact, the same stack causing this (apollo-server-micro with nexus). As the OP already mentioned, this has nothing to do with the database as we also ruled that out entirely (returning stubbed data performs exactly as poorly as it does with a database connection).

0 replies

joshsny · 2021-10-24T07:06:29Z

joshsny
Oct 24, 2021

We are experiencing similar issues, though cold start times are shorter for us, at around 1.5s.

Something that surprised me was that for NextJS (which is what we use), API endpoints are bundled together up to a size of 50mb. Therefore, despite having a number of separate API endpoints, they are actually bundled together with size ~30mb.

This is to reduce the number of cold starts and keep things warm. However, when there is a cold start (which happens quite frequently, as the API does not experience high traffic), it is long enough to cause issues for our application.

I haven't tried creating a small endpoint to keep it warm yet, but will try that next and see what affect it has.

0 replies

styxlab · 2021-10-26T16:30:08Z

styxlab
Oct 26, 2021
Author

A solution to the problem: https://vercel.com/docs/concepts/functions/edge-functions ?

0 replies

styxlab · 2022-03-14T11:21:21Z

styxlab
Mar 14, 2022
Author

@piotrpawlik: I haven't noticed longer cold start times after 12.1.0. However, I am also experiencing accumulating cold start issues with unstable_revalidate(). I can confirm that res.unstable_revalidate() takes ~ 300ms in a warm scenario as advertised. However, it's ~ 1.3 secs on cold start.

Unfortunately, this is in addition to the cold start time of ~ 1 sec of the calling lambda function itself (hence accumulating). With some inevitable network latency, the cold start of a revalidate endpoint will take approx. 3 seconds in total. I experimented with warming, but you have to trigger every edge server worldwide, so this is not a practical workaround. I am sad to say, but cold start issues are the biggest bummer with Vercel/AWS lambda.

0 replies

karmatradeDev · 2022-04-01T05:49:10Z

karmatradeDev
Apr 1, 2022

currently having this problem

0 replies

vicary · 2022-04-21T06:12:25Z

vicary
Apr 21, 2022

We are experiencing long cold starts for Next.js SSR, the resulting bundles are roughly 260B.

From the logs we see Init Duration of 4s - 5s, which is far from acceptable dynamic web response time.

Is it possible to increase memory size for SSR functions?

0 replies

jonbnewman · 2022-05-17T12:14:31Z

jonbnewman
May 17, 2022

This has become a completely untenable issue for our application.

API endpoints are basically useless on Vercel because of the cold start issue...why is this not solved?

0 replies

ltbittner · 2022-05-25T03:39:12Z

ltbittner
May 25, 2022

+1 API endpoints take way too long on a cold start. Will have to find a different solution. It's quite sad the lack of response here from the Vercel team.

I also find it strange that my API function size is 30MB+ even though I only have a couple small functions (and @next/bundle-analyzer is reporting them at 200kb...).

I love Vercel but this is super disappointing.

0 replies

chrisb2244 · 2022-06-03T06:19:55Z

chrisb2244
Jun 3, 2022

Having the same issue with signin/signup API routes - spent a while adjusting email generation, trialling templating, switching from SMTP to restAPI for mail, etc etc, but discovered now that although I see ~8s for the first test, if I log out and straight back in, second attempt is much faster (<~1s?, which is fine for me).

So guess my problem is not my emailing, but the cold start behaviour? (reading above, it seems like I might significantly reduce the 8s by trying to flatten out API requests down to a single endpoint, not sure that will be ideal for DRY but worth it if it cuts 8->2 seconds, which would be a bit annoying but no longer terrible).

Maybe the edge functions will solve my problem, will have to try and see if I can rewrite using those (functions are small, so hopefully will fit in the 1MB limit).

0 replies

leerob · 2022-06-14T00:42:32Z

leerob
Jun 14, 2022
Maintainer

I wrote up ways to debug and detect your root issue with Serverless Function performance decreases.

Edit: Since this has been posted, Prisma has done significant work to improve cold starts (commonly mentioned here as being used in Vercel Functions). Ensure you are on both the latest versions of Prisma and Next.js. Next.js 14 also had cold start improvements, including up to 80% small functions in some instances.

3 replies

joshsny Aug 29, 2022

This is helpful, thanks for writing it up! It doesn't solve the problem though. This response gives a good summary of the issues we are facing (and I am sure many others are, but are unaware).

leerob Sep 5, 2022
Maintainer

Commented back below! #7961 (reply in thread)

juniorforlife Mar 8, 2024

@leerob my simple SvelteKit app, which uses Firebase, is experiencing significant initial load delays, taking 2-3 seconds or more. Following the guidelines in your article, I've:

changed my function region to the closest one to my location (the same for my Firebase project)
even removed all my server code and Firebase dependencies. I just render a line of text

Despite these efforts, there's still a delay of 1.85 seconds when waiting for the server to respond.
I couldn't think of any other reason than the cold start issue from Vercel.
This really frustrates me and I might have to consider switching away from Vercel to something else.

jonbnewman · 2022-06-14T20:14:01Z

jonbnewman
Jun 14, 2022

Just as a follow-up, we ended up switching to a simple VPS for hosting our next application and its database, away from Vercel (and also away from planetscale).

I loved the super simple and straightforward integrations and optimizations that Vercel provided - but this issue was just untenable and so we had to switch.

The additional bonus is that the VPS hosting is a lot less expensive - so there's that.

2 replies

leerob Jun 14, 2022
Maintainer

To confirm, you moved to an always-on serverful instance, correct?

The baseline for a VPC (DO Droplet is $5/mo) can be cheaper than Vercel's Pro plan ($20/mo) but with different tradeoffs:

It's single region by default, versus a global CDN by default on Vercel (including picking which region you want your functions to run in)
It doesn't automatically scale, so you'd need to provision more server instances or upgrade your hardware – might be more expensive now
As mentioned, the integrations and optimizations Vercel includes

Glad the server is working well for you though 😄 If you end up exploring Vercel again, I'd recommend giving the guide above a read to help prevent slowdowns when working with serverless functions.

jonbnewman Jun 15, 2022

To confirm, you moved to an always-on serverful instance, correct?

Yes, we went with a DO instance.

The baseline for a VPC (DO Droplet is $5/mo) can be cheaper than Vercel's Pro plan ($20/mo) but with different tradeoffs

Yes, we are aware of the tradeoffs...to address each of your points (with our particular use case/context):

It's single region by default, versus a global CDN by default on Vercel (including picking which region you want your functions to run in)

Yes, in our case the application will only be used in the USA (it is an app/site for a trade school), and so we could easily pick a region that was relatively close to all/most of our users (and close enough for everyone else).

It doesn't automatically scale, so you'd need to provision more server instances or upgrade your hardware – might be more expensive now

Correct, however our traffic should be relatively low, plus we are relying heavily on nextjs SSG and other caching features to mitigate these potential problems.

As mentioned, the integrations and optimizations Vercel includes

This is honestly the reason we chose Vercel initially...it made setup of the application and deployment stupid simple, and the integrations were awesome.

But because of the platform issues we experienced it made us switch back to a more conventional architecture. We might have been able to optimize around the problems, possibly - but overall it made sense for us to switch (in our particular case).

Appreciate the response though!

ltbittner · 2022-06-14T23:55:59Z

ltbittner
Jun 14, 2022

+1 we also ended up moving away from Vercel and onto DO. The cold startup times were just not acceptable for our application (or any application imo) and we simply do not have time to try and debug a black box.

It would be great if Vercel would allow us to see how the serverless functions are bundled so we could try and figure out why the functions are so huge. I did some experimenting and a simple function that just returned a 200 status was 30MB+ somehow (I even made sure the function was bundled separately via config). I realize this was probably some kind of error or misconfiguration on my end, but there is no way to debug it and fix it.

I would easily move back to Vercel if you guys provided a way to either a) have my API routes not as serverless functions at all or b) provide some way for us to debug these kind of issues.

0 replies

nm2501 · 2022-07-18T12:25:05Z

nm2501
Jul 18, 2022

I also hit this problem. My reason for using getServerSideProps was to be able to control access to the page based on the user permissions. I had implemented pretty good caching that yielded its result in < 100ms but for some cache-hit scenarios the HTTP request itself was 7+ seconds, which I could only really attribute to a cold start.

I have chosen not to address the cold start issue itself but I am able to side-step the problem by switching to getStaticProps instead of getServerSideProps. In order to keep the same access control behaviour I'm using Vercel Edge Middleware with a cookie containing a JWT describing which restricted pages the user has access to. This lets the middleware perform access control without needing to make its own network requests.

Another way to reduce the impact of cold-starts would be SWR - using caching and usage of stale data to reduce the time to displaying useful content, with the fresh data loading in automatically when it's ready. This doesn't help if there's nothing in the cache or if stale data has no value, but it's another good option for my use case where I don't expect the data to change often.

0 replies

an1va · 2022-07-29T15:31:42Z

an1va
Jul 29, 2022

On AWS, we usually just provision a little concurrency to avoid the pure cold starts. Any reason why vercel couldn't enable provisioned concurrency to help this issue?

0 replies

birksy89 · 2022-08-26T10:42:53Z

birksy89
Aug 26, 2022

I just wanted to share my experience and research - As I'm confused to why this problem isn't wide-spread.

I want to start by saying I love the work @leerob and the rest of the Vercel team are doing... But this has caused my some real headaches. And I've sunk a lot of time into trying to find a solution. So if there's any advice which can help alleviate this issue - It would be most welcome!

Account Login

Initial Request 🐌

This is the first interaction a user would have with the application - It's the first API call which is made. As you can see, when the site has been idle the first request takes a fair bit of time.

Subsequent Request 🚀

Perhaps this is a Database / Prisma / tRPC Issue?

Note
My initial assumption was that the issue would be the DB or query - So I made a function which circumvents any "complex" infrastructure

Lightweight "health-check" request

src\pages\api\system\health-check.ts

health-check.ts

import { NextApiRequest, NextApiResponse } from 'next';

export default function handler(request: NextApiRequest, response: NextApiResponse) {
  response.status(200).json({
    body: request.body,
    query: request.query,
    cookies: request.cookies,
  });
}

Query "lightweight" API endpoint roughly every 10 mins

Attempted Solutions

Choose the correct region for your functions

I am based in the UK, so I set the Serverless Function Region as follows:

Choose smaller dependencies inside your functions

There are zero dependencies in the "health check" function

Use proper caching headers

Whilst this may improve the situation for repeat visitors - It doesn't fix the initial "cold start"

Migrate to "Always-on" (Heroku) ✅

This does fix the issue - The response times are 1/10th, but there are tradeoffs as discussed here

I don't want to migrate to Heroku. I like Vercel, but I'm sure you can appriciate this issue makes it untenable

Open Questions

First Load JS shared by all

Does the "First Load JS shared by all" effect the API routes?
Is there a way to disable this being loaded - As nothing from _app or css etc is required here

14 replies

cramforce Mar 22, 2023
Maintainer

Eliminating cold start is definitely our number 1 priority. But it should definitely never be in the 16-17 second territory (or even 10 seconds). Is there any chance you still have an old deployment exposing this behavior that we could debug?

What is a thing that in preview deployments after some hours of non-use the lambdas get archived and "unarchived" when they get a request again. This is slower (and can be in the 10 seconds range) but not something that happens in regular production traffic.

I'd love to understand if this is what you saw, or if there is something else hiding that causes such massive spikes in latency.

marcgreenstock Mar 22, 2023
Maintainer

Hi @TimNZ, do you have log drains set up? Would it be possible to share with me the logs that demonstrate 16-17 second boot times?

TimNZ Mar 22, 2023

@cramforce thanks for the reply.

This was the production deployment, and it was happening in the range of minutes.

I'm sure what is happening can be diagnosed and probably resolved with Vercel's help,
I just had to do what I needed to do so I could confidently let real users start playing right now and not give up because response times were terrible.

I now get rock solid 50-150ms response on API calls:

API [Lightsail VM] -> PG Bouncer -> Postgres RDS - all AP-SOUTHEAST-2

When serving from serverless /pages/api, response times vary a lot - 250ms up to 2sec [SYD1]
api routes adding at least 100 ms to roundtrip times.

Periodic 16 sec cold start on top of that.

If the only change was shifting API serving to fixed running Next instance, from serverless, can't point finger at something broken with my code.

I'll send you details in the next week so you can dig.

TimNZ Mar 22, 2023

@marcgreenstock only logging to Axiom at the moment.
As per reply to @cramforce just now I'll make time to replicate this next week.

denvudd Jun 14, 2023

Hello, I would like to know if you have resolved this issue and if so, how did you do it? I am experiencing a similar problem when deploying to Vercel. My "health check" example is quite similar - it's just a function in pages/api/hello that outputs JSON. The loading time is around 2 seconds if it's a cold start and less than 500 milliseconds if it's a warm start. The situation is the same with getServerSideProps. I've been trying to find the problem for a long time and can't find a solution. In my working project, getServerSideProps takes a whopping 7 seconds, which is not normal.

pages/api/hello.tsx:

import type { NextApiRequest, NextApiResponse } from 'next'

type Data = {
  name: string
}

export default function handler(
  req: NextApiRequest,
  res: NextApiResponse<Data>
) {
  res.status(200).json({ name: 'John Doe' })
}

pages/movies/[id].tsx:

import type { InferGetServerSidePropsType, GetServerSideProps } from 'next'
 
type Repo = {
  name: string
  stargazers_count: number
}
 
export const getServerSideProps: GetServerSideProps<{
  repo: Repo
}> = async () => {
  const res = await fetch('https://api.github.com/repos/vercel/next.js')
  const repo = await res.json()
  return { props: { repo } }
}
 
export default function Page({
  repo,
}: InferGetServerSidePropsType<typeof getServerSideProps>) {
  return repo.stargazers_count
}

package.json:

{
  "name": "tbmovies-test",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint"
  },
  "dependencies": {
    "@types/node": "20.3.1",
    "@types/react": "18.2.12",
    "@types/react-dom": "18.2.5",
    "next": "13.4.5",
    "react": "18.2.0",
    "react-dom": "18.2.0",
    "typescript": "5.1.3"
  }
}

Results api/hello on cold start:

stevenkeith85 · 2022-10-23T20:12:27Z

stevenkeith85
Oct 23, 2022

Are they any guidelines on what bundle size to aim for to get a reasonable cold start time?

330 to 350k seems to consistently result in 6 to 6.5 seconds or so. 😭

Can Vercel do anything here? Are Vercel doing anything?

(I've read the guide)

Thanks

3 replies

amyegan Oct 24, 2022
Maintainer

If you don't need access to Node.js APIs, then Edge Functions can offer faster performance. Other than that and the tips in "Improving your Lambda performance", any other recommendations probably need to be tailored to your specific application needs.

If you can share more details about what your application is doing and the tech stack, maybe others will be able to share suggestions for ways to improve performance.

stevenkeith85 Oct 24, 2022

Thanks,

Yeah, that's the guide I read.

Also read a bunch of other stuff. This one was helpful -> https://aws.amazon.com/blogs/compute/operating-lambda-performance-optimization-part-1/

I'm mostly looking for guidance for what sort of bundle size do I need to get down to to get a reasonable cold start 100k? (as there will always be cold starts. i.e. sudden traffic spikes)

Has anyone actually noticed an improvement in cold start times by reducing bundle size? (in practise; not theory)

Can I try increasing the serverless function memory via Vercel?

Are Vercel looking at keep warm options, or I do I need to write my own cron job.

I need access to the node APIs at the moment; so can't (immediately) try edge functions.

I'm new to serverless; and the cold start problem is much worse than I expected. I might have to go with Heroku or something else if I can't bring the cold start time down.

stevenkeith85 Oct 24, 2022

...just rewrote one of my endpoints to work as an edge function.

very fast!

noahhai · 2022-11-22T10:46:27Z

noahhai
Nov 22, 2022

@leerob I think there are two feature requests needed to alleviate this issue.

Provide a pro/enterprise feature (ideally an add-on where you pass through the provisioned costs) to let users specify a Provisioned Concurrency for their functions https://aws.amazon.com/blogs/compute/operating-lambda-performance-optimization-part-1/

This will let all of us stay on Vercel 100% and not have to move to DO, etc., for hosting a non-serverless API

Investigate the bundle size issue - provide some diagnostic tooling for users to download their bundle and inspect it. That will help the community debug any possible perf improvements and drive optimization

5 replies

stevenkeith85 Nov 22, 2022

"+1" for the first option.

For 2, the biggest thing that has helped me is code splitting. (and moving to edge functions where possible)

noahhai Nov 22, 2022

Just curious, how small of a function size/how big size reduction did you get with code splitting?

stevenkeith85 Nov 29, 2022

Sorry, just saw this.

Code splitting basically made my app usable.

Quick way of doing it is to do a yarn build; then look at the largest routes and see what you can dynamically load in each of them.

Modals or anything thats not immediately on screen should be your first targets.

Ideally get all the routes in green. Or yellow at the worst.

Make sure you dont import some huge lib in _app.ts .

E.g. ethers.js or something like that

stevenkeith85 Nov 29, 2022

This is also useful:

https://marketplace.visualstudio.com/items?itemName=wix.vscode-import-cost

stevenkeith85 Nov 29, 2022

Just curious, how small of a function size/how big size reduction did you get with code splitting?

So looks like i was about 300k+ for each route listed under a yarn build.

Most of them are < 150k now

perenstrom · 2023-08-08T19:15:47Z

perenstrom
Aug 8, 2023

Adding my name to the pile of people having problems. I'm experiencing sluggish cold starts for API routes as well.

I've set up a repo to experiment, https://github.com/perenstrom/database-timings-test, with two tests.
One API route fetching data from a postgres server hosted on Render, via a Prisma data proxy, with Prisma client; and one route serving a simple hello world response.

The frontend makes five requests in a row, printing the timings of all of them.

Everything is hosted in Frankfurt (API functions, data proxy, database).

Build output

Database example

NextJS frontend -> NextJS API route -> Prisma Data Proxy -> Render Postgres DB

The code looks as follows.

// pages/api/films
import { NextApiRequest, NextApiResponse } from 'next';
import { prismaContext } from 'lib/prisma';

const films = async (req: NextApiRequest, res: NextApiResponse) => {
  if (req.method === 'GET') {
    const startTime = performance.now();
    const result = await prismaContext.prisma.film.findMany({});
    const timing = performance.now() - startTime;

    return res.json({ data: result, timing });
  } else {
    res.status(404).end();
  }
};

export default films;

The database example shows the following timings:

The "database" timings are the timings as seen in the code above, i.e. time for the Prisma call, and the "total" timing is measured by the client, the amount of time the call to the API route takes. As you can see, the first request is really slow, and any subsequent calls are down to a reasonable time.

Function logs in the Vercel Dashboard shows the following time for the first call, which matches the measured times in the frontend.

I did a test without running Prisma Data Proxy, and that was even worse. So the proxy at least did something, but it's clear that it's cold starts that are the main issue.

Simple example

The simple example does what many above have done, just return some JSON from the API route:

import { NextApiRequest, NextApiResponse } from 'next';

const metrics = async (req: NextApiRequest, res: NextApiResponse) => {
  if (req.method === 'GET') {
    return new Promise((resolve) => {
      res.status(200).json({ data: 'Hello world' });
      resolve('');
    });
  } else {
    res.status(404).end();
  }
};

export default metrics;

This gives the following timings. Same as many before me, the cold start for this simple, simple function is still over 3 seconds.

Any and all help is appreciated. If this is not solvable, I guess I have to do the same as people above, pay for a server somewhere (I REALLY don't want to migrate away from Vercel, I do despise devops.).

3 replies

styxlab Aug 8, 2023
Author

@perenstrom The biggest new option you have on Vercel nowadays is edge functions. I do everything I can with them as cold starts are not significant here.

There is still an issue if you are using prisma as that needs to run in a node environment. I know the Prisma team is working on it, so Prisma will come to the edge one day.

If you are using Prisma's data proxy, you should be able to directly fetch from the proxy with your edge functions. My initial tests with Prisma data proxy were surprisingly slow, but that may have improved. If you are going to test this setup (edge -> prisma proxy) I am curious to see your timings :-)

juniorforlife Mar 8, 2024

@styxlab Is it possible to use edge for Firebase Admin SDK? I’m also experiencing long cold start. I tried changing the functions region but it doesn’t help

styxlab Mar 9, 2024
Author

@juniorforlife I am not using Firebase, so I do not know. However, edge support is increasing every day. The limits about prisma I wrote about in the message that you mentioned will also be removed (prisma is working on it, still work in progress).

Experiencing long lambda cold start delays of 2 - 3 seconds on Vercel #7961

Bug description

How to reproduce

Expected behavior

Additional information

Replies: 39 comments · 35 replies

styxlab May 28, 2021 Author

styxlab May 28, 2021 Author

styxlab Jun 2, 2021 Author

styxlab Aug 26, 2021 Author

styxlab Sep 1, 2021 Author

styxlab Sep 9, 2021 Author

styxlab Oct 26, 2021 Author

styxlab Mar 14, 2022 Author

leerob Jun 14, 2022 Maintainer

leerob Sep 5, 2022 Maintainer

leerob Jun 14, 2022 Maintainer

Account Login

Initial Request 🐌

Subsequent Request 🚀

Perhaps this is a Database / Prisma / tRPC Issue?

Lightweight "health-check" request

health-check.ts

Query "lightweight" API endpoint roughly every 10 mins

Attempted Solutions

Choose the correct region for your functions

Choose smaller dependencies inside your functions

Use proper caching headers

Migrate to "Always-on" (Heroku) ✅

Open Questions

First Load JS shared by all

cramforce Mar 22, 2023 Maintainer

marcgreenstock Mar 22, 2023 Maintainer

amyegan Oct 24, 2022 Maintainer

Replies: 39 comments 35 replies

styxlab
May 28, 2021
Author

styxlab
May 28, 2021
Author

styxlab
Jun 2, 2021
Author

styxlab
Aug 26, 2021
Author

styxlab
Sep 1, 2021
Author

styxlab
Sep 9, 2021
Author

styxlab
Oct 26, 2021
Author

styxlab
Mar 14, 2022
Author

leerob
Jun 14, 2022
Maintainer

leerob Sep 5, 2022
Maintainer

leerob Jun 14, 2022
Maintainer

cramforce Mar 22, 2023
Maintainer

marcgreenstock Mar 22, 2023
Maintainer

amyegan Oct 24, 2022
Maintainer