Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This lib seems to block event loop #683

Open
chrisdostert opened this issue Aug 5, 2020 · 10 comments
Open

This lib seems to block event loop #683

chrisdostert opened this issue Aug 5, 2020 · 10 comments
Labels
api: cloudprofiler Issues related to the googleapis/cloud-profiler-nodejs API. 🚨 This issue needs some love. triage me I really want to be triaged.

Comments

@chrisdostert
Copy link

Environment details

  • OS: alpine linux
  • Node.js version: 14
  • @google-cloud/profiler version: 4.0.2

Steps to reproduce

  1. start an app with blocked-at and @Google-Cloud profiler wired up
  2. watch logs and you'll see frequent blocking operations logged between 2-8 seconds example:
Blocked for 7586.2085ms, operation started here: [
  '    at Profiler.collectProfile (/src/backend/node_modules/@google-cloud/profiler/src/profiler.ts:393:25)',
  '    at Profiler.runLoop (/src/backend/node_modules/@google-cloud/profiler/src/profiler.ts:379:36)',
  '    at listOnTimeout (internal/timers.js:551:17)',
  '    at processTimers (internal/timers.js:494:7)'
]
@product-auto-label product-auto-label bot added the api: cloudprofiler Issues related to the googleapis/cloud-profiler-nodejs API. label Aug 5, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Aug 6, 2020
@nolanmar511
Copy link
Contributor

nolanmar511 commented Aug 10, 2020

I'm not able to reproduce multi-second blocks of the event loop.

I was able to sometime get long blocks (up to ~750ms) at this same location; but those did not occur consistently -- ~20ms of blocking was more common.

Here's an example of what was logged; note that the blocking was always logged after "Successfully created profile WALL.":

@google-cloud/profiler Successfully created profile WALL.
Blocked for 742.9036171875ms, operation started here: [
  '    at Profiler.collectProfile (/node_modules/@google-cloud/profiler/build/src/profiler.js:298:31)',
  '    at Profiler.runLoop (/node_modules/@google-cloud/profiler/build/src/profiler.js:285:40)',
  '    at listOnTimeout (internal/timers.js:549:17)',
c  '    at processTimers (internal/timers.js:492:7)'
]

Even though the line numbers in the stack trace suggest that the long delay may occur at the line which sends the CreateProfile request, the fact that I only see this when a wall profile is being collected suggests it may be something about the collection of wall profiles.

@chrisdostert -- If you would be able to enable debug logging (documented here; if modifying the source code isn't feasible, setting the environment variable GCLOUD_PROFILER_LOGLEVEL will also work) and check when within the logs the messages about long blocks are printed, that would be helpful. I'm hoping that I'm reproducing the issue you're observing, but it would be nice to confirm.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Aug 10, 2020
@nolanmar511
Copy link
Contributor

I added additional logging to local versions of cloud-profiler-nodejs and pprof-nodejs.

It looks like the blocking happens just after the call into native code to begin profile collection (here in pprof-nodejs).

I definitely see longer blocks on alpine-linux than I do when running in an non-alpine linux docker image (30-80ms, ve ~300ms).

My next step is to make a smaller reproducer which depends only on pprof-nodejs (since that seems to be where the issue is).

@nolanmar511
Copy link
Contributor

Yesterday, I added logging throughout pprof-nodejs, and was able to confirm that sometimes the method used to start the V8 CPU profiler takes a long time to return:

05:22:38.187643595: Calling cpuProfiler->StartProfiling
05:22:38.411110419: Finished calling cpuProfiler->StartProfiling
05:22:38.412645170: Finished calling startProfiling
05:22:38.414679435: Thu Aug 13 2020 05:22:38 GMT+0000 (Coordinated Universal Time): Blocked for 228.049232421875ms, operation started here: [
05:22:38.416546091: '    at Profiler.collectProfile (/cloud-profiler-nodejs/build/src/profiler.js:298:31)',
05:22:38.417701251: '    at Profiler.runLoop (/cloud-profiler-nodejs/build/src/profiler.js:285:40)',
05:22:38.419203220: '    at listOnTimeout (internal/timers.js:551:17)',
05:22:38.420860367: '    at processTimers (internal/timers.js:494:7)'
05:22:38.422373919: ] PROMISE undefined

The logging for "Calling cpuProfiler->StartProfiling" and "Finished calling cpuProfiler->StartProfiling" was added around the line on which the V8 CPU profiler was started.

I'm still working on creating a smaller reproducer (something which depends only on pprof-nodejs).

@nolanmar511
Copy link
Contributor

I'm actually less worried on a reproducer; I think I should look into seeing if I can make all calls into C++ asynchronous in pprof-nodejs.

Then, issues like this won't block the event loop.

@nolanmar511
Copy link
Contributor

nolanmar511 commented Aug 13, 2020

Making the call to start the V8 CPU profiler asynchronous isn't option.

This NAN example has the following comment:

  // Executed inside the worker-thread.
  // It is not safe to access V8, or V8 data structures
  // here, so everything we need for input and output
  // should go on `this`.
  void Execute () {
...

To make starting the CPU profiler asynchronous (following traditional patterns), we would need to start the V8 profiler within the "Execute" method. So, we can't use standard NAN helpers to call cpuProfiler->StartProfiling asynchronously.

It still might be possible to do this... Maybe we can spawn a C++ thread which calls StartProfiling and wait for that thread to finish within Execute.

Before I continue looking in to making out call to the V8 profiler asynchronous, though, I'm going to look a bit into the V8 CPU profiler.

@nolanmar511
Copy link
Contributor

@psmarshall -- Could it be expected that CpuProfiler::StartProfiling sometimes takes on the order of 100ms-1s?

If it's within the range of expected behaviors, then I really want to make our call to start the CPU profiler asynchronous. But if it's not expected, then I might be better off looking at optimizing CpuProfiler::StartProfiling.

@aalexand
Copy link
Contributor

I think CpuProfiler::StartProfiling does some JIT code logging / dumps so that may need to be optimized. I think we touched on this in b/112464936#comment12 before. It would be good to profile with Linux perf or a similar profiler the StartProfiling function using node built with debug information (but optimized binary of course).

@psmarshall
Copy link

Yes it's expected that CpuProfiler::StartProfiling takes a few hundred ms, we've seen that before for big sites e.g. facebook.com. I am pretty sure the slow part is where we iterate the heap and log some metadata about each function in ExistingCodeLogger::LogCompiledFunctions.

I don't remember ever trying to optimize the startup so there could be wins to find there.

We do all this logging on startup on the main thread because we need to access the v8 heap and copy all of the metadata we use for symbolizing functions during profiling to an off-heap data structure, because we access it from a background thread during profiling to keep the overhead of profiling itself low. Given this work has to happen on the main thread I don't think making it async is an option.

@psmarshall
Copy link

Here is a repro just in V8 tests BTW: https://chromium-review.googlesource.com/c/v8/v8/+/2366695

Profiler startup takes ~4.8 seconds with 500k functions on the heap

@aalexand
Copy link
Contributor

nodejs/diagnostics#444 is related to this I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudprofiler Issues related to the googleapis/cloud-profiler-nodejs API. 🚨 This issue needs some love. triage me I really want to be triaged.
Projects
None yet
Development

No branches or pull requests

5 participants