Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues retrieving code coverage for a large codebase #5554

Closed
amtrack opened this issue Apr 24, 2024 · 11 comments
Closed

Issues retrieving code coverage for a large codebase #5554

amtrack opened this issue Apr 24, 2024 · 11 comments

Comments

@amtrack
Copy link

amtrack commented Apr 24, 2024

Summary

Retrieving the code coverage for a large codebase is very slow and uses a lot of RAM.

In our case, the unit tests only take 20 minutes to run, but retrieving the code coverage takes 30 minutes and even fails with a JavaScript heap out of memory error in some environments.

Steps To Reproduce

Create a Scratch Org and deploy some generated Apex Classes (see this repository).

sf org create scratch -f config/project-scratch-def.json -a cc -d
sf project deploy start

Run the unit tests:

sf apex run test -l RunLocalTests --output-dir test-results --wait 60 --code-coverage

Expected result

After the unit tests have finished, I expect it will not take more than 5 minutes to retrieve the code coverage.

Actual result

It either takes very long (30 minutes) to retrieve the code coverage or even fails with a JavaScript heap out of memory in some environments.

System Information

{
  "architecture": "darwin-arm64",
  "cliVersion": "@salesforce/cli/2.37.4",
  "nodeVersion": "node-v20.10.0",
  "osVersion": "Darwin 23.4.0",
  "rootPath": "/Users/john.doe/.config/yarn/global/node_modules/@salesforce/cli",
  "shell": "zsh",
  "pluginVersions": [
    "@oclif/plugin-autocomplete 3.0.13 (core)",
    "@oclif/plugin-commands 3.2.2 (core)",
    "@oclif/plugin-help 6.0.20 (core)",
    "@oclif/plugin-not-found 3.1.2 (core)",
    "@oclif/plugin-plugins 5.0.7 (core)",
    "@oclif/plugin-search 1.0.20 (core)",
    "@oclif/plugin-update 4.2.4 (core)",
    "@oclif/plugin-version 2.0.16 (core)",
    "@oclif/plugin-warn-if-update-available 3.0.15 (core)",
    "@oclif/plugin-which 3.1.7 (core)",
    "@salesforce/cli 2.37.4 (core)",
    "apex 3.1.4 (core)",
    "auth 3.5.5 (core)",
    "data 3.2.5 (core)",
    "deploy-retrieve 3.5.6 (core)",
    "info 3.1.3 (core)",
    "limits 3.2.1 (core)",
    "marketplace 1.1.1 (core)",
    "org 4.0.4 (core)",
    "packaging 2.3.0 (core)",
    "schema 3.2.1 (core)",
    "settings 2.1.2 (core)",
    "sobject 1.2.2 (core)",
    "source 3.2.3 (core)",
    "telemetry 3.1.18 (core)",
    "templates 56.1.1 (core)",
    "trust 3.5.4 (core)",
    "user 3.4.3 (core)"
  ]
}

Additional information

This is probably a consequence of fixing an issue with incomplete code coverage report which was released in @salesforce/cli v2.22.7.
We didn't notice the incomplete coverage reports but I can now reproduce this with an old version.

  1. Debug logs from a Mac when retrieving the code coverage succeeds but takes 30 minutes:

    $ sf apex test get --code-coverage --output-dir test-results --test-run-id 7076s000023eTl5 --dev-debug
    ...
    [18:02:55.648] DEBUG (sf:elapsedTime): CodeCoverage.fetchResults - enter
    [18:03:17.554] DEBUG (sf:connection): request
        method: "GET"
        url: "https://efficiency-velocity-8823-dev-ed.scratch.my.salesforce.com/services/data/v60.0/tooling/query/0r8xx2sNrVvXQ7WAUW-2000"
        headers: {
        "content-type": "application/json",
        "user-agent": "sfdx toolbelt:"
        }
    [18:03:38.407] DEBUG (sf:connection): request
        method: "GET"
        url: "https://efficiency-velocity-8823-dev-ed.scratch.my.salesforce.com/services/data/v60.0/tooling/query/0r8xx2sNrVokQDMAU2-4000"
        headers: {
        "content-type": "application/json",
        "user-agent": "sfdx toolbelt:"
        }
    [18:03:39.872] DEBUG (sf:connection): request
        method: "GET"
        url: "https://efficiency-velocity-8823-dev-ed.scratch.my.salesforce.com/services/data/v60.0/tooling/query/0r8xx2sNrVvXQ7WAUW-4000"
        headers: {
        "content-type": "application/json",
        "user-agent": "sfdx toolbelt:"
        }
    ...
    [18:31:50.203] DEBUG (sf:connection): request
        method: "GET"
        url: "https://efficiency-velocity-8823-dev-ed.scratch.my.salesforce.com/services/data/v60.0/tooling/query/0r8xx2sNrVokQDMAU2-158000"
        headers: {
        "content-type": "application/json",
        "user-agent": "sfdx toolbelt:"
        }
    [18:32:06.048] DEBUG (sf:elapsedTime): CodeCoverage.fetchResults - exit
        elapsedTime: 1750444.311667
    ...
    Error (1): Invalid string length
    
    *** Internal Diagnostic ***
    
    RangeError: Invalid string length
        at JSON.stringify (<anonymous>)
        at stringify (/Users/john.doe/.config/yarn/global/node_modules/@salesforce/plugin-apex/node_modules/@salesforce/apex-node/lib/src/tests/utils.js:31:17)
        at TestService.writeResultFiles (/Users/john.doe/.config/yarn/global/node_modules/@salesforce/plugin-apex/node_modules/@salesforce/apex-node/lib/src/tests/testService.js:211:48)
        at descriptor.value (/Users/john.doe/.config/yarn/global/node_modules/@salesforce/plugin-apex/node_modules/@salesforce/apex-node/lib/src/utils/elapsedTime.js:57:48)
        at TestReporter.report (file:///Users/john.doe/.config/yarn/global/node_modules/@salesforce/plugin-apex/lib/reporters/testReporter.js:34:31)
        at Test.run (file:///Users/john.doe/.config/yarn/global/node_modules/@salesforce/plugin-apex/lib/commands/apex/get/test.js:64:29)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
        at async Test._run (/Users/john.doe/.config/yarn/global/node_modules/@oclif/core/lib/command.js:311:22)
        at async Config.runCommand (/Users/john.doe/.config/yarn/global/node_modules/@salesforce/cli/node_modules/@oclif/core/lib/config/config.js:433:25)
        at async run (/Users/john.doe/.config/yarn/global/node_modules/@salesforce/cli/node_modules/@oclif/core/lib/main.js:92:16)
    ...
    
  2. JavaScript heap out of memory error on a CI machine with 8GB RAM and heap size increased from 2GB default to 4GB using NODE_OPTIONS=--max_old_space_size=4096.

    [2024-04-19T08:54:08.222Z] + npm run --silent test
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z] <--- Last few GCs --->
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z] [30942:0x5853b10]  3076338 ms: Mark-Compact 4019.7 (4133.0) -> 4004.6 (4133.5) MB, 6141.71 / 0.00 ms  (average mu = 0.113, current mu = 0.025) allocation failure; scavenge might not succeed
    [2024-04-19T09:45:53.046Z] [30942:0x5853b10]  3082724 ms: Mark-Compact 4020.4 (4133.5) -> 4005.3 (4134.3) MB, 6250.92 / 0.00 ms  (average mu = 0.070, current mu = 0.021) allocation failure; scavenge might not succeed
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z] <--- JS stacktrace --->
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z] FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
    [2024-04-19T09:45:53.046Z] ----- Native stack trace -----
    [2024-04-19T09:45:53.046Z]
    [2024-04-19T09:45:53.046Z]  1: 0xb84bd6 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
    [2024-04-19T09:45:53.046Z]  2: 0xefec30 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
    [2024-04-19T09:45:53.046Z]  3: 0xefef17 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
    [2024-04-19T09:45:53.046Z]  4: 0x1110925  [node]
    [2024-04-19T09:45:53.046Z]  5: 0x1110eb4 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [node]
    [2024-04-19T09:45:53.046Z]  6: 0x1127da4 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, char const*) [node]
    [2024-04-19T09:45:53.046Z]  7: 0x11285bc v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
    [2024-04-19T09:45:53.046Z]  8: 0x10fe8c1 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
    [2024-04-19T09:45:53.046Z]  9: 0x10ffa55 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
    [2024-04-19T09:45:53.046Z] 10: 0x10dd0a6 v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
    [2024-04-19T09:45:53.046Z] 11: 0x1537e36 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
    [2024-04-19T09:45:53.046Z] 12: 0x7fdaabed9ef6
    [2024-04-19T09:45:53.046Z] test.sh: line 17: 30942 Aborted                 (core dumped) sf apex run test -l RunLocalTests --output-dir test-results --wait 60 --code-coverage
    
Copy link

Thank you for filing this issue. We appreciate your feedback and will review the issue as soon as possible. Remember, however, that GitHub isn't a mechanism for receiving support under any agreement or SLA. If you require immediate assistance, contact Salesforce Customer Support.

@mshanemc mshanemc transferred this issue from forcedotcom/cli Apr 24, 2024
@mshanemc
Copy link
Contributor

moved this since it's in the vscode team's apex library

@peternhale
Copy link
Contributor

@amtrack we are aware of the issue around volume and additional time to process code coverage. I see that you have already tried bumping max heap space to no avail. I also see that the machine being used for CI/CD is not very robust.

Given the limited memory on the CI/CD machine, out recommendation would be to breakup the test runs into smaller groups. The CLI supports running of Apex Test Suites, as well as taking a list of Apex test classes to run.

Thank you for the example project that challenges both time and space dimension of this issue. We also use similar configurations for our testing. The largest of those produces almost a million code coverage records and requires a heap space of 32GB.

We are looking into alternatives to help minimize the memory consumption.

@amtrack
Copy link
Author

amtrack commented Apr 25, 2024

@peternhale Thanks for your detailed response!
Good to know that you are already exploring this topic.

I'm wondering if you at Salesforce could eventually make the ApexCodeCoverage object available for the Bulk API / Bulk API 2.0. That could probably solve both challenges of the time and space dimension (at least on the client).

Apparently this is possible for the MetadataComponentDependency Tooling API object:

Using Bulk API 2.0, you can query the MetadataComponentDependency Tooling API object and retrieve up to 100,000 records in a single query.

@peternhale
Copy link
Contributor

@amtrack It would be nice to have the ApexCodeCoverage object available in Bulkv2. I don't believe either bulk of the endpoints support tooling objects.

As for improving fetch time, the Coverage field is blob of sorts, that stores two arrays, covered and uncovered lines. When included in a query, the fetch times go up dramatically, due to the field's complexity.

@amtrack
Copy link
Author

amtrack commented Apr 25, 2024

I don't believe either bulk of the endpoints support tooling objects.

I honestly believe the same but the docs say the Bulk API 2.0 does support the MetadataComponentDependency Tooling API object.

I wanted to test it myself but failed quickly failed:

$ sf data query -q "SELECT Id FROM MetadataComponentDependency" --use-tooling-api --bulk
Error (2): The following errors occurred:
  --bulk=true cannot also be provided when using --use-tooling-api
  --use-tooling-api=true cannot also be provided when using --bulk
See more help with --help

Looking at the code in the plugin-data repo, I found that it isn't as easy as just removing the flag constraint.
Gave up for now but wondering why the docs should make such a claim.

@peternhale
Copy link
Contributor

The errors you are seeing are from the cli query command, rather than to bulk api. Until now, bulk and tooling don'e mix.

@peternhale
Copy link
Contributor

@amtrack have you made any progress in getting the tests w/coverage to run on your CI/CD machine?

@amtrack
Copy link
Author

amtrack commented May 6, 2024

@peternhale thanks for asking!
Unfortunately 8GB RAM was not enough and scaling up the paid cloud infrastructure even more only because of this is not an option for us.

After trying lots of things and reading https://developer.salesforce.com/blogs/developer-relations/2012/11/how-code-coverage-works we're probably going to switch to Aggregated Code Coverage.
This works fine in terms of speed and memory.
I think there is a small bug in sf apex test run when Aggregated Code Coverage is enabled that the Coverage per Class isn't queried or printed.
I'll create an issue when I find time.
But I was able to query ApexCodeCoverageAggregate for the few hundred classes with a simple and fast Tooling API query.
And that seems to give us the info we need.

@peternhale
Copy link
Contributor

@amtrack Thanks for getting back to me. We do have work on backlog to try to better handle large test runs. Please file the issue you mentioned when you have the details gathered.

I would like to close this issue if it is ok with you.

@amtrack
Copy link
Author

amtrack commented May 7, 2024

@peternhale OK, I'm looking forward to test it again when you have better handling of large test runs implemented.

@amtrack amtrack closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants