[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

ablnk · 2024-03-19T16:48:51Z

Version:
Serverless project v 8.14.0
Stateful deployment v 8.14.0-SNAPSHOT

Description:
When requesting trace sample of a dependency operation within or outside the boost window, result doesn't return and it causes a browser tab with Kibana to hang and be unresponsive. Chrome shows high memory usage (~5,2 GB).

Steps to reproduce:

Go to Applications - Dependencies.
Filter data by last 30 days (Note: I also reproduced it with shorter time intervals such as 3 days).
Select a dependency with ~1000 tpm throughput.
Open the "Operations" tab.
Select a span with the highest throughput.
Wait for trace sample to load.

Expected behavior:
Trace sample is loaded.

Screenshots (if relevant):

chrome_vVTaU0Z61o.mp4

Logs:
I found following errors:

Kibana

Error decoding JSON: json: cannot unmarshal string into Go value of type map[string]interface {}

ES

java.io.IOException: Failed to delete blobs [e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114230, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114233, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114232, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114235, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114234, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114237, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114236, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114242, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114243, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114244]
	at org.elasticsearch.repositories.s3.S3BlobStore.deleteBlobsIgnoringIfNotExists(S3BlobStore.java:334)
	at org.elasticsearch.stateless@8.14.0/co.elastic.elasticsearch.stateless.objectstore.ObjectStoreService$ShardFilesDeleteTask.doRun(ObjectStoreService.java:799)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: HNAWF6D6Y1S4WTGS; S3 Extended Request ID: udLz1kUEPbMHAfUa5mc2ZNWuyrPKdHwm9UbGZMpJJvDUtZknmMaz0NdC0q6voJWzcaMhubfwllE=; Proxy: null), S3 Extended Request ID: udLz1kUEPbMHAfUa5mc2ZNWuyrPKdHwm9UbGZMpJJvDUtZknmMaz0NdC0q6voJWzcaMhubfwllE=
	at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2345)
	at org.elasticsearch.repositories.s3.S3BlobStore.deletePartition(S3BlobStore.java:345)
	at org.elasticsearch.repositories.s3.S3BlobStore.lambda$deleteBlobsIgnoringIfNotExists$1(S3BlobStore.java:327)
	at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:46)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:45)
	at org.elasticsearch.repositories.s3.S3BlobStore.deleteBlobsIgnoringIfNotExists(S3BlobStore.java:318)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-03-19T16:48:53Z

Pinging @elastic/apm-ui (Team:APM)

chrisdistasio · 2024-03-26T13:33:51Z

I'm not sure this is related to serverless or boost window. I have a hypothesis that this could be recreated pretty easily in any deployment.

ablnk · 2024-03-26T16:29:57Z

@chrisdistasio I've just tried that on stateful deployment with similar data and reproduced it. Noted a version in the description.

elasticmachine · 2024-03-29T15:11:10Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

chrisdistasio · 2024-04-08T20:58:23Z

looking at the video--i would suggest the first problem we need to look at is the fact that it took over a minute to render the inventory list of dependencies for 30 days.

cauemarcondes · 2024-04-09T08:53:27Z

I see that the current api /internal/apm/dependencies/top_dependencies, might return too much data. It returns current/previous stats along with the time series.

We should break it down as we are doing on the service inventory page. First fetch the top dependencies with the summary statistics, then on a follow-up request, we get the time series and only for the dependencies visible on the screen, as the user navigates we keep fetching only the time series from the visible items.

dgieselaar · 2024-04-09T09:13:45Z

@cauemarcondes you probably want to look at a trace first. The underlying requests are a little different from other places. We need to jump through some hoops to be able to resolve a destination service name from an exit span metric.

crespocarlos · 2024-05-08T18:23:45Z

looking at the video--i would suggest the first problem we need to look at is the fact that it took over a minute to render the inventory list of dependencies for 30 days.

There might be some improvement on that after #178979 is merged.

There is probably some optimization aspect that needs to be done on the frontend. The tab that loads the page is constantly crashing and becoming unresponsive. 5.2GB memory usage as shown on the screenshot is a huge number.

crespocarlos · 2024-05-13T13:07:52Z

The trace sample component doesn't necessarily need to render data for a large time range to present performance issues.

I've tested it with a 2h-range, and despite the API responding relatively fast, the trace sample component takes too much time to render the data. While it does that, the page becomes unresponsive for a short while and the tab consumes ~2.9Gb of memory.

trace_slow_2h.mov

The profiler shows the evolution of memory consumption

Testing it with a 4h range crashes the tab

trace_sample_unresponsive.mov

The performance seems to worsen when the trace object becomes more complex. After digging into the code, I found what might be the two main performance detractors:

1 - getWaterfallMaxLevel - recursive function that could be optimized with the usage of the BFS algorithm. It can consume a lot of heap memory depending on how complex the structure is.

2 - AccordionWaterfall - It's another recursive call, and it renders everything at once. Instead, it could render the trace data on demand, as the user scrolls down the page.

There might be other possibilities of micro-optimizations that will make the component render faster.

crespocarlos · 2024-05-14T14:25:49Z

Parsing the json was fast enough. We can move forward with the fix by improving the performance on the frontend.

crespocarlos · 2024-05-29T13:56:00Z

@chrisdistasio @paulb-elastic, I've managed to refactor the trace sample and optimize the performance, but another problem surfaced:

In large date ranges, the waterfall miscalculates the position of the elements in the row.

I'll open a new ticket for this, but it doesn't seem like something related to serverless performance

paulb-elastic · 2024-05-30T09:35:07Z

@crespocarlos to clarify, does this only happen because of the changed implementation?

If so, the concern of addressing this separately is that this new (incorrect) behaviour will impact users as soon as it's merged, until a subsequent fix is implemented/merged, in which case it would be better to fix in the same PR (have I understood it correctly?).

crespocarlos · 2024-05-30T09:51:17Z

@paulb-elastic , it's an existing issue. It can be seen on the edge cluster:

https://edge-oblt.kb.us-west2.gcp.elastic-cloud.com/app/apm/dependencies/operation?comparisonEnabled=true&dependencyName=checkoutService&detailTab=timeline&environment=ENVIRONMENT_ALL&kuery=&offset=1d&rangeFrom=now-30m&rangeTo=now&showCriticalPath=false&spanName=POST%20%2Fnodejs%2FplaceOrder&spanId=0000000000104250&page=0

It seems like a completely different problem. It has to be addressed but IMO it'd be better to do so in a separate ticket not to increase the scope of this PR. It could also be a problem with the data on the edge cluster because I couldn't find any SDH complaining about the rendering problem.

paulb-elastic · 2024-05-30T11:30:33Z

it'd be better to do so in a separate ticket not to increase the scope of this PR

@crespocarlos thanks for confirming it's an existing issue and not related to this PR - fully agree it should be tacked separately. As on edge, it's also not specifically related to serverless, so would you mind creating a separate issue for it that goes onto the infra&services backlog (as opposed to this project), thanks.

Fixes #178985 ## Summary This PR changes the frontend logic to render the trace waterfall component. Instead of recursively rendering transactions/spans and their child transactions/spans, which causes high memory usage depending on the amount of data/how complex the trace to be rendered is, it now uses tree data structure and BFS/DFS algorithms. Besides that, the trace sample component can render a very long list. To avoid rendering too many elements in the DOM, this PR changes it to use a virtual list ### Memory consumption 15-minutes worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/45746f12-3119-4641-9d68-a725a1fff6ac">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/64e7e5f2-8dda-40eb-8abc-f1974aeb7072">| 30-minutes worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/a0b32774-4bb9-4d8c-a088-b4baea0c204a">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/b09188e3-2fa9-4d38-b344-f3dd3656bde8">| 1-hour worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/c33f61ff-e7f8-4f1c-ac49-28bb4c819303">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/ad5299cd-7a72-43e1-aa4a-407c99acb107">| ### Extra Sticky header fix https://github.com/elastic/kibana/assets/2767137/632485ee-80c5-486d-aaa2-c34047b9c11e ### How to test The best way to test is to connect to an oblt cluster - Navigate to APM > Dependencies - Go into `cartService` - Click on `Operations` tab and click on `POST /nodejs/addToCart` operation. - Select different date ranges and services ### For reviewers There is a problem with positioning the trace elements in the grid when rendering data for large date ranges #178985 (comment). This won't be addressed in this PR --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

crespocarlos · 2024-06-11T09:46:31Z

Tested the fix in qa with 30-day range

ablnk added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Mar 19, 2024

ablnk changed the title ~~[APM] Requesting trace sample a dependency operation from large time interval results in browser tab with Kibana to hang~~ [APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang Mar 19, 2024

smith added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Mar 29, 2024

smith added apm:dependencies-ui needs-refinement Needs PM's to refine scope and removed Team:APM All issues that need APM UI Team support labels Mar 29, 2024

crespocarlos self-assigned this May 8, 2024

crespocarlos mentioned this issue May 19, 2024

[APM] Trace sample performance improvements #183802

Merged

smith removed the needs-refinement Needs PM's to refine scope label May 31, 2024

crespocarlos closed this as completed in #183802 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

ablnk commented Mar 19, 2024 •

edited

elasticmachine commented Mar 19, 2024

chrisdistasio commented Mar 26, 2024

ablnk commented Mar 26, 2024

elasticmachine commented Mar 29, 2024

chrisdistasio commented Apr 8, 2024 •

edited

cauemarcondes commented Apr 9, 2024

dgieselaar commented Apr 9, 2024

crespocarlos commented May 8, 2024

crespocarlos commented May 13, 2024 •

edited

crespocarlos commented May 14, 2024

crespocarlos commented May 29, 2024 •

edited

paulb-elastic commented May 30, 2024

crespocarlos commented May 30, 2024

paulb-elastic commented May 30, 2024

crespocarlos commented Jun 11, 2024

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

Comments

ablnk commented Mar 19, 2024 • edited

elasticmachine commented Mar 19, 2024

chrisdistasio commented Mar 26, 2024

ablnk commented Mar 26, 2024

elasticmachine commented Mar 29, 2024

chrisdistasio commented Apr 8, 2024 • edited

cauemarcondes commented Apr 9, 2024

dgieselaar commented Apr 9, 2024

crespocarlos commented May 8, 2024

crespocarlos commented May 13, 2024 • edited

crespocarlos commented May 14, 2024

crespocarlos commented May 29, 2024 • edited

paulb-elastic commented May 30, 2024

crespocarlos commented May 30, 2024

paulb-elastic commented May 30, 2024

crespocarlos commented Jun 11, 2024

ablnk commented Mar 19, 2024 •

edited

chrisdistasio commented Apr 8, 2024 •

edited

crespocarlos commented May 13, 2024 •

edited

crespocarlos commented May 29, 2024 •

edited