Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

Closed
ablnk opened this issue Mar 19, 2024 · 15 comments · Fixed by #183802
Assignees
Labels
apm:dependencies-ui bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@ablnk
Copy link

ablnk commented Mar 19, 2024

Version:
Serverless project v 8.14.0
Stateful deployment v 8.14.0-SNAPSHOT

Description:
When requesting trace sample of a dependency operation within or outside the boost window, result doesn't return and it causes a browser tab with Kibana to hang and be unresponsive. Chrome shows high memory usage (~5,2 GB).

Steps to reproduce:

  1. Go to Applications - Dependencies.
  2. Filter data by last 30 days (Note: I also reproduced it with shorter time intervals such as 3 days).
  3. Select a dependency with ~1000 tpm throughput.
  4. Open the "Operations" tab.
  5. Select a span with the highest throughput.
  6. Wait for trace sample to load.

Expected behavior:
Trace sample is loaded.

Screenshots (if relevant):
memoryusage

chrome_vVTaU0Z61o.mp4

Logs:
I found following errors:

Kibana

Error decoding JSON: json: cannot unmarshal string into Go value of type map[string]interface {}

ES

java.io.IOException: Failed to delete blobs [e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114230, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114233, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114232, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114235, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114234, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114237, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114236, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114242, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114243, e5fc0b6c-es-f8ac88cd//indices/aHfK8qi0QQmWiDJuB7yYGQ/0/1/stateless_commit_114244]
	at org.elasticsearch.repositories.s3.S3BlobStore.deleteBlobsIgnoringIfNotExists(S3BlobStore.java:334)
	at org.elasticsearch.stateless@8.14.0/co.elastic.elasticsearch.stateless.objectstore.ObjectStoreService$ShardFilesDeleteTask.doRun(ObjectStoreService.java:799)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: HNAWF6D6Y1S4WTGS; S3 Extended Request ID: udLz1kUEPbMHAfUa5mc2ZNWuyrPKdHwm9UbGZMpJJvDUtZknmMaz0NdC0q6voJWzcaMhubfwllE=; Proxy: null), S3 Extended Request ID: udLz1kUEPbMHAfUa5mc2ZNWuyrPKdHwm9UbGZMpJJvDUtZknmMaz0NdC0q6voJWzcaMhubfwllE=
	at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2345)
	at org.elasticsearch.repositories.s3.S3BlobStore.deletePartition(S3BlobStore.java:345)
	at org.elasticsearch.repositories.s3.S3BlobStore.lambda$deleteBlobsIgnoringIfNotExists$1(S3BlobStore.java:327)
	at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:46)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:45)
	at org.elasticsearch.repositories.s3.S3BlobStore.deleteBlobsIgnoringIfNotExists(S3BlobStore.java:318)
@ablnk ablnk added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Mar 19, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@ablnk ablnk changed the title [APM] Requesting trace sample a dependency operation from large time interval results in browser tab with Kibana to hang [APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang Mar 19, 2024
@chrisdistasio
Copy link

I'm not sure this is related to serverless or boost window. I have a hypothesis that this could be recreated pretty easily in any deployment.

@ablnk
Copy link
Author

ablnk commented Mar 26, 2024

@chrisdistasio I've just tried that on stateful deployment with similar data and reproduced it. Noted a version in the description.

@smith smith added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Mar 29, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@smith smith added apm:dependencies-ui needs-refinement Needs PM's to refine scope and removed Team:APM All issues that need APM UI Team support labels Mar 29, 2024
@chrisdistasio
Copy link

chrisdistasio commented Apr 8, 2024

looking at the video--i would suggest the first problem we need to look at is the fact that it took over a minute to render the inventory list of dependencies for 30 days.

@cauemarcondes
Copy link
Contributor

I see that the current api /internal/apm/dependencies/top_dependencies, might return too much data. It returns current/previous stats along with the time series.

We should break it down as we are doing on the service inventory page. First fetch the top dependencies with the summary statistics, then on a follow-up request, we get the time series and only for the dependencies visible on the screen, as the user navigates we keep fetching only the time series from the visible items.

@dgieselaar
Copy link
Member

@cauemarcondes you probably want to look at a trace first. The underlying requests are a little different from other places. We need to jump through some hoops to be able to resolve a destination service name from an exit span metric.

@crespocarlos crespocarlos self-assigned this May 8, 2024
@crespocarlos
Copy link
Contributor

looking at the video--i would suggest the first problem we need to look at is the fact that it took over a minute to render the inventory list of dependencies for 30 days.

There might be some improvement on that after #178979 is merged.

There is probably some optimization aspect that needs to be done on the frontend. The tab that loads the page is constantly crashing and becoming unresponsive. 5.2GB memory usage as shown on the screenshot is a huge number.

@crespocarlos
Copy link
Contributor

crespocarlos commented May 13, 2024

The trace sample component doesn't necessarily need to render data for a large time range to present performance issues.

I've tested it with a 2h-range, and despite the API responding relatively fast, the trace sample component takes too much time to render the data. While it does that, the page becomes unresponsive for a short while and the tab consumes ~2.9Gb of memory.

trace_slow_2h.mov

The profiler shows the evolution of memory consumption
image

Testing it with a 4h range crashes the tab

trace_sample_unresponsive.mov

The performance seems to worsen when the trace object becomes more complex. After digging into the code, I found what might be the two main performance detractors:

1 - getWaterfallMaxLevel - recursive function that could be optimized with the usage of the BFS algorithm. It can consume a lot of heap memory depending on how complex the structure is.

2 - AccordionWaterfall - It's another recursive call, and it renders everything at once. Instead, it could render the trace data on demand, as the user scrolls down the page.

There might be other possibilities of micro-optimizations that will make the component render faster.

@crespocarlos
Copy link
Contributor

Parsing the json was fast enough. We can move forward with the fix by improving the performance on the frontend.

@crespocarlos
Copy link
Contributor

crespocarlos commented May 29, 2024

@chrisdistasio @paulb-elastic, I've managed to refactor the trace sample and optimize the performance, but another problem surfaced:

image

In large date ranges, the waterfall miscalculates the position of the elements in the row.

image

I'll open a new ticket for this, but it doesn't seem like something related to serverless performance

@paulb-elastic
Copy link
Contributor

@crespocarlos to clarify, does this only happen because of the changed implementation?

If so, the concern of addressing this separately is that this new (incorrect) behaviour will impact users as soon as it's merged, until a subsequent fix is implemented/merged, in which case it would be better to fix in the same PR (have I understood it correctly?).

@crespocarlos
Copy link
Contributor

@paulb-elastic , it's an existing issue. It can be seen on the edge cluster:

https://edge-oblt.kb.us-west2.gcp.elastic-cloud.com/app/apm/dependencies/operation?comparisonEnabled=true&dependencyName=checkoutService&detailTab=timeline&environment=ENVIRONMENT_ALL&kuery=&offset=1d&rangeFrom=now-30m&rangeTo=now&showCriticalPath=false&spanName=POST%20%2Fnodejs%2FplaceOrder&spanId=0000000000104250&page=0

image

It seems like a completely different problem. It has to be addressed but IMO it'd be better to do so in a separate ticket not to increase the scope of this PR. It could also be a problem with the data on the edge cluster because I couldn't find any SDH complaining about the rendering problem.

@paulb-elastic
Copy link
Contributor

it'd be better to do so in a separate ticket not to increase the scope of this PR

@crespocarlos thanks for confirming it's an existing issue and not related to this PR - fully agree it should be tacked separately. As on edge, it's also not specifically related to serverless, so would you mind creating a separate issue for it that goes onto the infra&services backlog (as opposed to this project), thanks.

@smith smith removed the needs-refinement Needs PM's to refine scope label May 31, 2024
crespocarlos added a commit that referenced this issue Jun 6, 2024
Fixes #178985 
## Summary

This PR changes the frontend logic to render the trace waterfall
component. Instead of recursively rendering transactions/spans and their
child transactions/spans, which causes high memory usage depending on
the amount of data/how complex the trace to be rendered is, it now uses
tree data structure and BFS/DFS algorithms.

Besides that, the trace sample component can render a very long list. To
avoid rendering too many elements in the DOM, this PR changes it to use
a virtual list


### Memory consumption

15-minutes worth of data

| before | after |
|-------|-------|
|<img width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/45746f12-3119-4641-9d68-a725a1fff6ac">|<img
width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/64e7e5f2-8dda-40eb-8abc-f1974aeb7072">|


30-minutes worth of data

| before | after |
|-------|-------|
|<img width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/a0b32774-4bb9-4d8c-a088-b4baea0c204a">|<img
width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/b09188e3-2fa9-4d38-b344-f3dd3656bde8">|


1-hour worth of data
| before | after |
|-------|-------|
|<img width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/c33f61ff-e7f8-4f1c-ac49-28bb4c819303">|<img
width="590" alt="image"
src="https://github.com/elastic/kibana/assets/2767137/ad5299cd-7a72-43e1-aa4a-407c99acb107">|

### Extra

Sticky header fix


https://github.com/elastic/kibana/assets/2767137/632485ee-80c5-486d-aaa2-c34047b9c11e

### How to test

The best way to test is to connect to an oblt cluster
- Navigate to APM > Dependencies
- Go into `cartService`
- Click on `Operations` tab and click on `POST /nodejs/addToCart`
operation.
- Select different date ranges and services

### For reviewers

There is a problem with positioning the trace elements in the grid when
rendering data for large date ranges
#178985 (comment).
This won't be addressed in this PR

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@crespocarlos
Copy link
Contributor

Tested the fix in qa with 30-day range

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:dependencies-ui bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants