-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985
Comments
Pinging @elastic/apm-ui (Team:APM) |
I'm not sure this is related to serverless or boost window. I have a hypothesis that this could be recreated pretty easily in any deployment. |
@chrisdistasio I've just tried that on stateful deployment with similar data and reproduced it. Noted a version in the description. |
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
looking at the video--i would suggest the first problem we need to look at is the fact that it took over a minute to render the inventory list of dependencies for 30 days. |
I see that the current api We should break it down as we are doing on the service inventory page. First fetch the top dependencies with the summary statistics, then on a follow-up request, we get the time series and only for the dependencies visible on the screen, as the user navigates we keep fetching only the time series from the visible items. |
@cauemarcondes you probably want to look at a trace first. The underlying requests are a little different from other places. We need to jump through some hoops to be able to resolve a destination service name from an exit span metric. |
There might be some improvement on that after #178979 is merged. There is probably some optimization aspect that needs to be done on the frontend. The tab that loads the page is constantly crashing and becoming unresponsive. 5.2GB memory usage as shown on the screenshot is a huge number. |
The trace sample component doesn't necessarily need to render data for a large time range to present performance issues. I've tested it with a 2h-range, and despite the API responding relatively fast, the trace sample component takes too much time to render the data. While it does that, the page becomes unresponsive for a short while and the tab consumes ~2.9Gb of memory. trace_slow_2h.movThe profiler shows the evolution of memory consumption Testing it with a 4h range crashes the tab trace_sample_unresponsive.movThe performance seems to worsen when the trace object becomes more complex. After digging into the code, I found what might be the two main performance detractors: 1 - getWaterfallMaxLevel - recursive function that could be optimized with the usage of the BFS algorithm. It can consume a lot of heap memory depending on how complex the structure is. 2 - AccordionWaterfall - It's another recursive call, and it renders everything at once. Instead, it could render the trace data on demand, as the user scrolls down the page. There might be other possibilities of micro-optimizations that will make the component render faster. |
Parsing the json was fast enough. We can move forward with the fix by improving the performance on the frontend. |
@chrisdistasio @paulb-elastic, I've managed to refactor the trace sample and optimize the performance, but another problem surfaced: In large date ranges, the waterfall miscalculates the position of the elements in the row. I'll open a new ticket for this, but it doesn't seem like something related to serverless performance |
@crespocarlos to clarify, does this only happen because of the changed implementation? If so, the concern of addressing this separately is that this new (incorrect) behaviour will impact users as soon as it's merged, until a subsequent fix is implemented/merged, in which case it would be better to fix in the same PR (have I understood it correctly?). |
@paulb-elastic , it's an existing issue. It can be seen on the edge cluster: It seems like a completely different problem. It has to be addressed but IMO it'd be better to do so in a separate ticket not to increase the scope of this PR. It could also be a problem with the data on the edge cluster because I couldn't find any SDH complaining about the rendering problem. |
@crespocarlos thanks for confirming it's an existing issue and not related to this PR - fully agree it should be tacked separately. As on edge, it's also not specifically related to serverless, so would you mind creating a separate issue for it that goes onto the infra&services backlog (as opposed to this project), thanks. |
Fixes #178985 ## Summary This PR changes the frontend logic to render the trace waterfall component. Instead of recursively rendering transactions/spans and their child transactions/spans, which causes high memory usage depending on the amount of data/how complex the trace to be rendered is, it now uses tree data structure and BFS/DFS algorithms. Besides that, the trace sample component can render a very long list. To avoid rendering too many elements in the DOM, this PR changes it to use a virtual list ### Memory consumption 15-minutes worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/45746f12-3119-4641-9d68-a725a1fff6ac">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/64e7e5f2-8dda-40eb-8abc-f1974aeb7072">| 30-minutes worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/a0b32774-4bb9-4d8c-a088-b4baea0c204a">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/b09188e3-2fa9-4d38-b344-f3dd3656bde8">| 1-hour worth of data | before | after | |-------|-------| |<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/c33f61ff-e7f8-4f1c-ac49-28bb4c819303">|<img width="590" alt="image" src="https://github.com/elastic/kibana/assets/2767137/ad5299cd-7a72-43e1-aa4a-407c99acb107">| ### Extra Sticky header fix https://github.com/elastic/kibana/assets/2767137/632485ee-80c5-486d-aaa2-c34047b9c11e ### How to test The best way to test is to connect to an oblt cluster - Navigate to APM > Dependencies - Go into `cartService` - Click on `Operations` tab and click on `POST /nodejs/addToCart` operation. - Select different date ranges and services ### For reviewers There is a problem with positioning the trace elements in the grid when rendering data for large date ranges #178985 (comment). This won't be addressed in this PR --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Version:
Serverless project v 8.14.0
Stateful deployment v 8.14.0-SNAPSHOT
Description:
When requesting trace sample of a dependency operation within or outside the boost window, result doesn't return and it causes a browser tab with Kibana to hang and be unresponsive. Chrome shows high memory usage (~5,2 GB).
Steps to reproduce:
Expected behavior:
Trace sample is loaded.
Screenshots (if relevant):
chrome_vVTaU0Z61o.mp4
Logs:
I found following errors:
Kibana
ES
The text was updated successfully, but these errors were encountered: