Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate: long find lsn for timestamp operations #7729

Open
koivunej opened this issue May 13, 2024 · 1 comment
Open

investigate: long find lsn for timestamp operations #7729

koivunej opened this issue May 13, 2024 · 1 comment
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Contributor

Follow-up from recent incident which has stopped causing end-user problems with #7585. We still don't know why so many tenants have long times for the query. It is not limited to only many timelines case, because single timeline tenants show it as well.

Guesses so far:

  • branch is created but never receives writes in Lsn area where there is high commit density => the area is difficult to search through every time
    • assumption is that multiple branches make this N times harder
    • cache should help, but is insufficient when one search takes a long time, then cache has churned before the next similar timeline
@koivunej
Copy link
Contributor Author

#7755 shows that configuration change bring a particularly bad bisection from 90s to 13s.

I think that there are still cases where we end up doing a lot more work than should reasonably be done:

  • the prod project with 777 branches, assuming they had "backup alike branches" would had searched for the PITR Lsn over the same pages multiple times
    • with high slru count this would had been prohibitively long
    • "backup alike branches" as in branches where last_record_lsn == ancestor_lsn
    • perhaps we should special case the last_record_lsn == ancestor_lsn case -- we currently do not have metrics on how many timelines have never progressed beyond their ancestor_lsn
  • even if the many timelines were able to find different PITR lsns (from their branch), we could still do duplicate work if we need to reconstruct past ancestor_lsn
    • I think this is what we ultimately saw during the bug of image layering only the first partition
    • then/there the cost of reconstructing the clog pages at the parent was prohibitive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant