Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walredo metrics: some are just for apply_batch_postgres, others include apply_batch_neon #7595

Open
problame opened this issue May 2, 2024 · 1 comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@problame
Copy link
Contributor

problame commented May 2, 2024

Problem

apply_batch_postgres does .observe() on these metrics

WAL_REDO_TIME.observe(duration.as_secs_f64());
WAL_REDO_RECORDS_HISTOGRAM.observe(len as f64);
WAL_REDO_BYTES_HISTOGRAM.observe(nbytes as f64);

whereas apply_batch_neon does .observe() only on this metric

// FIXME: using the same metric here creates a bimodal distribution by default, and because
// there could be multiple batch sizes this would be N+1 modal.
WAL_REDO_TIME.observe(duration.as_secs_f64());

This makes it error-prone to reason about walredo metrics, as, e.g., WAL_REDO_TIME has a different _count than the WAL_REDO_RECORDS_HISTOGRAM and WAL_REDO_BYTES_HISTOGRAM

I stumbled across this and wasted multiple hours when trying to qualify async walredo.
(Actually, Joonas added that FIXME in the code quoted above. This issue addresses the FIXME)

Solution

  1. For metrics that make sense for both functions (WAL_REDO_TIME, WAL_REDO_RECORDS_HISTOGRAM)
  2. Group them together under a struct metrics::WalRedoManagerMetrics
  3. observe them in both functions and add a label called replayer=walredoproc|inprocess to distinguish them.
  4. The WAL_REDO_BYTES_HISTOGRAM is walredo process specific
  5. Group it under a struct metrics::WalRedoProcessMetrics (maybe we already have something like that)
  6. Rename the metric from pageserver_wal_redo_bytes_histogram to pageserver_wal_redo_proc_bytes_histogram
@problame problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 2, 2024
@koivunej
Copy link
Contributor

koivunej commented May 3, 2024

Regarding my FIXME I think we concluded yesterday that the batch size does not matter for these metrics, because the time taken will be a big component of actual request completion.

@jcsp jcsp added the triaged bugs that were already triaged label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

3 participants