Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new commit statistics metrics #10993

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

oleg68
Copy link
Collaborator

@oleg68 oleg68 commented Oct 18, 2023

Problem statement

Now it is hard to tune a fdb cluster for a write-intensive workload.

Description

While tuning a fdb cluster with a write-intensive application often the bottleneck is the commit latency: when trying to parallel degree of transactions payload, the commit latency grows and prevents increasing the transaction throughput.

There are lots of conditions and knobs influencing the commit latency: number of commit proxies, number of resolvers, number of tlog processes, commit batching knobs: MAX_COMMIT_BATCH_INTERVAL, COMMIT_TRANSACTION_BATCH_INTERVAL_MAX, COMMIT_TRANSACTION_BATCH_INTERVAL_SMOOTHER_ALPHA and others.

But for now, there is no any information, where is the root cause of the high commit latency, so it is unclear, what is to be changed.

Proposal

  • To collect and to log latency statistics from parts of commit workflow:
    • Waiting for a batch
    • Preresolution (allocating a commit version)
    • Resolution
    • Postresolution
    • Pushing to TLog
    • Replying
  • To collect and log the batch size statistics
    • The number of transactions in one batch
    • The total bytes in the transaction batch

PR content

This PR implements this proposal: the following new metrics are logged and exposed in status json:

  • CommitBatchTransactions (commit_batch_transactions) - the number of transactions in one batch
  • CommitBatchBytes (commit_batch_bytes) - the total size of one commit batch in bytes
  • CommitBatchingWaiting (commit_batching_waiting) - the time while the transaction is waiting for the batch becomes ready
  • CommitPreresolutionLatency (commit_preresolution_latency) - the time of the Preresolution phase
  • CommitResolutionLatency (commit_resolution_latency) - the time of the Resolution phase
  • CommitPostResolutionLatency (commit_resolution_latency) - the time of the Postresolution phase
  • CommitTLogLoggingLatency (commit_tlog_logging_latency) - the time of the TlogLogging phase
  • CommitReplyLatency (commit_reply_latency) - the time of Reply phase

The sum of the added *_latency_mean metrics should be equal to the commit_latency_mean metrics that already exists

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 03cf4dc
  • Duration 0:12:41
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 03cf4dc
  • Duration 0:16:01
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 03cf4dc
  • Duration 0:17:40
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 03cf4dc
  • Duration 0:17:48
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 03cf4dc
  • Duration 0:18:14
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 03cf4dc
  • Duration 0:22:37
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Oleg Samarin added 2 commits October 19, 2023 10:13
(cherry picked from commit ed145d9b9b0b72aa1b25acee2fc8b5dc135d8014)
(cherry picked from commit 54996a8)
(cherry picked from commit ae8aef7)

# Conflicts:
#	fdbserver/CommitProxyServer.actor.cpp
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 8ae2ad1
  • Duration 0:20:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 8ae2ad1
  • Duration 0:31:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 8ae2ad1
  • Duration 0:46:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 8ae2ad1
  • Duration 1:06:50
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 8ae2ad1
  • Duration 1:07:00
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 8ae2ad1
  • Duration 1:24:26
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@oleg68
Copy link
Collaborator Author

oleg68 commented Oct 19, 2023

Seems two first tests faled due an environment problems:

Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
errors pretty printing info

sbodagala
sbodagala previously approved these changes Oct 26, 2023

TraceEventFields const& commitBatchTransactions = metrics.at("CommitBatchTransactions");
if (commitBatchTransactions.size()) {
obj["commit_batch_transactions"] = addLatencyStatistics(commitBatchTransactions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this modified status json, you need to update documentation/sphinx/source/mr-status-json-schemas.rst.inc and fdbclient/Schemas.cpp as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new elements to fdbclient/Schemas.cpp, but now there is no cluster.processes..roles objetcs in documentation/sphinx/source/mr-status-json-schemas.rst.inc at all.

I can add the description of the new metrics, but I cannot describe all other elements of roles.

(cherry picked from commit 25fe53b)
(cherry picked from commit 5b54ebf)
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: a6b8f54
  • Duration 0:21:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: a6b8f54
  • Duration 0:33:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: a6b8f54
  • Duration 0:47:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: a6b8f54
  • Duration 1:17:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: a6b8f54
  • Duration 1:22:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: a6b8f54
  • Duration 1:29:57
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbodagala can you run a correctness for this PR before merging?

@sbodagala
Copy link
Contributor

@sbodagala can you run a correctness for this PR before merging?

Ran a correctness test (with 100000 simulation tests). The test run stopped after doing 99994 test runs, at which point 10 simulation tests have failed (not all of these failures may have been caused by this change set though).

ended=99994 fail=10 fail_fast=10 max_runs=100000

Majority of the failures are on tests in "tests/restarting/from_7.3.0/" directory:

SourceVersion="9cbaacab35d8c1e1230497afd6fec9fbd177cf51"

WillRestart="0", RandomSeed="1062210747", BuggifyEnabled="1", TestFile="tests/fast/RandomUnitTests.toml"

WillRestart="1", RandomSeed="1380708830", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"

WillRestart="0",  RandomSeed="1380708831", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"

WillRestart="1", RandomSeed="2128522034", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-1.txt"

WillRestart="0", RandomSeed="2128522035", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestSimpleRestart-2.txt"

 WillRestart="0", RandomSeed="398579556", BuggifyEnabled="0", TestFile="tests/fast/BlobGranuleVerifySmall.toml"

 WillRestart="1", RandomSeed="110402234", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="110402235", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="3174333056", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="3174333057", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="869995033", BuggifyEnabled="1", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-1.toml"

WillRestart="0", RandomSeed="869995034", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/DrUpgradeRestart-2.toml"

WillRestart="1", RandomSeed="630498846", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-1.txt"

WillRestart="0", RandomSeed="630498847", BuggifyEnabled="0", TestFile="tests/restarting/from_7.3.0/SnapTestRestart-2.txt"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants