Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WT-12954: optimizing Cluster performance jitter caused by checkpoint #10565

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

y123456yz
Copy link
Contributor

@y123456yz y123456yz commented May 7, 2024

Today, a user encountered a similar issue,they put a lot of pressure on us and requested us to resolve it within a week. I want to solve it by limiting the checkpoint speed, but I am not sure if this will solve the problem, and I am concerned about bugs in my code. Therefore, I need your help to ensure that everything is safe.

Through this PR, we can limit the checkpoint speed of MongoDB by using the following command:
db.adminCommand( { setParameter : 1, "wiredTigerEngineRuntimeConfig" : "io_capacity=(checkpoint=1M)"})

If convenient, please prioritize processing this PR, thank you.

Copy link

Hi @y123456yz, thank you for your submission!
Please make sure to sign our Contributor Agreement (if you haven't already) and provide us with editor permissions on your branch. Instructions on how do that can be found here.

@y123456yz
Copy link
Contributor Author

Background:
We have many low-level MongoDB instances on the cloud, such as 2C4G When a checkpoint cycle writes QPS slightly higher, we often encounter the following problems:

  1. CPU burrs and jitter
  2. Slow queries affecting business

By analyzing diagnose.data, it is confirmed that the main cause is checkpoint, which is basically completed in seconds. This is the root cause of the problem
This PR mainly limits the write speed of checkpoint io to ensure smoother and more stable checkpoints

Looking at the historical MongoDB user work orders, we found that this issue exists in at least dozens of user clusters, not including clusters where users have not reported any issues. There are actually more clusters with this issue.

image
image
image

@y123456yz y123456yz changed the title optimizing Cluster performance jitter caused checkpoint [Urgent, please prioritize processing]-optimizing Cluster performance jitter caused checkpoint May 7, 2024
@y123456yz y123456yz changed the title [Urgent, please prioritize processing]-optimizing Cluster performance jitter caused checkpoint [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused checkpoint May 7, 2024
@y123456yz y123456yz changed the title [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused checkpoint WT-12954: [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused checkpoint May 7, 2024
@y123456yz y123456yz changed the title WT-12954: [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused checkpoint WT-12954: [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused by checkpoint May 7, 2024
@y123456yz
Copy link
Contributor Author

WT-12954

@y123456yz
Copy link
Contributor Author

y123456yz commented May 7, 2024

May I ask: This PR needs to add a use case for io_capacity.checkpoint, which is similar to the test case for WT-11877. However, I found that WT-11877 was Revert, and I think PR(WT-11877) is meaningful. We need to solve the testing script's bug for PR(WT-11877). What should I do better here?

@y123456yz y123456yz changed the title WT-12954: [Urgent, please help prioritize review code]-optimizing Cluster performance jitter caused by checkpoint WT-12954: [Urgent, please help prioritize deal this PR]-optimizing Cluster performance jitter caused by checkpoint May 7, 2024
@y123456yz y123456yz changed the title WT-12954: [Urgent, please help prioritize deal this PR]-optimizing Cluster performance jitter caused by checkpoint WT-12954: optimizing Cluster performance jitter caused by checkpoint May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant