-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540
Comments
Hi, this is what you can find in mfsmaster.cfg file description:
So mode 2 will always perform the fsync operation after every write, which is why you see such a performance drop. |
Of course i expected performance drop an I/O increase but not thousand times... or 3-4000x |
Changelogs are recorded at least every one second, even if the system is idle. If it's not idle, they are even more frequent. You are asking your kernel to fsync several write operations per second. This has a BIG impact on performance. This option is there only for cases that demand the highest level of security and has no application in most scenarios. |
I only asking if this is intended and there no room for some grouping and reducing fsync calls. One block write to file with dd is causing 6 writes by mfs master(some will be filesystem...) |
Grouping fsyncs is lessening security. This option really isn't meant for regular use and was added only after specific requests from our users. MooseFS used to have only option 0 (that is, writing changelogs in background, aka separate process), because we knew that any other option would severely impact performance. But for security reasons, when performance wasn't an issue, some users wanted to have the option. We added it, but we always said: use at your own risk. Kind of middle ground is option 1, that is writing changelogs in foreground, but without those fsyncs. No danger of the background process hanging (and nobody noticing, which was the main issue as I remember), but also not such a big impact on performance, at the cost of no fsyncs (so still a possiblity of loosing the tail of latest changelog file in case of hardware failure - but with grouped fsyncs that would also exists). |
Grouping fsync mean less security thats true but for single operation before sending OK to client? In this scenario losing one fsync would by just forcing client to retry operation? Just metadata operation like touch or mkdir cousing only one fsync, but appending single line to file "date >> myfile" resulting everytime in 4 fsync by mfsmaster, dd is causing 3 fsync for every single block. |
Master doesn't know what the client process would consider a "single operation" for a file - it doesn't know that there is one "dd" command performed that requests 3 actions to be performed on one file. From the master's point of view those 3 operations might as well have been requested by 3 different processes using the same client. We would have to complicate the protocol and introduce some kind of markers from clients to the master to tell the master where the "fsync points" for metadata should be, but that would introduce a whole new level of "complicated" in master - client interactions, which would not be without a significant impact on performance. |
In my test setup i use DRBD to replicate medata (B protocol an RDMA transport) and manage moosefs master via pacemaker cluster manager to create HA setup. |
Looks like you have answered your own question - it is a feature, not a bug. CHANGELOG_SAVE_MODE=2 is simply a higher priced insurance premium cluster admin decides to pay to lessen the risks and impact of inevitable unplanned outages. Considering you could operate mfsmaster on a barebone RPi any specialized/low latency storage devices or replication over RDMA easily falls under 'sophisticated hardware' umbrella |
Have you read through available documentation, open Github issues and Github Q&A Discussions?
Yes
Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).
3.0.117 from official repositories by moosefs
Operating system (distribution) and kernel version.
Linux RHEL 8.8 4.18.0-477.10.1.el8_8.x86_64
Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.
dedicated server with AMD EPYC 7282, 256GB RAM, XFS(chunk/meta), 40G networking
How much data is tracked by moosefs master (order of magnitude)?
Describe the problem you observed.
With setting CHANGELOG_SAVE_MODE = 2 i observe drastical increse of master servers IOPS and drives utilization.
For metadata is used single Intel Optane 905P with XFS filesystem and default mount/mkfs options except noatime.
CHANGELOG_SAVE_MODE = 1
drive utilization: ~0.01%
IOPS(write only): 2-5
avg write: 0.6MB/s (could be inaccurate, just observed from graph)
CHANGELOG_SAVE_MODE = 2
drive utilization: 99.7%
IOPS(write only): 17.4K
avg write: 42MB/s (could be inaccurate)
From strace, master is calling more than 9K fsync/s (with only 3K block writing/s by dd)
This is test setup and there is no another load on master or chunk servers same on client, for test purpose only one client is writing with dd.
dd if=/dev/zero of=test bs=4k count=2621440 oflag=dsync status=progress
Write performance drops from 20.3MB/s(1) to 11.8MB/s(2) and mfsmaster is most of time in D state, i can even write way bigger file but performance is stable in both cases. How can this cause 17.4K IOPS on metadata(changelog) writes and more than 9K fsync calls.
Im not sure if this is bug.
The text was updated successfully, but these errors were encountered: