Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

Open
Lathanderjk opened this issue May 23, 2023 · 9 comments

Comments

@Lathanderjk
Copy link

Lathanderjk commented May 23, 2023

Have you read through available documentation, open Github issues and Github Q&A Discussions?

Yes

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

3.0.117 from official repositories by moosefs

Operating system (distribution) and kernel version.

Linux RHEL 8.8 4.18.0-477.10.1.el8_8.x86_64

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

dedicated server with AMD EPYC 7282, 256GB RAM, XFS(chunk/meta), 40G networking

How much data is tracked by moosefs master (order of magnitude)?

  • All fs objects: 44777185
  • Total space: 27TB
  • Free space: 3.5TB
  • RAM used: ~30GB(master)
  • last metadata save duration: 7.4s

Describe the problem you observed.

With setting CHANGELOG_SAVE_MODE = 2 i observe drastical increse of master servers IOPS and drives utilization.
For metadata is used single Intel Optane 905P with XFS filesystem and default mount/mkfs options except noatime.

CHANGELOG_SAVE_MODE = 1
drive utilization: ~0.01%
IOPS(write only): 2-5
avg write: 0.6MB/s (could be inaccurate, just observed from graph)

CHANGELOG_SAVE_MODE = 2
drive utilization: 99.7%
IOPS(write only): 17.4K
avg write: 42MB/s (could be inaccurate)
From strace, master is calling more than 9K fsync/s (with only 3K block writing/s by dd)

This is test setup and there is no another load on master or chunk servers same on client, for test purpose only one client is writing with dd.
dd if=/dev/zero of=test bs=4k count=2621440 oflag=dsync status=progress

Write performance drops from 20.3MB/s(1) to 11.8MB/s(2) and mfsmaster is most of time in D state, i can even write way bigger file but performance is stable in both cases. How can this cause 17.4K IOPS on metadata(changelog) writes and more than 9K fsync calls.

Im not sure if this is bug.

@xandrus
Copy link
Member

xandrus commented May 23, 2023

Hi,

this is what you can find in mfsmaster.cfg file description:

# Changelog save mode. There are three modes of writing changelogs:
# 0 - write in background by different process (less safe, but doesn't make master stop in case of heavy hdd load)
# 1 - write in foreground without syncing data (master waits for every changelog to be saved to hdd, but without syncing - a little more safe than the background option, but may cause master to stop and wait for flushing hdd buffers)
# 2 - write in foreground with fsync after each write (very safe, but may make your master very slow unless you have very sophisticated hardware)
# CHANGELOG_SAVE_MODE = 0

So mode 2 will always perform the fsync operation after every write, which is why you see such a performance drop.
And yes - this is not a bug.

@Lathanderjk
Copy link
Author

Lathanderjk commented May 23, 2023

Of course i expected performance drop an I/O increase but not thousand times... or 3-4000x

@chogata
Copy link
Member

chogata commented May 23, 2023

Changelogs are recorded at least every one second, even if the system is idle. If it's not idle, they are even more frequent. You are asking your kernel to fsync several write operations per second. This has a BIG impact on performance. This option is there only for cases that demand the highest level of security and has no application in most scenarios.

@Lathanderjk
Copy link
Author

I only asking if this is intended and there no room for some grouping and reducing fsync calls. One block write to file with dd is causing 6 writes by mfs master(some will be filesystem...)
I also try strace and dd count=10, 100 etc. and its doing at least 3x fsync per single written block.

@chogata
Copy link
Member

chogata commented May 25, 2023

Grouping fsyncs is lessening security. This option really isn't meant for regular use and was added only after specific requests from our users. MooseFS used to have only option 0 (that is, writing changelogs in background, aka separate process), because we knew that any other option would severely impact performance. But for security reasons, when performance wasn't an issue, some users wanted to have the option. We added it, but we always said: use at your own risk. Kind of middle ground is option 1, that is writing changelogs in foreground, but without those fsyncs. No danger of the background process hanging (and nobody noticing, which was the main issue as I remember), but also not such a big impact on performance, at the cost of no fsyncs (so still a possiblity of loosing the tail of latest changelog file in case of hardware failure - but with grouped fsyncs that would also exists).

@Lathanderjk
Copy link
Author

Grouping fsync mean less security thats true but for single operation before sending OK to client? In this scenario losing one fsync would by just forcing client to retry operation?

Just metadata operation like touch or mkdir cousing only one fsync, but appending single line to file "date >> myfile" resulting everytime in 4 fsync by mfsmaster, dd is causing 3 fsync for every single block.

@chogata
Copy link
Member

chogata commented Jun 5, 2023

Master doesn't know what the client process would consider a "single operation" for a file - it doesn't know that there is one "dd" command performed that requests 3 actions to be performed on one file. From the master's point of view those 3 operations might as well have been requested by 3 different processes using the same client. We would have to complicate the protocol and introduce some kind of markers from clients to the master to tell the master where the "fsync points" for metadata should be, but that would introduce a whole new level of "complicated" in master - client interactions, which would not be without a significant impact on performance.
May I ask why are you using save mode 2? As far as we know this is a very rarely used option.

@Lathanderjk
Copy link
Author

In my test setup i use DRBD to replicate medata (B protocol an RDMA transport) and manage moosefs master via pacemaker cluster manager to create HA setup.
CHANGELOG SAVE MODE=2 is necessary for clients and IO operation correctly and immediately recover after master failover. Without CHANGELOG_SAVE_MODE=2 you end up with stucked clients, holes in metadata reported by master or missing IO operations, not much reliable filesystem for production.
I very like performance, low resource requirements and simplicity of MooseFS... metadata operations(reads) compared to another distributed FS are blazing fast even with tens of milions files.

@borkd
Copy link
Collaborator

borkd commented Oct 12, 2023

Looks like you have answered your own question - it is a feature, not a bug. CHANGELOG_SAVE_MODE=2 is simply a higher priced insurance premium cluster admin decides to pay to lessen the risks and impact of inevitable unplanned outages. Considering you could operate mfsmaster on a barebone RPi any specialized/low latency storage devices or replication over RDMA easily falls under 'sophisticated hardware' umbrella

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants