[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

Lathanderjk · 2023-05-23T11:41:33Z

Have you read through available documentation, open Github issues and Github Q&A Discussions?

Yes

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

3.0.117 from official repositories by moosefs

Operating system (distribution) and kernel version.

Linux RHEL 8.8 4.18.0-477.10.1.el8_8.x86_64

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

dedicated server with AMD EPYC 7282, 256GB RAM, XFS(chunk/meta), 40G networking

How much data is tracked by moosefs master (order of magnitude)?

All fs objects: 44777185
Total space: 27TB
Free space: 3.5TB
RAM used: ~30GB(master)
last metadata save duration: 7.4s

Describe the problem you observed.

With setting CHANGELOG_SAVE_MODE = 2 i observe drastical increse of master servers IOPS and drives utilization.
For metadata is used single Intel Optane 905P with XFS filesystem and default mount/mkfs options except noatime.

CHANGELOG_SAVE_MODE = 1
drive utilization: ~0.01%
IOPS(write only): 2-5
avg write: 0.6MB/s (could be inaccurate, just observed from graph)

CHANGELOG_SAVE_MODE = 2
drive utilization: 99.7%
IOPS(write only): 17.4K
avg write: 42MB/s (could be inaccurate)
From strace, master is calling more than 9K fsync/s (with only 3K block writing/s by dd)

This is test setup and there is no another load on master or chunk servers same on client, for test purpose only one client is writing with dd.
dd if=/dev/zero of=test bs=4k count=2621440 oflag=dsync status=progress

Write performance drops from 20.3MB/s(1) to 11.8MB/s(2) and mfsmaster is most of time in D state, i can even write way bigger file but performance is stable in both cases. How can this cause 17.4K IOPS on metadata(changelog) writes and more than 9K fsync calls.

Im not sure if this is bug.

xandrus · 2023-05-23T12:07:16Z

Hi,

this is what you can find in mfsmaster.cfg file description:

# Changelog save mode. There are three modes of writing changelogs:
# 0 - write in background by different process (less safe, but doesn't make master stop in case of heavy hdd load)
# 1 - write in foreground without syncing data (master waits for every changelog to be saved to hdd, but without syncing - a little more safe than the background option, but may cause master to stop and wait for flushing hdd buffers)
# 2 - write in foreground with fsync after each write (very safe, but may make your master very slow unless you have very sophisticated hardware)
# CHANGELOG_SAVE_MODE = 0

So mode 2 will always perform the fsync operation after every write, which is why you see such a performance drop.
And yes - this is not a bug.

Lathanderjk · 2023-05-23T12:11:47Z

Of course i expected performance drop an I/O increase but not thousand times... or 3-4000x

chogata · 2023-05-23T13:26:54Z

Changelogs are recorded at least every one second, even if the system is idle. If it's not idle, they are even more frequent. You are asking your kernel to fsync several write operations per second. This has a BIG impact on performance. This option is there only for cases that demand the highest level of security and has no application in most scenarios.

Lathanderjk · 2023-05-24T09:52:40Z

I only asking if this is intended and there no room for some grouping and reducing fsync calls. One block write to file with dd is causing 6 writes by mfs master(some will be filesystem...)
I also try strace and dd count=10, 100 etc. and its doing at least 3x fsync per single written block.

chogata · 2023-05-25T14:00:49Z

Grouping fsyncs is lessening security. This option really isn't meant for regular use and was added only after specific requests from our users. MooseFS used to have only option 0 (that is, writing changelogs in background, aka separate process), because we knew that any other option would severely impact performance. But for security reasons, when performance wasn't an issue, some users wanted to have the option. We added it, but we always said: use at your own risk. Kind of middle ground is option 1, that is writing changelogs in foreground, but without those fsyncs. No danger of the background process hanging (and nobody noticing, which was the main issue as I remember), but also not such a big impact on performance, at the cost of no fsyncs (so still a possiblity of loosing the tail of latest changelog file in case of hardware failure - but with grouped fsyncs that would also exists).

Lathanderjk · 2023-05-26T07:08:41Z

Grouping fsync mean less security thats true but for single operation before sending OK to client? In this scenario losing one fsync would by just forcing client to retry operation?

Just metadata operation like touch or mkdir cousing only one fsync, but appending single line to file "date >> myfile" resulting everytime in 4 fsync by mfsmaster, dd is causing 3 fsync for every single block.

chogata · 2023-06-05T16:33:57Z

Master doesn't know what the client process would consider a "single operation" for a file - it doesn't know that there is one "dd" command performed that requests 3 actions to be performed on one file. From the master's point of view those 3 operations might as well have been requested by 3 different processes using the same client. We would have to complicate the protocol and introduce some kind of markers from clients to the master to tell the master where the "fsync points" for metadata should be, but that would introduce a whole new level of "complicated" in master - client interactions, which would not be without a significant impact on performance.
May I ask why are you using save mode 2? As far as we know this is a very rarely used option.

Lathanderjk · 2023-06-30T12:33:31Z

In my test setup i use DRBD to replicate medata (B protocol an RDMA transport) and manage moosefs master via pacemaker cluster manager to create HA setup.
CHANGELOG SAVE MODE=2 is necessary for clients and IO operation correctly and immediately recover after master failover. Without CHANGELOG_SAVE_MODE=2 you end up with stucked clients, holes in metadata reported by master or missing IO operations, not much reliable filesystem for production.
I very like performance, low resource requirements and simplicity of MooseFS... metadata operations(reads) compared to another distributed FS are blazing fast even with tens of milions files.

borkd · 2023-10-12T21:03:12Z

Looks like you have answered your own question - it is a feature, not a bug. CHANGELOG_SAVE_MODE=2 is simply a higher priced insurance premium cluster admin decides to pay to lessen the risks and impact of inevitable unplanned outages. Considering you could operate mfsmaster on a barebone RPi any specialized/low latency storage devices or replication over RDMA easily falls under 'sophisticated hardware' umbrella

chogata added platform specific performance best practices labels May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

Lathanderjk commented May 23, 2023 •

edited

xandrus commented May 23, 2023

Lathanderjk commented May 23, 2023 •

edited

chogata commented May 23, 2023

Lathanderjk commented May 24, 2023

chogata commented May 25, 2023

Lathanderjk commented May 26, 2023

chogata commented Jun 5, 2023

Lathanderjk commented Jun 30, 2023

borkd commented Oct 12, 2023

[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

[BUG] Performance impact and write amplification with CHANGELOG_SAVE_MODE = 2 #540

Comments

Lathanderjk commented May 23, 2023 • edited

Have you read through available documentation, open Github issues and Github Q&A Discussions?

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

Operating system (distribution) and kernel version.

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

How much data is tracked by moosefs master (order of magnitude)?

Describe the problem you observed.

xandrus commented May 23, 2023

Lathanderjk commented May 23, 2023 • edited

chogata commented May 23, 2023

Lathanderjk commented May 24, 2023

chogata commented May 25, 2023

Lathanderjk commented May 26, 2023

chogata commented Jun 5, 2023

Lathanderjk commented Jun 30, 2023

borkd commented Oct 12, 2023

Lathanderjk commented May 23, 2023 •

edited

Lathanderjk commented May 23, 2023 •

edited