-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dentry_hashsize #4335
Comments
Just to note: The only inode_table with a dentry_hashsize that is not the default is the SHD that uses 131. This may in part explain the generally poor performance of SHD when things go badly wrong here, although, it's been drastically better in 11.X. |
which version are you using? Do you really feel the bottleneck in posix_readdirp_fill() is the inode_grep() and not the posix_pstat() (which is going all the way to disk to fetch metadata, multiple times) ? |
Hi @mykaul Currently on version 11.1. Kernel is 6.4.6 with https://lore.kernel.org/lkml/20230727081237.18217-1-jaco@uls.co.za/ (there is a follow up discussion and I never got around to v3 ..., v2 works but apparently is definitely wrong and should crash the kernel, but it doesn't, nor does it seem to leak memory, I think I'm just over-allocating a lot of memory pointlessly here). I don't understand things well enough to give you an out of hand answer, what I can state (without having to think about this) is the getdents() is not performing well, where the bottleneck is inside of glusterfs I'm guessing at: On a folder with ~10k files in it (unfortunately not quite unusual, was 70k and we've got folders with >100k):
So yea, there you go, at ~500 entries/second that takes ~20 seconds just to enumerate the file list. When you're looking at larger folders that gets much, much slower very quickly. On the above folder where previously there was around 70k files we've waited several minutes for the above to complete. The further into the sequence you get, the slower the calls become (it's as if glusterfs restarts the readdir on the underlying file system almost every time. For maildir this is probably the single most important sequence of system calls. Just to give an idea (df -hi for one of the bricks):
And:
So that's average 62.90 dentries per folder (assuming my calculations are correct), however, a great many number of folders, at least two thirds, generally have nothing in them, and many clients use POP3 so those folders remain small too. As stated, the larger folders can trivially have upwards of 100k entries, I'll just need to make some effort to find a few of these again, but I've got a fair number with 40k+ entries by very quick search on some users known to have largish mailboxes. Before having patched the kernel this was sometimes 3-4 seconds just to get perhaps 18 entries per getdents() call. How can I confirm for you if the bottleneck is posix_readdirp_fill(), inode_grep() or posix_pstat()? |
I used to look at 'perf' results. But also counting time (with debug code added) around them could be useful. I think there's also a way to dump all entries and then see how many we have on the same hash, which will help to determine if we should make it larger indeed. Other than some memory potentially wasted, I don't see a harm in making it larger. |
I don;t think after change the dentry_hashsize you will get an improvement for posix_pstat function but you can get improvement in inode_grep function. The dentry_hashsize will not impact system call performance(like getdents/stat) so you would not get any improvement for the function posix_pstat because it will improve inode_grep so you can get improvement in readdirp latency. You can generate statedump and compare the readdirp latency for posix xlator after change the dentry_hashsize. In case if latency is improved for readdirp at posix xlator then you can assume dentry_hashsize is improving. |
Is readdir-optimize option enabled, btw? should be valuable especially if you have many empty directories. |
readdir-optimize is useful only for dht sub volumes here volume type is replicate so it would not be useful. |
It's enabled regardless.
Yea, with the way that maildir works (create file in tmp/ then rename() into appropriate location) any kind of distribute results in absolute chaos over time. We find that a simple replicate remains best. On less IO intensive workloads (voice recordings) where we typically have three or more nodes a setup of 3 x (2 + 1) works very well, even though we tend to link() and unlink() quite a bit anyway, however, we've done a LOT of work to keep files/folder counts down to <1000/folder as far as possible and I believe 95%+ of cases is below 100/folder.. For the maildir workload though we find that when files/folder goes up, the time to getdents() goes up equally badly, and given that at any point in time we've got upwards of 5000 connected clients, with definite hotspots during the day where sequences of readdir() (getdents in kernel speak) happens ... We're swapping courier-imap for dovecot in a couple of days which should alleviate the pressure on readdir() + open/read/close cycles quite significantly, but the core readdir problem when clients select folders (basically checks new/ folder for new files, generate UID values for these files, rename them into cur/, then readdir() loop over cur/ checking that all files have UID values, and generate for missing, create a new UID list file in tmp/ then rename into .) So that readdir loop over cur/ if that takes 10 minutes creates a really bad experience. So, I think immediate plan of action: Patch two of the four fuse mounts to have larger dentry_hashsize (100003 seems like a sensible larger value). Grab a statedump before remount, after remount let it run for a few hours and then get another statedump so that things can be compared. From yesterday's statedump I'm assuming the values we're looking for is the READDIRP line here:
What's the unit for these measurements? ns? |
Yes latency in ns , to observe the improvement you can generate statedump for a brick process also and compare the latency for posix.latency.READDIRP with/without patch. |
Is it possible that the counters are wrapping over?
Note the latter two lines where the later values for CNT and TOTAL is smaller than the earlier value. Time difference = 68815s = 19h06m55s. Or is there something going on that resets the CNT and TOTAL values? Will overflow even be detected and handled half-properly by say resetting both values to zero? I can't imagine CNT overflowing a 64-bit unsigned, but it might be conceivable that with billions of operations over the lifetime of a process the TOTAL value might, but I don't see that happening here either (The larger total value is about 0.003% on it's way to overflowing a 64-bit unsigned and given the average response time and assuming that's accurate that's 252x10^9 readdirp operations to get to overflow which is unlikely to happen in just over 19h, 3.7m readdirp's/s isn't going to happen). So what's the reset condition? |
AFAIK the variables are reset as we do generate a statedump. |
I've only patched two of the FUSE mounts so far and these don't seem to have posix.latency.READDIRP, so patched for the last few minutes:
It looks like one of the bricks is worse than the other if I read that correctly? Unpatched brick:
Am I interpreting this correctly as the bulk of the latency comes from the bricks rather than the FUSE mounts? Which honestly is NOT what I expected, but not impossible either. So this means that a way to tune the value for inode_hashsize on the bricks (rather than using the fixed 2^16) would probably be of benefit here too? Not that (according to /proc) either brick has >1000 open file descriptors with the kernel. Or am I completely missing the boat here as I suspect? I'll only be able patch brick nodes this evening (~12h from now). |
We should generate the statedump at the same time, currently data is not comparable. As per current data on brick side avg latency for readdirp is 63ms and fuse level is 132ms but it seems strace was not showing similar level latency, it was showing > 800ms. |
Not all cases are as pathological ... it's primarily for folders with a larger number of entries in them, but other cases do bite on this too sometimes, but then it's still 1s or so max and then you have the full listing, whereas if you have >500 entries it gets to be around 1s/500 entries, and that 1s goes up (I think the longest I've observed via strace is around 8s/getdents() call on a folder with 100k entries). I've not got recent examples, but I also recall that sometimes it would be nice in the <500ms/call region, and then jump to like 3.5s for a call then back down for a few calls, then jump up again, and then back down. It's this that first made me think it's probably some form of insufficient cache. What I'm even starting to think now (based on the fact that there aren't many thousands of open fd's on brick side) is that glusterfs restarts the underlying readdir() to the OS, and I'm wondering if that's not perhaps affecting performance too, that said a find on the folder with 70k entries we had took a few seconds on the brick whereas it took several minutes on through fuse. Based on "reset at dump time" I've now done a statedump on ALL glusterfs related processes via cssh, will post all the latency values from a similar dump of ALL processes later. |
Underlying FS performance attributed to dir_index (ext4), which was enabled on one of the two underlying filesystems. This is the first case ever where dir_index seems to improve performance in my experience. Enabled on the other filesystem now too, but obviously it'll take time for all folders to get this, so will see if we can downtime for e2fsck -D there to push that, but I don't think that's going to be sufficient still. |
brick 1:
brick 2 (previously -dir_index)
And then the four fuse mounts:
One thing that bugs me here immediately: Why doesn't the fourth FUSE mount also have a mail-client-1? netstat -ntap shows it connected to both bricks. The last two mounts also run on nodes where there are bricks (and thus I suspect the discrepancy between CNT to one vs the other on number 3). Regardless, based on this it looks like dir_index is a more significant factor than initially deduced, and a fsck -D should help significantly. It seems that the posix translator (just above the underlying filesystem?) is about the only place where we can even think of really gaining any additional performance since this is the bulk of it. This implies we need to:
Given the experience with find directly on the brick I'm not inclined to think 1 is a problem. Looking into the posix_readdirp() and sub-calls. Under what circumstances would GET_ANCESTRY_DENTRY_KEY on the dict be true? As far as I can tell this will be if there are quotas in effect but I'd rather get confirmation? This won't apply to me then, and the get_ancestry and resulting loop to count the returned elements doesn't apply? How does quotas affect readdirp() anyhow? This doesn't quite make sense to me.
That comment is quite interesting and probably belongs in posix_fill_readdir rather than where it's called from. Looking at posix_fill_readdir() (a call to which is the only thing after the comment) has this in it: 5716 entry = sys_readdir(pfd->dir, scratch); Which I would have said is a HUGE issue if it wasn't for the fact that glibc already uses getdents() behind the scenes, and the above wasn't simply a wrapper for glibc's entry = readdir(pfd->dir). What remains true is that we're basically copying the same data over and over so getdents() with a 256KB buffer and decoding that straight into the eventual return memory may well make sense and may well provide (probably marginal) improvements. I do have a concern if it's possible for multiple posix_fill_readdir calls to be invoked concurrently for the same pfd by multiple clients? This would result in a lot of seeking and possible races where different clients get wrong results. I don't see that this is protected by a lock anywhere, but I may simply be looking in the wrong place. Otherwise, busy looking though the code to see if I can spot anything obvious. |
Is this worth fixing to !entry && errno? readdir will only return NULL on EOF or error, and for error errno gets set (for EOF it doesn't get changed, so errno=0 before sys_readdir() is correct). |
I never got to complete my work, but there were some low-hanging fruits in the readdir(p) paths. There are many other nano-improvements - none would be as useful as figuring out if there's any disk access we are doing that we can skip (re-use previous data or not needed at all). |
I read that as "if there are micro-optimizations" please submit a patch and let's take the small wins where we can? Don't mind taking on slightly bigger things either, it seems for example that readdirp() variation does a second iteration over the normal readdir() over all the data again ... this would pretty much trash cache coherency on the CPU as well, which again is a micro optimization, but looks like possibly a HUGE alteration overall. Thinking I need to understand this code a bit better before making huge changes, so thinking a "micro optimizations" patch whilst busy trying to figure all that out is not unachievable. |
@jkroonza - the community appreciates any patch, big or small. I'm just mentioning that from my experience, the smaller improvements are all negligible, almost noise-level, when compared to removal of a disk access (even if it's already in the cache). There are countless of those that can be made, if you trace the calls that readdir(p) does. Code that goes twice over the results (I believe fuse does it as well - see fuse_readdir_cbk() for example) probably does a single pass to see how much memory is needed to get all entries and the 2nd to actually populate it. |
Looks like the dentry_hashsize did make some difference, only dumping the bricks though:
average here is now 15ms, was previously around 63ms (with dir_index).
average here is now 37ms, previously 173ms. dir_index here was enabled before both "stretches", however, I don't know how much of the filesystem has been dir_indexed in the meantime, so I think a part of this could be dir_index kicking in, but I doubt all of it. Will issue a fsck -D late tonight. Also, we perform touch .../$(hostname) on all fuse mounts from time to time and alert if the time taken for this exceeds 5s, this normally went completely nuts during a find over the full filesystem, whilst not completely silent it's definitely much better now. Looking at the above figures my gut of an ~ 4x improvement holds, so for ~7x the amount of "base table" size memory we gain 4x performance boost. Looking at list.h each list_head is 16 bytes, so at 100003 entries that's 1.6MB of RAM, given that, I'm happy to x10 that at least once more if it'll provide additional gains. I thus think it's critical that this value becomes configurable. It also does seem (non-scientific) to reduce disk IO requirements. iostat says that both the nvme and underlying disk is quieter, however, rrd io graphs says that overall read throughput has increased (write is ~ the same). This is conflicting, as there has been no reduction (rrd graphs) in disk write on either medium. The iostat %util has gone down drastically though. As has load average values. Could just be the time of the day as well, so I might be comparing apples with oranges. We do have similar number of client connections online though. Is there a way to see how full the hash tables are getting and how many entries are in the hash tables (would also allow calculating average chain length). IIRC the hash chains are double-linked lists, and a previous suggestion was that the overhead to maintain this was higher than the gain (I think the big benefit is when unlinking entries) compared to single-linked lists (to remove you need to re-iterate to find prev, on short chains that's OK). I suspect this will be more of a memory saving again more so than any real performance gain though. |
Thanks for sharing the data, it seems it is a big improvement. Do we have any data how much improvement we are getting |
What am I looking for? On the fuse mounts I can see these sections which are semi useful but also not useful enough:
On the bricks I'm seeing:
But that doesn't say anything about how used the dentry_hashsize is, only inode_hashsize. And this again shows why I patched fuse mount to support setting the inode_hashsize larger. Those chains are average of length 8 here, and lowering the chain length here would also improve performance. What I find interesting is that active_size is actually really tiny, but experience have shown that lowering lru_limit causes excessive disk IO to the point that even with a 97% effective NVMe based dm-cache things gets stalled badly. So as it stands I think:
1 is trivial and I think that 2+3 (3 requires a prime validator or do we just document that the number should be prime but not enforce it?) should be fairly easy, I don't know about 4 since there may be no counter available to base this on. I'll check if I can get the perf stuff working, the package I've got is https://perf.wiki.kernel.org/ - is that the correct one? I was contemplating recompiling with gproff and working from there, but you're more familiar with perf so I think it's more useful for me to learn a new tool than to blatantly waste your time. |
It is true currently there is no counter available specific to dentry usage, i think it is not difficult to generate a new counter and save the number. |
Just like other counters, I think it's the wrong approach. Count when you dump. All those 'inode->table->active_size++;' and 'inode->table->active_size--;' are not needed - as we take a lock when dumping anyway, let's just count the number of entries in the purge, lru and active lists. |
Yes i was also referring the same, we don;t need to generate a counter, during dump we can count the entries. |
#4339 starts with the four points I mentioned. I think 1 is complete, and I've started with 2, and is as far as I'm going to get today, would like to know if this looks OK so far at least. |
Description of problem:
Slow performance ... as usual :).
The exact command to reproduce the issue:
None.
Mandatory info:
- The output of the
gluster volume info
command:**- Is there any crash ? Provide the backtrace and coredump
No crash.
Additional info:
I'd like to discuss options to increase:
network.inode-lru-limit: 1048576
Our fuse mounts are mounted with --lru-limit=2097152 --inode-table-size=2097152 (state dump shows just over 1m entries in use) and we reckon bumping same for bricks should have some effect. As I understand the network.inode-lru-limit sets same/similar for bricks, whilst that's running on NL-SAS drives behind a battery-backed RAID controller and has 1TB NVMe write-through LVM dm-cache iostat is starting to show some pressure on the NVMe units, so we would like to sacrifice some more RAM in order to improve the overall performance.
There is also the dentry_hashsize which I believe may make a difference, in particular because the single largest bottleneck is readdir() (even though we've applied kernel patches to not read 4KB at a time but rather try to read 256KB at a time - in line with glibc getdents() calls) we suspect the performance here can still be improved if glusterfs can more frequently "pick up where it left off".
Unfortunately I don't understand all the impacts, previous issue reports had some patches that caused other problems, so we're looking for simpler lower impact changes such as increasing the amount of memory used rather.
From gluster state dump for one of the fuse mounts:
So we're hoping that we can bump that 14057 to something a little larger (say 100003). Whilst we can certainly just apply this:
We are not certain we understand the impact or what would all be affected, as such we don't want to blindly jump in and make random changes without understanding the potential impact.
Kind regards,
Jaco
The text was updated successfully, but these errors were encountered: