Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special vdevs weren't special enough for embedded_logs #15991

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rincebrain
Copy link
Contributor

@rincebrain rincebrain commented Mar 13, 2024

(Yes, I know the property isn't in the man page, I was going to wait and see what people thought before adding it.)

Do people think the allocation property for this needs to be persistent across imports? Would a pool-wide property make more sense than per-vdev? Should it default to off to avoid changing existing behavior? (I would argue no, because the number of cases where this will be a problem seems much lower than the number of cases where it would be a benefit, and choosing to make that a default now effectively means having to tell people to turn it on forever, like we used to do with compression=on...)

Motivation and Context

People keep wanting to use a special device as a slog as well. cf. #15881

Description

I just redid the dance we did adding the sacrificial metaslab for embedded_log, but for special vdevs, and made it prefer that class if it exists to the embedded_log class.

Also threw in a vdev property to turn off doing this. Oddly, none of the non-user properties seem to persist right now, they all seem to be derived values - I could change the current state of this property to make it persistent across export/import, but figured I'd ask if there was some reason none of them are like that right now first...

How Has This Been Tested?

Did a bunch of sync IO to the pool, twiddled the property and watched the IO count drastically shift between vdevs.

Also tried zpool freeze mid-run and then importing it on an unmodified copy of the code running with --enable-debug to see if it died horribly on ZIL blocks on the special, and it did not.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

People keep wanting to use part of their special device as a slog,
but manually partitioning off some space for it seems unnecessarily complex.

So let's just redo the embedded_log dance, but for special vdevs, and make
the allocator prefer those if they exist.

Also plumbs in a vdev property for turning off this behavior on devices
in case this is not desirable.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@amotin
Copy link
Member

amotin commented Mar 13, 2024

Do people think the allocation property for this needs to be persistent across imports?

Non-persistent properties would be a pain to manage after every pool import. IMO they should be persistent. On the other hand, having embedded log metaslab initialized and so unavailable for normal allocation, but then never used due to tunable set to false sounds like a waste.

Would a pool-wide property make more sense than per-vdev?

From one side I can imagine when there are several special vdevs with different characteristics. From another per-vdev properties are still a relatively new concept and I regularly forget about them personally. ;)

Should it default to off to avoid changing existing behavior?

Not every special vdev can be a good ZIL, either due to high latency (though hopefully better than regular vdevs), or due to limited throughput, simply because there are much less of them in a system, or due to limited write life time. As I mentioned it before, I'd probably prefer some unified ability to explicitly specify use(s) for each special vdev (small blocks, metadata, DDT+BRT, ZIL, etc) instead of adding this special hack for embedded ZIL there. Embedded ZIL may be a good way to implement ZIL there, but I don't like a custom way of configuring it.

@rincebrain
Copy link
Contributor Author

Sure, not every special device is a fantastic ZIL. But I would argue that the worst case behavior of any SSD is better than the best case behavior of many spinning rust disks, and I don't think the argument hold especially well that we shouldn't use the special vdevs for this if we're saying they can be used for all metadata writes of any type, since larger things should hit the "skip double writing to ZIL" logic anyway, unless I don't understand the allocation behavior at all.

Embedded ZIL may be a good way to implement ZIL there, but I don't like a custom way of configuring it.

I'm not really sure I understand your argument here, sorry. I agree that having multiple ways to configure this is bad, but I think per-vdev properties are probably the "right" way to do this, short of a new bespoke configuration interface, and I don't really think one of those is a great idea, either.

I don't especially like the idea of holding up this PR behind backfilling a better interface for all our historical methods of configuration, because that seems likely to get bogged down in debates about what the defaults should be and interactions with how it used to work, since we had A) global tunables and B) the special "dedup" vdev class for dedup-only storage, and figuring out what to do with those going forward seems like a messy rabbit hole to go down. I agree it would be good to do, if we agree this is a reasonable interface, but I would prefer that be a separate PR, since I think it's likely to be a much deeper rabbit hole to clean up the multiple disparate things that exist.

I'm happy to go do it, if nobody else cares to, but I would prefer this be independent of that.

I forget about per-vdev properties all the time

Yeah, me too. I'm not running 2.2.x on any of my personal systems, so I don't encounter them very often, and they're not that directly useful to me in most cases at the moment. But I think it's probably the right scope and location for the settings, even if it's not the most intuitive for CLI users (and GUI users can expose things however they want).

Having an embedded_log metaslab unusable if we don't use this functionality seems like a waste.

I would agree, but hopefully that would be a very uncommon case.

If I recall the logic correctly, I think we could add handling to the initing of metaslabs that notices the vdev property is set to not use this functionality, and either not carve it out in the first place or un-carve it out. (The first I know should be safe, the second I believe can be done but would probably require more glue to quiesce it for allocations then shuffle it around, since I believe the first happens before the device is marked allocable at all...), but since these are flash media, I'm not sure this is the worst outcome in the world.

(An alternate approach, I suppose, would be to pick a metaslab at, say, 50% capacity, or something, and put that in the allocation group, so that it rotates which metaslab is getting sacrificed out of partially full ones each import.

Thinking about this also makes me think we might want to reserve more than one metaslab at least sometimes for this, because I think in some cases we could exceed e.g. 8 GiB of a single metaslab on a special for this, and funneling multiple vdevs into the single one makes that seem more likely...but that also compounds the inefficiency issues with doing this in the first place.)

Conceivably, we could add handling to the non-ZIL special allocation path to make it fall back to the special_embedded_log metaslab in the event that it runs out of other options, since we might prefer that to falling onto data vdevs, but we'd still get pretty unbalanced utilization across metaslabs there. (Of course, on an SSD, we probably don't care that much, since the FTL is abstracting space beneath us anyway...)

@rincebrain
Copy link
Contributor Author

(Also, as written, this trips an IMPLY in zdb -mmmmm, which I have a local change to fix and will push, but A) makes me wonder if there's some other edge case that would come up outside of zdb, and B) made me notice our IMPLY macros in userland are broken and don't print line/file, so figuring out which assertion I was hitting was fun.)

@amotin
Copy link
Member

amotin commented Mar 13, 2024

Sure, not every special device is a fantastic ZIL.

Right now users who care about massive sync write workload should explicitly care about SLOG. Keeping it as default does not sound terrible for me. But I don't insist. Mentioned "skip double writing to ZIL" logic in absence of SLOG will push data blocks to the normal vdevs, waiting for cache flushes there. After that it is not so important actually where we write the logs, unless we consider special embedded log as an SLOG.

I agree it would be good to do, if we agree this is a reasonable interface, but I would prefer that be a separate PR, since I think it's likely to be a much deeper rabbit hole to clean up the multiple disparate things that exist.

This patch introduces one more interface to manage it. Having it in makes the following task of interfaces unification only more complicated. I'd prefer it to be done either same time or in opposite order.

If I recall the logic correctly, I think we could add handling to the initing of metaslabs that notices the vdev property is set to not use this functionality, and either not carve it out in the first place or un-carve it out.

Yes, I think it would be better. Would we not even create the embedded_log class for the vdev based on some property, initing would not try to populate it. Then we could think what to do about it later. I don't remember how dedup class specification is handled now, but would we be able to specify something like "meta,small,dedup,zil" instead of "special" on vdev creation, we would get the information early enough to not need it to be changed often.

Thinking about this also makes me think we might want to reserve more than one metaslab at least sometimes for this, because I think in some cases we could exceed e.g. 8 GiB of a single metaslab on a special for this, and funneling multiple vdevs into the single one makes that seem more likely...but that also compounds the inefficiency issues with doing this in the first place.)

I'd just like to note that 8GB is the absolute minimum for SLOG these days with default dirty_max_max settings, worst case space inefficiency of ZIL allocation on mixed-sized writes and current need to store last 2 or 3 TXGs. We often want more, something closer to 32GB, if not 64GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants