Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new disk partitioning scheme #478

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

unbel13ver
Copy link
Contributor

@unbel13ver unbel13ver commented Feb 16, 2024

New disk configuration provides grounds for upcoming features, such as AB software updates and Storage VM and many more.

The Lenovo X1 config has two LVM pools, first one is fixed-size 250G "system" partition (which is going to be encrypted) and the rest of the disk is dedicated to the StorageVM.

Description of changes

Checklist for things done

  • Summary of the proposed changes in the PR description
  • More detailed description in the commit message(s)
  • Commits are squashed into relevant entities - avoid a lot of minimal dev time commits in the PR
  • Contribution guidelines followed
  • Ghaf documentation updated with the commit - https://tiiuae.github.io/ghaf/
  • PR linked to architecture documentation and requirement(s) (ticket id)
  • Test procedure described (or includes tests). Select one or more:
    • Tested on Lenovo X1 x86_64
    • Tested on Jetson Orin NX or AGX aarch64
    • Tested on Polarfire riscv64
  • Author has run nix flake check --accept-flake-config and it passes
  • All automatic Github Action checks pass - see actions
  • Author has added reviewers and removed PR draft status

Testing

Current upstream partition scheme is as follows:

[ghaf@ghaf-host:~]$ lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda             8:0    0 465.8G  0 disk 
├─sda1          8:1    0   236M  0 part 
└─sda2          8:2    0   8.6G  0 part 
nvme0n1       259:0    0 953.9G  0 disk 
├─nvme0n1p1   259:1    0     1M  0 part 
├─nvme0n1p2   259:2    0   500M  0 part /boot
└─nvme0n1p3   259:3    0 953.4G  0 part 
  └─pool-root 254:0    0 953.4G  0 lvm  /nix/store
                                        /

To test new partitioning scheme, run lenovo-x1-carbon-gen11-debug-installer image and install Ghaf into the internal SSD storage of Lenovo-X1.
After insallation is completed and the laptop is booted into the Ghaf system, check partitions with lsblk command.
It should be as follows:

[ghaf@ghaf-host:~]$ lsblk
NAME                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                     8:0    0 465.8G  0 disk 
├─sda1                  8:1    0   1.8G  0 part 
└─sda2                  8:2    0     3M  0 part 
nvme0n1               259:0    0 953.9G  0 disk 
├─nvme0n1p1           259:1    0     1M  0 part 
├─nvme0n1p2           259:2    0   500M  0 part /boot
├─nvme0n1p3           259:3    0   500M  0 part 
├─nvme0n1p4           259:4    0   250G  0 part 
│ ├─pool-vm_storage_b 254:1    0    30G  0 lvm  
│ ├─pool-vm_storage_a 254:2    0    30G  0 lvm  /vmstore
│ ├─pool-root_b       254:3    0    50G  0 lvm  
│ ├─pool-root_a       254:4    0    50G  0 lvm  /nix/store
│ │                                             /
│ ├─pool-reserved_b   254:5    0    10G  0 lvm  
│ ├─pool-reserved_a   254:6    0    10G  0 lvm  
│ ├─pool-gp_storage   254:7    0    50G  0 lvm  
│ └─pool-recovery     254:8    0    20G  0 lvm  
└─nvme0n1p5           259:5    0 702.9G  0 part 
  └─vmstore-storagevm 254:0    0 702.9G  0 lvm

# SPDX-License-Identifier: Apache-2.0
# Example to create a bios compatible gpt partition
#
# The partitioning scheme is hardcoded for 500G disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this comment we could actually make this specific to Lenovo X1 Carbon Gen11 SKU (stock keeping unit) with 500G of internal NVMe. I'm afraid we must build target specific installers as we have not yet described disk space extension in case the internal NVMe is larger. Optionally, the installer could resolve the size of the disk with lsblk and parametrize the size to generate disk config for generic use. I think all the other fixed sizes could remain the same and only the remaining space here https://github.com/tiiuae/ghaf/pull/478/files#diff-8a1fbff8658406e95003d28116ba21ced73c81b4fb8d1a5edad5fe5aeef1c427R154 would have to be calculated. Please check with @remimimimimi if this is feasible. The next iterations of installer must support other installer parameters - like user specific secrets - as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, the installer could resolve the size of the disk with lsblk and parametrize the size to generate disk config for generic use.

This exact approach assumes that the installer program changes disk-config.nix file before passing it into disko utility, which I am pretty sure we want to avoid.

I think all the other fixed sizes could remain the same and only the remaining space would have to be calculated.

Unfortunately, disko utility for some reason does not support size = "100%" parameter in that exact place. It fails with error while partitioning the disk. Probably I should have mentioned this in the comment inside the file itself.

The next iterations of installer must support other installer parameters - like user specific secrets - as well.

This is out of scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, the installer could resolve the size of the disk with lsblk and parametrize the size to generate disk config for generic use.

This exact approach assumes that the installer program changes disk-config.nix file before passing it into disko utility, which I am pretty sure we want to avoid.

Ok. I'm fine with these assumptions for this iteration so that we can start development of the other areas this enables but I think we've identified a few issues that affect the installer and target device support design moving forward. Given the constraints and your scoping, I propose changing the file location and naming because it's not template and generic anymore. It wasn't even in #340 but we kind of agreed to iterate this to proper for this device in the next iteration which is this.

My proposal for placement is <ghaf_repo_root>/targets/lenovo-x1-carbon-disk-config.nix because this is very specific to that device. I don't think that is a problem because the current iteration of installer only supports X1. If we want template (or rather disko-config.nix nix module which accepts arguments from installer) and generic x86_64 disko-config, we must change also the dependency in the installer. Feel free to disagree but imo the filename remains confusing if we don't change it.

I think all the other fixed sizes could remain the same and only the remaining space would have to be calculated.

Unfortunately, disko utility for some reason does not support size = "100%" parameter in that exact place. It fails with error while partitioning the disk. Probably I should have mentioned this in the comment inside the file itself.

Sure, never too late to add insight in comments.

The next iterations of installer must support other installer parameters - like user specific secrets - as well.

This is out of scope of this PR.

Please see my previous comment. Scoping and addressing dependencies (even if next iterations of development) are both important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, disko utility for some reason does not support size = "100%" parameter in that exact place. It fails with error while partitioning the disk. Probably I should have mentioned this in the comment inside the file itself.

It seems that this should work @unbel13ver .

Copy link
Collaborator

@Mic92 Mic92 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-Disko maintainer here. Regarding secrets one can use the diskoImagesScript to generate full systems that also contain secrets. The script is run outside of the NixOS sandbox. Documentation is here: https://github.com/nix-community/disko/blob/master/docs/reference.md#generating-disk-images-with-secrets-included-using-disko

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal for placement is <ghaf_repo_root>/targets/lenovo-x1-carbon-disk-config.nix because this is very specific to that device.

That is reasonable, implemented.

Also I managed to take the rest of the disk in use for StorageVM (should have added "100%FREE" instead of "100%").
@leivos-unikie could you retest this please? I've updated the expected partitioning in the header of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted these deviations in log messages

  • at booting installer
    failed_creating_link

  • in the beginning of installation process
    dont_know_msg

  • after installation was finished and pressed Enter for reboot
    failed_unmounting

Otherwise no issues found in testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing! Those are not related to the matter of this PR (those are not something worth to worry about, anyway).

Copy link
Contributor

@vilvo vilvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the detailed comments.

@leivos-unikie
Copy link
Contributor

@unbel13ver can you provide test instructions for this PR? What is the expected outcome of new partitioning scheme?

@unbel13ver
Copy link
Contributor Author

@unbel13ver can you provide test instructions for this PR? What is the expected outcome of new partitioning scheme?

Testing instructions added.

Copy link
Contributor

@leivos-unikie leivos-unikie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • lsblk gives expected output
  • ci-test-automation passed
  • performance ok
  • apps launch

@leivos-unikie leivos-unikie added the Tested on Lenovo X1 Carbon This PR has been tested on Lenovo X1 Carbon label Feb 20, 2024
@unbel13ver unbel13ver temporarily deployed to internal-build-workflow February 21, 2024 06:40 — with GitHub Actions Inactive
@unbel13ver unbel13ver temporarily deployed to internal-build-workflow February 21, 2024 15:00 — with GitHub Actions Inactive
@unbel13ver unbel13ver temporarily deployed to internal-build-workflow February 22, 2024 12:37 — with GitHub Actions Inactive
@unbel13ver unbel13ver removed the Tested on Lenovo X1 Carbon This PR has been tested on Lenovo X1 Carbon label Feb 22, 2024
@leivos-unikie leivos-unikie added the Tested on Lenovo X1 Carbon This PR has been tested on Lenovo X1 Carbon label Feb 23, 2024
];
};
};
vm_storage_a = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the vm_storage_a and vm_storage_b used? For system updates it's clear to me that one would apply changes to alternating partitions, but how does this work for vm state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some ideas that some VMs can be supplied as pre-built images or so, and the end user will not be able to change the contents of that images. Those VMs will be updated together with the main system and must correspond the specific version of the base system.

name = "boot";
size = "1M";
type = "EF02";
};
Copy link
Collaborator

@Mic92 Mic92 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EFI standard allows only one boot loader to be registered, how would the second partition used, in case the other "fails"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is only one bootloader partition. The duplication in this case require some external tooling to switch between A and B, which we decided to avoid.

mountOptions = ["umask=0077"];
};
};
# LVM pool that is going to be encrypted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the encryption done here? Would it be not cryptsetup -> lvm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some discussion going about that in the mail thread. I do not know at this moment how it is going to be implemented, and I decided to remove this comment to avoid the confusion.

};
};
# LVM pool that is going to be passed to the Storage VM
other_2 = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are VMs not encrypted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageVM partition is not going to be encrypted, since it is supposed that it stores encrypted data. In this case it allows us to backup and recover the stored data safely if needed.

];
};
};
recovery = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the recover also some NixOS? How is it booted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a placeholder currently, but yes, it is going to contain some factory recovery image.

@Mic92
Copy link
Collaborator

Mic92 commented Feb 26, 2024

Brian asked me to compare using LVM vs ZFS for this setup. Here a comparison:

  • with LVM one has to plan in advance how much space needs to be used for what purpose. The A/B partitioning will take 50% of the space.
  • with zfs one would create a single pool and datasets can dynamically allocate space from that. Instead of A/B partitioning, one would use snapshots -> potentially also create boot loader entries for rollback.
  • Ext4 probably will provide best performance for most desktop workloads.
  • Zfs makes better use of storage. 500GB is not much and compression can help here. Checksums are also nice to be more confident about correct data.
  • LVM is accepted upstream so any Linux kernel will work. With zfs there is some delay usually until the latest kernel can be used. NixOS by default uses LTS version of the kernel, which will always support zfs. Apart from that there is zfs.latestCompatibleLinuxPackages that will give the latest linux kernel supported
  • I'd recommend having a swap for zfs because the page eviction in ARC is not as quick as the normal page cache - to normal users it appears that zfs uses way more memory but this is mostly because ARC is not accounted as page cache
  • Encryption: There some open issues with zfs's native encryption. Cryptsetup is recommended if one wants to use zfs send at some point.
  • Impermanence: It's possible to implement automatic state reset with both LVM/Zfs. For LVM one would probably recreate/reformat a volatile state partitions. With zfs it would be a of a dataset to some blank snapshot (exceptionally a bit easier)

I currently do not have the full picture of all how this a/b partitioning will be implemented in the end. Especially around VM storage.

format = "ext4";
mountpoint = "/";
mountOptions = [
"defaults"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add noatime for all of the ext4 mountpoints. It's not useful for most applications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, done

type = "filesystem";
format = "vfat";
mountpoint = "/boot";
mountOptions = ["umask=0077"];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like adding "nofail" for non-root mounts. It makes recovering easier in case there is something wrong with mountpoints. Otherwise systemd will go into "emergency" mode. This can even happen on nixos-rebuild.

Suggested change
mountOptions = ["umask=0077"];
mountOptions = ["umask=0077" "nofail"];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, done

];
};
};
reserved_a = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some comment what this is for would be helpful.
If this is just a spaceholder do we need a filesystem on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the filesystem from the reserved partitions and the recovery partition. I have also added a comment section in the beginning of the file to explain some of its parts.

@vilvo
Copy link
Contributor

vilvo commented Apr 3, 2024

Brian asked me to compare using LVM vs ZFS for this setup. Here a comparison:
...

  • LVM is accepted upstream so any Linux kernel will work. With zfs there is some delay usually until the latest kernel can be used. NixOS by default uses LTS version of the kernel, which will always support zfs. Apart from that there is zfs.latestCompatibleLinuxPackages that will give the latest linux kernel supported

ghaf hardened requires a ZFS kernel config fragment. It's not an issue but it's a dependency to keep in mind.

I currently do not have the full picture of all how this a/b partitioning will be implemented in the end. Especially around VM storage.

Yeah, the A/B partitioning was specified and documented before the latest arch documentation process change. It would probably help to collect the prev documented designs and LVM/ZFS comparisons together according to the new process. With such many contributors I'd like to call it a collaborative documentation process to ensure a better full picture for contributors.

@vunnyso
Copy link
Contributor

vunnyso commented Apr 8, 2024

Is it possible to use smaller partitioning size and resizing mechanism later with current implementation? As final image generated will be 260GB which will take more time to generate and flash.

If not whether following postboot commands will be valid to execute still?

New disk configuration provides grounds for upcoming features,
such as AB software updates and Storage VM and many more.

Signed-off-by: Ivan Nikolaenko <ivan.nikolaenko@unikie.com>
Signed-off-by: Ivan Nikolaenko <ivan.nikolaenko@unikie.com>
@unbel13ver
Copy link
Contributor Author

unbel13ver commented May 15, 2024

I have added a second commit with ZFS configuration so it is easier to compare what has changed.
ZFS also requires an option boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages; which I think changes the host's kernel version.

Known issues: For some reason, disko now fails with OOM during the nix store copying process if I give it's VM less than 16G of RAM.

@brianmcgillion
Copy link
Collaborator

Known issues: For some reason, disko now fails with OOM during the nix store copying process if I give it's VM less than 16G of RAM.

that should be fixed if you rebase against main. that was fixed in disko and the main has been bumped to use the newer version.

@lheckemann
Copy link
Collaborator

ZFS also requires an option boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages; which I think changes the host's kernel version.

This shouldn't be the case since as Mic92 mentioned NixOS uses an LTS kernel by default, which should generally be supported by ZFS. But I don't see the line in the diff anyway, so it's fine I guess :)

# Copyright 2022-2024 TII (SSRC) and the Ghaf contributors
# SPDX-License-Identifier: Apache-2.0
#
# !!! To utilize this partition scheme, the disk size must be >252G !!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I suspect the first pool can be a fair bit smaller.

The quota on a zfs filesystem limits its maximum size but doesn't guarantee availability of the given amount of space. If you really want to ensure the availability, you should set a reservation or refreservation on the filesystems.

# - ESP-B : (500M)
#
# First ZFS pool contains next partitions:
# - root-A : (50G) Root FS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're using ZFS it doesn't make that much sense to have exactly two root filesystems. It would make more sense to:

  • Allow more than two, since zfs doesn't limit us to allocating the space ahead of time
  • Not set a quota on them, since they should (AFAIU) be immutable and therefore fixed-size anyway

Depending on our threat model, it may also be undesirable to use zfs (for the root filesystem(s)) at all, since this mixes them up with the mutable pieces of the filesystem and doesn't allow us to use something like dm-verity to authenticate the whole partition (guarding against malicious filesystem images, for instance).

I've also personally experienced some unreliability from zfs, where the zfs kernel driver would crash and any further access to files on zfs filesystems would hang until reboot. Using ext4 (with dm-verity to counteract both malicious modification and bitrot) would probably be more reliable.

# - root-B : (50G)
# - vm-storage-A : (30G) Possible standalone pre-built VM images are stored here
# - vm-storage-B : (30G)
# - reserved-A : (10G) Reserved partition, no use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary. ZFS allows us the flexibility to add or remove filesystems from a pool after the fact if we end up needing them.

};
};
};
zroot_2 = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably avoid importing this pool in the host at all, but I'm not sure that's currently feasible with disko. @Mic92?

# - root-B : (50G)
# - vm-storage-A : (30G) Possible standalone pre-built VM images are stored here
# - vm-storage-B : (30G)
# - reserved-A : (10G) Reserved partition, no use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with zfs, LVM affords us some flexibility here (though indeed not as much). Why were you thinking to allocate this? I could see it being more necessary with LVM than with zfs, but I'm still skeptical about its usefulness.

# - reserved-A : (10G) Reserved partition, no use
# - reserved-B : (10G)
# - gp-storage : (50G) General purpose storage for some common insecure cases
# - recovery : (rest of the LVM pool) Recovery factory image is stored here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - recovery : (rest of the LVM pool) Recovery factory image is stored here
# - recovery : (rest of the ZFS pool) Recovery factory image is stored here

@@ -0,0 +1,181 @@
# Copyright 2022-2024 TII (SSRC) and the Ghaf contributors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually have this file if we're not referencing it anywhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tested on Lenovo X1 Carbon This PR has been tested on Lenovo X1 Carbon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants