Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Encrypted scratch area #1942

Open
taltenbach opened this issue Apr 15, 2024 · 10 comments
Open

Enhancement: Encrypted scratch area #1942

taltenbach opened this issue Apr 15, 2024 · 10 comments

Comments

@taltenbach
Copy link
Contributor

Currently, when using swap-scratch upgrade strategy with encrypted firmware images, the scratch area has to be placed in internal flash memory, even when BOOT_SWAP_SAVE_ENCTLV is enabled, since MCUboot is decrypting when copying the image data from the secondary slot to the scratch area. However, in order to save some internal flash memory and also potentially to increase the size of the scratch area, it could be interesting to be able to put that area in external flash memory. This would be especially valuable for MCUs having large internal flash memory sectors, like e.g. the STM32F413, which has 128 KiB flash sectors.

I was able with minor changes to perform the decryption when copying image data from the scratch area to the primary slot instead, and it seems to work fine. I haven't performed extensive testing though, so I might have missed something. Is there any good reason explaining why MCUboot is decrypting when writing to the scratch area? If not, would you be interested by these changes?

@nordicjm
Copy link
Collaborator

Swap using scratch shouldn't be used, swap using move should be used instead @d3zd3z

@taltenbach
Copy link
Contributor Author

Swap using scratch shouldn't be used, swap using move should be used instead @d3zd3z

@nordicjm I'm working on a project with an STM32F413, which has 1.5 MiB of internal flash memory divided in 128 KiB sectors. Using swap-move would mean that two sectors of the internal flash memory could not be used for storing the app (one must be reserved for the swap-move algorithm and another for the MCUboot trailer). Wasting 256 KiB (~17 %) is not acceptable in my case, that's why I'm considering swap-scratch

@nordicjm
Copy link
Collaborator

Swap using scratch shouldn't be used, swap using move should be used instead @d3zd3z

@nordicjm I'm working on a project with an STM32F413, which has 1.5 MiB of internal flash memory divided in 128 KiB sectors. Using swap-move would mean that two sectors of the internal flash memory could not be used for storing the app (one must be reserved for the swap-move algorithm and another for the MCUboot trailer). Wasting 256 KiB (~17 %) is not acceptable in my case, that's why I'm considering swap-scratch

I would suggest using direct XIP or overwrite only in that case then. Swap using scratch is known to brick devices for reasons that are not known, it's use is not recommended and @d3zd3z has considered removing support from it from MCUboot entirely

@taltenbach
Copy link
Contributor Author

Swap using scratch shouldn't be used, swap using move should be used instead @d3zd3z

@nordicjm I'm working on a project with an STM32F413, which has 1.5 MiB of internal flash memory divided in 128 KiB sectors. Using swap-move would mean that two sectors of the internal flash memory could not be used for storing the app (one must be reserved for the swap-move algorithm and another for the MCUboot trailer). Wasting 256 KiB (~17 %) is not acceptable in my case, that's why I'm considering swap-scratch

I would suggest using direct XIP or overwrite only in that case then. Swap using scratch is known to brick devices for reasons that are not known, it's use is not recommended and @d3zd3z has considered removing support from it from MCUboot entirely

Well, I wasn't aware of such a critical issue with swap-scratch, this is very valuable information, thank you. I saw you add a note about that in the Zephyr-specific documentation, perhaps it would be helpful to also add a (big) warning in design.md. Unless that issue is specific to Zephyr of course, but from what you say I understand the algorithm itself is suspected to be broken?

Unfortunately, in my case direct XIP is not possible and overwrite only not an option because the revert mechanism is desired. I guess I will have to develop my own strategy, perhaps something similar to #1902.

I saw several comments from @d3zd3z suggesting swap-scratch has no added value over swap-move and should not be used. I am missing something here? Because unless I'm mistaken, the swap-move requires one additional flash sector compared to swap-scratch (for storing the MCUboot trailer), which seems totally fine in case sector size is small, but can represent a significant waste of memory for MCUs having large flash sectors like numerous STM32 devices. IMO dropping support for swap-scratch without providing a revert-able alternative other than swap-move is likely to be an issue for a lot of projects

@nordicjm
Copy link
Collaborator

Well, I wasn't aware of such a critical issue with swap-scratch, this is very valuable information, thank you. I saw you add a note about that in the Zephyr-specific documentation, perhaps it would be helpful to also add a (big) warning in design.md. Unless that issue is specific to Zephyr of course, but from what you say I understand the algorithm itself is suspected to be broken?

No-one actually knows, the issue has just occurred randomly on devices, only been seen on zephyr as far as I'm aware.

Unfortunately, in my case direct XIP is not possible and overwrite only not an option because the revert mechanism is desired. I guess I will have to develop my own strategy, perhaps something similar to #1902.

Direct XIP does support revert which @de-nordic added, you have to set it up for revert and then it functions similarly to swap using move, you load an update and I believe mark it for test and reboot, it gets booted and if you reboot without confirming it that image slot gets erased and the previous image gets booted, otherwise if it is confirmed than the alternative slot gets erased and you use the newer image.

@taltenbach
Copy link
Contributor Author

No-one actually knows, the issue has just occurred randomly on devices, only been seen on zephyr as far as I'm aware.

Noted, thanks for those details.

Direct XIP does support revert which @de-nordic added, you have to set it up for revert and then it functions similarly to swap using move, you load an update and I believe mark it for test and reboot, it gets booted and if you reboot without confirming it that image slot gets erased and the previous image gets booted, otherwise if it is confirmed than the alternative slot gets erased and you use the newer image.

Yes, sorry, my message wasn't clear enough, I meant:

  • Direct XIP is not possible in my case because the secondary slot is in external flash memory, which doesn't support XIP.
  • Overwrite only isn't an option either because the revert mechanism is desired.

@utzig
Copy link
Member

utzig commented Apr 21, 2024

I was able with minor changes to perform the decryption when copying image data from the scratch area to the primary slot instead, and it seems to work fine. I haven't performed extensive testing though, so I might have missed something. Is there any good reason explaining why MCUboot is decrypting when writing to the scratch area?

I don't remember the exact reason, but have you tested your implementation on the simulator? What I have in mind is that there was one specific step, either the last or first sector swap that contained information that needs to be known in plain text to resume an interrupted swap, and one of those sectors might be in the scratch area when an interrupt happens. It could also be the whatever the reason was, it is not an issue anymore.

This would be especially valuable for MCUs having large internal flash memory sectors, like e.g. the STM32F413, which has 128 KiB flash sectors.

It would not be hard to update the swap with move algorithm so that it performs the moves and swaps using 128KB, which would also work for the external flash. The main issue would be that you need the least 256KB free, one for the status, and another for the spare sector where the first move is done. Using swap with scratch even the last sector can be used as long as there is enough space for status.

No-one actually knows, the issue has just occurred randomly on devices, only been seen on zephyr as far as I'm aware.

JUUL vaping devices have been running swap with scratch, although they use Mynewt. I have no idea how many devices they have on the field, but there's no lack of people using vaping devices! If there was a bug, I find it very hard to believe even statistically that they would not have found it. Anyone upgrading millions of devices should bump into every last bug there is. That said, there might be something Zephyr specific. The swap algorithms themselves are very deterministic for a given slot/scratch size and image size combination, so even someone with rudimentary testing skills should be able to find a way to consistently reproduce this bug, once found! Since that never happened, things which might have some impact, are flash driver specifics, like writes that return success when the write failed, probably more so for external (Q)SPI flashes, or something like multitasking being enabled and some other task/thread interfering with the swap. MCUboot was designed to run as a sole application without any other extraneous stuff happening.

All that said, swap with move is a much better algorithm if the limitations are removed. Btw, when I fist added it with PR #589 it was broken for more than 3 years until the issue was found and I fixed it with #1597. This was not a bug that "could" happen, but an issue that always happened as long as a reset happened at the right time. That it was broken for 3 years all the while people kept praising it, still makes me chuckle. And also a testament for my lack of testing this algorithm at the time, but at least I did a lot more testing after #1597 and it looked solid afterwards! ;-)

@taltenbach
Copy link
Contributor Author

taltenbach commented Apr 22, 2024

I don't remember the exact reason, but have you tested your implementation on the simulator? What I have in mind is that there was one specific step, either the last or first sector swap that contained information that needs to be known in plain text to resume an interrupted swap, and one of those sectors might be in the scratch area when an interrupt happens. It could also be the whatever the reason was, it is not an issue anymore.

My point is that only the payload of MCUboot images is encrypted, i.e. all metadata added for MCUboot (header, TLVs and trailer) are non-encrypted. That means unless MCUboot depends on data within the raw firmware image during the swap (which is not the case unless I'm mistaken), having the image in plain text in the scratch area shouldn't be required. However, I didn't tested on the simulator, I will definitely give it a try.

It would not be hard to update the swap with move algorithm so that it performs the moves and swaps using 128KB, which would also work for the external flash. The main issue would be that you need the least 256KB free, one for the status, and another for the spare sector where the first move is done. Using swap with scratch even the last sector can be used as long as there is enough space for status.

Yes, in fact I initially configured MCUboot with swap-move and it was working like a charm with 128 KiB sectors. As you mentioned, the issue here is the 256 KiB overhead: on the project I'm working on, the application is quite huge and losing 256 KiB (or even 128 KiB) of internal flash memory isn't an option. However, there is a lot of spare memory in the external flash, that's why I'm considering using swap-scratch and placing a large scratch area in external flash. Since the scratch area is large flash wear isn't an issue. But this solution is only possible assuming the firmware image is encrypted in the scratch area.

JUUL vaping devices have been running swap with scratch, although they use Mynewt. I have no idea how many devices they have on the field, but there's no lack of people using vaping devices! If there was a bug, I find it very hard to believe even statistically that they would not have found it. Anyone upgrading millions of devices should bump into every last bug there is. That said, there might be something Zephyr specific. The swap algorithms themselves are very deterministic for a given slot/scratch size and image size combination, so even someone with rudimentary testing skills should be able to find a way to consistently reproduce this bug, once found! Since that never happened, things which might have some impact, are flash driver specifics, like writes that return success when the write failed, probably more so for external (Q)SPI flashes, or something like multitasking being enabled and some other task/thread interfering with the swap. MCUboot was designed to run as a sole application without any other extraneous stuff happening.

This definitely makes sense to me and in hindsight I think using swap-scratch in my case is a way better option than implementing my own custom upgrade strategy. Perhaps there is actually an issue with the swap-scratch implementation, that occurs in very specific cases, but it sounds far more likely that the issue is due to a non-MCUboot component. Also, being used and tested for years I expect the swap-scratch implementation to be much more robust than any custom strategy I might design.

On the other hand, is the issue @nordicjm mentioned occurring when resetting the device during an upgrade process? If it does, I think there might be some nondeterministic/hardware-specific behavior here if the reset happens precisely while a sector containing an MCUboot trailer is being erased. My knowledge in hardware is limited so perhaps this doesn't make sense at all, but is there something guaranteeing that during an erase all bits flips exactly at the same time? My understanding would be that if a reset occurs during an erase, the sector that was being erased isn't in a defined state: some bits might have flipped but some other not. If that assumption is true, then if a reset occurs while a MCUboot trailer is being erased, there is a chance that the trailer's magic is good but the swap status totally corrupts, which might lead to MCUboot doing crazy things at next boot. Is MCUboot assuming that with a 16-byte long magic number, the likelihood of such an issue occurring is close to zero?

@utzig
Copy link
Member

utzig commented Apr 22, 2024

On the other hand, is the issue @nordicjm mentioned occurring when resetting the device during an upgrade process? If it does, I think there might be some nondeterministic/hardware-specific behavior here if the reset happens precisely while a sector containing an MCUboot trailer is being erased. My knowledge in hardware is limited so perhaps this doesn't make sense at all, but is there something guaranteeing that during an erase all bits flips exactly at the same time? My understanding would be that if a reset occurs during an erase, the sector that was being erased isn't in a defined state: some bits might have flipped but some other not. If that assumption is true, then if a reset occurs while a MCUboot trailer is being erased, there is a chance that the trailer's magic is good but the swap status totally corrupts, which might lead to MCUboot doing crazy things at next boot. Is MCUboot assuming that with a 16-byte long magic number, the likelihood of such an issue occurring is close to zero?

The whole reset recovery resilience is builtin; without having it calling this a secure bootloader would be quite a stretch! That's also why the simulator is used, when it runs an upgrade/revert, it resets at every flash write so we can be sure that every step is tested for resilience.

If you find any brick issues, please dump that last sector of the first slot, so we can find out at what step in the upgrade process it was running. I am afraid it won't be enough information to fix the issue, and extra debug traces would be required, but it's some more information than we have.

@taltenbach
Copy link
Contributor Author

taltenbach commented May 2, 2024

@utzig I finally was able to test my implementation on the simulator and you were right: there is a corner case I didn't anticipate that was causing a failure when the device is reset after the header of the image has been erased in the secondary slot. I guess this is the reason why the swap-scratch is currently decrypting before copying to the scratch area.

I was able to fix the issue though and my implementation now seems to work fine on the simulator. I created a PR for you to be able to check my changes and choose whether or not you consider it is desirable to merge them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants