Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver/mtd_spi_nor, pkg/littlefs: improve reliability with corrupted flash (new PR) #20589

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

crasbe
Copy link
Contributor

@crasbe crasbe commented Apr 17, 2024

Contribution description

This is a takeover of PR #14908, which has been abandoned. The goal is to implement the proposed changes in the original PR to make it mergeable.

The changes to make it mergeable after four years were necessary due to the following commits:

  1. 60eb75d
    This commit removed the "mtd_write_page" function in favor of the "mtd_write_page_raw" function.

  2. ea105d3
    This commit removed the "mtd_spi_nor_write" function, which was changed in the original PR as well but these changes are not necessary anymore.

Open Tasks:

  • The review from @benpicco ( mtd_spi_nor | littlefs: improve reliability with corrupted flash #14908 (review) ) has to be included in this PR still.
    My plan was to add a header file to the drivers/mtd_spi_nor folder which would probably be called something like "drivers/mtd_spi_nor/include/mtd_spi_nor_security.h" "drivers/mtd_spi_nor/include/mtd_spi_nor_defines.h".

  • The security bits used by this PR are not universal to all manufacturers of Flash chips, specifically the Flash used by the original author @vincent-d is from Macronix and for example ISSI uses a different register and different bit positions. I'll have to check if they are manufacturer specific or device specific.

  • The Kconfig related changes can probably be were deleted, as the Kconfig subsystem has been removed from the mtd_spi_nor driver.

  • There is no timeout for the added functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer.

  • The function wait_for_write_complete does not have a timeout either but it counts the attempts and how many times it yielded the thread. This can be used as a basis for a timeout.

  • Write tests for the Macronix and ISSI flash security features.

  • Add IS25LP128 and IS25LE01G as DuTs to the test (and possibly more, whatever the magic drawer might supply).

Testing procedure

Testing this might be somewhat difficult, as I do not have a flash chip with broken memory cells (at least none that I know of) and I do not have the Macronix chip.

I will buy the MX25L51245G flash from the original PR to test the functionality and for testing the handling of broken memory cells we'll have to get creative.
Update: Two MX25L12873F chips are on the way to me which have the same security register as the aforementioned chip.
Update 2: The mechanisms in the code to detect corrupt memory areas can be triggered by applying protection to a certain area in the chip. Therefore it is possible to create a test case by "destroying" a chip.

The PR comes with new tests that utilize the Block Protect functions from NOR Flashs which triggers the Program Fail and Erase Fail flags. The Block Protect bits are not permanent and can be reset easily.
The nRF52840DK development board has a suitable flash chip on board and is covered by the test.

All tests should run successfully to verify the PR is working correctly.

Issues/PRs references

This closes #14908.

@github-actions github-actions bot added Area: pkg Area: External package ports Area: drivers Area: Device drivers Area: Kconfig Area: Kconfig integration labels Apr 17, 2024
@crasbe
Copy link
Contributor Author

crasbe commented Apr 17, 2024

@benpicco I think I found a suitable solution to get rid of the Kconfig configuration options and have a versatile option to enable manufacturer dependent security options.
The mtd_spi_nor_params_t struct (https://github.com/RIOT-OS/RIOT/blob/master/drivers/include/mtd_spi_nor.h) has a flag field, which is evaluated during compile time and is currently used to signify the page size. However it is a 16-bit wide flag field, so there should be plenty of space to add security flags.

The flags are defined in the global header file https://github.com/RIOT-OS/RIOT/blob/master/drivers/include/mtd_spi_nor.h
This would follow the approach that the mtd_spi_nor driver is configured at a board level.
One "downside" is that the common Opcode structure in https://github.com/RIOT-OS/RIOT/blob/master/drivers/mtd_spi_nor/mtd_spi_nor_configs.c can not be used in this form anymore, since the security related Opcodes are not shared between manufacturers and even worse, the Opcodes collide. While it would be possible to just ignore the collision and use the right opcode in the code later on, it's not the right approach since it blocks future extension of the code (if anyone wants to use the ACTUAL opcode).

Therefore I would introduce four new structures: two for Macronix with security (1byte and 4byte access) and two for ISSI (1byte and 4byte access).
The comment above the structures says that the compiler only adds the selected structure to the memory, so it does not take up any additional space.

I'll probably have the preliminary commit with the proposed changes ready in a couple of hours, that should make everything clearer.

@github-actions github-actions bot removed the Area: Kconfig Area: Kconfig integration label Apr 17, 2024
@crasbe
Copy link
Contributor Author

crasbe commented Apr 17, 2024

The static-test codespell complains about these lines in drivers/mtd_spi_nor/mtd_spi_nor.c being a typo (it would like to have WELL instead of WEL):

/* Wait for WEL to be set */
...
/* Wait for WEL to be cleared */

I'm not sure how to fix that? Oh wel... 😅
Interestingly it does not occur when I run "make static-test" locally.

@kfessel
Copy link
Contributor

kfessel commented Apr 18, 2024

Interestingly it does not occur when I run "make static-test" locally.

the static-test checks if the tool that is running a test is installed - so you probably do not have codespell installed.

I'm not sure how to fix that? Oh wel...

it there is a long form (write enable L... what is the L? eg WREL?) or alternative of WEL (STATUS_WEL ?) you could use that -and workaround the spell checker or as it suggest it could be added to the ignored words list.

@crasbe
Copy link
Contributor Author

crasbe commented Apr 19, 2024

Interestingly it does not occur when I run "make static-test" locally.

the static-test checks if the tool that is running a test is installed - so you probably do not have codespell installed.
That is the weird part about it: I did install codespell and it did complain about a couple of actual typos, but at the moment the tests are all running without failure:

chris@ThinkPias:~/flashdev-riot/RIOT$ make static-test
./dist/tools/ci/static_tests.sh
...
Running "./dist/tools/codespell/check.sh" ✓
Running "./dist/tools/uncrustify/uncrustify.sh --check" ✓
Command output:

	All files are uncrustified!

Running "./dist/tools/shellcheck/check.sh" ✓

I'm not sure how to fix that? Oh wel...

it there is a long form (write enable L... what is the L? eg WREL?) or alternative of WEL (STATUS_WEL ?) you could use that -and workaround the spell checker or as it suggest it could be added to the ignored words list.

WEL stands for "Write Enable Latch". However I found out that codespell added an inline option to ignore certain words. That would avoid adding a global ignored word (and potentially missing typos of "well" in the future): https://github.com/codespell-project/codespell?tab=readme-ov-file#inline-ignore

@kfessel
Copy link
Contributor

kfessel commented Apr 19, 2024

WEL stands for "Write Enable Latch". However I found out that codespell added an inline option to ignore certain words. That would avoid adding a global ignored word (and potentially missing typos of "well" in the future): https://github.com/codespell-project/codespell?tab=readme-ov-file#inline-ignore

citing from https://github.com/codespell-project/codespell?tab=readme-ov-file#ignoring-words

note that spelling errors are case-insensitive but words to ignore are case-sensitive.

i don't think you need to go the inline route (wel would still be detected if WEL was added to the ignore file and if someone writes WEL i got somthing its probably by choice)

@github-actions github-actions bot added Area: doc Area: Documentation Area: tools Area: Supplementary tools labels Apr 19, 2024
@crasbe
Copy link
Contributor Author

crasbe commented Apr 19, 2024

Okay, this was a bit naive of me. Obviously the spellcheck only checks with the ignored word list that is currently in master. So either I have to ignore the failed spell test or open a separate PR with just the updated ignored word list 🤔

@kfessel
Copy link
Contributor

kfessel commented Apr 19, 2024

Okay, this was a bit naive of me. Obviously the spellcheck only checks with the ignored word list that is currently in master. So either I have to ignore the failed spell test or open a separate PR with just the updated ignored word list 🤔

that PR also would get spellchecked i think -> there is a egg and chicken problem

seems like (our)codespell does not follow their doc codespell-project/codespell#3272 (don't know which date our version is

seems like our codespell is case-insensitive in either case

@kfessel
Copy link
Contributor

kfessel commented Apr 19, 2024

i read it wrong by case sensitive they mean you need to type it the way they have it in their typo database (usually lowercase) (i don't know why they do it like this) codespell-project/codespell#2451 (comment)

so for WEL to not trigger you need to add wel but then wel and wEl and Wel would also not trigger

@github-actions github-actions bot removed Area: doc Area: Documentation Area: tools Area: Supplementary tools labels Apr 19, 2024
@crasbe
Copy link
Contributor Author

crasbe commented Apr 19, 2024

So this was quite a rabbit hole. No matter what I did I could not reproduce the issue... It turns out that Linux Mint had an old version of codespell. Now I installed it via pip and not form the package repository and now it complains about WEL as well, yay.

BUT... the latest Release of codespell does not support the inline ignores yet. It's only implemented in the development version.

A workaround is changing "WEL" to "WEL-Flag" and now the spellcheck passes.

@github-actions github-actions bot added Area: doc Area: Documentation Area: tests Area: tests and testing framework labels Apr 24, 2024
@crasbe
Copy link
Contributor Author

crasbe commented Apr 24, 2024

The last PR adds a new test for the mtd_spi_nor driver which tests the security features. Contrary to what I wrote in the first post, you can trigger the P_FAIL (Program Failed) and E_FAIL (Erase Failed) flags by protecting a block with the Block Protect flags in the status register.

The test is not suuuper neat because I had to copy some internal functions from the mtd_spi_nor driver to have direct access to the register. This seemed the least intrusive way to create the tests.

This is a log from the most relevant test for this PR, checking the security functions with the Block Protect bits.

2024-04-24 19:02:42,301 # ..test_mtd_block_protect: Checking the Block Protect Functions
2024-04-24 19:02:42,357 # mtd_spi_nor_write_page: 0x20000800, 0x200017d0, 0x0, 0x0, 0x10
2024-04-24 19:02:42,360 # mtd_spi_cmd: 0x20000800, 06
2024-04-24 19:02:42,364 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000173f, 1
2024-04-24 19:02:42,369 # mtd_spi_nor: wait device status = 0x3e, waiting WEL-Flag
2024-04-24 19:02:42,374 # mtd_spi_cmd_addr_write: 0x20000800, 02, (000000), 0x200017d0, 16
2024-04-24 19:02:42,379 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000172f, 1
2024-04-24 19:02:42,383 # mtd_spi_nor: wait device status = 0x3c, waiting !WIP
2024-04-24 19:02:42,388 # wait loop 0 times, yield 0 times, total wait 0us
2024-04-24 19:02:42,392 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000173f, 1
2024-04-24 19:02:42,397 # mtd_spi_nor: wait device status = 0x3c, waiting !WEL-Flag
2024-04-24 19:02:42,401 # mtd_spi_cmd_read: 0x20000800, 2b, 0x2000176f, 1
2024-04-24 19:02:42,405 # mtd_spi_nor_write: ERR: page program failed!
2024-04-24 19:02:42,409 # mtd_spi_nor_read: 0x20000800, 0x200017e0, 0x0, 0x10
2024-04-24 19:02:42,415 # mtd_spi_cmd_addr_read: 0x20000800, 03, (000000), 0x200017e0, 16
2024-04-24 19:02:42,420 # mtd_spi_nor_erase: 0x20000800, 0x0, 0x1000
2024-04-24 19:02:42,423 # mtd_spi_cmd: 0x20000800, 06
2024-04-24 19:02:42,427 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000176f, 1
2024-04-24 19:02:42,432 # mtd_spi_nor: wait device status = 0x3e, waiting WEL-Flag
2024-04-24 19:02:42,437 # mtd_spi_cmd_addr_write: 0x20000800, 20, (000000), 0, 0
2024-04-24 19:02:42,441 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000175f, 1
2024-04-24 19:02:42,446 # mtd_spi_nor: wait device status = 0x3c, waiting !WIP
2024-04-24 19:02:42,450 # wait loop 0 times, yield 0 times, total wait 0us
2024-04-24 19:02:42,454 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000176f, 1
2024-04-24 19:02:42,459 # mtd_spi_nor: wait device status = 0x3c, waiting !WEL-Flag
2024-04-24 19:02:42,463 # mtd_spi_cmd_read: 0x20000800, 2b, 0x20001797, 1
2024-04-24 19:02:42,467 # mtd_spi_nor_erase: ERR: erase failed!
2024-04-24 19:02:42,467 # 
2024-04-24 19:02:42,468 # OK (4 tests)
2024-04-24 19:03:01,934 # Exiting Pyterm

These two messages are generated by the new code that was created by @vincent-d:

2024-04-24 19:02:42,405 # mtd_spi_nor_write: ERR: page program failed!
2024-04-24 19:02:42,467 # mtd_spi_nor_erase: ERR: erase failed!

HOWEVER... I do not love the solution with the configuration (setting both Flag and Opcode) in drivers/mtd_spi_nor/mtd_spi_nor_configs.c. On one hand it's unnecessary duplication and on the other hand it is really easy to forget to set one of the two (ask me how I know...). And if you forget to set the correct opcodes, it results in false positive security behaviour. An undefined opcode is set to 0x00 (NOP) and that returns 0x00. This 0x00 is interpreted as "everything is fine" by the security code.

My preferred option would be adding some Pseudomodules, similar to the AT24CXXX driver. However I'm not sure if it would make sense to do it the same way by allocating every namespace that begins with "M25F...", "M25R...", "IS25LP...", "IS25LE...", "SST25V..." etc. A better option would probably be something like "MTD_SPI_NOR_MX", "MTD_SPI_NOR_IS", "MTD_SPI_NOR_SST" for defining the opcode set and "MTD_SPI_NOR_SECURITY" and (in a future expansion) "MTD_SPI_NOR_ECC" for defining the features.

Any other ideas about this are very welcome.

@crasbe
Copy link
Contributor Author

crasbe commented Apr 25, 2024

@benpicco You reviewed the original PR, could you take a quick look at this to give me some guidance on the previous question? You don't have to review this PR yet.

@crasbe
Copy link
Contributor Author

crasbe commented Apr 30, 2024

I implemented the Pseudomodule solution and IMO it is way more elegant than the original idea and scales a lot better.
One thing that became clear is that the ISSI flash behaves a lot different than the Macronix chip. The flags in the security register of the Macronix flash reset themselves after a successful read and the ISSI flags remain until cleared with a separate OpCode.
Furthermore, it appears like the WEL (Write Enable Latch) Flag on the ISSI is not reset when the program or erase operation was not successful. This is not clear from the datasheet, it only states that the WEL flag is reset on completion of a program or erase operation, not that is has to be successful. Therefore the WEL flag has to be cleared with a dedicated command.
When you don't reset the WEL flag, the wait_for_wel_reset function will poll the flag indefinitely.

So one thing to add to this function is trying to clear the flag manually perhaps after a certain number of tries. (As well as adding the timeout).

Therefore it took a lot of testing and fiddling until the ISSI chip was running as expected and the solution with the Pseudomodules proved to be superior in this case because I was able to add the additional OpCodes conditionally.

Unfortunately I did not separate the changes in separate commits because I did not intend them to become this involved...

@crasbe
Copy link
Contributor Author

crasbe commented May 6, 2024

I think this PR is already quite extensive and beyond the originally intended scope, so I would like to postpone adding the timeout tasks for now to a separate PR.

@crasbe crasbe marked this pull request as ready for review May 6, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: doc Area: Documentation Area: drivers Area: Device drivers Area: pkg Area: External package ports Area: tests Area: tests and testing framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants