Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] NAND flash file system #11979

Open
resyfer opened this issue Mar 23, 2024 · 3 comments
Open

[Discussion] NAND flash file system #11979

resyfer opened this issue Mar 23, 2024 · 3 comments

Comments

@resyfer
Copy link
Contributor

resyfer commented Mar 23, 2024

Hi, as part of GSoC '24, I wish to propose a file system to support NAND flash file systems, that would take into account the challenges of an RTOS, limited memory of embdedded systems, and the practical range of devices used in these systems.

The file system I have in mind, what I call mnemofs, is not something entirely new. It's a heavily modified version of littlefs that has been modified to suit NAND flash needs like bad block management.

It will be quite long, and I would really love suggestions and constructive criticism on this, and I would be grateful if you could give this a read.

NAND Flash vs NOR Flash

While this page and this page sum it up pretty well, these are the advantages and disadvantages of NAND flash in a nutshell:

  • Advantages: Less size, (about 40%) less cost per byte, faster writes/erases.
  • Disadvantages: Slower reads, bad blocks right from production, higher chance of bit flipping, higher standby power requirement.

Mnemofs

Behavior

COnsidering the reasoning behind similar file systems like UBIFS and littlefs, mnemofs will also be a Copy-On-Write file system. This is accentuated by the fact that a block needs to be erased before a page can be updated in it, thus requiring out-of-place updates.

Journal

To facilitate atomic updates, which in turn promotes power-loss resilience, a journal is required.

The journal in mnemofs will be of n+2 blocks and will be in the form of a circular singly linked list. It's sufficient to be a singly linked list, any case where traversal is required also requires traversal from the start (for eg. to recreate a file with its updates).

The first n blocks will store updates, followed by their checksums (VERY similar to littlefs). The function of the last two blocks is explained later.

Due to the limitations of journaling file systems, the journal will move once it has faced enough of P/E cycles (Program/Erase cycles). Again, the moving of the journal is explained later.

The journal will be committed after it's full, or any log that needs to be inserted can be written to it.

Master Node & Master Blocks

The directories will be, as usual, in a tree structure. The master node is the root node. As mnemofs is CoW, once the journal is being committed, the tree will be updated from the bottom by making copies with the updated values. It will be done to prevent multiple updates to the inner.

This means that, like demonstrated in littlefs design document, the root will be the last to get updated, which is the master node.

Every update of the master node will be written to one of the last two blocks of the journal (alternatively). There are two blocks at the end of the journal, called the master blocks, as the smallest erasable unit is one block. To prevent traversal for the master nodes, the head of the journal will have the values of the master blocks.

The master node contains a pointer to the journal, while the journal, through this mechanism, contains pointers to the master node.

Moving journal

Moving of the journal will happen when the blocks the journal is on wears down to the upper limit. Then, the entire journal will be moved. To prevent the constraints of a contiguous space, a linked list structure was used.

The journal will be copied by first copying the latest change such that block n+1 block has 1 log, while n+2 block has at most 1 log. Then the head of the journal will be copied, and then the rest of the block.

This would mean that the master block will be updated after moving of journal to point to the new journal. Thus a new log will be put into the new journal for this update.

Power-Loss Resilience

The journal, plus maintaining the old master node, will lead to power-loss resilience as the old tree can be reconstructed upon mount again.

Thus it will look something like this:
image

Block Allocator

Blocks will have a 2 bit wear value, ie. 0-3. Once all blocks hit 3, they will all be reset.

There is a need for a block allocator that has 1 job. It gives the most suitable block to write to whenever asked. The storage will be divided into groups of 2^14 blocks, called chunks. The chunk wear will be a total of the wear of blocks inside it.

Averages are:

Chunk Wear Ranges Chunk Average Wear
0 0
(0, 16384] 1
(16384, 32768] 2
(32768, 49152] 3

The block allocator will have 3 separate data structures, but all of them will share the same node structure.

image11

The free chunks are completely free chunks, and are sorted in a min-heap according to their wear levels. They are written in the image below like chunk_no (wear).

image12

Then partially filled chunks will be sorted in a min-heap using their wear levels, as well as a red-black tree (using a different set of points) using their chunk numbers for easy searching.

image7

The final section, the fully filled chunks, will be sorted according to their chunk numbers as a red-black tree.

image8

The block allocator will use up all of the free chunks first, and then the partially filled chunks.

Within a selected chunk, a random algorithm will try to find a block below the chunk's average wear. If half iterations are over, then it will begin a linear search. Also, for this, another roaring bitmap will be used to prevent checking a block twice.

Blocks in a chunk

Status of blocks will be in Roaring bitmap(s). The graphs in the first comment shows the comparisons.

Data

Directories and direntries

Directories will be inaccessible-to-user files containing directory entries.

Each file name and directory name will be hashed to an integer to reduce the string comparisons. It will be done such that lexically similar strings will have different hashes, so that for hash collisions, string comparison is quick.

Graph in first comment for comparison of a personally devised hash-search.

Files

On-flash files will be arranges in a CTZ skip list like in littlefs.

There will be a global LRU for changes to files and directories. When a node is removed from the LRU, it will be written to the journal. Thus, to recreate a portion of the file after change, the LRU needs to be read, then the journal, and then finally the flash.

This is quicker than reading the journal or pages directly from flash at the expense of some RAM.

mkmnemofs

This will format the flash to the file system's needs. It will be a standalone utility as well as part of the mount options.

Conclusion

This is mnemofs. This was quite a lot, and I hope you believe me, my original attempt was around 19 pages worth of information. Thus, if you need clarification for any part, or if you find any glaring holes in this, please do let me know. I can then update this as well. Thank you in advance!

Glaring Limitations

  • This follows in the footsteps of JFFS2 to consume a fair bit of memory but efforts have been made to reduce it, and uses quite a bit of memory, especially as it linearly increases.
  • It does not yet support MLC, but only SLC. Given the rising popularity of MLC, one of the future plans would be to support it.

Testing

Apart from simulation testing, I will be testing this on some Arm boards I have, but I am hopeful the community will be willing to test this file system.

It would really great to cover as many architectures (and boards) as possible. Apart from testing it in a controlled testing environment, a file system's main test lies in real life usage, including as brutal of a treatment as possible.

@resyfer
Copy link
Contributor Author

resyfer commented Mar 23, 2024

Graphs

Block status


Hash Search

The left graph is of all points, and the right is of random few points from the left to be able to see the comparision.

Randomness

POSIX random function slightly modified:

@acassis
Copy link
Contributor

acassis commented Mar 24, 2024

@resyfer wow!!! Amazing work!

@xiaoxiang781216 could you or someone from Xiaomi with previous experience with NAND file system review this proposal?

@xiaoxiang781216
Copy link
Contributor

@Donny9 could you review the plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants