-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Superblock corruption #953
Comments
make that three: another device has just succumbed to the same issue, whatever it is. |
there is a slight suspicion that #901 may have been a catalyst for it, though it needs confirmation. |
Hmmm, this certainly looks like an unrelated mdir is being written to blocks 0x{0,1}. That at least rules out CTZ skip-list issues. Is the log_090-20240301-205905.log file stored in the root dir? To be honest this sounds a lot like a driver issue, though I also know from other discussion you've been using littlefs for a while. Has anything changed hardware-wise? Another user had a similar issue that turned out to be using the wrong chip which led to silent address truncation, for example. Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling? How large is your storage? It would be interesting to see if the last blocks on the storage contain any interesting data.
I agree, it'd be interesting to see if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system. Well, if these devices started with blank storage. It could also just be wrap-around related. If you have devices under test still, it would be interesting to put an assert in
Feel free to send it to geky at geky.net, though no promises I'll be able to find anything. It's a bit hard to understand a filesystem when the superblock is gone, since you don't know what blocks are out-of-date. Aside from the above points, the only other thing I can think of would be to use |
Ah, correction, we rely on |
yes (everything is in the root dir in our case)
nope, we've been using the same block layer for ages
that is certainly a possibility.
in this case - 448 KB (we have devies with varying fs sizes). about half of it is one big file (index.html.gz), and apart from that it's a dozen or so small files - config, some storage... and the recent addition are the log files, a set of at most 4, at most 4K each.
no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).
no, in fact string |
i sent filesystem dumps by email, along with some logs |
we only format externally, when preparing an image, so i can differentiate between usage by device vs mklfs (our image creation tool).
also dropped |
Well that rules out address truncation, though it might still be that some layer thinks the block device is larger than it actually is.
That makes an issue with #901 seem less likely. Adding more superblocks should make the "littlefs" string appear more often, not less.
Hmm, I haven't received these yet. Not in spam either. GitHub is also sending notification updates through this domain just fine. Maybe try sending an email with no attachments first? Gmail may just be being a butt... |
re-sent with a link instead |
Two other possibilities come to mind:
|
ok, so we did not get any crashes on the block alloc assertion, bu we are still getting corruptions where superblock ceases to be super. we added this to our write function:
which basically is - crash if writing block 0 or 1 and the first 128 byes (our io size) do not contain
contents of the
another one:
buffer:
does this help? we have memory dumps corresponding to these but don't quite know what to look for. |
Smart! It is only asserting when the cache is flushed, which is a bit removed from when the new metadata is "staged" (written to pcache), but this is still useful info. You could add this assert to littlefs's internal An aside: What exact commit hash are you on? The line numbers don't seem to quite line up with 2.8.2. One thing to note is this stack trace is going through a normal compaction (lfs_dir_compact in lfs_dir_splittingcompact), NOT the expanding superblock compaction route. This is another point against a #901 bug, though it's still not impossible. I'm writing as I'm looking into this, sorry if my thoughts are out of order or lead nowhere.
This is a very strange tag. The super-type 0x100 (0x9fe & 0x700) should be internal-only and never written to disk. But we never use the sub-type 0xfe with LFS_TYPE_FROM. Maybe this is caused by an underflow somewhere? Not sure how this tag is possible but it looks too much like a tag to ignore.
This looks like a checksum tag (crc + padding) but... isn't a checksum tag? (0x5xx -> checksum, 0x3xx -> userattr?). Do yours file have userattrs that end in a bunch of 0xffs? Looking at the second dump:
This commit is just a mess. The tag On second thought, moving this assert into littlefs's internal Something else I'm concerned about: conf9.json appears to reside in id=0. In blocks 0,1 this should normally be the superblock entry. In I'm not really sure what could cause this. Do you ever move the |
it's 2.8.2 + assertion + lfs_probe from #947
no, it is allocated on open and never moved while the file is open.
i think it does. at least doesn't look obviously stomped over:
|
and this is for the second one:
|
Hmm, it doesn't look corrupted. So at least arbitrary memory corruption is unlikely. It is weird that the id=14/23, but the tag's id ends up 0. This may be a stretch, but have you checked for stack overflow? The commit sequence of code gets quite deep. A change in stack depth could also be explained by a littlefs version change. Will dig more... |
no, we're ok wrt stack: 6K and 4K free out of 8K total. additionally, we have a stack canary watchpoint at the end of he stack ( |
we have an early indication that it is caused by a change between 2.7.0 and 2.8.2 - we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code (with the same fs state), makes it go away (at least the lfs_prog assertion). |
I stand corrected, I forgot about the tag xoring. These dumps make more sense than I thought. Running through $ ./scripts/readmdir.py dump1 512 0 -a -T
mdir {0x0} rev 513 -> {0x75, 0x76}
off tag type id len
00000008: 601ffc08 hardtail . 8
00000008: 75 00 00 00 76 00 00 00 u...v...
00000014: 500ffc6c ccrc 0x0 . 108
00000014: be b2 f8 b2 ff ff ff ff ff ff ff ff ff ff ff ff ................
00000024: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000034: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000044: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000054: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000064: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000074: ff ff ff ff ff ff ff ff ff ff ff ff ............
... This is a well formed tail commit, but the superblock is missing. And the file contents are missing? $ ./scripts/readmdir.py dump2 512 0 -a -T
mdir {0x0} rev 181 (corrupted!)
off tag type id len
00000008: 0c0000e5 name 0xc0 0 229
00000008: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000018: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000028: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000038: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000048: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000058: ff ff ff 10 1f e0 78 40 00 00 0a 63 6f 6e 66 39 ......x@...conf9
00000068: 2e 6a 73 6f 6e 20 30 00 02 1a 00 00 00 d2 06 00 .json 0.........
00000078: 00 6f d0 48 08 1f ff a8 ff ff ff ff ff ff ff ff .o.H............
00000088: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
00000098: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000000a8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000000b8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000000c8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000000d8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
000000e8: ff ff ff ff ff .....
... This commit makes less sense. The type It's curious that these commits are doing very different things... |
I forgot I asked you to add asserts, that would offset the line numbers a bit wouldn't it...
Looking at the changes, the biggest ones are making
A reproducible example would be quite a good find :) |
From an out-of-band discussion, thanks to work from Nikola Kosturski and @rojer a reproducible test case for the missing magic was found. What's interesting is this turned out to not actually be an erroneous case as far as But this is an issue in terms of documentation and expected behavior. I've put up #959 to try to fix the state of things. There is still an issue with the superblock entry disappearing entirely, which shouldn't happen and cause is unknown. |
In the context of #959, first dump makes perfect sense now. It's a superblock mdir with the superblock entry deleted (but valid tail pointer pointing to the new superblock). The second dump still does not make sense:
It looks like the magic tag is being corrupted somehow. The fact that there also looks like some remnants of a file (conf9.json) in the superblock suggests either superblock expansion hasn't occured, or the error is during superblock expansion. Thinking out loud, I wonder if the block ends up not erased. This corruption could come from flash masking if multiple progs occur without an erase... Also a random though, but to make results show up quicker, you could drop the |
Double checking block erase during superblock expansion: It seems hard for hard for littlefs to miss an erase in this case. It's the first thing we do in |
What is your opinion is #962 something similar or totally different? What are you using to read out such nice hex dumps (are you also using some mcu with an external flash)? |
This looks like the same issue. #959 is a bit of a red herring. The magic string can go missing, but the current driver works around this without erroring. It's more a spec/impl disagreement. Ignoring #959, there is still an issue where superblock expansion can lead to filesystem corruption. Seemingly due to a rogue tail pointer commit. Nikola has also found a stack trace that suggests |
That would be a question for @rojer, fortunately they've been able to send the extracted binaries. There may be some external tool that can read SPI flash chips, otherwise you could write a loop to print the disk a byte at a time over your JTAG/debugger. Note if you go over JTAG you should really have some form of error-detection, JTAG/SWD can be noisy. The easiest thing to do is send each block three times. If you also happen to have an SD card or network interface, going that route may be easier. |
re: dumps - some come from reading device's flash, some were taken in a gdb session while examining crash dump. after that, it's @geky speaking of #959 - i'm somewhat concerned by its current open status, do you think there may be something wrong with it? or awaiting review? we are about to ship a production release with it. |
I was hoping to resolve the remaining issue in the same release. Or at least better understand what is going on. #959 is in a weird place where it's technically a bug, but it doesn't actually break anything. And since the current behavior is already out there and in use, it doesn't seem like the highest priority to bring in. That and it's a behavior change which affects the disk, so we need to be confident it is correct. We can't really take it back after releasing. |
Until/unless we find more info, I'll go ahead and merge #959 |
Two devices in our test fleet experienced a weird corruption where filesystem is not obviously corrupt but the super block seems to have lot its superness - missing the
littlefs
tag, for example.I have two flash dumps, here are two first blocks from one of them: https://gist.github.com/rojer/6a9fe5f2947a12b660570534252474f8
I can share full dumps privately, please feel free to reach out.
I have examined console logs from the devices at the time failure occured and it doesn't seem to be associated with anything unusual - no power loss event, for example, though both happened shortly after a soft reboot. The devices remained responsive, though seemingly having lost ability to read files (file not found errors). Upon rebooting both were unable to mount their filesystems (unsurprisingly) and thus became bricks.
The text was updated successfully, but these errors were encountered: