Superblock corruption #953

rojer · 2024-03-06T13:30:30Z

Two devices in our test fleet experienced a weird corruption where filesystem is not obviously corrupt but the super block seems to have lot its superness - missing the littlefs tag, for example.
I have two flash dumps, here are two first blocks from one of them: https://gist.github.com/rojer/6a9fe5f2947a12b660570534252474f8

I can share full dumps privately, please feel free to reach out.

I have examined console logs from the devices at the time failure occured and it doesn't seem to be associated with anything unusual - no power loss event, for example, though both happened shortly after a soft reboot. The devices remained responsive, though seemingly having lost ability to read files (file not found errors). Upon rebooting both were unable to mount their filesystems (unsurprisingly) and thus became bricks.

The text was updated successfully, but these errors were encountered:

rojer · 2024-03-06T14:16:45Z

make that three: another device has just succumbed to the same issue, whatever it is.
we are using 2.8.2 in disk format 2.0 mode.
the only somewhat relevant change in the environment has been file logging: we started logging to filesystem, which would increase write activity and fs utilization significantly.

rojer · 2024-03-07T09:37:02Z

there is a slight suspicion that #901 may have been a catalyst for it, though it needs confirmation.

geky · 2024-03-07T20:41:01Z

Hmmm, this certainly looks like an unrelated mdir is being written to blocks 0x{0,1}. That at least rules out CTZ skip-list issues. Is the log_090-20240301-205905.log file stored in the root dir?

To be honest this sounds a lot like a driver issue, though I also know from other discussion you've been using littlefs for a while. Has anything changed hardware-wise? Another user had a similar issue that turned out to be using the wrong chip which led to silent address truncation, for example.

Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling? How large is your storage?

It would be interesting to see if the last blocks on the storage contain any interesting data.

there is a slight suspicion that #901 may have been a catalyst for it, though it needs confirmation.

I agree, it'd be interesting to see if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system. Well, if these devices started with blank storage.

It could also just be wrap-around related.

If you have devices under test still, it would be interesting to put an assert in lfs_alloc (here) to assert that it never returns block 0 or 1 (actually why don't we already have this?).

I can share full dumps privately, please feel free to reach out.

Feel free to send it to geky at geky.net, though no promises I'll be able to find anything. It's a bit hard to understand a filesystem when the superblock is gone, since you don't know what blocks are out-of-date.

Aside from the above points, the only other thing I can think of would be to use ./scripts/readmdir.py to try to see if any mdirs contain a pointer to blocks 0, 1, which shouldn't happen normally.

geky · 2024-03-07T20:43:10Z

to assert that it never returns block 0 or 1 (actually why don't we already have this?)

Ah, correction, we rely on lfs_alloc returning blocks 0, 1 in lfs_format_. So this would only work if you don't format.

rojer · 2024-03-07T20:52:41Z

Is the log_090-20240301-205905.log file stored in the root dir?

yes (everything is in the root dir in our case)

Has anything changed hardware-wise?

nope, we've been using the same block layer for ages

Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling?

that is certainly a possibility.

How large is your storage?

in this case - 448 KB (we have devies with varying fs sizes). about half of it is one big file (index.html.gz), and apart from that it's a dozen or so small files - config, some storage... and the recent addition are the log files, a set of at most 4, at most 4K each.

see if the last blocks on the storage contain any interesting data

no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).

if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system

no, in fact string littlefs does not appear in the entire image

rojer · 2024-03-07T21:20:27Z

i sent filesystem dumps by email, along with some logs

rojer · 2024-03-07T21:57:52Z

assert that it never returns block 0 or 1
this would only work if you don't format

we only format externally, when preparing an image, so i can differentiate between usage by device vs mklfs (our image creation tool).
i added the assertion to lfs_alloc:

diff --git a/littlefs/lfs.c b/littlefs/lfs.c
index 105915b..d9ad91f 100644
--- a/littlefs/lfs.c
+++ b/littlefs/lfs.c
@@ -662,7 +662,11 @@ static int lfs_alloc(lfs_t *lfs, lfs_block_t *block) {
                     lfs->free.i += 1;
                     lfs->free.ack -= 1;
                 }
-
+// Debug LFS corruption
+// https://github.com/littlefs-project/littlefs/issues/953#issuecomment-1984394787
+#ifndef LFS_TOOLS_BUILD
+                LFS_ASSERT(!(*block == 0 || *block == 1));
+#endif
                 return 0;
             }
         }

also dropped LFS_NO_ASSERT (which we had to save space). will try over the next few days and report.

geky · 2024-03-07T22:31:00Z

see if the last blocks on the storage contain any interesting data

no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).

Well that rules out address truncation, though it might still be that some layer thinks the block device is larger than it actually is.

if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system

no, in fact string littlefs does not appear in the entire image

That makes an issue with #901 seem less likely. Adding more superblocks should make the "littlefs" string appear more often, not less.

i sent filesystem dumps by email, along with some logs

Hmm, I haven't received these yet. Not in spam either. GitHub is also sending notification updates through this domain just fine.

Maybe try sending an email with no attachments first? Gmail may just be being a butt...

rojer · 2024-03-07T22:43:36Z

re-sent with a link instead

geky · 2024-03-07T22:44:06Z

Two other possibilities come to mind:

in this case - 448 KB (we have devies with varying fs sizes)

It could be a lookahead buffer vs block count alignment issue (in littlefs). The block allocator/lookahead buffer was changed a bit recently. We have tests over this, but there can always be holes in tests.

What lookahead_size are you using? I'm assuming block_count=112 for the 448 KiB.
It could be something internally is causing littlefs to delete id 0 in the superblock, which is what the string "littlefs" and all the superblock info is attached to.

Though I'm not sure what could cause trigger this if it happens without powerloss...

rojer · 2024-03-13T10:39:17Z

ok, so we did not get any crashes on the block alloc assertion, bu we are still getting corruptions where superblock ceases to be super. we added this to our write function:

  if (block < 2 && off == 0) {
    struct shos_str content = SHOS_MK_STR_N((const char *) buffer, size);
    struct shos_str lfs = SHOS_MK_STR("littlefs");
    assert(shos_strstr(content, lfs) != NULL);
  }

which basically is - crash if writing block 0 or 1 and the first 128 byes (our io size) do not contain littlefs.
and we get quite a few of these, for example:

...
#2  0x403924c6 in __assert_func (file=<optimized out>, file@entry=0x3c170a7d "shos_vfs_lfs.c", line=line@entry=69, func=<optimized out>, func@entry=0x3c170a3c <__func__.14970> "shos_lfs_prog", expr=<optimized out>, expr@entry=0x3c170a58 "shos_strstr(content, lfs) != NULL") at assert.c:85
#3  0x4203ffe8 in shos_lfs_prog (c=0x3fca94e8, block=0, off=0, buffer=0x3fca9644, size=128) at shos_vfs_lfs.c:69
#4  shos_lfs_prog (c=0x3fca94e8, block=0, off=0, buffer=0x3fca9644, size=128) at shos_vfs_lfs.c:64
#5  0x42041e30 in lfs_bd_flush (lfs=lfs@entry=0x3fca9538, pcache=pcache@entry=0x3fca9548, rcache=rcache@entry=0x3fca9538, validate=validate@entry=false) at lfs.c:181
#6  0x420422b4 in lfs_bd_sync (validate=false, rcache=0x3fca9538, pcache=0x3fca9548, lfs=0x3fca9538) at lfs.c:215
#7  lfs_dir_commitcrc (lfs=lfs@entry=0x3fca9538, commit=commit@entry=0x3fc9a0f8 <shos_task_stack+5620>) at lfs.c:1725
#8  0x42042e12 in lfs_dir_compact (lfs=lfs@entry=0x3fca9538, dir=dir@entry=0x3fc9a210 <shos_task_stack+5900>, attrs=attrs@entry=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=attrcount@entry=2, source=source@entry=0x3fc9a210 <shos_task_stack+5900>, begin=begin@entry=0, end=end@entry=0) at lfs.c:2026
#9  0x42042ffc in lfs_dir_splittingcompact (begin=0, end=0, source=0x3fc9a210 <shos_task_stack+5900>, attrcount=2, attrs=0x3fc9a2b0 <shos_task_stack+6060>, dir=0x3fc9a210 <shos_task_stack+5900>, lfs=0x3fca9538) at lfs.c:2180
#10 lfs_dir_relocatingcommit (lfs=lfs@entry=0x3fca9538, dir=dir@entry=0x3fc9a210 <shos_task_stack+5900>, pair=pair@entry=0x3fcc2ad8, attrs=attrs@entry=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=attrcount@entry=2, pdir=pdir@entry=0x3fc9a230 <shos_task_stack+5932>) at lfs.c:2298
#11 0x42043438 in lfs_dir_orphaningcommit (lfs=lfs@entry=0x3fca9538, dir=0x3fcc2ad8, attrs=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=2) at lfs.c:2379
#12 0x420438ee in lfs_dir_commit (lfs=0x3fca9538, dir=<optimized out>, attrs=<optimized out>, attrcount=<optimized out>) at lfs.c:2551
#13 0x42043af2 in lfs_file_rawsync (file=0x3fcc2ad0, lfs=0x3fca9538) at lfs.c:3409
#14 lfs_file_rawsync (lfs=0x3fca9538, file=0x3fcc2ad0) at lfs.c:3373
#15 0x42043b2c in lfs_file_rawclose (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:3190
#16 0x420445ea in lfs_file_close (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:6067
#17 0x4204030e in shos_vfs_fs_lfs_close (fs=<optimized out>, fd=<optimized out>) at shos_vfs_lfs.c:364
...

contents of the buffer passed to shos_lfs_prog (frame 3):

00000000  01 02 00 00 9f e0 03 f7  75 00 00 00 76 00 00 00  |........u...v...|
00000010  30 10 00 64 be b2 f8 b2  ff ff ff ff ff ff ff ff  |0..d............|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000060  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

another one:

...
#2  0x40097d2c in __assert_func (file=<optimized out>, line=<optimized out>, func=<optimized out>, expr=<optimized out>) at assert.c:85
#3  0x40105c2b in shos_lfs_prog (c=0x3ffcd248, block=0, off=0, buffer=0x3ffcd3a4, size=128) at shos_vfs_lfs.c:64
#4  shos_lfs_prog (c=0x3ffcd248, block=0, off=0, buffer=0x3ffcd3a4, size=128) at shos_vfs_lfs.c:64
#5  0x401073e8 in lfs_bd_flush (lfs=0x3ffcd298, pcache=0x3ffcd2a8, rcache=0x3ffcd298, validate=false) at lfs.c:181
#6  0x401074dc in lfs_bd_prog (size=0, buffer=<optimized out>, off=<optimized out>, block=<optimized out>, validate=<optimized out>, rcache=<optimized out>, pcache=<optimized out>, lfs=<optimized out>) at lfs.c:251
#7  lfs_bd_prog (lfs=0x3ffcd298, pcache=0x3ffcd2a8, rcache=0x3ffcd298, validate=false, block=0, off=<optimized out>, buffer=<optimized out>, size=1) at lfs.c:227
#8  0x40107525 in lfs_dir_commitprog (lfs=0x3ffcd298, commit=0x3ffc4000 <shos_task_stack+7188>, buffer=0x3ffc3ea4 <shos_task_stack+6840>, size=1) at lfs.c:1559
#9  0x401075d4 in lfs_dir_commitattr (lfs=0x3ffcd298, commit=0x3ffc4000 <shos_task_stack+7188>, tag=<optimized out>, buffer=0x3ffc3f88 <shos_task_stack+7068>) at lfs.c:1608
#10 0x401075f1 in lfs_dir_commit_commit (p=0x3ffc402c <shos_task_stack+7232>, tag=2348810469, buffer=0x3ffc3f88 <shos_task_stack+7068>) at lfs.c:1887
#11 0x40106aa5 in lfs_dir_traverse (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, off=<optimized out>, ptag=2348810469, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, tmask=tmask@entry=1074789376, begin=begin@entry=0, end=end@entry=23, diff=diff@entry=0, 
    cb=cb@entry=0x401075e0 <lfs_dir_commit_commit>, data=data@entry=0x3ffc402c <shos_task_stack+7232>, ttag=0) at lfs.c:1039
#12 0x40108110 in lfs_dir_compact (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, source=0x3ffc4130 <shos_task_stack+7492>, begin=<optimized out>, end=end@entry=23) at lfs.c:1969
#13 0x4010831e in lfs_dir_splittingcompact (begin=0, end=23, source=0x3ffc4130 <shos_task_stack+7492>, attrcount=2, attrs=0x3ffc41e0 <shos_task_stack+7668>, dir=0x3ffc4130 <shos_task_stack+7492>, lfs=0x3ffcd298) at lfs.c:2180
#14 lfs_dir_relocatingcommit (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, pair=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, pdir=<optimized out>) at lfs.c:2298
#15 0x4010873a in lfs_dir_orphaningcommit (lfs=0x3ffcd298, dir=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2) at lfs.c:2379
#16 0x40108bac in lfs_dir_commit (lfs=0x3ffcd298, dir=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2) at lfs.c:2551
#17 0x40108d58 in lfs_file_rawsync (file=0x3ffd6d88, lfs=0x3ffcd298) at lfs.c:3409
#18 lfs_file_rawsync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:3373
#19 0x4010972f in lfs_file_sync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:6083
#20 0x40105e81 in shos_vfs_fs_lfs_fsync (fs=<optimized out>, fd=0) at shos_vfs_lfs.c:380
#21 0x40104bb1 in shos_vfs_fsync (vfd=256) at shos_vfs.c:433
#22 0x400e84ad in esp_vfs_fsync (fd=<optimized out>) at vfs.c:565
...
...

buffer:

0000000  b5 00 00 00 f3 ff ff 1a  ff ff ff ff ff ff ff ff  |................|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff 10 1f e0 78 40  |..............x@|
00000060  00 00 0a 63 6f 6e 66 39  2e 6a 73 6f 6e 20 30 00  |...conf9.json 0.|
00000070  02 1a 00 00 00 d2 06 00  00 6f d0 48 08 1f ff a8  |.........o.H....|

does this help? we have memory dumps corresponding to these but don't quite know what to look for.

geky · 2024-03-15T07:07:19Z

which basically is - crash if writing block 0 or 1 and the first 128 byes (our io size) do not contain littlefs.

Smart! It is only asserting when the cache is flushed, which is a bit removed from when the new metadata is "staged" (written to pcache), but this is still useful info.

You could add this assert to littlefs's internal lfs_bd_read function, but I'm not sure it would really give us more than what you've found.

An aside: What exact commit hash are you on? The line numbers don't seem to quite line up with 2.8.2.

One thing to note is this stack trace is going through a normal compaction (lfs_dir_compact in lfs_dir_splittingcompact), NOT the expanding superblock compaction route. This is another point against a #901 bug, though it's still not impossible.

I'm writing as I'm looking into this, sorry if my thoughts are out of order or lead nowhere.

9f e0 03 f7

This is a very strange tag. The super-type 0x100 (0x9fe & 0x700) should be internal-only and never written to disk. But we never use the sub-type 0xfe with LFS_TYPE_FROM. Maybe this is caused by an underflow somewhere? Not sure how this tag is possible but it looks too much like a tag to ignore.

00000000  01 02 00 00 9f e0 03 f7  75 00 00 00 76 00 00 00  |........u...v...|
00000010  30 10 00 64 be b2 f8 b2  ff ff ff ff ff ff ff ff  |0..d............|

This looks like a checksum tag (crc + padding) but... isn't a checksum tag? (0x5xx -> checksum, 0x3xx -> userattr?). Do yours file have userattrs that end in a bunch of 0xffs?

Looking at the second dump:

00000000  b5 00 00 00 f3 ff ff 1a  ff ff ff ff ff ff ff ff  |................|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff 10 1f e0 78 40  |..............x@|
00000060  00 00 0a 63 6f 6e 66 39  2e 6a 73 6f 6e 20 30 00  |...conf9.json 0.|
00000070  02 1a 00 00 00 d2 06 00  00 6f d0 48 08 1f ff a8  |.........o.H....|

This commit is just a mess. The tag f3 ff ff 1a has an immense size that is probably an underflow/uninitialized/etc. The 00 00 0a 63 tag is at least a proper looking name tag, but the preceding bytes don't make sense as far as I can tell.

On second thought, moving this assert into littlefs's internal lfs_bd_read function may be useful to at least rule out driver/hardware related issues.

Something else I'm concerned about: conf9.json appears to reside in id=0. In blocks 0,1 this should normally be the superblock entry.

In lfs_file_t, .m.pair={0,1} and .id=0 should NEVER happen. It may be fruitful to add an assert on this in lfs_file_sync.

I'm not really sure what could cause this. Do you ever move the lfs_file_t struct? Does the lfs_file_t's .type field make sense? (littlefs always sets this to LFS_TYPE_REG, 0x001, otherwise maybe memory corruption?)

rojer · 2024-03-15T09:46:23Z

What exact commit hash are you on? The line numbers don't seem to quite line up with 2.8.2.

it's 2.8.2 + assertion + lfs_probe from #947
Here it is, for reference: https://gist.github.com/rojer/cc5fd75cd65a7c64df85c73287616bcb

Do you ever move the lfs_file_t struct?

no, it is allocated on open and never moved while the file is open.

Does the lfs_file_t's .type field make sense?

i think it does. at least doesn't look obviously stomped over:

(gdb) frame 16
#16 0x420445ea in lfs_file_close (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:6067
6067    lfs.c: No such file or directory.
(gdb) print *file
$1 = {next = 0x0, id = 23, type = 1 '\001', m = {pair = {1, 0}, rev = 512, off = 4096, etag = 1343224952, count = 26, erased = false, split = false, tail = {4294967295, 4294967295}}, ctz = {head = 116, size = 136}, flags = 66818, pos = 136, block = 116, off = 136, cache = {block = 4294967295, off = 128, size = 8, 
    buffer = 0x3fcc2c64 '\377' <repeats 128 times>, "T"}, cfg = 0x3c170d38 <defaults>}

(gdb) print *lfs
$1 = {rcache = {block = 4294967295, off = 128, size = 128, buffer = 0x3fca95c0 '\377' <repeats 128 times>, "\200"}, pcache = {block = 0, off = 0, size = 24, buffer = 0x3fca9644 "\001\002"}, root = {117, 118}, mlist = 0x3fcc2ad0, seed = 596460605, gstate = {tag = 0, pair = {0, 0}}, gdisk = {tag = 0, pair = {0, 0}}, 
  gdelta = {tag = 0, pair = {0, 0}}, free = {off = 116, size = 256, i = 3, ack = 254, buffer = 0x3fca96c8}, cfg = 0x3fca94e8, block_count = 256, name_max = 255, file_max = 2147483647, attr_max = 1022}

rojer · 2024-03-15T09:50:40Z

and this is for the second one:

(gdb) frame 19
#19 0x4010972f in lfs_file_sync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:6083
6083    lfs.c: No such file or directory.
(gdb) print *file
$1 = {next = 0x0, id = 14, type = 1 '\001', m = {pair = {1, 0}, rev = 180, off = 4096, etag = 1343224922, count = 23, erased = true, split = false, tail = {4294967295, 4294967295}}, ctz = {head = 25, size = 3354}, flags = 67842, pos = 3354, block = 25, off = 3354, cache = {block = 4294967295, off = 3328, size = 26, 
    buffer = 0x3ffd7000 '\377' <repeats 128 times>, " "}, cfg = 0x3f410b74 <defaults>}

(gdb) print *lfs
$1 = {rcache = {block = 1, off = 3968, size = 128, buffer = 0x3ffcd320 "\020\037\340x@"}, pcache = {block = 0, off = 0, size = 128, buffer = 0x3ffcd3a4 "\265"}, root = {1, 0}, mlist = 0x3ffd6d88, seed = 2227447927, gstate = {tag = 0, pair = {0, 0}}, gdisk = {tag = 0, pair = {0, 0}}, gdelta = {tag = 0, pair = {0, 
      0}}, free = {off = 106, size = 112, i = 33, ack = 112, buffer = 0x3ffcd428}, cfg = 0x3ffcd248, block_count = 112, name_max = 255, file_max = 2147483647, attr_max = 1022}

geky · 2024-03-15T19:46:46Z

Hmm, it doesn't look corrupted. So at least arbitrary memory corruption is unlikely.

It is weird that the id=14/23, but the tag's id ends up 0.

This may be a stretch, but have you checked for stack overflow? The commit sequence of code gets quite deep. A change in stack depth could also be explained by a littlefs version change.

Will dig more...

rojer · 2024-03-15T19:57:51Z

no, we're ok wrt stack: 6K and 4K free out of 8K total. additionally, we have a stack canary watchpoint at the end of he stack (CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y if you are familiar with idf) - it causes and exception in case of stack overflow.

rojer · 2024-03-15T20:07:48Z

we have an early indication that it is caused by a change between 2.7.0 and 2.8.2 - we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code (with the same fs state), makes it go away (at least the lfs_prog assertion).
this needs more work to fully confirm (the week ended, leaving us with a bit of a cliffhanger there :) ) but if you focus on the changes between 2.7.0 and 2.8.2, it might be there somewhere.

geky · 2024-03-15T20:10:57Z

I stand corrected, I forgot about the tag xoring. These dumps make more sense than I thought.

Running through readmdir.py:

$ ./scripts/readmdir.py dump1 512 0 -a -T      
mdir {0x0} rev 513 -> {0x75, 0x76}                                           
off       tag       type            id  len                                  
00000008: 601ffc08  hardtail         .    8                                  
  00000008: 75 00 00 00 76 00 00 00                          u...v...        
00000014: 500ffc6c  ccrc 0x0         .  108                                  
  00000014: be b2 f8 b2 ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000024: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000034: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000044: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000054: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000064: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000074: ff ff ff ff ff ff ff ff ff ff ff ff              ............    
  ...

This is a well formed tail commit, but the superblock is missing. And the file contents are missing?

$ ./scripts/readmdir.py dump2 512 0 -a -T      
mdir {0x0} rev 181 (corrupted!)                                              
off       tag       type            id  len                                  
00000008: 0c0000e5  name 0xc0        0  229                                  
  00000008: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000018: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000028: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000038: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000048: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000058: ff ff ff 10 1f e0 78 40 00 00 0a 63 6f 6e 66 39  ......x@...conf9
  00000068: 2e 6a 73 6f 6e 20 30 00 02 1a 00 00 00 d2 06 00  .json 0.........
  00000078: 00 6f d0 48 08 1f ff a8 ff ff ff ff ff ff ff ff  .o.H............
  00000088: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000098: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000a8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000b8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000c8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000d8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000e8: ff ff ff ff ff                                   .....           
  ...

This commit makes less sense. The type 0x0c0 doesn't exist, you would expect the superblock type here 0x0ff. Its contents is are all 0xffs which also doesn't make sense, and it seems to run over the following tags, though that's likely just because the size was corrupted.

It's curious that these commits are doing very different things...

geky · 2024-03-15T20:20:49Z

it's 2.8.2 + assertion + lfs_probe from #947

I forgot I asked you to add asserts, that would offset the line numbers a bit wouldn't it...

we have an early indication that it is caused by a change between 2.7.0 and 2.8.2

Looking at the changes, the biggest ones are making block_count optional and some cleanup to lfs_alloc. Both could be culprits. I wonder if a lfs->cfg->block_count -> lfs->block_count change was missed somewhere...

we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code

A reproducible example would be quite a good find :)

geky · 2024-03-19T05:57:20Z

From an out-of-band discussion, thanks to work from Nikola Kosturski and @rojer a reproducible test case for the missing magic was found.

What's interesting is this turned out to not actually be an erroneous case as far as lfs_mount is concerned. lfs_mount is happy as long as a valid superblock entry exists anywhere in the superblock chain.

But this is an issue in terms of documentation and expected behavior. I've put up #959 to try to fix the state of things.

There is still an issue with the superblock entry disappearing entirely, which shouldn't happen and cause is unknown.

geky · 2024-03-19T06:07:08Z

In the context of #959, first dump makes perfect sense now. It's a superblock mdir with the superblock entry deleted (but valid tail pointer pointing to the new superblock).

The second dump still does not make sense:

$ ./scripts/readmdir.py dump2 512 0 -a -T      
mdir {0x0} rev 181 (corrupted!)                                              
off       tag       type            id  len                                  
00000008: 0c0000e5  name 0xc0        0  229                                  
  00000008: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000018: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000028: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000038: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000048: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000058: ff ff ff 10 1f e0 78 40 00 00 0a 63 6f 6e 66 39  ......x@...conf9
  00000068: 2e 6a 73 6f 6e 20 30 00 02 1a 00 00 00 d2 06 00  .json 0.........
  00000078: 00 6f d0 48 08 1f ff a8 ff ff ff ff ff ff ff ff  .o.H............
  00000088: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000098: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000a8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000b8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000c8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000d8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000e8: ff ff ff ff ff                                   .....           
  ...

It looks like the magic tag is being corrupted somehow.

The fact that there also looks like some remnants of a file (conf9.json) in the superblock suggests either superblock expansion hasn't occured, or the error is during superblock expansion.

Thinking out loud, I wonder if the block ends up not erased. This corruption could come from flash masking if multiple progs occur without an erase...

Also a random though, but to make results show up quicker, you could drop the block_cycles very low, say block_cycles=1. This is what we do in the local littlefs tests.

geky · 2024-03-19T06:43:11Z

Double checking block erase during superblock expansion: It seems hard for hard for littlefs to miss an erase in this case. It's the first thing we do in lfs_dir_compact after creating the new superblock:
https://github.com/littlefs-project/littlefs/blob/4dd30c1b8f1b416633fe63a338ede8934b6449a9/lfs.c#L1967

nobody19 · 2024-04-10T05:14:52Z

What is your opinion is #962 something similar or totally different?

What are you using to read out such nice hex dumps (are you also using some mcu with an external flash)?

geky · 2024-04-12T06:58:28Z

What is your opinion is #962 something similar or totally different?

This looks like the same issue.

#959 is a bit of a red herring. The magic string can go missing, but the current driver works around this without erroring. It's more a spec/impl disagreement.

Ignoring #959, there is still an issue where superblock expansion can lead to filesystem corruption. Seemingly due to a rogue tail pointer commit. Nikola has also found a stack trace that suggests lfs_dir_splittingcompact is suspicious.

geky · 2024-04-12T07:12:04Z

What are you using to read out such nice hex dumps (are you also using some mcu with an external flash)?

That would be a question for @rojer, fortunately they've been able to send the extracted binaries.

There may be some external tool that can read SPI flash chips, otherwise you could write a loop to print the disk a byte at a time over your JTAG/debugger. Note if you go over JTAG you should really have some form of error-detection, JTAG/SWD can be noisy. The easiest thing to do is send each block three times.

If you also happen to have an SD card or network interface, going that route may be easier.

rojer · 2024-04-12T11:57:44Z

re: dumps - some come from reading device's flash, some were taken in a gdb session while examining crash dump. after that, it's hexdump -C, nothing special.

@geky speaking of #959 - i'm somewhat concerned by its current open status, do you think there may be something wrong with it? or awaiting review? we are about to ship a production release with it.

geky · 2024-04-12T17:15:38Z

@geky speaking of #959 - i'm somewhat concerned by its current open status, do you think there may be something wrong with it? or awaiting review? we are about to ship a production release with it.

I was hoping to resolve the remaining issue in the same release. Or at least better understand what is going on.

#959 is in a weird place where it's technically a bug, but it doesn't actually break anything. And since the current behavior is already out there and in use, it doesn't seem like the highest priority to bring in.

That and it's a behavior change which affects the disk, so we need to be confident it is correct. We can't really take it back after releasing.

geky · 2024-04-29T19:25:12Z

Until/unless we find more info, I'll go ahead and merge #959

geky added the needs investigation no idea what is wrong label Mar 7, 2024

geky mentioned this issue Mar 19, 2024

Duplicate the superblock entry during superblock expansion, fix missing magic #959

Merged

nobody19 mentioned this issue Apr 8, 2024

Expanding superblock deletes File #962

Open

geky mentioned this issue Apr 12, 2024

All the data is lost after update from disk 2.0 to disk 2.1 #964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superblock corruption #953

Superblock corruption #953

rojer commented Mar 6, 2024

rojer commented Mar 6, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 7, 2024

rojer commented Mar 7, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 13, 2024 •

edited

geky commented Mar 15, 2024

rojer commented Mar 15, 2024 •

edited

rojer commented Mar 15, 2024 •

edited

geky commented Mar 15, 2024

rojer commented Mar 15, 2024

rojer commented Mar 15, 2024

geky commented Mar 15, 2024

geky commented Mar 15, 2024

geky commented Mar 19, 2024

geky commented Mar 19, 2024

geky commented Mar 19, 2024

nobody19 commented Apr 10, 2024 •

edited

geky commented Apr 12, 2024

geky commented Apr 12, 2024

rojer commented Apr 12, 2024

geky commented Apr 12, 2024

geky commented Apr 29, 2024

Superblock corruption #953

Superblock corruption #953

Comments

rojer commented Mar 6, 2024

rojer commented Mar 6, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 7, 2024

rojer commented Mar 7, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 7, 2024

geky commented Mar 7, 2024

rojer commented Mar 13, 2024 • edited

geky commented Mar 15, 2024

rojer commented Mar 15, 2024 • edited

rojer commented Mar 15, 2024 • edited

geky commented Mar 15, 2024

rojer commented Mar 15, 2024

rojer commented Mar 15, 2024

geky commented Mar 15, 2024

geky commented Mar 15, 2024

geky commented Mar 19, 2024

geky commented Mar 19, 2024

geky commented Mar 19, 2024

nobody19 commented Apr 10, 2024 • edited

geky commented Apr 12, 2024

geky commented Apr 12, 2024

rojer commented Apr 12, 2024

geky commented Apr 12, 2024

geky commented Apr 29, 2024

rojer commented Mar 13, 2024 •

edited

rojer commented Mar 15, 2024 •

edited

rojer commented Mar 15, 2024 •

edited

nobody19 commented Apr 10, 2024 •

edited