Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superblock corruption #953

Open
rojer opened this issue Mar 6, 2024 · 28 comments
Open

Superblock corruption #953

rojer opened this issue Mar 6, 2024 · 28 comments
Labels
needs investigation no idea what is wrong

Comments

@rojer
Copy link
Contributor

rojer commented Mar 6, 2024

Two devices in our test fleet experienced a weird corruption where filesystem is not obviously corrupt but the super block seems to have lot its superness - missing the littlefs tag, for example.
I have two flash dumps, here are two first blocks from one of them: https://gist.github.com/rojer/6a9fe5f2947a12b660570534252474f8

I can share full dumps privately, please feel free to reach out.

I have examined console logs from the devices at the time failure occured and it doesn't seem to be associated with anything unusual - no power loss event, for example, though both happened shortly after a soft reboot. The devices remained responsive, though seemingly having lost ability to read files (file not found errors). Upon rebooting both were unable to mount their filesystems (unsurprisingly) and thus became bricks.

@rojer
Copy link
Contributor Author

rojer commented Mar 6, 2024

make that three: another device has just succumbed to the same issue, whatever it is.
we are using 2.8.2 in disk format 2.0 mode.
the only somewhat relevant change in the environment has been file logging: we started logging to filesystem, which would increase write activity and fs utilization significantly.

@rojer
Copy link
Contributor Author

rojer commented Mar 7, 2024

there is a slight suspicion that #901 may have been a catalyst for it, though it needs confirmation.

@geky geky added the needs investigation no idea what is wrong label Mar 7, 2024
@geky
Copy link
Member

geky commented Mar 7, 2024

Hmmm, this certainly looks like an unrelated mdir is being written to blocks 0x{0,1}. That at least rules out CTZ skip-list issues. Is the log_090-20240301-205905.log file stored in the root dir?

To be honest this sounds a lot like a driver issue, though I also know from other discussion you've been using littlefs for a while. Has anything changed hardware-wise? Another user had a similar issue that turned out to be using the wrong chip which led to silent address truncation, for example.

Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling? How large is your storage?

It would be interesting to see if the last blocks on the storage contain any interesting data.

there is a slight suspicion that #901 may have been a catalyst for it, though it needs confirmation.

I agree, it'd be interesting to see if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system. Well, if these devices started with blank storage.

It could also just be wrap-around related.

If you have devices under test still, it would be interesting to put an assert in lfs_alloc (here) to assert that it never returns block 0 or 1 (actually why don't we already have this?).

I can share full dumps privately, please feel free to reach out.

Feel free to send it to geky at geky.net, though no promises I'll be able to find anything. It's a bit hard to understand a filesystem when the superblock is gone, since you don't know what blocks are out-of-date.

Aside from the above points, the only other thing I can think of would be to use ./scripts/readmdir.py to try to see if any mdirs contain a pointer to blocks 0, 1, which shouldn't happen normally.

@geky
Copy link
Member

geky commented Mar 7, 2024

to assert that it never returns block 0 or 1 (actually why don't we already have this?)

Ah, correction, we rely on lfs_alloc returning blocks 0, 1 in lfs_format_. So this would only work if you don't format.

@rojer
Copy link
Contributor Author

rojer commented Mar 7, 2024

Is the log_090-20240301-205905.log file stored in the root dir?

yes (everything is in the root dir in our case)

Has anything changed hardware-wise?

nope, we've been using the same block layer for ages

Is it possible because of the increased writing this is the first time littlefs has wrapped around storage due to wear-leveling?

that is certainly a possibility.

How large is your storage?

in this case - 448 KB (we have devies with varying fs sizes). about half of it is one big file (index.html.gz), and apart from that it's a dozen or so small files - config, some storage... and the recent addition are the log files, a set of at most 4, at most 4K each.

see if the last blocks on the storage contain any interesting data

no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).

if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system

no, in fact string littlefs does not appear in the entire image

@rojer
Copy link
Contributor Author

rojer commented Mar 7, 2024

i sent filesystem dumps by email, along with some logs

@rojer
Copy link
Contributor Author

rojer commented Mar 7, 2024

assert that it never returns block 0 or 1
this would only work if you don't format

we only format externally, when preparing an image, so i can differentiate between usage by device vs mklfs (our image creation tool).
i added the assertion to lfs_alloc:

diff --git a/littlefs/lfs.c b/littlefs/lfs.c
index 105915b..d9ad91f 100644
--- a/littlefs/lfs.c
+++ b/littlefs/lfs.c
@@ -662,7 +662,11 @@ static int lfs_alloc(lfs_t *lfs, lfs_block_t *block) {
                     lfs->free.i += 1;
                     lfs->free.ack -= 1;
                 }
-
+// Debug LFS corruption
+// https://github.com/littlefs-project/littlefs/issues/953#issuecomment-1984394787
+#ifndef LFS_TOOLS_BUILD
+                LFS_ASSERT(!(*block == 0 || *block == 1));
+#endif
                 return 0;
             }
         }

also dropped LFS_NO_ASSERT (which we had to save space). will try over the next few days and report.

@geky
Copy link
Member

geky commented Mar 7, 2024

see if the last blocks on the storage contain any interesting data

no, last 2 blocks appear to be data blocks (contain what appears to be valid failes - a log file and a config json file respectively).

Well that rules out address truncation, though it might still be that some layer thinks the block device is larger than it actually is.

if any other superblocks (blocks with "littlefs" at offset=8 bytes) exist on the system

no, in fact string littlefs does not appear in the entire image

That makes an issue with #901 seem less likely. Adding more superblocks should make the "littlefs" string appear more often, not less.

i sent filesystem dumps by email, along with some logs

Hmm, I haven't received these yet. Not in spam either. GitHub is also sending notification updates through this domain just fine.

Maybe try sending an email with no attachments first? Gmail may just be being a butt...

@rojer
Copy link
Contributor Author

rojer commented Mar 7, 2024

re-sent with a link instead

@geky
Copy link
Member

geky commented Mar 7, 2024

Two other possibilities come to mind:

  1. in this case - 448 KB (we have devies with varying fs sizes)

    It could be a lookahead buffer vs block count alignment issue (in littlefs). The block allocator/lookahead buffer was changed a bit recently. We have tests over this, but there can always be holes in tests.

    What lookahead_size are you using? I'm assuming block_count=112 for the 448 KiB.

  2. It could be something internally is causing littlefs to delete id 0 in the superblock, which is what the string "littlefs" and all the superblock info is attached to.

    Though I'm not sure what could cause trigger this if it happens without powerloss...

@rojer
Copy link
Contributor Author

rojer commented Mar 13, 2024

ok, so we did not get any crashes on the block alloc assertion, bu we are still getting corruptions where superblock ceases to be super. we added this to our write function:

  if (block < 2 && off == 0) {
    struct shos_str content = SHOS_MK_STR_N((const char *) buffer, size);
    struct shos_str lfs = SHOS_MK_STR("littlefs");
    assert(shos_strstr(content, lfs) != NULL);
  }

which basically is - crash if writing block 0 or 1 and the first 128 byes (our io size) do not contain littlefs.
and we get quite a few of these, for example:

...
#2  0x403924c6 in __assert_func (file=<optimized out>, file@entry=0x3c170a7d "shos_vfs_lfs.c", line=line@entry=69, func=<optimized out>, func@entry=0x3c170a3c <__func__.14970> "shos_lfs_prog", expr=<optimized out>, expr@entry=0x3c170a58 "shos_strstr(content, lfs) != NULL") at assert.c:85
#3  0x4203ffe8 in shos_lfs_prog (c=0x3fca94e8, block=0, off=0, buffer=0x3fca9644, size=128) at shos_vfs_lfs.c:69
#4  shos_lfs_prog (c=0x3fca94e8, block=0, off=0, buffer=0x3fca9644, size=128) at shos_vfs_lfs.c:64
#5  0x42041e30 in lfs_bd_flush (lfs=lfs@entry=0x3fca9538, pcache=pcache@entry=0x3fca9548, rcache=rcache@entry=0x3fca9538, validate=validate@entry=false) at lfs.c:181
#6  0x420422b4 in lfs_bd_sync (validate=false, rcache=0x3fca9538, pcache=0x3fca9548, lfs=0x3fca9538) at lfs.c:215
#7  lfs_dir_commitcrc (lfs=lfs@entry=0x3fca9538, commit=commit@entry=0x3fc9a0f8 <shos_task_stack+5620>) at lfs.c:1725
#8  0x42042e12 in lfs_dir_compact (lfs=lfs@entry=0x3fca9538, dir=dir@entry=0x3fc9a210 <shos_task_stack+5900>, attrs=attrs@entry=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=attrcount@entry=2, source=source@entry=0x3fc9a210 <shos_task_stack+5900>, begin=begin@entry=0, end=end@entry=0) at lfs.c:2026
#9  0x42042ffc in lfs_dir_splittingcompact (begin=0, end=0, source=0x3fc9a210 <shos_task_stack+5900>, attrcount=2, attrs=0x3fc9a2b0 <shos_task_stack+6060>, dir=0x3fc9a210 <shos_task_stack+5900>, lfs=0x3fca9538) at lfs.c:2180
#10 lfs_dir_relocatingcommit (lfs=lfs@entry=0x3fca9538, dir=dir@entry=0x3fc9a210 <shos_task_stack+5900>, pair=pair@entry=0x3fcc2ad8, attrs=attrs@entry=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=attrcount@entry=2, pdir=pdir@entry=0x3fc9a230 <shos_task_stack+5932>) at lfs.c:2298
#11 0x42043438 in lfs_dir_orphaningcommit (lfs=lfs@entry=0x3fca9538, dir=0x3fcc2ad8, attrs=0x3fc9a2b0 <shos_task_stack+6060>, attrcount=2) at lfs.c:2379
#12 0x420438ee in lfs_dir_commit (lfs=0x3fca9538, dir=<optimized out>, attrs=<optimized out>, attrcount=<optimized out>) at lfs.c:2551
#13 0x42043af2 in lfs_file_rawsync (file=0x3fcc2ad0, lfs=0x3fca9538) at lfs.c:3409
#14 lfs_file_rawsync (lfs=0x3fca9538, file=0x3fcc2ad0) at lfs.c:3373
#15 0x42043b2c in lfs_file_rawclose (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:3190
#16 0x420445ea in lfs_file_close (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:6067
#17 0x4204030e in shos_vfs_fs_lfs_close (fs=<optimized out>, fd=<optimized out>) at shos_vfs_lfs.c:364
...

contents of the buffer passed to shos_lfs_prog (frame 3):

00000000  01 02 00 00 9f e0 03 f7  75 00 00 00 76 00 00 00  |........u...v...|
00000010  30 10 00 64 be b2 f8 b2  ff ff ff ff ff ff ff ff  |0..d............|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000060  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

another one:

...
#2  0x40097d2c in __assert_func (file=<optimized out>, line=<optimized out>, func=<optimized out>, expr=<optimized out>) at assert.c:85
#3  0x40105c2b in shos_lfs_prog (c=0x3ffcd248, block=0, off=0, buffer=0x3ffcd3a4, size=128) at shos_vfs_lfs.c:64
#4  shos_lfs_prog (c=0x3ffcd248, block=0, off=0, buffer=0x3ffcd3a4, size=128) at shos_vfs_lfs.c:64
#5  0x401073e8 in lfs_bd_flush (lfs=0x3ffcd298, pcache=0x3ffcd2a8, rcache=0x3ffcd298, validate=false) at lfs.c:181
#6  0x401074dc in lfs_bd_prog (size=0, buffer=<optimized out>, off=<optimized out>, block=<optimized out>, validate=<optimized out>, rcache=<optimized out>, pcache=<optimized out>, lfs=<optimized out>) at lfs.c:251
#7  lfs_bd_prog (lfs=0x3ffcd298, pcache=0x3ffcd2a8, rcache=0x3ffcd298, validate=false, block=0, off=<optimized out>, buffer=<optimized out>, size=1) at lfs.c:227
#8  0x40107525 in lfs_dir_commitprog (lfs=0x3ffcd298, commit=0x3ffc4000 <shos_task_stack+7188>, buffer=0x3ffc3ea4 <shos_task_stack+6840>, size=1) at lfs.c:1559
#9  0x401075d4 in lfs_dir_commitattr (lfs=0x3ffcd298, commit=0x3ffc4000 <shos_task_stack+7188>, tag=<optimized out>, buffer=0x3ffc3f88 <shos_task_stack+7068>) at lfs.c:1608
#10 0x401075f1 in lfs_dir_commit_commit (p=0x3ffc402c <shos_task_stack+7232>, tag=2348810469, buffer=0x3ffc3f88 <shos_task_stack+7068>) at lfs.c:1887
#11 0x40106aa5 in lfs_dir_traverse (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, off=<optimized out>, ptag=2348810469, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, tmask=tmask@entry=1074789376, begin=begin@entry=0, end=end@entry=23, diff=diff@entry=0, 
    cb=cb@entry=0x401075e0 <lfs_dir_commit_commit>, data=data@entry=0x3ffc402c <shos_task_stack+7232>, ttag=0) at lfs.c:1039
#12 0x40108110 in lfs_dir_compact (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, source=0x3ffc4130 <shos_task_stack+7492>, begin=<optimized out>, end=end@entry=23) at lfs.c:1969
#13 0x4010831e in lfs_dir_splittingcompact (begin=0, end=23, source=0x3ffc4130 <shos_task_stack+7492>, attrcount=2, attrs=0x3ffc41e0 <shos_task_stack+7668>, dir=0x3ffc4130 <shos_task_stack+7492>, lfs=0x3ffcd298) at lfs.c:2180
#14 lfs_dir_relocatingcommit (lfs=0x3ffcd298, dir=0x3ffc4130 <shos_task_stack+7492>, pair=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2, pdir=<optimized out>) at lfs.c:2298
#15 0x4010873a in lfs_dir_orphaningcommit (lfs=0x3ffcd298, dir=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2) at lfs.c:2379
#16 0x40108bac in lfs_dir_commit (lfs=0x3ffcd298, dir=0x3ffd6d90, attrs=0x3ffc41e0 <shos_task_stack+7668>, attrcount=2) at lfs.c:2551
#17 0x40108d58 in lfs_file_rawsync (file=0x3ffd6d88, lfs=0x3ffcd298) at lfs.c:3409
#18 lfs_file_rawsync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:3373
#19 0x4010972f in lfs_file_sync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:6083
#20 0x40105e81 in shos_vfs_fs_lfs_fsync (fs=<optimized out>, fd=0) at shos_vfs_lfs.c:380
#21 0x40104bb1 in shos_vfs_fsync (vfd=256) at shos_vfs.c:433
#22 0x400e84ad in esp_vfs_fsync (fd=<optimized out>) at vfs.c:565
...
...

buffer:

0000000  b5 00 00 00 f3 ff ff 1a  ff ff ff ff ff ff ff ff  |................|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff 10 1f e0 78 40  |..............x@|
00000060  00 00 0a 63 6f 6e 66 39  2e 6a 73 6f 6e 20 30 00  |...conf9.json 0.|
00000070  02 1a 00 00 00 d2 06 00  00 6f d0 48 08 1f ff a8  |.........o.H....|

does this help? we have memory dumps corresponding to these but don't quite know what to look for.

@geky
Copy link
Member

geky commented Mar 15, 2024

which basically is - crash if writing block 0 or 1 and the first 128 byes (our io size) do not contain littlefs.

Smart! It is only asserting when the cache is flushed, which is a bit removed from when the new metadata is "staged" (written to pcache), but this is still useful info.

You could add this assert to littlefs's internal lfs_bd_read function, but I'm not sure it would really give us more than what you've found.

An aside: What exact commit hash are you on? The line numbers don't seem to quite line up with 2.8.2.

One thing to note is this stack trace is going through a normal compaction (lfs_dir_compact in lfs_dir_splittingcompact), NOT the expanding superblock compaction route. This is another point against a #901 bug, though it's still not impossible.


I'm writing as I'm looking into this, sorry if my thoughts are out of order or lead nowhere.

9f e0 03 f7 

This is a very strange tag. The super-type 0x100 (0x9fe & 0x700) should be internal-only and never written to disk. But we never use the sub-type 0xfe with LFS_TYPE_FROM. Maybe this is caused by an underflow somewhere? Not sure how this tag is possible but it looks too much like a tag to ignore.

00000000  01 02 00 00 9f e0 03 f7  75 00 00 00 76 00 00 00  |........u...v...|
00000010  30 10 00 64 be b2 f8 b2  ff ff ff ff ff ff ff ff  |0..d............|

This looks like a checksum tag (crc + padding) but... isn't a checksum tag? (0x5xx -> checksum, 0x3xx -> userattr?). Do yours file have userattrs that end in a bunch of 0xffs?

Looking at the second dump:

00000000  b5 00 00 00 f3 ff ff 1a  ff ff ff ff ff ff ff ff  |................|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000020  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000030  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000040  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000050  ff ff ff ff ff ff ff ff  ff ff ff 10 1f e0 78 40  |..............x@|
00000060  00 00 0a 63 6f 6e 66 39  2e 6a 73 6f 6e 20 30 00  |...conf9.json 0.|
00000070  02 1a 00 00 00 d2 06 00  00 6f d0 48 08 1f ff a8  |.........o.H....|

This commit is just a mess. The tag f3 ff ff 1a has an immense size that is probably an underflow/uninitialized/etc. The 00 00 0a 63 tag is at least a proper looking name tag, but the preceding bytes don't make sense as far as I can tell.


On second thought, moving this assert into littlefs's internal lfs_bd_read function may be useful to at least rule out driver/hardware related issues.


Something else I'm concerned about: conf9.json appears to reside in id=0. In blocks 0,1 this should normally be the superblock entry.

In lfs_file_t, .m.pair={0,1} and .id=0 should NEVER happen. It may be fruitful to add an assert on this in lfs_file_sync.

I'm not really sure what could cause this. Do you ever move the lfs_file_t struct? Does the lfs_file_t's .type field make sense? (littlefs always sets this to LFS_TYPE_REG, 0x001, otherwise maybe memory corruption?)

@rojer
Copy link
Contributor Author

rojer commented Mar 15, 2024

What exact commit hash are you on? The line numbers don't seem to quite line up with 2.8.2.

it's 2.8.2 + assertion + lfs_probe from #947
Here it is, for reference: https://gist.github.com/rojer/cc5fd75cd65a7c64df85c73287616bcb

Do you ever move the lfs_file_t struct?

no, it is allocated on open and never moved while the file is open.

Does the lfs_file_t's .type field make sense?

i think it does. at least doesn't look obviously stomped over:

(gdb) frame 16
#16 0x420445ea in lfs_file_close (lfs=lfs@entry=0x3fca9538, file=file@entry=0x3fcc2ad0) at lfs.c:6067
6067    lfs.c: No such file or directory.
(gdb) print *file
$1 = {next = 0x0, id = 23, type = 1 '\001', m = {pair = {1, 0}, rev = 512, off = 4096, etag = 1343224952, count = 26, erased = false, split = false, tail = {4294967295, 4294967295}}, ctz = {head = 116, size = 136}, flags = 66818, pos = 136, block = 116, off = 136, cache = {block = 4294967295, off = 128, size = 8, 
    buffer = 0x3fcc2c64 '\377' <repeats 128 times>, "T"}, cfg = 0x3c170d38 <defaults>}

(gdb) print *lfs
$1 = {rcache = {block = 4294967295, off = 128, size = 128, buffer = 0x3fca95c0 '\377' <repeats 128 times>, "\200"}, pcache = {block = 0, off = 0, size = 24, buffer = 0x3fca9644 "\001\002"}, root = {117, 118}, mlist = 0x3fcc2ad0, seed = 596460605, gstate = {tag = 0, pair = {0, 0}}, gdisk = {tag = 0, pair = {0, 0}}, 
  gdelta = {tag = 0, pair = {0, 0}}, free = {off = 116, size = 256, i = 3, ack = 254, buffer = 0x3fca96c8}, cfg = 0x3fca94e8, block_count = 256, name_max = 255, file_max = 2147483647, attr_max = 1022}

@rojer
Copy link
Contributor Author

rojer commented Mar 15, 2024

and this is for the second one:

(gdb) frame 19
#19 0x4010972f in lfs_file_sync (lfs=0x3ffcd298, file=0x3ffd6d88) at lfs.c:6083
6083    lfs.c: No such file or directory.
(gdb) print *file
$1 = {next = 0x0, id = 14, type = 1 '\001', m = {pair = {1, 0}, rev = 180, off = 4096, etag = 1343224922, count = 23, erased = true, split = false, tail = {4294967295, 4294967295}}, ctz = {head = 25, size = 3354}, flags = 67842, pos = 3354, block = 25, off = 3354, cache = {block = 4294967295, off = 3328, size = 26, 
    buffer = 0x3ffd7000 '\377' <repeats 128 times>, " "}, cfg = 0x3f410b74 <defaults>}
(gdb) print *lfs
$1 = {rcache = {block = 1, off = 3968, size = 128, buffer = 0x3ffcd320 "\020\037\340x@"}, pcache = {block = 0, off = 0, size = 128, buffer = 0x3ffcd3a4 "\265"}, root = {1, 0}, mlist = 0x3ffd6d88, seed = 2227447927, gstate = {tag = 0, pair = {0, 0}}, gdisk = {tag = 0, pair = {0, 0}}, gdelta = {tag = 0, pair = {0, 
      0}}, free = {off = 106, size = 112, i = 33, ack = 112, buffer = 0x3ffcd428}, cfg = 0x3ffcd248, block_count = 112, name_max = 255, file_max = 2147483647, attr_max = 1022}

@geky
Copy link
Member

geky commented Mar 15, 2024

Hmm, it doesn't look corrupted. So at least arbitrary memory corruption is unlikely.

It is weird that the id=14/23, but the tag's id ends up 0.

This may be a stretch, but have you checked for stack overflow? The commit sequence of code gets quite deep. A change in stack depth could also be explained by a littlefs version change.

Will dig more...

@rojer
Copy link
Contributor Author

rojer commented Mar 15, 2024

no, we're ok wrt stack: 6K and 4K free out of 8K total. additionally, we have a stack canary watchpoint at the end of he stack (CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y if you are familiar with idf) - it causes and exception in case of stack overflow.

@rojer
Copy link
Contributor Author

rojer commented Mar 15, 2024

we have an early indication that it is caused by a change between 2.7.0 and 2.8.2 - we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code (with the same fs state), makes it go away (at least the lfs_prog assertion).
this needs more work to fully confirm (the week ended, leaving us with a bit of a cliffhanger there :) ) but if you focus on the changes between 2.7.0 and 2.8.2, it might be there somewhere.

@geky
Copy link
Member

geky commented Mar 15, 2024

I stand corrected, I forgot about the tag xoring. These dumps make more sense than I thought.

Running through readmdir.py:

$ ./scripts/readmdir.py dump1 512 0 -a -T      
mdir {0x0} rev 513 -> {0x75, 0x76}                                           
off       tag       type            id  len                                  
00000008: 601ffc08  hardtail         .    8                                  
  00000008: 75 00 00 00 76 00 00 00                          u...v...        
00000014: 500ffc6c  ccrc 0x0         .  108                                  
  00000014: be b2 f8 b2 ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000024: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000034: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000044: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000054: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000064: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000074: ff ff ff ff ff ff ff ff ff ff ff ff              ............    
  ...

This is a well formed tail commit, but the superblock is missing. And the file contents are missing?

$ ./scripts/readmdir.py dump2 512 0 -a -T      
mdir {0x0} rev 181 (corrupted!)                                              
off       tag       type            id  len                                  
00000008: 0c0000e5  name 0xc0        0  229                                  
  00000008: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000018: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000028: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000038: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000048: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000058: ff ff ff 10 1f e0 78 40 00 00 0a 63 6f 6e 66 39  ......x@...conf9
  00000068: 2e 6a 73 6f 6e 20 30 00 02 1a 00 00 00 d2 06 00  .json 0.........
  00000078: 00 6f d0 48 08 1f ff a8 ff ff ff ff ff ff ff ff  .o.H............
  00000088: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000098: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000a8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000b8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000c8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000d8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000e8: ff ff ff ff ff                                   .....           
  ...

This commit makes less sense. The type 0x0c0 doesn't exist, you would expect the superblock type here 0x0ff. Its contents is are all 0xffs which also doesn't make sense, and it seems to run over the following tags, though that's likely just because the size was corrupted.

It's curious that these commits are doing very different things...

@geky
Copy link
Member

geky commented Mar 15, 2024

it's 2.8.2 + assertion + lfs_probe from #947

I forgot I asked you to add asserts, that would offset the line numbers a bit wouldn't it...

we have an early indication that it is caused by a change between 2.7.0 and 2.8.2

Looking at the changes, the biggest ones are making block_count optional and some cleanup to lfs_alloc. Both could be culprits. I wonder if a lfs->cfg->block_count -> lfs->block_count change was missed somewhere...

we managed to isolate what we think is a reliable repro for this, and going back to 2.7.0 code

A reproducible example would be quite a good find :)

@geky
Copy link
Member

geky commented Mar 19, 2024

From an out-of-band discussion, thanks to work from Nikola Kosturski and @rojer a reproducible test case for the missing magic was found.

What's interesting is this turned out to not actually be an erroneous case as far as lfs_mount is concerned. lfs_mount is happy as long as a valid superblock entry exists anywhere in the superblock chain.

But this is an issue in terms of documentation and expected behavior. I've put up #959 to try to fix the state of things.

There is still an issue with the superblock entry disappearing entirely, which shouldn't happen and cause is unknown.

@geky
Copy link
Member

geky commented Mar 19, 2024

In the context of #959, first dump makes perfect sense now. It's a superblock mdir with the superblock entry deleted (but valid tail pointer pointing to the new superblock).

The second dump still does not make sense:

$ ./scripts/readmdir.py dump2 512 0 -a -T      
mdir {0x0} rev 181 (corrupted!)                                              
off       tag       type            id  len                                  
00000008: 0c0000e5  name 0xc0        0  229                                  
  00000008: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000018: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000028: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000038: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000048: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000058: ff ff ff 10 1f e0 78 40 00 00 0a 63 6f 6e 66 39  ......x@...conf9
  00000068: 2e 6a 73 6f 6e 20 30 00 02 1a 00 00 00 d2 06 00  .json 0.........
  00000078: 00 6f d0 48 08 1f ff a8 ff ff ff ff ff ff ff ff  .o.H............
  00000088: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  00000098: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000a8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000b8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000c8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000d8: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  000000e8: ff ff ff ff ff                                   .....           
  ...

It looks like the magic tag is being corrupted somehow.

The fact that there also looks like some remnants of a file (conf9.json) in the superblock suggests either superblock expansion hasn't occured, or the error is during superblock expansion.

Thinking out loud, I wonder if the block ends up not erased. This corruption could come from flash masking if multiple progs occur without an erase...

Also a random though, but to make results show up quicker, you could drop the block_cycles very low, say block_cycles=1. This is what we do in the local littlefs tests.

@geky
Copy link
Member

geky commented Mar 19, 2024

Double checking block erase during superblock expansion: It seems hard for hard for littlefs to miss an erase in this case. It's the first thing we do in lfs_dir_compact after creating the new superblock:
https://github.com/littlefs-project/littlefs/blob/4dd30c1b8f1b416633fe63a338ede8934b6449a9/lfs.c#L1967

@nobody19
Copy link

nobody19 commented Apr 10, 2024

What is your opinion is #962 something similar or totally different?

What are you using to read out such nice hex dumps (are you also using some mcu with an external flash)?

@geky
Copy link
Member

geky commented Apr 12, 2024

What is your opinion is #962 something similar or totally different?

This looks like the same issue.

#959 is a bit of a red herring. The magic string can go missing, but the current driver works around this without erroring. It's more a spec/impl disagreement.

Ignoring #959, there is still an issue where superblock expansion can lead to filesystem corruption. Seemingly due to a rogue tail pointer commit. Nikola has also found a stack trace that suggests lfs_dir_splittingcompact is suspicious.

@geky
Copy link
Member

geky commented Apr 12, 2024

What are you using to read out such nice hex dumps (are you also using some mcu with an external flash)?

That would be a question for @rojer, fortunately they've been able to send the extracted binaries.

There may be some external tool that can read SPI flash chips, otherwise you could write a loop to print the disk a byte at a time over your JTAG/debugger. Note if you go over JTAG you should really have some form of error-detection, JTAG/SWD can be noisy. The easiest thing to do is send each block three times.

If you also happen to have an SD card or network interface, going that route may be easier.

@rojer
Copy link
Contributor Author

rojer commented Apr 12, 2024

re: dumps - some come from reading device's flash, some were taken in a gdb session while examining crash dump. after that, it's hexdump -C, nothing special.

@geky speaking of #959 - i'm somewhat concerned by its current open status, do you think there may be something wrong with it? or awaiting review? we are about to ship a production release with it.

@geky
Copy link
Member

geky commented Apr 12, 2024

@geky speaking of #959 - i'm somewhat concerned by its current open status, do you think there may be something wrong with it? or awaiting review? we are about to ship a production release with it.

I was hoping to resolve the remaining issue in the same release. Or at least better understand what is going on.

#959 is in a weird place where it's technically a bug, but it doesn't actually break anything. And since the current behavior is already out there and in use, it doesn't seem like the highest priority to bring in.

That and it's a behavior change which affects the disk, so we need to be confident it is correct. We can't really take it back after releasing.

@geky
Copy link
Member

geky commented Apr 29, 2024

Until/unless we find more info, I'll go ahead and merge #959

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation no idea what is wrong
Projects
None yet
Development

No branches or pull requests

3 participants