Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.13.0~rc1: epic fail? Growing amount of missing chunks; "replication status: IO error" #746

Open
onlyjob opened this issue Aug 27, 2018 · 26 comments
Milestone

Comments

@onlyjob
Copy link
Member

onlyjob commented Aug 27, 2018

After upgrading chunkservers to 3.13.0~rc1 I'm afraid I'm not getting away without massive data loss: mfsmaster logs replication status: IO error all the time and as replication progresses, CGI's Chunks view report growing (!) number of missing chunks in ec, and xor goals.

Bloody hell... :( :( :(

@njhurst
Copy link

njhurst commented Aug 27, 2018

Oh no! :(:(:( Thanks for the warning and being the sacrifical one. I'll wait for now. I hope you have a backup.

@onlyjob
Copy link
Member Author

onlyjob commented Aug 27, 2018

Thank you for kind words, @njhurst. No, there is no backup. Where would you backup 100+ TiB? Systems like LizardFS are meant to protect from disasters, not cause them... Unforgivable...

Now I have 100_000+ missing chunks... I'm guessing that 3.13 destroyed chunks that did not finish goal change and had excessive chunks (mix of replicated and EC chunks).

Most of the damage occurred in the most precious data with goals ec(2,2) and ec(3,2) - those files should have been protected by 2 redundant chunks (RAID-6 level of safety).
All lost files were readable. No hardware failure was involved.

Replicated goals were not affected as far as I can tell... Maybe safe upgrade could be to change all EC goals to replicated ones, wait till no EC chunks are left and then upgrade...

@Blackpaw
Copy link

Blackpaw commented Aug 27, 2018 via email

@onlyjob
Copy link
Member Author

onlyjob commented Aug 28, 2018

Yes I've managed to retrieve some files from snapshots though only some...

I've managed to recover some missing chunks from older (recently replaced) HDD by connecting it to chunkserver 3.12 (chunkserver 3.13 rapidly deletes valid ec chunks).

@onlyjob
Copy link
Member Author

onlyjob commented Sep 3, 2018

Here is a quick summary of devastating upgrade from 3.12.0: tibibytes of data destroyed; 100_000+ missing chunks; 80_000+ files damaged. Almost all data in EC goals is gone either due to direct or collateral damage.

Pattern of damage is not enough parts available:

        chunk 0: 00000D9CA2DCB53A_00000001 / (id:14966398432570 ver:1)
                copy 1: 192.168.0.130:9422:wks part 4/4 of ec(2,2)
                not enough parts available
        chunk 0: 00000D9CA2DCB6EA_00000001 / (id:14966398433002 ver:1)
                copy 1: 192.168.0.250:9622:stor part 1/4 of ec(2,2)
                not enough parts available

Before upgrade I had fully replicated files with ec(2,2) goals and most of them are gone despite having no undergoal files prior to upgrade.

I also have significant loss (at least 50%) in ec(3,2) chunks of which some were fully replicated and some were in progress of changing goal from std:3 to ec(3,2) so there were enough replicas to avoid data loss.

This is how damaged ec(3,2) files look, according to lizardfs fileinfo:

        chunk 0: 0000000000954D73_00000001 / (id:9784691 ver:1)
                copy 1: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.250:9622:stor part 1/5 of ec(3,2)
                not enough parts available
        chunk 0: 0000000000954B9D_00000001 / (id:9784221 ver:1)
                copy 1: 192.168.0.2:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool part 3/5 of ec(3,2)
                not enough parts available
        chunk 1: 0000000000954BA5_00000001 / (id:9784229 ver:1)
                copy 1: 192.168.0.2:9422:pool part 3/5 of ec(3,2)
                copy 2: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                not enough parts available
        chunk 2: 0000000000954BAC_00000001 / (id:9784236 ver:1)
                copy 1: 192.168.0.130:9422:wks part 2/5 of ec(3,2)
                not enough parts available
        chunk 3: 0000000000954BB8_00000001 / (id:9784248 ver:1)
                copy 1: 192.168.0.2:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool part 3/5 of ec(3,2)
                copy 3: 192.168.0.4:9422:wks part 1/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool
                copy 5: 192.168.0.250:9422:stor part 2/5 of ec(3,2)
                copy 6: 192.168.0.250:9522:stor part 5/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor
        chunk 4: 0000000000954BBC_00000001 / (id:9784252 ver:1)
                copy 1: 192.168.0.2:9422:pool part 3/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool
                copy 3: 192.168.0.4:9422:wks part 2/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 5: 192.168.0.250:9422:stor part 5/5 of ec(3,2)
                copy 6: 192.168.0.250:9522:stor part 1/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor
        chunk 5: 0000000000954BC4_00000002 / (id:9784260 ver:2)
                copy 1: 192.168.0.2:9422:pool part 1/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool
                copy 3: 192.168.0.130:9422:wks part 2/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 5: 192.168.0.250:9422:stor
                copy 6: 192.168.0.250:9522:stor part 3/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor part 5/5 of ec(3,2)

Snapshots were useless to recover data unless snapshots had different goals. In the aftermath I'll probably make an std:1 goal to use exclusively on snapshots an pin it to slow-ish chunkserver.

3.13.0~rc1 is very unsafe for EC chunks. It deletes valid copies causing massive replication of remaining data. Beware...

@eleaner
Copy link

eleaner commented Sep 14, 2018

Hi Guys. Now you scared me shitless.
I just started my adventure with LizzardFS and obviously with 3.13.0~rc1
200k chunks and I don't see any major problems yet, maybe except #765

What is the safest way forward? the whole point on using lizzard was to use ec instead of btrfs/zfs parity
is there a way to downgrade lizard version to a working one?

@onlyjob
Copy link
Member Author

onlyjob commented Sep 14, 2018

If you are already on v3.13.0~rc1 then you might be safe. Between 3.12 and 3.13 they've made a very unsafe change to convert EC chunks made by earlier LizardFS versions: e76c386. Unless there are other issues affecting EC chunks, this particular one is about upgrade to 3.13.0~rc1.

@eleaner
Copy link

eleaner commented Sep 14, 2018

@onlyjob
Yes. I started in v3.13.0~rc1
So I hope I am safe

@creolis
Copy link

creolis commented Aug 10, 2019

We updated to 3.13.0rc1 in order to get proper bandwidth limit handling - but we did not notice this ticket prior to updating.

We also use EC goals (EC6,3 and EC7,2) with 660280 chunks (404057 fs objects).
Now 3 days into the update we start losing chunks in EC6,3 for no apparent reason,
we lost 5 chunks (3 files) so far.

Looking into the issue I stumbled upon your ticket and now I'm not sure how to proceed -
not sure if downgrading would be an option to consider since I don't know if the recalculation is still running and if we have to expect lost chunks adding up leaving it on 3.13.0rc1.

To be honest I would love to see at least some sign of life from skytech here ... at least a heartbeat showing that they acknowledge our findings and issues.

@onlyjob: how did you ultimately proceed?

@onlyjob
Copy link
Member Author

onlyjob commented Aug 11, 2019

@creolis, I think downgrade is the only option to save your data. It is especially important to avoid upgrading chunkservers (or downgrade them ASAP). I don't know if anything else could be done. From my memory, ~rc chunkservers were aggressively removing valid EC chunks.

I've lost terabytes of data due to this bug and ultimately moved away from LizardFS.
IMHO current governance of LizardFS can not be trusted and even if they'd care to repair the trust it would take a lot of time, expertise and communication with community.
Skytech is hopeless. It's been almost a year and they couldn't care less... :(

Knowing no better alternatives, I recommend to use MooseFS instead of LizardFS.

@onlyjob onlyjob pinned this issue Aug 11, 2019
@creolis
Copy link

creolis commented Aug 11, 2019

sigh I hate to accept this ... but data loss is the only thing I can't cope with and I have a hard time trusting a FS that allows this, even if we're talking about a RC.

For me the real issue is two things:

  1. I wonder why 3.13.0-rc1 is still online and the preferred link if you hit "Download" on lizardfs.com, without the slightest note or warning that there is the chance to loose data on existing EC goals.

  2. I really don't know why an absolute showstopper like this is not handled as a priority. If your users loose data, you get a reputation problem, even if you're working on an updated branch. Just dedicate enough time to 3.13.0-rc2 to prevent data loss. Management features that do not work? I'll survive that. Lost Performance? I can deal with that. Unstable chunkservers? I'll watchdog them with a shell script and restart them if necessary. Data Loss? I can't deal with this.

@onlyjob
Copy link
Member Author

onlyjob commented Aug 12, 2019

Regarding 1 I'd be very reluctant to install software from source to production infrastructure, not to mention pre-release. To some extent Debian users are protected from this regression because I could not upload such broken release knowing severity of the problems. Official Debian package is not a panacea but it is better/safer because at least package maintainer double checked the release.

As for 2, it seems there are nobody left to care. I think either all developers are left (or they were pulled away from the project). You can see from the history of commits that senior developers stopped committing a while ago, then there were no commits at all, then (after a while) a new (junior?) developer started to work on simple issues.
IMHO under current governance there is no hope for this project: #805.

Indeed data loss is the worst but there are other severe issues I've listed in the milestone: https://github.com/lizardfs/lizardfs/milestone/2. Notably #662 causes a lot of grief in CI because various git commands randomly fail. I'm not sure what else might be affected by #780 and #672 but it feels very insecure. #754 leaves even less confidence and #742/#743 show how much worse the quality of ~rc1 has become comparing to previous releases.

IMHO priorities of this project drifted too far away from quality (towards features?), causing so much damage to trust that I'd be surprised if it ever recovers... :(

@zicklag
Copy link

zicklag commented Sep 11, 2019

LizardFS is now under new management ( see #805 (comment) ) so hopefully LizardFS will start to get back on track again.

@BloodBlight
Copy link

Is there any progress on this issue?

@creolis
Copy link

creolis commented Jun 30, 2022

Nah.

Also, we ended up with another error, that presented itself as "bit rot" in replication (non EC!) goals that was not detected by the chunk check loop - at this point we could not stay with LizardFS and had to migrate away.

We disbanded our 20 node LizardFS array and switched to another storage solution (and no, it's not MooseFS, due to the lack of features that made LizardFS exactly what we needed in our - apparently weird - use case).

We've waited several years now for any sign of progress or change, this regression is open since 2018 (4 yrs at the time of this writing!) but since there seems to be no interest in fixing critical flaws that result in data loss, we had to close this chapter. I'm still reading here, hoping that the guys will eventually resurrect this project, but I don't think this is going to happen in the current situation. I had really high hopes for LizardFS ... it's a pity.

  • Daniel

@BloodBlight
Copy link

Oh, that's no good!

I am using both Moose and Lizard right now, but we plan on migrating out Moose cluster to Ceph as we have the hardware to do it...

I still use Lizard at home, but am actively looking for SOMETHING that can do what it does and still let me migrate disks in and out...

May I ask what you switched to?

@Blackpaw
Copy link

Blackpaw commented Jul 1, 2022

I still use Lizard at home, but am actively looking for SOMETHING that can do what it does and still let me migrate disks in and out...

Well, theres moosefs, no EC goals for the free version though.

If you're only using a single server, there's bcachefs, if you don't mind beta filesystems and building a custom kernel :)

@BloodBlight
Copy link

Ya, EC is basically a must as well. Seaweed looked interesting, but the response I got from the dev when specifically asking for a commercial license was less than inspiring...

I had NOT heard of bcachefs! And no, I don't mind a custom kernel at all.

Thanks. :)

@jkiebzak
Copy link

jkiebzak commented Jul 1, 2022

@creolis what did you end up going with?

@onlyjob
Copy link
Member Author

onlyjob commented Jul 5, 2022

We disbanded our 20 node LizardFS array and switched to another storage solution (and no, it's not MooseFS, due to the lack of features that made LizardFS exactly what we needed in our - apparently weird - use case).

What features that would be??

EC is overrated. It allows slightly more efficient utilisation of space at expense of performance, higher administration and maintenance cost, more troubleshooting, more downtime and a risk of paying the ultimate price -- data loss.

Are those troubles worth the price of several (cheap/slow) high capacity HDDs to accommodate non-EC replicas on MooseFS?
In our case the answer is definitive no. After switch to MooseFS we have more reliable storage, less bugs, greater availability, less administration effort, better performance, lower access latency, better support, etc.

And not just that. MooseFS have an awesome feature to compensate for lack of EC: Storage Classes, that allow to pin data to disks with different capacity/performance and design tiers for efficient utilisation of hybrid storage with SSD and rotational disks.

As for master/master replication, we've also found that it is not as useful as it seems. It can be re-implemented with fail-over to metadata backup logger, with some downtime. However the point is to avoid accidental/automatic switch between masters (like in case of network switch maintenance) hence it is safer to move master manually when required.

So my conclusion is that MooseFS is massively superior to LizardFS even without EC, because TCO and reliability matters. Less reliable storage tends to be more costly to operate.

P.S. We've tried and considered almost every Open Source storage solution (e.g. Ceph, GFarm, GlusterFS, RozoFS, SeaweedFS, XtreemFS and few others that I don't recall at the moment) but nothing stands even close to MooseFS.
LeoFS was also considered but I've spent so much time on trial and comparison of everything else, and was already so happy with MooseFS that I never had a chance to try LeoFS... If anyone tried it, please let me know your impressions. Thanks.

@Blackpaw
Copy link

Blackpaw commented Jul 5, 2022

EC is overrated. It allows slightly more efficient utilisation of space at expense of performance, higher administration and maintenance cost, more troubleshooting, more downtime and a risk of paying the ultimate price -- data loss.

Gotta admit, when I migrated my media server from LizardFS to MooseFS I didn't worry about giving up on EC goals. Had a large section of media that was on ec(2,1), converted it to Goal 2, added an extra 5TB disk, problem solved.

@creolis
Copy link

creolis commented Jul 5, 2022

What features that would be??

A flexible number of replicas to be configured (in one extreme with a 80 node lizardFS cluster, 80 replicas!).
Weird use case, I know. But it was exactly (!!) what we needed.

@lgsilva3087
Copy link
Contributor

As @onlyjob pointed out, we confirmed that this bug was introduced by the following commit:

e76c386

The issue only appears after upgrading from version 3.12 (last official released version) to version 3.13 (still in release candidate status) with EC chunks under rebalancing. We have replicated the issue in our testing infrastructure and are working on fixing it.

The Safe Scenarios are:

  • Installation of version 3.12 having files with EC replication goals.
  • Installation of version 3.13 having files with EC replication goals. (Clean installation, not upgraded from v3.12 if you have files with EC goals!).
  • Upgrade from version 3.12 to 3.13 if no EC replication goals are used on the v3.12 cluster.

@creolis
Copy link

creolis commented Jul 14, 2022

well, my issue was that while using replication goals (and again we're talking about 80 replicas, 'cause we use it as a kind of sync) more and more replicas ended up with garbage inside the files and lizardFS never noticed that this happened and flagged all of them as available and okay. We could not find a clear reproducer, it "just happened" for individual files that were all around 2 GB in size .. but not for all of them.

Anyway - maybe some day lizardFS will release a new version - maybe even with the announced complete rewrite .. then I will happily take a look at it again, as it served me really well for years. I'm looking forward to it :)

@borkd
Copy link

borkd commented Jul 16, 2022

@creolis - did you, at any point, have a non ECC-memory systems connected as clients which process the data? Workstations, etc..

@creolis
Copy link

creolis commented Jul 20, 2022

@borkd Yes, we had. The clients (and at the same time chunkservers) whose vmware template cache partitions have been kept in sync using LizardFS replication targets are non-ECC. It worked flawlessly for so long ... so to be honest I never even bothered to think bit-rot could be a problem due to non ECC memory ... sheesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants