Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss -- Fsync parent directory on file creation and rename #35

Open
aganesan4 opened this issue Sep 20, 2016 · 2 comments
Open

Data loss -- Fsync parent directory on file creation and rename #35

aganesan4 opened this issue Sep 20, 2016 · 2 comments

Comments

@aganesan4
Copy link

I am running a three node mongoDB cluster. I am using mongoDB 3.0.11 with rocksdb as storage engine. When I insert a new item into the store, I set w=3, j=True. When running strace on mongod, these are the file-system operations that happen on the node:

creat("data_dir/db/000004.sst")
append("data_dir/db/000004.sst")
fdatasync("data_dir/db/000004.sst")
creat("data_dir/db/MANIFEST-000005")
append("data_dir/db/MANIFEST-000005")
fdatasync("data_dir/db/MANIFEST-000005")
creat("data_dir/db/000005.dbtmp")
append("data_dir/db/000005.dbtmp")
fdatasync("data_dir/db/000005.dbtmp")
rename(source="data_dir/db/000005.dbtmp", dest="data_dir/db/CURRENT")
unlink("data_dir/db/MANIFEST-000001")
creat("data_dir/db/journal/000006.log")
unlink("data_dir/db/journal/000003.log")
fsync("data_dir/db")
trunc("data_dir/mongod.lock")
----client insert request----
append("data_dir/db/journal/000006.log")
----client ack----

When a new file is created or a file is renamed, the parent directory needs be explicitly fsynced to persist the new file. Please see this: https://www.quora.com/Linux/When-should-you-fsync-the-containing-directory-in-addition-to-the-file-itself and http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf. The log file and any further appends to it might be lost if the node crashes and the new file is not persisted. If the crash happens on two or more nodes on a three node cluster, one of these nodes could become the leader and a global data loss is possible. We have reproduced this particular data loss issue using our testing framework.

If the sst file goes missing or manifest file goes missing on a subsequent crash as the directory is not fsynced, the node fails to start again. As a fix, it would be safe to fsync the parent directory on creat or rename of files. This could result in the cluster becoming unavailable for quorum writes.

@mdcallag
Copy link
Contributor

We have discussed this before. I thought RocksDB was doing the right thing but I haven't looked at that code recently. I see places where it is likely done...

find . -type f -name \*\.cc -print | xargs grep -i sync | grep -i direct
./utilities/backupable/backupable_db.cc:      backup_private_directory->Fsync();
./utilities/backupable/backupable_db.cc:      private_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      meta_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      shared_directory_->Fsync();
./utilities/backupable/backupable_db.cc:      backup_directory_->Fsync();
./utilities/checkpoint/checkpoint.cc:      s = checkpoint_directory->Fsync();
./utilities/env_librados.cc:  // Fsync directory. Can be called concurrently from multiple threads.
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewRandomAccessFile:O_DIRECT",
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewWritableFile:O_DIRECT",
./utilities/persistent_cache/persistent_cache_test.cc:  rocksdb::SyncPoint::GetInstance()->SetCallBack("NewRandomAccessFile:O_DIRECT",
./db/db_impl.cc:      s = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:    status = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:          // We only sync WAL directory the first time WAL syncing is
./db/db_impl.cc:          status = directories_.GetWalDir()->Fsync();
./db/db_impl.cc:      s = impl->directories_.GetDbDir()->Fsync();
./db/filename.cc:                      Directory* directory_to_fsync) {
./db/filename.cc:    if (directory_to_fsync != nullptr) {
./db/filename.cc:      directory_to_fsync->Fsync();
./db/compaction_job.cc:  if (output_directory_ && !db_options_.disableDataSync) {
./db/compaction_job.cc:    output_directory_->Fsync();
./db/version_set.cc:                         db_options_->disableDataSync ? nullptr : db_directory);
./db/flush_job.cc:    if (!db_options_.disableDataSync && output_file_directory_ != nullptr) {
./db/flush_job.cc:      output_file_directory_->Fsync();

@igorcanadi
Copy link
Contributor

Thanks for the bug report. We do fsync the parent directory on the first WAL write. However, we do it only if you pass in the sync flag with your write. MongoRocks 3.0 has a known issue where it doesn't pass in a fsync flag even if it was requested. The bug is fixed in MongoRocks 3.2, can you please try upgrading?

This is where the parent directory fsync happens in RocksDB: https://github.com/facebook/rocksdb/blob/master/db/db_impl.cc#L4771

igorsol pushed a commit to igorsol/mongo-rocks that referenced this issue Jan 9, 2017
PSMDB-97 use static LZ4 library via shim_lz4 proxy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants