Sync WAL during db Close() #12556

hx235 · 2024-04-17T21:02:16Z

Context/Summary:
Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.

./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound: 
Verification failed :(

Test:

New UT
Same stress test command failed before this fix but pass after
CI

facebook-github-bot · 2024-04-17T21:02:45Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr · 2024-04-17T21:18:36Z

Can we sync the WAL during DB close?

facebook-github-bot · 2024-04-18T08:00:28Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2024-04-18T08:03:33Z

Can we sync the WAL during DB close?

Good point - updated.

facebook-github-bot · 2024-04-18T08:03:53Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ltamasi · 2024-04-19T18:01:05Z

unreleased_history/behavior_changes/wal_sync_on_close.md

@@ -0,0 +1 @@
+DB `Close()` will sync the WALs now


It seems to me that we're trading performance for consistency here: a clean closing of the database now ensures we're safe from host crashes but closing itself can become much more expensive, right? This feels like a big change; I'm wondering if we should make it optional (and disabled by default to prevent surprises for clients).

That's a good point - let me think a bit more about this. I also wondered how much close() performance matter since adding a DB options is "expensive" in usability.

With more research, i realized this in our API. So it seems like the responsibility of WAL sync on users has been documented @ajkr

This will not fsync the WAL files. // If syncing is required, the caller must first call SyncWAL(), or Write() // using an empty write batch with WriteOptions.sync=true.

Let's not block our crash test by this discussion - #12567

With more research, i realized this in our API. So it seems like the responsibility of WAL sync on users has been documented @ajkr

I think good documentation cannot make up for poor API quality. In my opinion, a safe default for close would be to sync the data that we will have no future opportunities to sync.

I don't have a strong opinion on this but I think in general we don't explicitly sync the WAL files we're done with unless we absolutely have to. One thing this means is that if we're going to do it in this case, we'll probably have to call SyncClosedLogs to make sure any earlier unsynced WALs are also synced.

One thing this means is that if we're going to do it in this case, we'll probably have to call SyncClosedLogs to make sure any earlier unsynced WALs are also synced

Let me look into that

There's a SyncWAL() function that is safe

Fixed with SyncWAL()

Summary: **Context/Summary:** See #12556 for the original problem. The [fix](#12556) encountered some design [discussion](#12556 (comment)) that might take longer than I expected. Temporarily disable reopen with unsync data loss now just to stablize our crash test since the original problem is root-caused already. Pull Request resolved: #12567 Test Plan: CI Reviewed By: ltamasi Differential Revision: D56365503 Pulled By: hx235 fbshipit-source-id: 0755e82617c065f42be4c8429e86fa289b250855

hx235 · 2024-04-24T18:49:57Z

Working on a UT

facebook-github-bot · 2024-04-25T00:49:41Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-04-25T01:03:32Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-04-25T01:06:43Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2024-04-25T01:29:39Z

Looking into the ASAN test failure. The other one seems flaky.

facebook-github-bot · 2024-04-25T02:16:18Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2024-04-25T02:23:13Z

file/writable_file_writer.cc

@@ -495,9 +495,6 @@ IOStatus WritableFileWriter::SyncWithoutFlush(const IOOptions& opts,
  IOStatus s = SyncInternal(io_options, use_fsync);
  TEST_SYNC_POINT("WritableFileWriter::SyncWithoutFlush:2");
  if (!s.ok()) {
-#ifndef NDEBUG
-    sync_without_flush_called_ = true;


@anand1976 Can we remove sync_without_flush_called_? If we run DBCompactionTest.ManualCompactionFailsInReadOnlyMode with this PR without this being removed, we can hit Assertion `sync_without_flush_called_' failed.

This is what happened:
mock_env->SetFilesystemActive(false); will call set_seen_error(). Later, in DB close, we sync the WAL so will enter AssertFalseAndGetStatusForPrevError() while sync WAL has not been previously called. So it seems to me that
"// This should only happen if SyncWithoutFlush() was called." is no longer true.

I also still don't understand why we want below behavior in debug mode. Can we just "return IOStatus::IOError("Writer has previous error.");" instead of crashing?

// SyncWithoutFlush() is the function that is allowed to be called // concurrently with other function. One of the concurrent call // could set seen_error_, and the other one would hit assertion // in debug mode.

facebook-github-bot · 2024-04-25T02:27:50Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

tools/ldb_cmd_test.cc

hx235 · 2024-04-25T02:34:34Z

db/db_impl/db_impl.cc

@@ -626,6 +626,19 @@ Status DBImpl::CloseHelper() {
    job_context.Clean();
    mutex_.Lock();
  }
+  if (!mutable_db_options_.avoid_flush_during_shutdown && !logs_.empty()) {
+    mutex_.Unlock();
+    Status s = SyncWAL();


Looking for some ideas to benchmark @ajkr @ltamasi: since now DB Close() with WAL sync will be slower, how should we answer the question of "how much slower?".

I'm thinking of
(1) Give a rough estimate of per sync time in OS ... probably googling for this number? And tell people to time this number by your #WAL files.

(2) Similar to (1) but microbench SyncWAL() on 1 WAL for estimating the per sync time

(3) ...?

how should we answer the question of "how much slower?"

IMO, "a lot slower" plus a reminder that the old behavior to not guarantee durability is available via avoid_flush_during_shutdown == true.

On the topic of durability violations, now avoid_flush_during_shutdown == true both makes WAL-enabled writes unsynced and drops WAL-disabled writes. I thought about the possible use cases (especially ones that mix WAL-enabled/disabled writes) and think the risk of bundling these violations into one option is low, but I could be wrong.

If you wanted the risk to be reduced you could introduce avoid_sync_during_shutdown. It would allow the following (hypothetical, AFAIK) use case to get the old behavior: (1) sometimes writes with WAL disabled, (2) must not lose WAL-disabled writes on clean shutdown, and (3) does not want to pay WAL sync overhead. It's kind of a confusing use case because (2) implies they are willing to pay SST sync overhead thus making (3) sound contradictory. But that is the case I thought of and the struggle to make sense out of it is why I thought the risk is low.

I thought the risk is low.

... With respect to the bundling durability violations into one option. Regarding the whole thing about changing the default behavior, that feels more like medium.

Okay - now i'm more toward introducing avoid_sync_during_shutdown and set it true by default so no existing behavior change

Ok but we will need to reserve time to actually follow through with users to get them to migrate / eventually change the default. Otherwise it'll be confusing for years to come regarding why we built a database without durability by default.

Sounds good - will do so particularly with avoid_flush_during_shutdown=false users

facebook-github-bot · 2024-04-25T02:37:08Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ajkr

LGTM!

ajkr · 2024-05-06T22:28:19Z

db/db_impl/db_impl.cc

@@ -626,6 +626,19 @@ Status DBImpl::CloseHelper() {
    job_context.Clean();
    mutex_.Lock();
  }
+  if (!mutable_db_options_.avoid_flush_during_shutdown && !logs_.empty()) {
+    mutex_.Unlock();
+    Status s = SyncWAL();


I thought the risk is low.

... With respect to the bundling durability violations into one option. Regarding the whole thing about changing the default behavior, that feels more like medium.

facebook-github-bot · 2024-05-13T23:24:22Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-05-13T23:30:27Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-05-14T22:32:26Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot added the CLA Signed label Apr 17, 2024

hx235 changed the title ~~Disable reopen upon unsync data loss in crash test~~ Disable reopen with unsynced data loss in crash test Apr 17, 2024

hx235 requested review from ajkr and ltamasi April 17, 2024 21:04

hx235 force-pushed the repro_not_found branch from 6251cbf to bc3719d Compare April 18, 2024 08:00

hx235 changed the title ~~Disable reopen with unsynced data loss in crash test~~ Sync WAL during db Close() Apr 18, 2024

ltamasi reviewed Apr 19, 2024

View reviewed changes

hx235 mentioned this pull request Apr 19, 2024

Temporarily disable reopen with unsync data loss #12567

Closed

hx235 force-pushed the repro_not_found branch from bc3719d to 0dcc705 Compare April 25, 2024 00:49

hx235 force-pushed the repro_not_found branch from 0dcc705 to 9b43d8a Compare April 25, 2024 01:03

hx235 force-pushed the repro_not_found branch from 9b43d8a to eacb550 Compare April 25, 2024 02:16

hx235 commented Apr 25, 2024

View reviewed changes

hx235 force-pushed the repro_not_found branch from eacb550 to 059fec5 Compare April 25, 2024 02:27

hx235 commented Apr 25, 2024

View reviewed changes

tools/ldb_cmd_test.cc Outdated Show resolved Hide resolved

hx235 commented Apr 25, 2024

View reviewed changes

hx235 requested a review from ltamasi April 25, 2024 16:44

ajkr approved these changes May 6, 2024

View reviewed changes

hx235 force-pushed the repro_not_found branch from 059fec5 to bbd7096 Compare May 13, 2024 23:24

hx235 force-pushed the repro_not_found branch from bbd7096 to 93c9102 Compare May 13, 2024 23:30

fix

f740769

hx235 force-pushed the repro_not_found branch from 93c9102 to f740769 Compare May 14, 2024 22:32

Sync WAL during db Close() #12556

Are you sure you want to change the base?

Sync WAL during db Close() #12556

Conversation

hx235 commented Apr 17, 2024 • edited

facebook-github-bot commented Apr 17, 2024

ajkr commented Apr 17, 2024

facebook-github-bot commented Apr 18, 2024

hx235 commented Apr 18, 2024

facebook-github-bot commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 commented Apr 24, 2024

facebook-github-bot commented Apr 25, 2024

facebook-github-bot commented Apr 25, 2024

facebook-github-bot commented Apr 25, 2024

hx235 commented Apr 25, 2024 • edited

facebook-github-bot commented Apr 25, 2024

hx235 Apr 25, 2024 • edited

Choose a reason for hiding this comment

facebook-github-bot commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 May 15, 2024 • edited

Choose a reason for hiding this comment

facebook-github-bot commented Apr 25, 2024

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 13, 2024

facebook-github-bot commented May 14, 2024

hx235 commented Apr 17, 2024 •

edited

hx235 commented Apr 25, 2024 •

edited

hx235 Apr 25, 2024 •

edited

hx235 May 13, 2024 •

edited

hx235 May 15, 2024 •

edited