Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync WAL during db Close() #12556

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Apr 17, 2024

Context/Summary:
Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.

./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound: 
Verification failed :(

Test:

  • New UT
  • Same stress test command failed before this fix but pass after
  • CI

@hx235 hx235 changed the title Disable reopen upon unsync data loss in crash test Disable reopen with unsynced data loss in crash test Apr 17, 2024
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested review from ajkr and ltamasi April 17, 2024 21:04
@ajkr
Copy link
Contributor

ajkr commented Apr 17, 2024

Can we sync the WAL during DB close?

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Disable reopen with unsynced data loss in crash test Sync WAL during db Close() Apr 18, 2024
@hx235
Copy link
Contributor Author

hx235 commented Apr 18, 2024

Can we sync the WAL during DB close?

Good point - updated.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -0,0 +1 @@
DB `Close()` will sync the WALs now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we're trading performance for consistency here: a clean closing of the database now ensures we're safe from host crashes but closing itself can become much more expensive, right? This feels like a big change; I'm wondering if we should make it optional (and disabled by default to prevent surprises for clients).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - let me think a bit more about this. I also wondered how much close() performance matter since adding a DB options is "expensive" in usability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With more research, i realized this in our API. So it seems like the responsibility of WAL sync on users has been documented @ajkr

This will not fsync the WAL files.
  // If syncing is required, the caller must first call SyncWAL(), or Write()
  // using an empty write batch with WriteOptions.sync=true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not block our crash test by this discussion - #12567

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With more research, i realized this in our API. So it seems like the responsibility of WAL sync on users has been documented @ajkr

I think good documentation cannot make up for poor API quality. In my opinion, a safe default for close would be to sync the data that we will have no future opportunities to sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this but I think in general we don't explicitly sync the WAL files we're done with unless we absolutely have to. One thing this means is that if we're going to do it in this case, we'll probably have to call SyncClosedLogs to make sure any earlier unsynced WALs are also synced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing this means is that if we're going to do it in this case, we'll probably have to call SyncClosedLogs to make sure any earlier unsynced WALs are also synced

Let me look into that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a SyncWAL() function that is safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with SyncWAL()

facebook-github-bot pushed a commit that referenced this pull request Apr 19, 2024
Summary:
**Context/Summary:**
See #12556 for the original problem.
The [fix](#12556) encountered some design [discussion](#12556 (comment)) that might take longer than I expected. Temporarily disable reopen with unsync data loss now just to stablize our crash test since the original problem is root-caused already.

Pull Request resolved: #12567

Test Plan: CI

Reviewed By: ltamasi

Differential Revision: D56365503

Pulled By: hx235

fbshipit-source-id: 0755e82617c065f42be4c8429e86fa289b250855
@hx235
Copy link
Contributor Author

hx235 commented Apr 24, 2024

Working on a UT

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Apr 25, 2024

Looking into the ASAN test failure. The other one seems flaky.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@@ -495,9 +495,6 @@ IOStatus WritableFileWriter::SyncWithoutFlush(const IOOptions& opts,
IOStatus s = SyncInternal(io_options, use_fsync);
TEST_SYNC_POINT("WritableFileWriter::SyncWithoutFlush:2");
if (!s.ok()) {
#ifndef NDEBUG
sync_without_flush_called_ = true;
Copy link
Contributor Author

@hx235 hx235 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 Can we remove sync_without_flush_called_? If we run DBCompactionTest.ManualCompactionFailsInReadOnlyMode with this PR without this being removed, we can hit Assertion `sync_without_flush_called_' failed.

This is what happened:
mock_env->SetFilesystemActive(false); will call set_seen_error(). Later, in DB close, we sync the WAL so will enter AssertFalseAndGetStatusForPrevError() while sync WAL has not been previously called. So it seems to me that
"// This should only happen if SyncWithoutFlush() was called." is no longer true.

I also still don't understand why we want below behavior in debug mode. Can we just "return IOStatus::IOError("Writer has previous error.");" instead of crashing?

  // SyncWithoutFlush() is the function that is allowed to be called
  // concurrently with other function. One of the concurrent call
  // could set seen_error_, and the other one would hit assertion
  // in debug mode.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

tools/ldb_cmd_test.cc Outdated Show resolved Hide resolved
@@ -626,6 +626,19 @@ Status DBImpl::CloseHelper() {
job_context.Clean();
mutex_.Lock();
}
if (!mutable_db_options_.avoid_flush_during_shutdown && !logs_.empty()) {
mutex_.Unlock();
Status s = SyncWAL();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for some ideas to benchmark @ajkr @ltamasi: since now DB Close() with WAL sync will be slower, how should we answer the question of "how much slower?".

I'm thinking of
(1) Give a rough estimate of per sync time in OS ... probably googling for this number? And tell people to time this number by your #WAL files.

(2) Similar to (1) but microbench SyncWAL() on 1 WAL for estimating the per sync time

(3) ...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should we answer the question of "how much slower?"

IMO, "a lot slower" plus a reminder that the old behavior to not guarantee durability is available via avoid_flush_during_shutdown == true.

On the topic of durability violations, now avoid_flush_during_shutdown == true both makes WAL-enabled writes unsynced and drops WAL-disabled writes. I thought about the possible use cases (especially ones that mix WAL-enabled/disabled writes) and think the risk of bundling these violations into one option is low, but I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted the risk to be reduced you could introduce avoid_sync_during_shutdown. It would allow the following (hypothetical, AFAIK) use case to get the old behavior: (1) sometimes writes with WAL disabled, (2) must not lose WAL-disabled writes on clean shutdown, and (3) does not want to pay WAL sync overhead. It's kind of a confusing use case because (2) implies they are willing to pay SST sync overhead thus making (3) sound contradictory. But that is the case I thought of and the struggle to make sense out of it is why I thought the risk is low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the risk is low.

... With respect to the bundling durability violations into one option. Regarding the whole thing about changing the default behavior, that feels more like medium.

Copy link
Contributor Author

@hx235 hx235 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - now i'm more toward introducing avoid_sync_during_shutdown and set it true by default so no existing behavior change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but we will need to reserve time to actually follow through with users to get them to migrate / eventually change the default. Otherwise it'll be confusing for years to come regarding why we built a database without durability by default.

Copy link
Contributor Author

@hx235 hx235 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - will do so particularly with avoid_flush_during_shutdown=false users

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from ltamasi April 25, 2024 16:44
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -626,6 +626,19 @@ Status DBImpl::CloseHelper() {
job_context.Clean();
mutex_.Lock();
}
if (!mutable_db_options_.avoid_flush_during_shutdown && !logs_.empty()) {
mutex_.Unlock();
Status s = SyncWAL();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the risk is low.

... With respect to the bundling durability violations into one option. Regarding the whole thing about changing the default behavior, that feels more like medium.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants