Skip to content

Releases: facebook/rocksdb

RocksDB 9.1.1

22 Apr 18:36
Compare
Choose a tag to compare

9.1.1 (2024-04-17)

Bug Fixes

  • Fixed Java SstFileMetaData to prevent throwing java.lang.NoSuchMethodError
  • Fixed a regression when ColumnFamilyOptions::max_successive_merges > 0 where the CPU overhead for deciding whether to merge could have increased unless the user had set the option ColumnFamilyOptions::strict_max_successive_merges

RocksDB 9.1.0

18 Apr 18:47
Compare
Choose a tag to compare

9.1.0 (2024-03-22)

New Features

  • Added an option, GetMergeOperandsOptions::continue_cb, to give users the ability to end GetMergeOperands()'s lookup process before all merge operands were found.
  • *Add sanity checks for ingesting external files that currently checks if the user key comparator used to create the file is compatible with the column family's user key comparator.
    *Support ingesting external files for column family that has user-defined timestamps in memtable only enabled.
  • On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read.
  • Some enhancements and fixes to experimental Temperature handling features, including new default_write_temperature CF option and opening an SstFileWriter with a temperature.
  • WriteBatchWithIndex now supports wide-column point lookups via the GetEntityFromBatch API. See the API comments for more details.
  • *Implement experimental features: API Iterator::GetProperty("rocksdb.iterator.write-time") to allow users to get data's approximate write unix time and write data with a specific write time via WriteBatch::TimedPut API.

Public API Changes

  • Best-effort recovery (best_efforts_recovery == true) may now be used together with atomic flush (atomic_flush == true). The all-or-nothing recovery guarantee for atomically flushed data will be upheld.
  • Remove deprecated option bottommost_temperature, already replaced by last_level_temperature
  • Added new PerfContext counters for block cache bytes read - block_cache_index_read_byte, block_cache_filter_read_byte, block_cache_compression_dict_read_byte, and block_cache_read_byte.
  • Deprecate experimental Remote Compaction APIs - StartV2() and WaitForCompleteV2() and introduce Schedule() and Wait(). The new APIs essentially does the same thing as the old APIs. They allow taking externally generated unique id to wait for remote compaction to complete.
  • *For API WriteCommittedTransaction::GetForUpdate, if the column family enables user-defined timestamp, it was mandated that argument do_validate cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user sets do_validate to false and does not set a read timestamp. With this, GetForUpdate skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc.
  • Defined a new PerfLevel kEnableWait to measure time spent by user threads blocked in RocksDB other than mutex, such as a write thread waiting to be added to a write group, a write thread delayed or stalled etc.
  • RateLimiter's API no longer requires the burst size to be the refill size. Users of NewGenericRateLimiter() can now provide burst size in single_burst_bytes. Implementors of RateLimiter::SetSingleBurstBytes() need to adapt their implementations to match the changed API doc.
  • Add write_memtable_time to the newly introduced PerfLevel kEnableWait.

Behavior Changes

  • RateLimiters created by NewGenericRateLimiter() no longer modify the refill period when SetSingleBurstBytes() is called.
  • Merge writes will only keep merge operand count within ColumnFamilyOptions::max_successive_merges when the key's merge operands are all found in memory, unless strict_max_successive_merges is explicitly set.

Bug Fixes

  • Fixed kBlockCacheTier reads to return Status::Incomplete when I/O is needed to fetch a merge chain's base value from a blob file.
  • Fixed kBlockCacheTier reads to return Status::Incomplete on table cache miss rather than incorrectly returning an empty value.
  • Fixed a data race in WalManager that may affect how frequent PurgeObsoleteWALFiles() runs.
  • Re-enable the recycle_log_file_num option in DBOptions for kPointInTimeRecovery WAL recovery mode, which was previously disabled due to a bug in the recovery logic. This option is incompatible with WriteOptions::disableWAL. A Status::InvalidArgument() will be returned if disableWAL is specified.

Performance Improvements

  • Java API multiGet() variants now take advantage of the underlying batched multiGet() performance improvements.
    Before
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 6315.541 ± 8.106 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 6975.468 ± 68.964 ops/s

After

Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 7046.739 ± 13.299 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 7654.521 ± 60.121 ops/s

RocksDB 9.0.1

17 Apr 20:51
Compare
Choose a tag to compare

9.0.1 (2024-04-11)

Bug Fixes

  • Fixed CMake Javadoc and source jar builds
  • Fixed Java SstFileMetaData to prevent throwing java.lang.NoSuchMethodError

RocksDB 8.11.4

09 Apr 19:44
Compare
Choose a tag to compare

8.11.4 (2024-04-09)

Bug Fixes

  • Fixed CMake Javadoc build
  • Fixed Java SstFileMetaData to prevent throwing java.lang.NoSuchMethodError

RocksDB 9.0.0

19 Mar 16:44
Compare
Choose a tag to compare

9.0.0 (2024-02-16)

New Features

  • Provide support for FSBuffer for point lookups. Also added support for scans and compactions that don't go through prefetching.
  • *Make SstFileWriter create SST files without persisting user defined timestamps when the Option.persist_user_defined_timestamps flag is set to false.
  • Add support for user-defined timestamps in APIs DeleteFilesInRanges and GetPropertiesOfTablesInRange.
  • Mark wal_compression feature as production-ready. Currently only compatible with ZSTD compression.

Public API Changes

  • Allow setting Stderr logger via C API
  • Declare one Get and one MultiGet variant as pure virtual, and make all the other variants non-overridable. The methods required to be implemented by derived classes of DB allow returning timestamps. It is up to the implementation to check and return an error if timestamps are not supported. The non-batched MultiGet APIs are reimplemented in terms of batched MultiGet, so callers might see a performance improvement.
  • Exposed mode option to Rate Limiter via c api.
  • Removed deprecated option access_hint_on_compaction_start
  • Removed deprecated option ColumnFamilyOptions::check_flush_compaction_key_order
  • *Remove the default WritableFile::GetFileSize and FSWritableFile::GetFileSize implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation.
  • Removed deprecated option ColumnFamilyOptions::level_compaction_dynamic_file_size
  • *Removed tickers with typos "rocksdb.error.handler.bg.errro.count", "rocksdb.error.handler.bg.io.errro.count", "rocksdb.error.handler.bg.retryable.io.errro.count".
  • Remove the force mode for EnableFileDeletions API because it is unsafe with no known legitimate use.
  • Removed deprecated option ColumnFamilyOptions::ignore_max_compaction_bytes_for_input
  • sst_dump --command=check now compares the number of records in a table with num_entries in table property, and reports corruption if there is a mismatch. API SstFileDumper::ReadSequential() is updated to optionally do this verification. (#12322)

Behavior Changes

  • format_version=6 is the new default setting in BlockBasedTableOptions, for more robust data integrity checking. DBs and SST files written with this setting cannot be read by RocksDB versions before 8.6.0.
  • Compactions can be scheduled in parallel in an additional scenario: multiple files are marked for compaction within a single column family
  • For leveled compaction, RocksDB will try to do intra-L0 compaction if the total L0 size is small compared to Lbase (#12214). Users with atomic_flush=true are more likely to see the impact of this change.

Bug Fixes

  • Fixed a data race in DBImpl::RenameTempFileToOptionsFile.
  • Fix some perf context statistics error in write steps. which include missing write_memtable_time in unordered_write. missing write_memtable_time in PipelineWrite when Writer stat is STATE_PARALLEL_MEMTABLE_WRITER. missing write_delay_time when calling DelayWrite in WriteImplWALOnly function.
  • Fixed a bug that can, under rare circumstances, cause MultiGet to return an incorrect result for a duplicate key in a MultiGet batch.
  • Fix a bug where older data of an ingested key can be returned for read when universal compaction is used

RocksDB 8.11.3

28 Feb 01:51
Compare
Choose a tag to compare

8.11.3 (2024-02-27)

  • Correct CMake Javadoc and source jar builds

8.11.2 (2024-02-16)

  • Update zlib to 1.3.1 for Java builds

8.11.1 (2024-01-25)

Bug Fixes

  • Fix a bug where older data of an ingested key can be returned for read when universal compaction is used
  • Apply appropriate rate limiting and priorities in more places.

8.11.0 (2024-01-19)

New Features

  • Add new statistics: rocksdb.sst.write.micros measures time of each write to SST file; rocksdb.file.write.{flush|compaction|db.open}.micros measure time of each write to SST table (currently only block-based table format) and blob file for flush, compaction and db open.

Public API Changes

  • Added another enumerator kVerify to enum class FileOperationType in listener.h. Update your switch statements as needed.
  • Add CompressionOptions to the CompressedSecondaryCacheOptions structure to allow users to specify library specific options when creating the compressed secondary cache.
  • Deprecated several options: level_compaction_dynamic_file_size, ignore_max_compaction_bytes_for_input, check_flush_compaction_key_order, flush_verify_memtable_count, compaction_verify_record_count, fail_if_options_file_error, and enforce_single_del_contracts
  • Exposed options ttl via c api.

Behavior Changes

  • rocksdb.blobdb.blob.file.write.micros expands to also measure time writing the header and footer. Therefore the COUNT may be higher and values may be smaller than before. For stacked BlobDB, it no longer measures the time of explictly flushing blob file.
  • Files will be compacted to the next level if the data age exceeds periodic_compaction_seconds except for the last level.
  • Reduced the compaction debt ratio trigger for scheduling parallel compactions
  • For leveled compaction with default compaction pri (kMinOverlappingRatio), files marked for compaction will be prioritized over files not marked when picking a file from a level for compaction.

Bug Fixes

  • Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
  • Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
  • Fix bugs where rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.
  • Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
  • Fixed some cases of in-memory data corruption using mmap reads with BackupEngine, sst_dump, or ldb.
  • Fixed issues with experimental preclude_last_level_data_seconds option that could interfere with expected data tiering.
  • Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.

RocksDB 8.10.2

20 Feb 23:14
Compare
Choose a tag to compare

8.10.2 (2024-02-16)

  • Update zlib to 1.3.1 for Java builds

8.10.1 (2024-01-16)

Bug Fixes

  • Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key

RocksDB 8.10.0

10 Jan 19:52
Compare
Choose a tag to compare

8.10.0 (2023-12-15)

New Features

  • Provide support for async_io to trim readahead_size by doing block cache lookup
  • Added initial wide-column support in WriteBatchWithIndex. This includes the PutEntity API and support for wide columns in the existing read APIs (GetFromBatch, GetFromBatchAndDB, MultiGetFromBatchAndDB, and BaseDeltaIterator).

Public API Changes

  • Custom implementations of TablePropertiesCollectorFactory may now return a nullptr collector to decline processing a file, reducing callback overheads in such cases.

Behavior Changes

  • Make ReadOptions.auto_readahead_size default true which does prefetching optimizations for forward scans if iterate_upper_bound and block_cache is also specified.
  • Compactions can be scheduled in parallel in an additional scenario: high compaction debt relative to the data size
  • HyperClockCache now has built-in protection against excessive CPU consumption under the extreme stress condition of no (or very few) evictable cache entries, which can slightly increase memory usage such conditions. New option HyperClockCacheOptions::eviction_effort_cap controls the space-time trade-off of the response. The default should be generally well-balanced, with no measurable affect on normal operation.

Bug Fixes

  • Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
  • Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
  • Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
  • Fix a WRITE_STALL counter that was reporting wrong value in few cases.
  • A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.

Performance Improvements

  • Java API extensions to improve consistency and completeness of APIs
    • Extended RocksDB.get([ColumnFamilyHandle columnFamilyHandle,] ReadOptions opt, ByteBuffer key, ByteBuffer value) which now accepts indirect buffer parameters as well as direct buffer parameters
    • Extended RocksDB.put( [ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOpts, final ByteBuffer key, final ByteBuffer value) which now accepts indirect buffer parameters as well as direct buffer parameters
    • Added RocksDB.merge([ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOptions, ByteBuffer key, ByteBuffer value) methods with the same parameter options as put(...) - direct and indirect buffers are supported
    • Added RocksIterator.key( byte[] key [, int offset, int len]) methods which retrieve the iterator key into the supplied buffer
    • Added RocksIterator.value( byte[] value [, int offset, int len]) methods which retrieve the iterator value into the supplied buffer
    • Deprecated get(final ColumnFamilyHandle columnFamilyHandle, final ReadOptions readOptions, byte[]) in favour of get(final ReadOptions readOptions, final ColumnFamilyHandle columnFamilyHandle, byte[]) which has consistent parameter ordering with other methods in the same class
    • Added Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value) methods which retrieve the requested value into the supplied buffer
    • Added Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value) methods which retrieve the requested value into the supplied buffer
    • Added Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value, boolean exclusive [, boolean doValidate]) methods which retrieve the requested value into the supplied buffer
    • Added Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value, boolean exclusive [, boolean doValidate]) methods which retrieve the requested value into the supplied buffer
    • Added Transaction.getIterator() method as a convenience which defaults the ReadOptions value supplied to existing Transaction.iterator() methods. This mirrors the existing RocksDB.iterator() method.
    • Added Transaction.put([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked]) methods which supply the key, and the value to be written in a ByteBuffer parameter
    • Added Transaction.merge([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked]) methods which supply the key, and the value to be written/merged in a ByteBuffer parameter
    • Added Transaction.mergeUntracked([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value) methods which supply the key, and the value to be written/merged in a ByteBuffer parameter

RocksDB 8.9.1

11 Dec 22:14
Compare
Choose a tag to compare

8.9.1 (2023-12-08)

Bug Fixes

  • Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.

8.9.0 (2023-11-17)

New Features

  • Add GetEntity() and PutEntity() API implementation for Attribute Group support. Through the use of Column Families, AttributeGroup enables users to logically group wide-column entities.

Public API Changes

  • Added rocksdb_ratelimiter_create_auto_tuned API to create an auto-tuned GenericRateLimiter.
  • Added clipColumnFamily() to the Java API to clip the entries in the CF according to the range [begin_key, end_key).
  • Make the EnableFileDeletion API not default to force enabling. For users that rely on this default behavior and still
    want to continue to use force enabling, they need to explicitly pass a true to EnableFileDeletion.
  • Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.

Behavior Changes

  • During off-peak hours defined by daily_offpeak_time_utc, the compaction picker will select a larger number of files for periodic compaction. This selection will include files that are projected to expire by the next off-peak start time, ensuring that these files are not chosen for periodic compaction outside of off-peak hours.
  • If an error occurs when writing to a trace file after DB::StartTrace(), the subsequent trace writes are skipped to avoid writing to a file that has previously seen error. In this case, DB::EndTrace() will also return a non-ok status with info about the error occured previously in its status message.
  • Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.
  • Make RocksDB only call TablePropertiesCollector::Finish() once.
  • When WAL_ttl_seconds > 0, we now process archived WALs for deletion at least every WAL_ttl_seconds / 2 seconds. Previously it could be less frequent in case of small WAL_ttl_seconds values when size-based expiration (WAL_size_limit_MB > 0 ) was simultaneously enabled.

Bug Fixes

  • Fixed a crash or assertion failure bug in experimental new HyperClockCache variant, especially when running with a SecondaryCache.
  • Fix a race between flush error recovery and db destruction that can lead to db crashing.
  • Fixed some bugs in the index builder/reader path for user-defined timestamps in Memtable only feature.

RocksDB 8.8.1

22 Nov 18:00
Compare
Choose a tag to compare

8.8.1 (2023-11-17)

Bug fixes

  • Make the cache memory reservation accounting in Tiered cache (primary and compressed secondary cache) more accurate to avoid over/under charging the secondary cache.
  • Allow increasing the compressed_secondary_ratio in the Tiered cache after setting it to 0 to disable.

8.8.0 (2023-10-23)

New Features

  • Introduce AttributeGroup by adding the first AttributeGroup support API, MultiGetEntity(). Through the use of Column Families, AttributeGroup enables users to logically group wide-column entities. More APIs to support AttributeGroup will come soon, including GetEntity, PutEntity, and others.
  • Added new tickers rocksdb.fifo.{max.size|ttl}.compactions to count FIFO compactions that drop files for different reasons
  • Add an experimental offpeak duration awareness by setting DBOptions::daily_offpeak_time_utc in "HH:mm-HH:mm" format. This information will be used for resource optimization in the future
  • Users can now change the max bytes granted in a single refill period (i.e, burst) during runtime by SetSingleBurstBytes() for RocksDB rate limiter

Public API Changes

  • The default value of DBOptions::fail_if_options_file_error changed from false to true. Operations that set in-memory options (e.g., DB::Open*(), DB::SetOptions(), DB::CreateColumnFamily*(), and DB::DropColumnFamily()) but fail to persist the change will now return a non-OK Status by default.
  • Add new Cache APIs GetSecondaryCacheCapacity() and GetSecondaryCachePinnedUsage() to return the configured capacity, and cache reservation charged to the secondary cache.

Behavior Changes

  • For non direct IO, eliminate the file system prefetching attempt for compaction read when Options::compaction_readahead_size is 0
  • During a write stop, writes now block on in-progress recovery attempts
  • Deleting stale files upon recovery are delegated to SstFileManger if available so they can be rate limited.

Bug Fixes

  • Fix a bug in auto_readahead_size where first_internal_key of index blocks wasn't copied properly resulting in corruption error when first_internal_key was used for comparison.
  • Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than Status::NotSupported()
  • Add bounds check in WBWIIteratorImpl and make BaseDeltaIterator, WriteUnpreparedTxn and WritePreparedTxn respect the upper bound and lower bound in ReadOption. See 11680.
  • Fixed the handling of wide-column base values in the max_successive_merges logic.
  • Fixed a rare race bug involving a concurrent combination of Create/DropColumnFamily and/or Set(DB)Options that could lead to inconsistency between (a) the DB's reported options state, (b) the DB options in effect, and (c) the latest persisted OPTIONS file.
  • Fixed a possible underflow when computing the compressed secondary cache share of memory reservations while updating the compressed secondary to total block cache ratio.

Performance Improvements

  • Improved the I/O efficiency of DB::Open a new DB with create_missing_column_families=true and many column families.