Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

bernard035 · 2024-04-29T01:53:07Z

A proposal to improve db_bench/benchmark.sh

Hi RocksDB community! I am Bernard Jiang, a master's student at SPAIL(System Performance Analytics and Intelligence Lab) from Zhejiang University. My current research direction is system performance analysis.

I'm doing performance testing on RocksDB via tools/benchmark.sh based on db_bench. I chose tools/benchmark.sh since it is the recommended benchmark tool in the RocksDB project, and it has been used to test the performance of several versions.

However, with some basic analysis, I found that the symbolrocksdb::Stats::FinishedOps in the db_bench consumes a lot of cycles in multi-threaded randomread conditions. And as the number of threads increases, so does the percentage of cycles that observe it using perf-record.

I suppose there might be something wrong with that. Such an issue can cause db_bench to report inaccurate values in multi-threaded conditions that do not reflect the performance level of RocksDB.

I explored this and conducted a series of experiments. My current conclusion is that the default random read-run method provided by the current tool db_bench a significant performance overhead in a multi-threaded environment, due to the inappropriate design of the writing QPS to a single CSV file over a period of time feature that was introduced nine years ago.

Next, I'm going to introduce my experiments.

The Environment of Experiment

RocksDB: version 9.2.0
CPU: 2 * Intel(R) Xeon(R) Platinum 8383C CPU @ 2.70GHz
- HyperThreading ON
- 40 cores per socket
- 160 hardware threads
CPU Cache: 61440 KB
Memory: 512 GB
OS: Ubuntu 22.04 5.15.0-102-generic
Workload Description: randomread in tools/benchmark.sh
- I configured the CPU affinity of the task viataskset.
- I've used the following parameters to run the Benchmark's randomread project. The only parameter that differs in all of the experiments below is the NUM_THREADS.

    # the parameters of benchmark.sh 
    export DB_DIR="./db"
    export WAL_DIR=./wal
    export NUM_KEYS=900000000
    export CACHE_SIZE=6442450944
    export DURATION=300
    export NUM_THREADS=1 # only this changed in the following different experiments
    
    ./tools/benchmark.sh randomread
    
    # equals to
    ./db_bench --benchmarks=readrandom,stats --use_existing_db=1 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=16 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_blob_cache,multiread_batched,cache_low_pri_pool_ratio,prepopulate_block_cache --db=./db --wal_dir=/home/jupyter-lab/XJ_code/rocksdb/rocksdb-perf/data/randomread/static/t1//perf-stat-sar/wal --num=900000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=6442450944 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=zstd --bytes_per_sync=1048576 --benchmark_write_rate_limit=0 --write_buffer_size=134217728 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=60 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=-1 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --duration=300 --threads=1 --seed=1713189836

The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

1 thread vs. 160 threads

I started by comparing the case where the number of threads is equal to 1 and 160. Note that a thread count of 160 means that all threads on the server (distributed over 2 CPU sockets) are allocated to db_bench. I got the data below.

We can see that at 160 threads, the Ops_Sec is only about 7.4 times that of 1 thread.

At the same time, the CPI is much higher than the case where the number of threads is 1 (7.59 vs. 0.36).

These numbers lead me to suspect that perhaps there are mutexes or global variables that limit performance in multithreaded scenarios.

To further determine the cause, I conducted the following experiment.

from 1 to 40 threads

I configured CPU affinity to distribute all the threads to different physical cores of the same processor, and increased the number of threads from 1 to 40.

When the number of threads is greater than 8, the metric ops_sec that represents throughput has not increased steadily.

I also used perf-record for observations (perf record -F 97). In the data parsed by perf-report, the percentage of the cycles sample of the symbol rocksdb::Stats::FinishedOps rises as the number of threads increases. When the number of threads is greater than 20, this symbol accounts for more than 80% of all cycles sample. Note that when the number of threads is 1, this symbol only accounts for a small percentage (<5%).

Based on the experiments conducted, it is evident that performance improvement under multi-threading scenarios does not exhibit a linear growth pattern. This observation implies the presence of contention or blocking phenomena, which hinder the efficient utilization of parallel resources.

Remove Bottleneck

I did a series of analyses and finally found that the biggest performance bottleneck was on the reporter_agent in the functionrocksdb::Stats::FinishedOps. With the removal of the call here, the performance in multi-threaded scenarios has been greatly improved.

The ops_sec with 160 threads is about 61 times as many as 1 thread (323795384 vs 5293853).

When I repeat the 1-40threads randomread experiment above, ops_sec grew almost linearly as the number of threads increased.

I have two suggestions for improvement: a) add a db_bench parameter and set the default parameter of this feature to not be enabled, and b) modify the implementation of this feature to reduce its overhead.

And I have some questions: I want to do random read/random write/random rw performance analysis for rocksdb, do you have any recommended benchmark? Or db_bench parameters?

If so, is it possible to set it as the default parameter for benchmark.sh?

Thanks in advance,

Bernard Jiang

The text was updated successfully, but these errors were encountered:

ajkr · 2024-05-03T05:35:44Z

Thank you for looking into this and for the detailed analysis. Our benchmarks usually do not reach tens of millions of lookups per second since we usually measure large datasets that involve relatively slow storage accesses. Nevertheless, it is always nice to remove bottlenecks in the measurement system, so please let us know any ideas or PRs you wish to contribute towards that end.

And I have some questions: I want to do random read/random write/random rw performance analysis for rocksdb, do you have any recommended benchmark? Or db_bench parameters?

I guess it depends on what is the goal. This paper has some commands in the appendix that reflect production workloads to some extent - https://www.usenix.org/system/files/fast20-cao_zhichao.pdf. If you are interested in making an architectural improvement to our benchmarking tools, #9478 is a good issue, and this paper has more explanation on the risks associated with closed-loop benchmarking: https://www.vldb.org/pvldb/vol13/p449-luo.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

bernard035 commented Apr 29, 2024

ajkr commented May 3, 2024 •

edited

Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

Comments

bernard035 commented Apr 29, 2024

A proposal to improve db_bench/benchmark.sh

ajkr commented May 3, 2024 • edited

ajkr commented May 3, 2024 •

edited