Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious performance loss in benchmark.sh/db_bench under multithreading (reporter_agent / rocksdb::Stats::FinishedOps) #12594

Open
bernard035 opened this issue Apr 29, 2024 · 1 comment

Comments

@bernard035
Copy link

A proposal to improve db_bench/benchmark.sh

Hi RocksDB community! I am Bernard Jiang, a master's student at SPAIL(System Performance Analytics and Intelligence Lab) from Zhejiang University. My current research direction is system performance analysis.

I'm doing performance testing on RocksDB via tools/benchmark.sh based on db_bench. I chose tools/benchmark.sh since it is the recommended benchmark tool in the RocksDB project, and it has been used to test the performance of several versions.

However, with some basic analysis, I found that the symbolrocksdb::Stats::FinishedOps  in the db_bench consumes a lot of cycles in multi-threaded randomread conditions. And as the number of threads increases, so does the percentage of cycles that observe it using perf-record.

I suppose there might be something wrong with that. Such an issue can cause db_bench to report inaccurate values in multi-threaded conditions that do not reflect the performance level of RocksDB.

I explored this and conducted a series of experiments. My current conclusion is that the default random read-run method provided by the current tool db_bench a significant performance overhead in a multi-threaded environment, due to the inappropriate design of the writing QPS to a single CSV file over a period of time feature that was introduced nine years ago.

Next, I'm going to introduce my experiments.

The Environment of Experiment 

  • RocksDB: version 9.2.0

  • CPU: 2 * Intel(R) Xeon(R) Platinum 8383C CPU @ 2.70GHz 

    • HyperThreading ON

    • 40 cores per socket

    • 160 hardware threads

  • CPU Cache:   61440 KB

  • Memory: 512 GB

  • OS: Ubuntu 22.04 5.15.0-102-generic

  • Workload Description: randomread in tools/benchmark.sh

    • I configured the CPU affinity of the task viataskset.

    • I've used the following parameters to run the Benchmark's randomread project. The only parameter that differs in all of the experiments below is the NUM_THREADS.

    # the parameters of benchmark.sh 
    export DB_DIR="./db"
    export WAL_DIR=./wal
    export NUM_KEYS=900000000
    export CACHE_SIZE=6442450944
    export DURATION=300
    export NUM_THREADS=1 # only this changed in the following different experiments
    
    ./tools/benchmark.sh randomread
    
    # equals to
    ./db_bench --benchmarks=readrandom,stats --use_existing_db=1 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=16 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_blob_cache,multiread_batched,cache_low_pri_pool_ratio,prepopulate_block_cache --db=./db --wal_dir=/home/jupyter-lab/XJ_code/rocksdb/rocksdb-perf/data/randomread/static/t1//perf-stat-sar/wal --num=900000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=6442450944 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=zstd --bytes_per_sync=1048576 --benchmark_write_rate_limit=0 --write_buffer_size=134217728 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=60 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=-1 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --duration=300 --threads=1 --seed=1713189836 

The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

1 thread vs. 160 threads

I started by comparing the case where the number of threads is equal to 1 and 160. Note that a thread count of 160 means that all threads on the server (distributed over 2 CPU sockets) are allocated to db_bench. I got the data below.

image

We can see that at 160 threads, the Ops_Sec is only about 7.4 times that of 1 thread.

At the same time, the CPI is much higher than the case where the number of threads is 1 (7.59 vs. 0.36).

image

These numbers lead me to suspect that perhaps there are mutexes or global variables that limit performance in multithreaded scenarios.

To further determine the cause, I conducted the following experiment.

from 1 to 40 threads

I configured CPU affinity to distribute all the threads to different physical cores of the same processor, and increased the number of threads from 1 to 40.

When the number of threads is greater than 8, the metric ops_sec that represents throughput has not increased steadily.

image

I also used perf-record for observations (perf record -F 97). In the data parsed by perf-report, the percentage of the cycles sample of the symbol rocksdb::Stats::FinishedOps rises as the number of threads increases. When the number of threads is greater than 20, this symbol accounts for more than 80% of all cycles sample. Note that when the number of threads is 1, this symbol only accounts for a small percentage (<5%).

image

Based on the experiments conducted, it is evident that performance improvement under multi-threading scenarios does not exhibit a linear growth pattern.   This observation implies the presence of contention or blocking phenomena, which hinder the efficient utilization of parallel resources.

Remove Bottleneck

I did a series of analyses and finally found that the biggest performance bottleneck was on the reporter_agent in the functionrocksdb::Stats::FinishedOps. With the removal of the call here, the performance in multi-threaded scenarios has been greatly improved. 

The ops_sec with 160 threads is about 61 times as many as 1 thread (323795384 vs 5293853).

image

When I repeat the 1-40threads randomread experiment above, ops_sec grew almost linearly as the number of threads increased.

image

I have two suggestions for improvement: a) add a db_bench parameter and set the default parameter of this feature to not be enabled, and b) modify the implementation of this feature to reduce its overhead.

And I have some questions: I want to do random read/random write/random rw performance analysis for rocksdb, do you have any recommended benchmark? Or db_bench parameters? 

If so, is it possible to set it as the default parameter for benchmark.sh?

Thanks in advance,

Bernard Jiang

@ajkr
Copy link
Contributor

ajkr commented May 3, 2024

Thank you for looking into this and for the detailed analysis. Our benchmarks usually do not reach tens of millions of lookups per second since we usually measure large datasets that involve relatively slow storage accesses. Nevertheless, it is always nice to remove bottlenecks in the measurement system, so please let us know any ideas or PRs you wish to contribute towards that end.

And I have some questions: I want to do random read/random write/random rw performance analysis for rocksdb, do you have any recommended benchmark? Or db_bench parameters?

I guess it depends on what is the goal. This paper has some commands in the appendix that reflect production workloads to some extent - https://www.usenix.org/system/files/fast20-cao_zhichao.pdf. If you are interested in making an architectural improvement to our benchmarking tools, #9478 is a good issue, and this paper has more explanation on the risks associated with closed-loop benchmarking: https://www.vldb.org/pvldb/vol13/p449-luo.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants