Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom storage performance tuning quesitons #7657

Open
tjjh89017 opened this issue Apr 4, 2024 · 4 comments
Open

Custom storage performance tuning quesitons #7657

tjjh89017 opened this issue Apr 4, 2024 · 4 comments

Comments

@tjjh89017
Copy link
Sponsor

tjjh89017 commented Apr 4, 2024

Please provide the following information

Hi,

I wrote a program named EZIO, which is a disk/partition filesystem deployment program which is using Libtorrent custom_storage feature.
We faced some issues that old ver of EZIO with Libtorrent 1.x is always faster than the current EZIO with libtorrent 2.0.
Even I change mmap to pread/pwrite to avoid some page fault, it really increase some performance, but still slower than the old one somehow. (Our scenario is usually reading the pieces from disk, and cache missed.)
e.g. in our first release with Clonezilla, we have some journal paper with almost full 1Gbps speed to deploy the 32 machines.
But right now, we could only get half performance in the same environment. (by the way, another weird thing, Multicast deployment is much faster than the case in few years ago.)
In my environment, I could always prove BT is much better and more stable than Multicast in deployment usage.

Clonezilla team and I just guess, maybe it's related to the cache model. In libtorrent 1.x, libtorrent will handle the cache, and it may suggest the pieces to other peers.
But libtorrent 2.0 will not be aware of OS cache (mmap or pread/pwrite.)
It will depend on the custom storage buffer implement.
Our current buffer implement is a fixed length array (16MB), and split those into 16KB for an unit.
In our testing, we found our cache will not exceed to even the half.
disk_buffer_holder is always released immediately.
Does that mean libtorrent doesn't always suggest pieces that is in the cache?

I wanted to do some "read-ahead" and put it into the suggest pieces pool.
I found a API that I could ask torrent_handle to read some specific pieces with read_piece and sugest to other peers with suggest_read_cache.
But I want to check about the timing when store_buffer or disk_buffer_holder will be released.
especially, read cache timing.
If I have to read the alert, and then the disk_buffer_holder will be released from the cache, that means I need to keep my eyes on the alert to make sure the cache will not be occupied by "read-ahead" mechanism.
To make sure the process could be faster, if the bottleneck is not HDD random read speed, but some gaps that libtorrent doesn't read the data, and disk is idle in some timing.

Although, I know it may already be the fastest performance for the HDD without any tuning. In the past few years, sometimes we left all peers in chaos, and that will make the fastest deployment ever, but we don't know the reason.
We also tested with NVMe SSD and 10Gbps network. It went to about 200~300MB/s which we thought it could be faster.
Of course, we always try to keep EZIO as simple as possible because Clonezilla team and I all have a full-time job.
This side project is just a side project, we will not put too much effort, only get little increasing, and raise the maintainence difficulty.

Is there any suggestion to profile or benchmark the bottleneck?

Thank you!
And thank you again for this awesome project and your contribution, it helps us a lot that we don't need to implement BT from scratch
Date

libtorrent version (or branch): RC2.0

platform/architecture: Linux / amd64 (Debian sid)

compiler and compiler version: gcc 11

please describe what symptom you see, what you would expect to see instead and
how to reproduce it.

@nagyimre1980
Copy link

@arvidn
Copy link
Owner

arvidn commented Apr 6, 2024

Hi @tjjh89017 It appears the mmap storage in libtorrent 2.0 indeeds performs worse than the pread/pwrite in libtorrent 1.x. I've been working on an alternative backend for 2.1 that uses pread/pwrite.

There are quite a few knobs that can affect the transfer rate, the disk buffer size is one, but also making sure peers are saturated with piece requests (to cover the bandwidth delay product). There are heuristics around determining the number of outstanding piece requests, that also interfere with timeout logic. There's essentially a "slow-start" mechanism, where the number of outstanding piece requests is doubled until the download rate plateaus.

The main tool to tune and optimize the disk I/O throughput is the stats logging that can be enabled in libtorrent. There's this python script that interprets the resulting output: https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/parse_session_stats.py

To enable the logging:

  • call post_session_stats() regularly.
  • when receiving the session_stats_header_alert, print its message() to a log file.
  • when receiving the session_stats_alert, print its message() to the log file.

the resulting log file can then be parsed by parse_session_stats.py. It requires gnuplot.

Regarding read_piece(), it will read 16kiB blocks from disk into a heap allocated piece. When all blocks have been read, that heap allocation (in a shared_ptr<>) is passed back to the client in an alert. It's good to reap those alerts to free the heap allocation, but it's separate from the disk cache. The buffer_holder is freed by the main thread after the block has been copied into the piece buffer to be returned to the client.

The current disk I/O backend in libtorrent 2.0 (mmap) does not have its own read buffer, its intention is to rely on the block cache.

The patch I'm working on for a multi-threaded pread/pwrite backend also doesn't have a read-buffer per-se. It has a store buffer while blocks are waiting to be flushed to disk.

It's possible that one performance benefit libtorrent 1.2 has is that it actually holds a disk buffer in user space, implementing an ARC cache. It might save syscalls when pulling data from the cache.

@tjjh89017
Copy link
Sponsor Author

tjjh89017 commented Apr 6, 2024

Hi @tjjh89017 It appears the mmap storage in libtorrent 2.0 indeeds performs worse than the pread/pwrite in libtorrent 1.x. I've been working on an alternative backend for 2.1 that uses pread/pwrite.

I read this PR already.

There are quite a few knobs that can affect the transfer rate, the disk buffer size is one, but also making sure peers are saturated with piece requests (to cover the bandwidth delay product). There are heuristics around determining the number of outstanding piece requests, that also interfere with timeout logic. There's essentially a "slow-start" mechanism, where the number of outstanding piece requests is doubled until the download rate plateaus.

The main tool to tune and optimize the disk I/O throughput is the stats logging that can be enabled in libtorrent. There's this python script that interprets the resulting output: https://github.com/arvidn/libtorrent/blob/RC_2_0/tools/parse_session_stats.py

To enable the logging:

  • call post_session_stats() regularly.
  • when receiving the session_stats_header_alert, print its message() to a log file.
  • when receiving the session_stats_alert, print its message() to the log file.

the resulting log file can then be parsed by parse_session_stats.py. It requires gnuplot.

I think I should try this first.
Thank you!

Regarding read_piece(), it will read 16kiB blocks from disk into a heap allocated piece. When all blocks have been read, that heap allocation (in a shared_ptr<>) is passed back to the client in an alert. It's good to reap those alerts to free the heap allocation, but it's separate from the disk cache. The buffer_holder is freed by the main thread after the block has been copied into the piece buffer to be returned to the client.

Ok, it seems in my case it will be hard to control read_piece() easily.

The current disk I/O backend in libtorrent 2.0 (mmap) does not have its own read buffer, its intention is to rely on the block cache.

The patch I'm working on for a multi-threaded pread/pwrite backend also doesn't have a read-buffer per-se. It has a store buffer while blocks are waiting to be flushed to disk.

I also changed my implementation to pread/pwrite without ARC cache.
OS will still store some cache in the page.
So sometimes pread is still faster with page cache.

It's possible that one performance benefit libtorrent 1.2 has is that it actually holds a disk buffer in user space, implementing an ARC cache. It might save syscalls when pulling data from the cache.

but i think if I need to implement ARC on my own, it will take too much effort.
Some Libtorrent useful class is in aux_, that I cannot rely on it.
I think I will keep current implementation for now.

In our today's testing, in NVMe SSD with 10Gbps network, pread/pwrite is faster than mmap way.
(pread/pwrite 400MB/s vs mmap 230MB/s) (We think 400MB/s is the best performance for TLC SSD already)
I think mmap page fault penalty is too high because our scenario is with high cache miss rate.
We only transfer piece X for once, and other peers will help us to transfer that pieces to each others.
The main seeder need to try its best to send different pieces out rather than sending the same pieces to different peers.

Thank you!

@tjjh89017
Copy link
Sponsor Author

Hi @arvidn
I read the #7013, and it looks great!
Once the PR is merged, I will check it again and try to implement the disk cache into EZIO.
I also hope you could export some aux_ API for custom storage, that could let us re-use it and avoid to re-implement a same thing.

And hope you will not be affected by those people with "strong" wording in the PR.
I sponsored you for a little number, I cannot offer too much, but I want to say "Always thank you for this awesome project."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants