Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问您们有没有测过quiche最大吞吐量能跑到多少呢? #10

Open
adcen0107 opened this issue Nov 1, 2021 · 9 comments
Open

Comments

@adcen0107
Copy link

No description provided.

@ktprime
Copy link

ktprime commented Jan 25, 2022

测试受限于cpu性能, 我在9700上ping-pong跑大概540-600MB/s. tcp可以到1800MB

@adcen0107
Copy link
Author

看着差距还是挺大的,请问你使用的是专门跑吞吐的 quiche demo吗?

@ktprime
Copy link

ktprime commented Jan 25, 2022

专门高度优化后的demo测试quic性能. 性能差距主要发包(sendto. sendmmg) 还有部分在gquic协议栈.

@Rouzip
Copy link

Rouzip commented Apr 19, 2022

测试受限于cpu性能, 我在9700上ping-pong跑大概500MB/s. tcp可以到1800MB

请问下这个500MB/s是怎么测试的呢,用QUIC做代理统计有效流量还是网卡上传输UDP传输数据包流量?什么条件下测试的呢,一个connection下的一个stream?

@ktprime
Copy link

ktprime commented Apr 25, 2022

测试受限于cpu性能, 我在9700上ping-pong跑大概540-600MB/s. tcp可以到1800MB

请问下这个500MB/s是怎么测试的呢,用QUIC做代理统计有效流量还是网卡上传输UDP传输数据包流量?什么条件下测试的呢,一个connection下的一个stream?

自己写的quic客户端和服务器demo,带宽统计有效stream载荷数据,一个连接一个流(多流不能提升ping-pong性能)
服务器绑定127.0.0.1, 客户端连上后预先发送10-1000KB数据(握手协商关闭加密功能),服务器收到直接发回客户端

编译优化开了LTO和PGO,使用GCC10.3 和Ubutun 20.04(win WSL2)。
为了提升性能做了聚合回包(sendmmsg/gso)之类的大量性能优化。

@ktprime
Copy link

ktprime commented Jan 2, 2024

经过多年持续度gquic 性能极致优化(基于2023.05版本quiche),剪裁部分功能,

在win10 + wsl2 + 12700 cpu,ping-pong测试能跑出1GB+的带宽性能,单核心处理(收+发)80万+个 quic数据包。
应用层cpu从45%降到25%, 后续打算优化系统cpu, 如果配合DPDK,估计cpu性能还能提升一倍以上

Statics:1118 [C=10001]  1 ep_fd/last_udp = 1/3, active_udps, udps = 1, 1 tid = 19: client send_batch:1
Statics:1121 [C=10001]  2 ep_conns/all_conns/ep_streams/all_entry = 1|1|1|1, fail_conns/new_conns = 0|0
Statics:1124 [C=10001]  3 poll_calls = 13860, notify_calls = 0, send_calls = 20929, recv_calls = 13860, time_calls = 92199, event_calls = 13860
Statics:1127 [C=10001]  4 send_packets, recv_packets = 812988, 806831/s send_bytes, recv_bytes = 1091022.23, 1084537.95 KB/s
Statics:1130 [C=10001]  5 timer size = 2, once/runs/schedules = 63/  0/  63 /sec
Statics:1133 [C=10001]  6 thread cpu(user_time, sys_time) = 98.93% (24.05%, 74.88%) process 98.93% mem:11.00 MB
Statics:1136 [C=10001]  7 recv, sent, migrations, all_discons, online_sec = 108865573, 109765566, 0, 0, 0 sec
Statics:1141 [C=10001]  8 retrans, loss, duplicate, error, fail_conn, zero_conn =(0.00%%, 0.00%%, 0.00%%, 0.00%%, 0.00%, 100.00%) online = 0.04 hr

DumpQuicStats:924 [C=10001]     1 send,recv,slow_sent = 109765566, 108865573, 9219057:  (quid = 1)
DumpQuicStats:929 [C=10001]     2 loss_timeout, pto = 0, 1, lost, transmit = 27 25 (sp_trans 0. sp_lost 1)
DumpQuicStats:932 [C=10001]     3 slowstart_packets_lost,tcp_loss_events,packets_reordered = 14, 3, 0
DumpQuicStats:934 [C=10001]     4 min_rtt_us, srtt_us, max_reordering/max_send_packet = 24 100 us, 0/1470
DumpQuicStats:938 [C=10001]     5 bw = 2815380 k/s [trans:0.00%% 127.0.0.1:10060] |online = 135 sec, last_recv/last_send = 1/0 ms
DumpQuicStats:941 [C=10001] status =  packets_sent: 109765566 packets_received: 108865573 stream_bytes_received: 148342862802 bytes_retransmitted: 31375 packets_retransmitted: 25 packets_lost: 27 slowstart_packets_sent: 9219057 slowstart_packets_lost: 14 slowstart_bytes_lost: 13225 pto_count: 1 min_rtt_us: 24 srtt_us: 100 egress_mtu: 1470 max_egress_mtu: 1470 ingress_mtu: 1470 estimated_bandwidth: 22.52 Gbits/s (2.82 Gbytes/s) tcp_loss_events: 3 }

:147   binheap[ 1] size = 2, ups/ups_downs/runs = 50546/99%/122
-----------------------------------------------------------------------TPS: 79778/sec, QPS: 266905/sec, RBW 1042.60, SBW: 1042.60 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 79544/sec, QPS: 264894/sec, RBW 1034.74, SBW: 1034.74 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 80098/sec, QPS: 266174/sec, RBW 1039.74, SBW: 1039.74 MB/sec Cons: 1

@FreeMind-LJ
Copy link

经过多年持续度gquic 性能极致优化(基于2023.05版本quiche),剪裁部分功能,

在win10 + wsl2 + 12700 cpu,ping-pong测试能跑出1GB+的带宽性能,单核心处理(收+发)80万+个 quic数据包。 应用层cpu从45%降到25%, 后续打算优化系统cpu, 如果配合DPDK,估计cpu性能还能提升一倍以上

Statics:1118 [C=10001]  1 ep_fd/last_udp = 1/3, active_udps, udps = 1, 1 tid = 19: client send_batch:1
Statics:1121 [C=10001]  2 ep_conns/all_conns/ep_streams/all_entry = 1|1|1|1, fail_conns/new_conns = 0|0
Statics:1124 [C=10001]  3 poll_calls = 13860, notify_calls = 0, send_calls = 20929, recv_calls = 13860, time_calls = 92199, event_calls = 13860
Statics:1127 [C=10001]  4 send_packets, recv_packets = 812988, 806831/s send_bytes, recv_bytes = 1091022.23, 1084537.95 KB/s
Statics:1130 [C=10001]  5 timer size = 2, once/runs/schedules = 63/  0/  63 /sec
Statics:1133 [C=10001]  6 thread cpu(user_time, sys_time) = 98.93% (24.05%, 74.88%) process 98.93% mem:11.00 MB
Statics:1136 [C=10001]  7 recv, sent, migrations, all_discons, online_sec = 108865573, 109765566, 0, 0, 0 sec
Statics:1141 [C=10001]  8 retrans, loss, duplicate, error, fail_conn, zero_conn =(0.00%%, 0.00%%, 0.00%%, 0.00%%, 0.00%, 100.00%) online = 0.04 hr

DumpQuicStats:924 [C=10001]     1 send,recv,slow_sent = 109765566, 108865573, 9219057:  (quid = 1)
DumpQuicStats:929 [C=10001]     2 loss_timeout, pto = 0, 1, lost, transmit = 27 25 (sp_trans 0. sp_lost 1)
DumpQuicStats:932 [C=10001]     3 slowstart_packets_lost,tcp_loss_events,packets_reordered = 14, 3, 0
DumpQuicStats:934 [C=10001]     4 min_rtt_us, srtt_us, max_reordering/max_send_packet = 24 100 us, 0/1470
DumpQuicStats:938 [C=10001]     5 bw = 2815380 k/s [trans:0.00%% 127.0.0.1:10060] |online = 135 sec, last_recv/last_send = 1/0 ms
DumpQuicStats:941 [C=10001] status =  packets_sent: 109765566 packets_received: 108865573 stream_bytes_received: 148342862802 bytes_retransmitted: 31375 packets_retransmitted: 25 packets_lost: 27 slowstart_packets_sent: 9219057 slowstart_packets_lost: 14 slowstart_bytes_lost: 13225 pto_count: 1 min_rtt_us: 24 srtt_us: 100 egress_mtu: 1470 max_egress_mtu: 1470 ingress_mtu: 1470 estimated_bandwidth: 22.52 Gbits/s (2.82 Gbytes/s) tcp_loss_events: 3 }

:147   binheap[ 1] size = 2, ups/ups_downs/runs = 50546/99%/122
-----------------------------------------------------------------------TPS: 79778/sec, QPS: 266905/sec, RBW 1042.60, SBW: 1042.60 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 79544/sec, QPS: 264894/sec, RBW 1034.74, SBW: 1034.74 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 80098/sec, QPS: 266174/sec, RBW 1039.74, SBW: 1039.74 MB/sec Cons: 1

有啥优化思路

@ktprime
Copy link

ktprime commented Feb 29, 2024

quiche底层的核心数据基本都被换成高性能的数据结构(small vector/map/set/list/hash).
原来使用的absl系列容器并不适合的小数据集合。

时间和定时器开销不小,采用极致的优化方法减少了大量的不必要的高频调用(90%)

gcov 测试各种分支覆盖情况,改进if 判断语句(收发报文核心路径删除大约20% if 语句,去除不常用的部分功能)
大量防御代码基本都被删除了(需要进行大量功能性和稳定性测试,)

最好是内存优化,改进后正常的收发报文过程中不再有动态内存分配。

配合perf+lto 最后分析瓶颈, 都是代码细节问题。密切关注核心路径的每一行代码副作用

@ktprime
Copy link

ktprime commented Feb 29, 2024

perfs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants