Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多session时(多算法) CPU计算场景,内部线程池性能比openMP线程池差50% #2854

Open
zhenjing opened this issue Apr 29, 2024 · 4 comments
Labels
question Further information is requested

Comments

@zhenjing
Copy link

MNN内部线程池:MNN_THREAD_POOL_MAX_TASKS 2 限制最多2个算法使用线程池。
MNN原线程池的不足:1) 并发任务总是分配给低序号的线程,导致高序号的线程不处理计算;2)计算并发任务时,所有线程都被唤醒,线程使用自旋锁,导致多于并发数的线程处于空跑状态。

测试yolov8n.mnn模型,使用Session API方式,共享输入图片,对比测试内部线程池和openMP线程池。

测试结论:
1、openMP线程池性能最好,在6个算法句柄时,吞吐量90,平时耗时65ms;相比MNN内部线程池最大吞吐量51提升80%,同样6个句柄时,MNN内部线程池平均耗时176ms。
2、多个子线程池方案,在7个句柄时,吞吐量73,平均耗时95ms;相比MNN内部线程池最大吞吐量51提升40%,同样7个句柄时,MNN内部线程池平均耗时193ms。
3、yolov8模型并发任务计算时间和句柄数有关,在1个句柄时,并发任务的平均计算耗时0.1ms,在15个句柄时,并发任务的平均计算耗时0.6ms。

为啥将内部线程池作为默认线程池选项?

@zhenjing
Copy link
Author

MNN编译选项:MNN_ARM82
测试yolov8n.mnn,使用Session API方式,共享输入图片。

鲲鹏920环境测试数据:
内部线程池:
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 1 | 302.21 | 299.49 | 307.17 | 101 | 100 | 119 | 99 | 3.31 |
| 1 | 2 | 163.55 | 160.16 | 328.65 | 3080 | 2989 | 122 | 29 | 6.11 |
| 1 | 3 | 133.54 | 131.09 | 153.16 | 3057 | 3026 | 119 | 28 | 7.49 |
| 1 | 4 | 104.17 | 103.15 | 108.18 | 3011 | 2988 | 115 | 29 | 9.60 |
| 1 | 5 | 89.63 | 88.48 | 100.63 | 2982 | 2958 | 122 | 29 | 11.15 |
| 1 | 6 | 84.53 | 83.17 | 144.62 | 2927 | 2897 | 115 | 30 | 11.82 |
| 1 | 7 | 73.92 | 72.51 | 121.87 | 2933 | 2901 | 116 | 31 | 13.52 |
| 1 | 8 | 68.15 | 66.50 | 98.63 | 2910 | 2875 | 115 | 31 | 14.67 |
| 1 | 9 | 69.13 | 67.43 | 78.88 | 2873 | 2841 | 120 | 32 | 14.45 |
| 1 | 10 | 65.51 | 62.18 | 109.37 | 2861 | 2816 | 115 | 33 | 15.25 |
| 1 | 11 | 66.33 | 64.78 | 104.34 | 2842 | 2810 | 114 | 33 | 15.06 |
| 1 | 12 | 64.23 | 62.21 | 104.18 | 2829 | 2794 | 112 | 34 | 15.56 |
| 1 | 13 | 60.36 | 57.25 | 91.09 | 2862 | 2807 | 125 | 34 | 16.55 |
| 1 | 14 | 57.46 | 53.00 | 86.55 | 2825 | 2764 | 125 | 34 | 17.39 |
| 1 | 15 | 57.06 | 54.58 | 79.47 | 2810 | 2763 | 119 | 35 | 17.51 |
| 1 | 16 | 53.06 | 51.46 | 84.32 | 2791 | 2757 | 117 | 35 | 18.83 |
| 1 | 17 | 56.50 | 53.74 | 75.90 | 2826 | 2773 | 124 | 36 | 17.69 |
| 1 | 18 | 58.38 | 53.27 | 105.62 | 2826 | 2785 | 124 | 36 | 17.11 |
| 1 | 19 | 58.60 | 56.58 | 78.08 | 2798 | 2767 | 117 | 36 | 17.05 |
| 1 | 20 | 57.50 | 55.27 | 100.63 | 2794 | 2757 | 117 | 37 | 17.38 |
| 1 | 21 | 55.87 | 54.27 | 66.56 | 2778 | 2744 | 115 | 37 | 17.89 |
| 1 | 22 | 55.07 | 53.46 | 69.17 | 2773 | 2738 | 117 | 37 | 18.15 |
| 1 | 23 | 51.62 | 49.48 | 70.91 | 2771 | 2732 | 116 | 37 | 19.36 |
| 1 | 24 | 50.34 | 48.55 | 97.78 | 2762 | 2720 | 117 | 38 | 19.85 |
| 1 | 25 | 49.48 | 47.93 | 81.96 | 2756 | 2717 | 117 | 38 | 20.20 |
| 1 | 26 | 49.00 | 47.16 | 62.74 | 2748 | 2706 | 116 | 38 | 20.40 |
| 1 | 27 | 45.74 | 43.68 | 83.62 | 2736 | 2701 | 116 | 39 | 21.85 |
| 1 | 28 | 44.77 | 42.97 | 91.32 | 2726 | 2692 | 119 | 39 | 22.32 |
| 1 | 29 | 43.99 | 42.50 | 83.11 | 2715 | 2681 | 119 | 39 | 22.71 |
| 1 | 30 | 43.39 | 41.85 | 56.30 | 2703 | 2666 | 117 | 39 | 23.03 |
| 1 | 31 | 42.99 | 41.19 | 60.30 | 2709 | 2670 | 117 | 40 | 23.24 |
| 1 | 32 | 44.53 | 41.08 | 78.13 | 2691 | 2658 | 125 | 40 | 22.43 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 4 | 92.00 | 91.06 | 106.52 | 2993 | 2954 | 118 | 27 | 10.86 |
| 2 | 4 | 309.84 | 91.53 | 550.67 | 1843 | 1102 | 203 | 29 | 6.43 |
| 3 | 4 | 311.59 | 91.94 | 518.50 | 1389 | 1201 | 287 | 31 | 9.59 |
| 4 | 4 | 316.22 | 93.55 | 520.95 | 1577 | 1186 | 371 | 33 | 12.58 |
| 5 | 4 | 320.57 | 92.92 | 534.82 | 1677 | 1227 | 455 | 34 | 15.44 |
| 6 | 4 | 322.07 | 95.07 | 482.06 | 1653 | 1377 | 539 | 36 | 18.54 |
| 7 | 4 | 327.64 | 94.54 | 587.61 | 1420 | 1364 | 620 | 38 | 21.24 |
| 8 | 4 | 338.09 | 136.99 | 542.82 | 2291 | 1820 | 707 | 39 | 23.29 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 2 | 161.70 | 159.80 | 180.62 | 3081 | 3031 | 668 | 33 | 6.18 |
| 2 | 2 | 307.31 | 159.33 | 468.00 | 2619 | 1820 | 584 | 32 | 6.48 |
| 3 | 2 | 311.39 | 160.73 | 467.74 | 2181 | 1872 | 584 | 32 | 9.58 |
| 4 | 2 | 313.20 | 161.55 | 457.25 | 2222 | 1893 | 584 | 32 | 12.72 |
| 5 | 2 | 322.94 | 160.54 | 471.00 | 2519 | 2034 | 586 | 32 | 15.35 |
| 6 | 2 | 336.08 | 161.32 | 657.59 | 2477 | 2194 | 586 | 32 | 17.53 |
| 7 | 2 | 341.25 | 163.78 | 696.82 | 2339 | 2000 | 636 | 33 | 20.35 |
| 8 | 2 | 338.36 | 163.89 | 472.60 | 2075 | 1976 | 709 | 33 | 23.28 |
| 9 | 2 | 349.00 | 174.64 | 659.26 | 2844 | 2261 | 793 | 34 | 25.46 |
| 10 | 2 | 358.57 | 168.42 | 659.79 | 2969 | 2197 | 876 | 34 | 27.12 |
| 11 | 2 | 368.14 | 167.50 | 749.81 | 2890 | 2392 | 958 | 35 | 29.22 |
| 12 | 2 | 379.36 | 169.97 | 700.04 | 2888 | 2167 | 1040 | 35 | 30.91 |
| 13 | 2 | 391.24 | 168.41 | 1212.57 | 2845 | 2205 | 1119 | 36 | 32.12 |
| 14 | 2 | 431.56 | 165.49 | 1474.95 | 2701 | 2385 | 1208 | 36 | 30.71 |
| 15 | 2 | 398.22 | 175.14 | 859.35 | 3111 | 2128 | 1295 | 37 | 36.19 |
| 16 | 2 | 429.22 | 173.18 | 1837.27 | 3070 | 2747 | 1379 | 37 | 36.14 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

openMP线程池
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 1 | 302.93 | 298.91 | 322.26 | 101 | 100 | 120 | 99 | 3.30 |
| 1 | 2 | 167.06 | 165.37 | 170.31 | 189 | 185 | 121 | 99 | 5.98 |
| 1 | 3 | 126.03 | 122.64 | 157.94 | 262 | 256 | 129 | 99 | 7.93 |
| 1 | 4 | 100.89 | 96.24 | 127.97 | 335 | 324 | 143 | 98 | 9.91 |
| 1 | 5 | 109.94 | 87.66 | 170.26 | 357 | 345 | 152 | 98 | 9.09 |
| 1 | 6 | 98.22 | 78.51 | 112.05 | 403 | 385 | 149 | 97 | 10.18 |
| 1 | 7 | 96.84 | 94.59 | 127.38 | 414 | 409 | 158 | 97 | 10.32 |
| 1 | 8 | 90.21 | 70.48 | 109.52 | 451 | 446 | 152 | 97 | 11.08 |
| 1 | 9 | 93.04 | 90.55 | 125.76 | 442 | 432 | 163 | 96 | 10.74 |
| 1 | 10 | 86.95 | 85.53 | 98.38 | 470 | 463 | 161 | 96 | 11.50 |
| 1 | 11 | 95.68 | 93.90 | 130.65 | 450 | 446 | 174 | 95 | 10.45 |
| 1 | 12 | 91.33 | 84.73 | 115.54 | 485 | 466 | 182 | 95 | 10.94 |
| 1 | 13 | 92.05 | 90.02 | 131.43 | 488 | 479 | 182 | 94 | 10.86 |
| 1 | 14 | 90.80 | 88.67 | 131.91 | 492 | 484 | 186 | 94 | 11.01 |
| 1 | 15 | 88.31 | 86.11 | 126.08 | 516 | 506 | 190 | 94 | 11.32 |
| 1 | 16 | 86.64 | 83.26 | 126.51 | 537 | 527 | 184 | 93 | 11.54 |
| 1 | 17 | 98.74 | 94.63 | 200.81 | 506 | 492 | 186 | 93 | 10.12 |
| 1 | 18 | 98.36 | 94.58 | 131.67 | 506 | 496 | 185 | 93 | 10.16 |
| 1 | 19 | 99.25 | 96.09 | 129.91 | 511 | 498 | 201 | 92 | 10.07 |
| 1 | 20 | 102.10 | 97.94 | 127.53 | 505 | 497 | 196 | 92 | 9.79 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 4 | 98.97 | 98.10 | 105.72 | 365 | 336 | 195 | 92 | 10.10 |
| 2 | 4 | 111.05 | 99.38 | 125.07 | 638 | 614 | 282 | 93 | 17.95 |
| 3 | 4 | 125.96 | 102.84 | 146.91 | 931 | 924 | 350 | 93 | 23.78 |
| 4 | 4 | 129.78 | 106.55 | 184.36 | 1230 | 1136 | 438 | 93 | 30.68 |
| 5 | 4 | 121.97 | 103.38 | 160.35 | 1603 | 1451 | 456 | 97 | 40.64 |
| 6 | 4 | 129.07 | 105.19 | 161.60 | 1875 | 1592 | 586 | 97 | 46.11 |
| 7 | 4 | 132.34 | 108.91 | 180.89 | 2182 | 2021 | 695 | 97 | 52.15 |
| 8 | 4 | 134.77 | 109.17 | 212.90 | 2461 | 1950 | 815 | 97 | 58.69 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 2 | 167.21 | 165.35 | 171.93 | 991 | 220 | 498 | 93 | 5.98 |
| 2 | 2 | 189.64 | 169.03 | 211.02 | 376 | 355 | 504 | 93 | 10.53 |
| 3 | 2 | 184.25 | 166.91 | 214.19 | 554 | 535 | 507 | 93 | 16.24 |
| 4 | 2 | 194.37 | 170.20 | 223.24 | 743 | 692 | 507 | 94 | 20.39 |
| 5 | 2 | 191.27 | 172.01 | 225.70 | 924 | 850 | 563 | 94 | 26.02 |
| 6 | 2 | 202.51 | 169.93 | 225.91 | 1096 | 990 | 653 | 94 | 29.38 |
| 7 | 2 | 208.42 | 173.66 | 239.81 | 1268 | 1144 | 723 | 94 | 33.17 |
| 8 | 2 | 198.68 | 172.22 | 239.96 | 1469 | 1294 | 806 | 94 | 39.72 |
| 9 | 2 | 202.30 | 174.92 | 235.47 | 1643 | 1366 | 896 | 94 | 43.80 |
| 10 | 2 | 207.00 | 180.15 | 252.07 | 1839 | 1526 | 1010 | 97 | 47.65 |
| 11 | 2 | 206.64 | 178.10 | 267.44 | 2026 | 1519 | 1137 | 97 | 52.55 |
| 12 | 2 | 210.54 | 180.79 | 266.52 | 2204 | 1881 | 1224 | 97 | 56.25 |
| 13 | 2 | 218.15 | 186.11 | 261.90 | 2349 | 1654 | 1306 | 97 | 58.45 |
| 14 | 2 | 220.81 | 183.64 | 277.31 | 2555 | 1935 | 1380 | 97 | 62.21 |
| 15 | 2 | 234.77 | 194.08 | 282.70 | 2697 | 1904 | 1456 | 97 | 62.86 |
| 16 | 2 | 231.98 | 184.98 | 328.54 | 2828 | 2338 | 1530 | 97 | 67.51 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

@zhenjing
Copy link
Author

内部线程池性能优化到比openMP线程池一样或更好吗?

@zhenjing
Copy link
Author

将队列换成无锁队列 https://github.com/cameron314/concurrentqueue 做过测试。数据如下:
线程池:
1、采用多个子线程池,每个线程池4个并发线程,任务队列采用无锁阻塞队列
2、每个算法句柄绑定特定线程池

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 4 | 52.11 | 43.25 | 91.51 | 310 | 299 | 127 | 95 | 19.18 |
| 2 | 4 | 73.77 | 45.71 | 145.01 | 587 | 565 | 220 | 95 | 27.06 |
| 3 | 4 | 76.27 | 48.91 | 132.90 | 859 | 811 | 313 | 95 | 39.23 |
| 4 | 4 | 81.51 | 48.48 | 139.89 | 1136 | 1070 | 406 | 95 | 48.85 |
| 5 | 4 | 87.28 | 53.61 | 136.57 | 1440 | 1295 | 499 | 95 | 57.06 |
| 6 | 4 | 90.79 | 54.60 | 150.20 | 1694 | 1509 | 592 | 95 | 65.70 |
| 7 | 4 | 94.99 | 56.78 | 154.01 | 1901 | 1691 | 685 | 95 | 73.11 |
| 8 | 4 | 119.03 | 59.02 | 213.66 | 2135 | 1873 | 778 | 95 | 66.81 |
| 9 | 4 | 145.49 | 66.99 | 245.22 | 2397 | 2065 | 871 | 95 | 61.40 |
| 10 | 4 | 166.84 | 70.86 | 305.54 | 2639 | 2322 | 965 | 95 | 59.49 |
| 11 | 4 | 184.74 | 80.89 | 350.71 | 2878 | 2659 | 1057 | 95 | 59.23 |
| 12 | 4 | 229.78 | 81.02 | 329.65 | 3128 | 2964 | 1150 | 95 | 52.00 |
| 13 | 4 | 230.75 | 103.11 | 336.60 | 3344 | 2895 | 1243 | 95 | 56.08 |
| 14 | 4 | 268.87 | 93.64 | 431.35 | 3531 | 2992 | 1335 | 95 | 51.82 |
| 15 | 4 | 262.94 | 72.90 | 2979.28 | 3176 | 2362 | 1429 | 95 | 53.75 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

线程池:
1、采用单个线程池,任务队列采用无锁阻塞队列

+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| HandleCount | ThreadNum | AVG (ms) | Min (ms) | Max (ms) | MaxCPU | AvgCPU | MaxMemory(MB) | userTimeRatio | throughput |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+
| 1 | 4 | 72.99 | 55.21 | 112.32 | 304 | 294 | 127 | 94 | 13.70 |
| 2 | 4 | 84.10 | 77.08 | 123.71 | 591 | 574 | 221 | 94 | 23.73 |
| 3 | 4 | 93.11 | 84.60 | 152.11 | 887 | 858 | 313 | 94 | 32.19 |
| 4 | 4 | 97.14 | 86.18 | 135.40 | 1173 | 1123 | 407 | 94 | 41.06 |
| 5 | 4 | 105.34 | 74.98 | 143.07 | 1458 | 1354 | 500 | 95 | 47.25 |
| 6 | 4 | 113.20 | 98.05 | 143.42 | 1745 | 1631 | 593 | 95 | 52.93 |
| 7 | 4 | 145.99 | 112.52 | 189.18 | 2033 | 1854 | 685 | 95 | 47.81 |
| 8 | 4 | 164.72 | 134.42 | 201.22 | 2288 | 2230 | 778 | 95 | 48.45 |
| 9 | 4 | 167.02 | 118.06 | 213.49 | 2529 | 2265 | 863 | 95 | 53.73 |
| 10 | 4 | 196.22 | 101.76 | 256.27 | 2821 | 2496 | 963 | 95 | 50.69 |
| 11 | 4 | 241.54 | 143.93 | 293.38 | 3027 | 2773 | 1057 | 95 | 45.31 |
| 12 | 4 | 251.07 | 190.37 | 297.89 | 3262 | 2887 | 1150 | 95 | 47.72 |
| 13 | 4 | 283.10 | 168.88 | 371.04 | 3471 | 3106 | 1242 | 95 | 45.72 |
| 14 | 4 | 310.47 | 214.52 | 369.59 | 3665 | 3216 | 1335 | 95 | 44.92 |
| 15 | 4 | 359.19 | 172.35 | 458.20 | 3826 | 3775 | 1429 | 95 | 41.52 |
+-------------+-----------+----------+----------+----------+--------+--------+---------------+---------------+------------+

@jxt1234
Copy link
Collaborator

jxt1234 commented May 8, 2024

内部线程池主要考虑少量实例(小于2)的加速。在多实例情况下一般建议全部用单线程,外部用线程池,也可自行换成 openmp .

@jxt1234 jxt1234 added the question Further information is requested label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants