Skip to content

Latest commit

 

History

History
129 lines (118 loc) · 15.7 KB

performance.md

File metadata and controls

129 lines (118 loc) · 15.7 KB

Performance

The following two tables is a comparison of performance between LightSeq and Faster Transformer, Which is tested on Tesla T4 with a model of Transformer-base. We also provide a TF baseline which's code is from Faster Transformer.

Beam search

batch_size beam_size seq_len TF(ms) FT(ms) lightseq(ms) PyTorch(ms) FT speedup lightseq speedup PyTorch speedup
1 4 32 419.53 26.25 29.66 385.23 15.98 14.14 1.09
1 4 64 806.38 54.02 63.04 760.77 14.93 12.79 1.06
8 4 32 439.64 35.99 34.77 416.06 12.22 12.64 1.06
8 4 64 891.54 79.82 79.43 835.79 11.17 11.22 1.07
32 4 32 536 82.82 59.49 429.78 6.47 9.01 1.25
32 4 64 1116.74 198.95 155.08 929.97 5.61 7.20 1.20
64 4 32 668.45 144.53 101.54 520.66 4.62 6.58 1.28
64 4 64 1476.17 351.14 277.4 1237.79 4.20 5.32 1.19
128 4 32 996.88 271.8 200.49 721.66 3.67 4.97 1.38
128 4 64 2157.85 671.76 502.91 2158.81 3.21 4.29 1.00

Sampling

batch_size topk/topp seq_len FT(ms) lightseq(ms) lightseq speedup
1 0.75 32 34.4 29.66 1.16
1 0.75 64 71.45 59.72 1.20
32 0.75 32 56.61 40.40 1.40
32 0.75 64 120.39 100.36 1.20
128 0.75 32 111.4 94.68 1.18
128 0.75 64 246.97 270.55 0.91
1 32 32 34.35 28.06 1.22
1 32 64 72.48 56.4 1.29
32 32 32 40.15 39.23 1.02
32 32 64 87.46 98.62 0.89
128 32 32 99 90.83 1.09
128 32 64 222.62 262 0.85

Machine Translation

The following table is a comparison on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size seq_len tf-fp32, ms lightseq-fp32, ms lightseq-fp16, ms lightseq-fp32/tf-fp32, speedup lightseq-fp16/lightseq-fp32, speedup lightseq-fp16/tf-fp32, speedup
1 6 303 47 27 6.44 1.74 11.22
1 12 399 63 38 6.33 1.66 10.5
1 18 702 108 59 6.5 1.83 11.9
1 24 1071 167 82 6.41 2.04 13.06
1 36 1234 192 105 6.42 1.83 11.75
1 46 1445 227 110 6.36 2.06 13.14
1 58 1887 303 142 6.22 2.13 13.29
1 70 2771 428 197 6.47 2.17 14.07
2 6 317 57 32 5.56 1.78 9.91
2 12 418 73 39 5.72 1.87 10.72
2 18 723 131 66 5.51 1.98 10.95
2 24 1113 201 91 5.53 2.21 12.23
2 36 1276 234 104 5.45 2.25 12.27
2 46 1521 282 121 5.39 2.33 12.57
2 58 2004 371 159 5.4 2.33 12.6
2 70 2965 542 221 5.47 2.45 13.42
4 6 326 61 39 5.34 1.56 8.36
4 12 433 85 47 5.09 1.81 9.21
4 18 761 154 77 4.94 2 9.88
4 24 1195 245 113 4.87 2.17 10.58
4 36 1391 282 128 4.93 2.2 10.87
4 46 1679 339 153 4.95 2.22 10.97
4 58 2232 455 199 4.9 2.29 11.22
4 70 3406 673 285 5.06 2.36 11.95
8 6 364 76 43 4.78 1.77 8.47
8 12 470 110 56 4.27 1.96 8.39
8 18 854 205 91 4.16 2.25 9.38
8 24 1381 318 139 4.34 2.29 9.94
8 36 1628 378 156 4.3 2.42 10.44
8 46 1989 459 193 4.33 2.38 10.31
8 58 2683 617 254 4.34 2.43 10.56
8 70 4251 949 382 4.47 2.48 11.13

The following table is a comparison on a en2zh translation model which is a Transformer-deep(Compared with Transformer-big, it has 16 layers of encoder and other configurations remain the same) with a beam size of 4 and a target vocabulary size of approximately 30k. FP32 models are tested on Tesla P4, and FP16 models are tested on Tesla T4.

batch_size seq_len tf-fp32, ms lightseq-fp32, ms lightseq-fp16, ms lightseq-fp32/tf-fp32, speedup lightseq-fp16/lightseq-fp32, speedup lightseq-fp16/tf-fp32, speedup
1 12 544 86 43 6.32 2 12.65
1 24 914 131 66 6.97 1.98 13.85
1 36 1290 200 93 6.45 2.15 13.87
1 48 1836 233 106 7.89 2.2 17.32
1 72 3456 482 212 7.17 2.27 16.3
1 84 2626 431 193 6.09 2.23 13.61
2 12 566 100 50 5.66 2 11.32
2 24 842 158 70 5.32 2.26 12.03
2 36 1287 247 103 5.21 2.4 12.5
2 48 1504 288 118 5.22 2.44 12.75
2 72 3131 611 240 5.12 2.55 13.05
2 84 2789 546 217 5.1 2.52 12.85
4 12 590 118 58 5 2.03 10.17
4 24 885 187 89 4.73 2.1 9.94
4 36 1380 301 127 4.58 2.37 10.87
4 48 1622 352 149 4.6 2.36 10.89
4 72 3492 763 311 4.57 2.45 11.23
4 84 3145 687 282 4.57 2.44 11.15
8 12 631 150 66 4.2 2.27 9.56
8 24 979 248 103 3.94 2.41 9.5
8 36 1584 412 156 3.84 2.64 10.15
8 48 1880 477 186 3.94 2.56 10.11
8 72 4218 1069 404 3.94 2.65 10.44
8 84 3831 976 373 3.92 2.62 10.27

BERT

The following table is a comparison between Hugging Face BERT-base model and LightSeq model on Tesla T4 using FP16.

batch_size seq_len Hugging Face(ms) lightseq(ms) lightseq speedup
1 16 15.23 2.19 6.95
1 32 16.24 1.99 8.16
1 64 19.32 2.35 8.22
1 128 16.57 2.98 5.56
1 256 23.99 4.60 5.22
8 16 13.06 3.47 3.76
8 32 13.27 4.46 2.98
8 64 23.02 7.43 3.10
8 128 59.35 17.27 3.44
8 256 117.06 40.74 2.87
32 16 29.27 12.38 2.36
32 32 54.90 17.68 3.11
32 64 109.13 36.20 3.01
32 128 260.13 66.03 3.94
32 256 498.84 145.57 3.43