Skip to content

PaddlePaddle 2.5.0 Release Note EN

XiaoguangHu edited this page Aug 8, 2023 · 1 revision

PaddlePaddle 2.5.0 Release Note

1. Highlights

  • New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
  • Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
  • Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
  • Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as paddle.sum etc.
  • New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.

2. Incompatibility Upgrade

  • PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like loss.numpy()[0].After this modification, output loss in model training is a 0-dimensional tensor. When using loss.numpy(), users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit.
  • paddle.fluid API is fully decommissioned. According to the plan that has been previewed in the last version, 1116 paddle.fluid APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.
  • Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
  • In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including paddle.static.ParallelExecutor and paddle.static. CompiledProgram(). with_data_parallel( ) APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e., paddle.distributed.launch API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351#50501#51240#51701#51616#51369#52671
  • Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.

3. Training Framework (Including Distributed)

Python API

API supporting 0-dimensional tensor

new API

  • Add paddle.autograd.jacobian and paddle.autograd.hessian APIs for scientific computing. #53331
  • Add sparse computing API. For example, paddle.sparse.reshape , paddle.sparse.sum and paddle.sparse.slice . #46694, #51513, #53794, #51406
  • Add APIsFor example, paddle.optimizer.LBFGS , paddle.index_put and paddle.logaddexp . #53314, #51912, #52886, #50843, #47282, #52284

Dynamic graphs

New features

  • Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data's memory/GPU memory. PR49935, PR48235, PR49173
  • Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. PR45763, PR46215, PR48124
  • Tensor supports pickler, for serialization of Tensor. PR47025, PR48179
  • Add debug logs, to print forward Python stacks when nan/inf appears in reverse. PR53217 PR52639 PR52729
  • Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. PR45941, PR45942, PR45940, PR45879, PR45960

Improvements

  • Optimize log printing for dynamic graphs, including log content, VLog level, and error reporting content. PR45783, PR46349, PR46934, PR47724
  • Add FLAGS_auto_growth_chunk_size_in_mb for minimum chunk size settings of auto_growth_allocator. PR52204

bug fix

  • Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. PR47802, PR47634, PR47349, PR46124, PR46147, PR50388, PR48626, PR48519, PR50386, PR48432, PR51851
  • Fix some PyLayer bugs. PR51740, PR47154, PR47323, PR54041, PR48533
  • Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. PR52268, PR52860, PR52779
  • Fix a bug of linspace under AMP. PR46088
  • Fix Python C API’s incorrect call that causes Windows to crash. PR46833
  • Fix the bug that DataLoader may miss deleting/dev/shm. PR48511
  • Fix some bugs of paddle.grad. PR47151
  • Add error message for operators that do not support higher order differentiation. PR47231
  • Add numpyarray support for python operators. PR48229
  • Delete either of element_size APIs. PR49631
  • Fix the bug of crash when opening old dynamic graph VLOG. PR47115
  • For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. PR48373

Performance optimization

Static graphs

The new static graph executor is now fully go-live.

The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. #45913#46025#48911#50239#45696#46092#48158,#51389#49708#49275,#48789#49939#51149#52652

Operator library

Enhance functions of customized operators

New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework's secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the inplace , vector < Tensor> output, optional < Tnesor> input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs. #49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537

Unification of operator architecture

Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware. #47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202, #47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371

Dynamic-to-static plus combinator

New features

  • Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
  • Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
  • Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise #50452, #52192, #53014
  • Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max #54263#54236, #53865, #54175, #54399
  • Add support for assigning semantics to containers in control flow in dynamic-to-static. #51248
  • For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. #50111, #52006
  • For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; #49836, #52948, #53572
  • Add a backend parameter to the to_static API. It can be specified as CINN or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. #52596
  • Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. #50315, #49654, #50642
  • Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. #49605
  • Add the combinator switch. You can set environmental variables in shell to split operators in different ways. #50309
  • Add OpTest combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. #50509, #50807, #52815

Improvements

  • Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. #52397, #52598, #51473
  • Simplify combination rules and vjp rules of the combinator batch_norm. #54012, #51827, #51933,
  • Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. #51960, #50160
  • Combinator supports the jit.save API. Add custom VJP rule API. #52344, #50885
  • Remove the overwrite parameter from combinator gather_grad. #52707
  • Clean up dynamic-to-static code style, optimize error message, and standardize logs. #48637, #46128, #52527, #46800,#46415
  • For dynamic-to-static, call the append backward to get grad var name to fix the error in the high order gradient computation. #53250
  • Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. #47102, #50596, #45768
  • For dynamic-to-static, optimize print function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. #48672, #50336

bug fix

Performance optimization

  • Add scope caching and reuse mechanism during execution of run_program_op in dynamic-to-static, to avoid passing new scope for each step. #45813

Distributed training

Dynamic graph distributed training

Automatic parallel

  • Improve semi-automatic parallel for static graphs:
    • Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. #48083,#47978,#47595,#48083,#48084,#47816
    • Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
    • Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. #46519,#47358,#46391, #51035
    • Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. #52903,#49865
    • Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. #46771, #54067
    • Improve grad_clip to support load balancing in data parallel scenarios. #49510, #49249
  • Semi-automatic parallel performance improvement for static graphs:
    • Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. #48604, #47180,#46180
    • Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. #48608,#47846,#49010
    • For the pipeline parallel, add 1F1B scheduling optimization Pass #54260, #45915
    • Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. #48092,#45643,#49744, #47578
    • Optimize Reshard module concate performance. Reduce number of concates in some scenarios. #47809
    • Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. #51285,#51147, #49219, #49079
  • Improve function of fully automatic parallel for static graphs:

Parameter server

  • Clean up the all list in ps directory, in which API is not exposed #51289
  • Clean up cvm operator #48989
  • For GPUPS, add support for AFS. #46611
  • Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface #49946,#50166,#50349
  • GPUPs run to switch to specified mode. #51115
  • GPUPS is added to benchmark. #49587,#49649
  • Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. #47026,#47192,#49878, #46356,#46575,#49389,#46258,#50136
  • Add rocksdb compilation method. #46074

CUDA

New features

  • Add compilation support for CUDA 12.0. Fix related unit test. (#49539, #54542)
  • Add CUDNN Frontend API compilation support and related unit test. You can use WITH_CUDNN_FRONTEND=ON compilation option for start. (#47524, #47612)

Improvements

bug fix

  • Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. (#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
  • Fix the problem that conv operator exhaustive search does not work in some scenarios. (#47065)
  • Fix timeout problem of collective_reduce_scatter and other operators on A100. (#54513)
  • Fix the problem of attribute error in FusedLinear unit test. (#50359)
  • Fix the OOM problem that may occur when using Profiler. (#46089)

Performance optimization

Intermediate Representation

In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast.

CINN compiler

New features

  • Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. (#53382, #53955, #54064, #54118, #54216, #53454)
  • Add CINN support for int8/uint8/int16/uint16/bf16 data types. (#50566, #53637)
  • Add support for the CINN expand operator. (#46776)
  • Add CINN support for PaddleInference. (#45009)

Improvements

  • For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. #49471, #49553
  • For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. #51645
  • Add build_cinn_pass to BuildStrategy for use in dynamic-to-static (#49496)
  • Add reshape operator to perform unit test under combinator mechanism. (#51276)
  • Change version of the main framework binding CINN from fixed commit to develop. (#49775)
  • Set default Target parameter for CINN. (#50182)

bug fix

  • Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. (#52556)
  • Fix some operator computation errors, accuracy degradation, and unit test related problems. (#53859, #54261, #46801, #53676, #53772)
  • Fix the problem of CINN support for float16 type. (#48249)
  • Fix the problem in build_cinn_pass. (#46843)
  • Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. (#50116)
  • Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found #51688, #52813, #51769

Performance optimization

  • Optimize reshape related fusion strategy (#53066)
  • Optimize performance of BuildCINNPass. (#49696)
  • Optimize performance of subgraph detection module. (#45040, #46937)

Hardware support

CustomDevice

  • Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
  • Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. #49271
  • Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. #50409, #50666
  • Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. #46963

KUNLUNXIN XPU

4. Deployment Direction(Paddle Inference)

New features

  • Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. #45842 #47631
  • For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. #49758
  • Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. #54433#47050#54254
  • Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. (#50445)
  • Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. #52162
  • Paddle-TensorRT supports model input in NHWC format. #49633
  • Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. #49497

Improvements

  • Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. #47993
  • Support double type input for inference. #51786
  • Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. #45547
  • Paddle-TensorRT supports more operators into TensorRT inference, including:
  • Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. #46819#47998#48043#48998#49675 , #47495
  • Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. #53660#53627#53634#53714#53729#53769#53506#53704
  • Support compilation for versions earlier than GCC12 + CUDA 12.0. #50106
  • Paddle-TensorRT's DeformableConv plugin supports dynamic Shape input. #50698
  • For Paddle-TensorRT, add plugin support for lookup_table operator. #46613
  • Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. #52485
  • Paddle-TensorRT's LayerNorm plugin supports FP16 computation. #45043
  • Predictor's input data paddle_infer::Tensor supports bool type. #49388
  • Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. #47653
  • conv2d_fusion operator supports NHWC format. #49047
  • Adjust the directory structure related to Phi operators under C++ inference library. #53091
  • Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. #50775
  • Optimize Paddle-TensorRT runtime to print log messages. #50181
  • Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. #51656
  • Clean up and normalize support for Paddle-TensorRT's FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT's IMatrixMultiplyLayer for support. #52222

Performance optimization

  • Support multiple lookup_tables into Paddle-TensorRT's Embedding+Eltwise+LayerNorm fusion. #46243#46230
  • Add MoE fusion Phi operator to improve inference performance of MoE model. #48703
  • In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. #50554
  • Optimize memory and video memory in case of inference. #49051#49046#53930
  • Optimize Layout and enhance Pass. #52997
  • Support caching of operator Shape inferences to improve model inference performance. #48312
  • Optimize bias+add+relu fusion using half2 instructions. #49048
  • Optimize Concat Kernel for multiple inputs using vectorization operations. #49540
  • Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. #47989#50603#51792#50603
  • Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. #49438
  • Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. #50082
  • Add Elementwise+Transpose fusion. #50081
  • Optimize Paddle-TensorRT Group Norm plugin implementation. #49160
  • For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. #53406
  • Support inplace operation of Reshape, to reduce copying time of the model at runtime. #49146
  • Optimize LayerNorm kernel implementation based on oneDNN. #47782
  • Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. #49509
  • When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. #45704
  • CPU OneDNN inference supports suqeeze2 + transpose2 fusion. #47592

XPU inference enhancement and performance optimization

  • Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. #53334#52466#53240
  • Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. #50570#51346#50499#53982#50759#51571#53144#53306
  • Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. #53130
  • Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. #52099
  • Add support for exporting multi-card inference models for Kunlunxin. #50490
  • Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. #50590
  • interpolate class operator phi kernel supports FP16. #52358
  • argmax operator supports INT32 type output. #51303
  • Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. #52994
  • Fix segment error of instance_norm when scale and bias are empty. #52627
  • conv_transpose operator supports FP16. #53626
  • Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. #54163
  • Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. #52247#53626
  • Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. #53580
  • Add act_add fusion pass and operator phi kernel to improve inference performance. #53965
  • Add fold_interp_outsize fusion pass, to improve inference performance. #54245
  • Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. #51108#51039
  • Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. #51029
  • Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. #53696
  • Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. #51306
  • subblock supports early copying of H2D parameters to improve inference performance. #51876
  • Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. #53505
  • In new executor Kunlunxin D2D copy, support asynchronous execution. #51876
  • Remove concat operator with only one input. #52304
  • lookup_table_v2 supports FP16 to remove redundant cast operator. #52888
  • Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. #52628
  • Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. #52831

Model quantization

  • Upgrade of dynamic graph quantization function.
    • Add a new API for quantization training of dynamic graph models: paddle.quantization.QAT . Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. (#49398)
    • Add a new offline quantization API: paddle.quantization.PTQ . Support exporting quantization model to model format supported by inference. (#50107)
    • Add STUB operator to simulate actual quantization operation during training process. (#50510)
  • Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. #47892#45911#48912
  • Support hybrid parallel training of static graph quantization training. #52219
  • Fix the problem in the process of dynamic graph quantization:
    • Repeat insertion of quantization nodes when exporting quantization training models. #48751
    • Fix the problem of inserting quantization nodes into model input. #49926

5. Environment Adaptation

Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x.

6. Security

Thanks to our Contributors

This release contains contributions from: 1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao

Clone this wiki locally