PaddlePaddle 2.5.0 Release Note EN

PaddlePaddle 2.5.0 Release Note

1. Highlights

New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as paddle.sum etc.
New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.

2. Incompatibility Upgrade

PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like loss.numpy()[0].After this modification, output loss in model training is a 0-dimensional tensor. When using loss.numpy(), users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit.
paddle.fluid API is fully decommissioned. According to the plan that has been previewed in the last version, 1116 paddle.fluid APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.
Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including paddle.static.ParallelExecutor and paddle.static. CompiledProgram(). with_data_parallel( ) APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e., paddle.distributed.launch API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351，#50501，#51240，#51701，#51616，#51369，#52671
Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.

3. Training Framework (Including Distributed)

Python API

API supporting 0-dimensional tensor

API input supports 0-dimensional tensor, involving paddle.reshape , paddle.trace , paddle.linalg.norm and other 286 APIs. #53208, #53592, #47074, #53186, #47677, #49357, #50237, #46555, #47219, #47501, #47858, #47961, #48058, #48007, #49755, #51024, #51566, #51899, #49813, #47812, #47849, #47251, #53125, #53828, #51265, #47689, #48452, #49072, #48638, #49175, #49279, #50857, #49805, #47734, #45992, #49616, #49959, #50536, #49544, #49842, #46909, #49361, #50169, #48314, #48735, #49122, #49122, #49177, #49501, #49562, #49340, #49550, #49596, #49730, #49667, #49692, #49854, #49845, #49803, #49889, #49904, #49518, #49884, #49880, #49862, #49921, #49260, #49929, #49570, #49882, #50213, #49780, #50271, #50289, #50293, #49735, #50433, #49847, #50635, #50950, #50947, #49460, #53087, #51687, #52185, #54649
API output supports 0-dimensional tensor, involving paddle.sum , paddle.min/max , paddle.any/all and other 90 APIs. #52891, #52861, #52775, #52850, #52843, #52857, #51721, #53051, #53192, #52739, #52741, #53175, #51889, #53199, #53242, #53421
In addition to the support of 0-dimensional tensor, fix the original non-standard codes, and provide hints and compatibility for non-standard usage in the model codes. #51562, #51586, #51757, #52197, #54117。

new API

Add paddle.autograd.jacobian and paddle.autograd.hessian APIs for scientific computing. #53331
Add sparse computing API. For example, paddle.sparse.reshape , paddle.sparse.sum and paddle.sparse.slice . #46694, #51513, #53794, #51406
Add APIsFor example, paddle.optimizer.LBFGS , paddle.index_put and paddle.logaddexp . #53314, #51912, #52886, #50843, #47282, #52284

Dynamic graphs

New features

Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data's memory/GPU memory. PR49935 , PR48235, PR49173
Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. PR45763, PR46215, PR48124
Tensor supports pickler, for serialization of Tensor. PR47025, PR48179
Add debug logs, to print forward Python stacks when nan/inf appears in reverse. PR53217 PR52639 PR52729
Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. PR45941, PR45942, PR45940, PR45879, PR45960

Improvements

Optimize log printing for dynamic graphs, including log content, VLog level, and error reporting content. PR45783, PR46349, PR46934, PR47724
Add FLAGS_auto_growth_chunk_size_in_mb for minimum chunk size settings of auto_growth_allocator. PR52204

bug fix

Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. PR47802, PR47634, PR47349, PR46124, PR46147, PR50388, PR48626, PR48519, PR50386, PR48432, PR51851
Fix some PyLayer bugs. PR51740, PR47154, PR47323, PR54041, PR48533
Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. PR52268, PR52860, PR52779
Fix a bug of linspace under AMP. PR46088
Fix Python C API’s incorrect call that causes Windows to crash. PR46833
Fix the bug that DataLoader may miss deleting/dev/shm. PR48511
Fix some bugs of paddle.grad. PR47151
Add error message for operators that do not support higher order differentiation. PR47231
Add numpyarray support for python operators. PR48229
Delete either of element_size APIs. PR49631
Fix the bug of crash when opening old dynamic graph VLOG. PR47115
For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. PR48373

Performance optimization

Python operators sink to C++ implementation, to improve API performance. There is a 3x to 6x performance improvement in this class of APIs after sinking. PR45811, PR46326, PR46329, PR46520, PR46542, PR46565, PR47060, PR47077, PR47174, PR47315
Optimize the Optimizer CPU scheduling performance to reduce GPU Gap caused by Optimizer phase. PR49787, PR50188 , PR51340, PR49864, PR50158, PR50335
According to the logic that API can be sunk to C++, API is sunk to C++ to improve API performance. PR46412, PR46190
Optimize unnecessary call logic on Python side under dynamic graph, to improve API performance. PR46221, PR49473, PR49574, PR49589, PR49612, PR49717 , PR49733, PR49823 , PR49508, PR46840
Optimize use of Allocator to improve dynamic graph API scheduling performance. PR47125, PR48548, PR50995, PR47731
Optimize fused_attention operator performance. PR48902
For optimizer's _add_accumulator, if device is CPU and under dynamic graphs, use full to initialize var directly. PR48189
Prune unnecessarily executed subgraphs for inverse graphs to improve performance. PR47827
Optimize performance of initalizers. PR46033
Add fused dropout add operator to improve computation performance when dropout and add are used together. #52903

Static graphs

The new static graph executor is now fully go-live.

The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. #45913，#46025，#48911，#50239，#45696，#46092，#48158,#51389，#49708，#49275,#48789，#49939，#51149，#52652

Operator library

Enhance functions of customized operators

New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework's secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the inplace , vector < Tensor> output, optional < Tnesor> input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs. #49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537

Unification of operator architecture

Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware. #47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202, #47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371

Dynamic-to-static plus combinator

New features

Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise #50452, #52192, #53014
Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max #54263， #54236, #53865, #54175, #54399
Add support for assigning semantics to containers in control flow in dynamic-to-static. #51248
For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. #50111, #52006
For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; #49836, #52948, #53572
Add a backend parameter to the to_static API. It can be specified as CINN or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. #52596
Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. #50315, #49654, #50642
Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. #49605
Add the combinator switch. You can set environmental variables in shell to split operators in different ways. #50309
Add OpTest combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. #50509, #50807, #52815

Improvements

Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. #52397, #52598, #51473
Simplify combination rules and vjp rules of the combinator batch_norm. #54012, #51827, #51933,
Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. #51960, #50160
Combinator supports the jit.save API. Add custom VJP rule API. #52344, #50885
Remove the overwrite parameter from combinator gather_grad. #52707
Clean up dynamic-to-static code style, optimize error message, and standardize logs. #48637, #46128, #52527, #46800,#46415
For dynamic-to-static, call the append backward to get grad var name to fix the error in the high order gradient computation. #53250
Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. #47102, #50596, #45768
For dynamic-to-static, optimize print function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. #48672, #50336

bug fix

For the combinator, fix cmake compilation errors. Fix cuda 12 test errors. Fix bugs of operators such as meshgird, expand_as, concat, conv, and arrange. #49643, #54622, #53951, #53951, #53350, #51486, #52764
For the combinator, fix the bug in a number of scenarios such as rank=1, shape=-1, amp, and multi-process. #51413, #51435, #50518, #47301,
For the combinator, fix bugs in automatic code generation of composite grad maker and static prim api. Fix bugs that op creation attributes are missing, and some combination rules do not take effect. #50854, #51445, #50780, #52120
Fix some other bugs for combinators #50086, #51208, #51577, #53598, #47500, #52119, #50397, #50527, #50788, #51014, #52154, #52752
For dynamic-to-static, fix the bugs of dataloader, cond input dict, transformer import, T5 model memory leak, and grad var name parsing error. #49821， #47299, #50776, #50883, #51100, #51464, #51966, #52110, #52821
For dynamic-to-static, fix the bugs of Lazy initialization, Windows training, is_paddle_func failure, and recurrent op failure to delete pass. #50785, #52580, #51585, #51763, #51763

Performance optimization

Add scope caching and reuse mechanism during execution of run_program_op in dynamic-to-static, to avoid passing new scope for each step. #45813

Distributed training

Dynamic graph distributed training

Remove the distributed sharding API in the old dynamic graphs. #49334
Upgrade fleet to distributed directory. #50834
Optimize log printing for distributed strategies. #47761
For re-computation, support hook mode, inplace function, and stop_gradient mode. Support more flexible use. #48471, #47985
Data parallel
- For data parallel, support no_sync API for blocking parameter gradient communications. Support the parameter synchronization function. Add scale API to scale parameters. #47536,#51895,#47519
- Fix the problem of video memory leakage under data parallel. #47369,#47444,#48668
- Support sparse parameter gradient synchronization. #52785
Pipeline parallel
- Optimize pipeline performance, and remove communication wait. Optimize scheduling and communication overlap. #46209,#54003,#54312,#53384,#54310,#46399,#46483,#46780,#46116
- Support custom sharding, log printing, random seed setting, and timer elapsed time printing. #53344, #47670,#47336,#52656,#53831
- Optimize video memory release logic in pipeline scheduling, and release intermediate variables and data in advance. #54557, #47199,#47497,#48045,#54672
- Support VPP mode and model saving for pipeline parallel. #54196, #52927,#47801,#45922,#47242
Grouping sharding parallel
- sharding stage2 parallel supports the quantization function, hybrid parallel training, gradient accumulation, XPU hardware, BF16 low precision computation, optimizer learning rate setting, offload function, and data parallel. #47169,#47535, #46795,#47711,#48310,#46846,#48857,#49196,#49931,#47114,#49767
- Optimize sharing stage2 performance. Support the communication computation overlap. #46495,#46894
- sharding stage3 support shared parameters, and untrainable parameters. #48695,#48577
Tensor model parallel
- Optimize tensor model parallel performance to reduce performance impact of stream sharding. #47715,#51617
- Support parameter, optimizer shapes, gradient synchronization. #51428,#53254, #53335,#45803,#46303,#52293
- Optimize tensor model parallel operators such as c_embedding, softmax_with_corss_entropy. #53197,#53547,#53541,#52789,#46491,#52742,#53419
Launch
- Support distributed Launch function, with keeping independent logs. #53207,#50405
- Add framework print environment variable function, log overwrite function, log return, and environment check. It is easy to change the debug environment variable. #53243,#53243, #51803, #53990
Communication library
- Add custom mixed parallel communication groups, topology information printing, and custom communication topology order. #47021,#54000,#51781
- Remove communication library dependency on Place information #47857
- Add communications library to support GLOO operator. Support send/recv/gather. #52221, #52334,#49084
- Disable reverse computation of communication operator. #47636
- Add communication library static shape check, to help determine whether communication volume is matched. #48256,#48915,#48646
- Support communication python object type, BF16 type, alltoall, reduce, allgather, group call, global gather, broadcast, and scatter communication methods. Support XPU device communications. #51765,#45844,#48059,#48115, #48339,#49252,#49451,#50085,#50701,#48208,#48736,#51762,#52495,#53514,#48232,#49896,#49941,#45584
- Add support for communications between computational streams. #46182,#46023,#46295,#46761,#47481,#47740,#47976,#48163,#48396,#48308,#47110,#53089
- Optimize communication library TCP linking time. #49810,#47184

Automatic parallel

Improve semi-automatic parallel for static graphs:
- Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. #48083,#47978,#47595,#48083,#48084,#47816
- Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
- Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. #46519,#47358,#46391, #51035
- Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. #52903,#49865
- Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. #46771, #54067
- Improve grad_clip to support load balancing in data parallel scenarios. #49510, #49249
Semi-automatic parallel performance improvement for static graphs:
- Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. #48604, #47180,#46180
- Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. #48608,#47846,#49010
- For the pipeline parallel, add 1F1B scheduling optimization Pass #54260, #45915
- Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. #48092,#45643,#49744, #47578
- Optimize Reshard module concate performance. Reduce number of concates in some scenarios. #47809
- Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. #51285,#51147, #49219, #49079
Improve function of fully automatic parallel for static graphs:
- Add new rule-based fully automated search strategy. #51859,#51908,#52053,#48316,#48464, #52041
- Improve automatic parallel modelling capability, enriching single-node topology modelling and communication volume modelling. #52723,#46387,#47043

Parameter server

Clean up the all list in ps directory, in which API is not exposed #51289
Clean up cvm operator #48989
For GPUPS, add support for AFS. #46611
Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface #49946,#50166,#50349
GPUPs run to switch to specified mode. #51115
GPUPS is added to benchmark. #49587,#49649
Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. #47026,#47192,#49878, #46356,#46575,#49389,#46258,#50136
Add rocksdb compilation method. #46074

CUDA

New features

Add compilation support for CUDA 12.0. Fix related unit test. (#49539, #54542)
Add CUDNN Frontend API compilation support and related unit test. You can use WITH_CUDNN_FRONTEND=ON compilation option for start. (#47524, #47612)

Improvements

Add mixed precision strategy and optimize precision:
- Add and optimize FP16 and BF16 data type support for more than 200 operators in the framework, including logsumexp, reduce_max, cumprod, sync_batch_norm, compare class OP, etc. Carry out precision optimization and unit test for all FP16 and BF16 operators. Improve the unit test framework function for low-precision operators, to ensure there is no loss of accuracy in the process of large-model training. (#51193, #51114, #45817, #52862, #52919, #52921, #46413, #48205, #54193, #48041, #48121, #46364, #51153, #53023, #53079, #53137, #46212, #50908, #52555, #51582, #47897, #45601, #53522, #52666, #50101, #48315, #50847, #50905, #50906, #50909, #50916, #50917, #50920, #50919, #50904, #50918, #50938, #50858, #50933, #50945, #50936, #51168, #51493, #50924, #50923, #50926, #50925, #50930, #53284, #53286, #53285, #50976, #50915, #50915, #48192, #50993， #50998, #51380, #51137, #51106, #51197, #51159, #51552, #51151, #51005, #51565, #51036, #51185, #51791, #51083, #51694, #51689, #51009, #51051, #51532, #51978, #51903, #51888, #52016, #52035, #52184, #52018, #51787, #51640, #52172, #52193, #51160, #51809, #51678, #52158, #51015, #52240, #52276, #52233, #52220, #52107, #52282, #52311, #52315, #52357, #52256, #51649, #52413, #52369, #51837, #52112, #51819, #52388, #52411, #52521, #51300, #51117, #52380, #52317, #51263, #52668, #52259, #50999, #52407, #52288, #52845, #50953, #52667, #52582, #52426, #51884, #52630, #52136, #52604, #51615, #51275, #52898, #52918, #52572, #52683, #52956, #52963, #52954, #52444, #52314, #52887, #52195, #53100, #52961, #52953, #53111, #53549, #53736, #52920, #53195, #53535, #53876, #53785, #53722, #54285, #54232, #53922, #47277, #50811, #54571, #50129, #50340, #50848, #50849, #50868, #50878, #50929, #50939, #50973, #50913, #51145, #51090, #51098, #51094, #51216, #51736, #51684, #51925, #54030, #50700, #52264, #51069, #51101, #51286, #53582,#49869))
AMP optimization: Comprehensively upgrade and optimize ease of use, accuracy stability and debuggability of AMP training, to better support acceleration of large model training. In terms of ease of use, unify the API for dynamic and static graphs. Add new conversion interfaces such as model.float(), model.float16() and model.bfloat16(). In terms of accuracy stability, enhance automatic adjustment of the strategy for BF16 type. Optimize blacklist settings. Enhance support of the multi_precision function by optimizer operators Adagrad, Adamax, Adadelta, and RMSProp. In the O2 mode, improve master grad mechanism, add type promotion mechanism and a new parameter for the specific module to use float32 computation to guarantee accuracy. In terms of debuggability, add the paddle.amp.debugging module to provide operator statistics, outlier detection, and accuracy comparison. ( #50132, #50078, #50131, #49705, #52936, #52871, #53289, #53362, #54240, #53768, #48041, #47672, #48843, #49391, #51635, #45541, #53742, #51020, #51063, #52514, #50940, #52936, #53439, #53712, #48238, #52215, #53012, #52918, #54571)
For GroupNorm operator, add support for NHWC data format. (#47533)
For index_put operator, add support for mixed data types of bool and int. (#54195)
Add sparse.is_nan API for determining whether a sparse tensor contains a NaN element. (#51513)

bug fix

Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. (#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
Fix the problem that conv operator exhaustive search does not work in some scenarios. (#47065)
Fix timeout problem of collective_reduce_scatter and other operators on A100. (#54513)
Fix the problem of attribute error in FusedLinear unit test. (#50359)
Fix the OOM problem that may occur when using Profiler. (#46089)

Performance optimization

Further optimize GPU Kernel and eigen implementations of the framework's large number of operators, including max_pool3d, dropout, adaptive_pooling, depthwise_conv2d, transpose, eigh, broadcast class computations, reduce class computations, prelu, logsumexp, and sparse, to achieve better performance in more configuration scenarios. (#45820, #45959, #45934, #46332, #46287, #47233, #48855, #48560, #49419, #49748, #50348, #52401, #51131, #51141, #51479, #51835, #52509, #52482, #52700, #53112, #53659, #53658, #53154, #54071, #53622, #52952, #46046, #46119, #45946, #47212, #47791, #47454, #45230, #48899, #33051, #49040, #48992, #49086, #50808, #46431, #50931, #48056, #46071, #49231, #38660, #50287, #46111, #46997, #45854, #47738, #48635, #50353, #50362, #51934, #54045, #46679, #52093, #52969)
Provide more fusion implementations and related fusion pass, such as fused_feed_forward, gather-gemm-scatter, matmul + bias, layernorm_shift_partition + element_add, and elementwise class fusion, to further improve performance of models that use the mode. ( #50423, #50091, #50364, #53017, #50755, #50050, #47099, #48848, #49383, #50809, #52361, #52028, #48439, #49009, #51427, #52731, #51805)

Intermediate Representation

In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast.

Complete the definition of IR data structure, including type system and operator definition. Implement execution adaptation with phi kernel. #51112， #51992, #50412, #53557, #53953, #50959, #54250, #54197, #54289, #51636, #52846, #53988, #54143, #54035, #54052, #54340, #54356, #54068, #53894, #53707, #54185, #54031, #54220, #54275, #54281, #54186, #54259, #54124, #54292, #48068, #53978
Improve the basic pass setup, including basic pass definition, pass registration management. #54023,#54170, #54170, #54308, #54348, #54385
Improve adaptation of high-level arithmetic, including modification of the basic module and adaptation of silu and cast arithmetic. #52005, #53425, #53417, #53417, #53498, #53171, #53632, #53605, #53746, #53874, #54164, #45888, #46024, #46446, #46960

CINN compiler

New features

Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. (#53382, #53955, #54064, #54118, #54216, #53454)
Add CINN support for int8/uint8/int16/uint16/bf16 data types. (#50566, #53637)
Add support for the CINN expand operator. (#46776)
Add CINN support for PaddleInference. (#45009)

Improvements

For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. #49471, #49553
For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. #51645
Add build_cinn_pass to BuildStrategy for use in dynamic-to-static (#49496)
Add reshape operator to perform unit test under combinator mechanism. (#51276)
Change version of the main framework binding CINN from fixed commit to develop. (#49775)
Set default Target parameter for CINN. (#50182)

bug fix

Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. (#52556)
Fix some operator computation errors, accuracy degradation, and unit test related problems. (#53859, #54261, #46801, #53676, #53772)
Fix the problem of CINN support for float16 type. (#48249)
Fix the problem in build_cinn_pass. (#46843)
Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. (#50116)
Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found #51688, #52813, #51769

Performance optimization

Optimize reshape related fusion strategy (#53066)
Optimize performance of BuildCINNPass. (#49696)
Optimize performance of subgraph detection module. (#45040, #46937)

Hardware support

CustomDevice

Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. #49271
Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. #50409, #50666
Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. #46963

KUNLUNXIN XPU

For the training side, use a new version of dynamic graph, with adding support for distributed strategy MP/Sharding/PP and recompute function, and communication library. For the inference side, add support for distributed strategy MP and support for XPU FasterTransformer operator acceleration library. #49531, #49815, #48897, #50717, #51082, #49757, #51399, #50329, #48369, #47838,#48076,#47882,#48961,#49043,#49749,#49806,#53427,#48470,#49207,#52296,#51785,#47168,#47445,#50200,#49934,#50792,#52228,#53337,#53389,#53496,#53609,#53697,#53496,#53720,#53734,#54172,PR46227

4. Deployment Direction（Paddle Inference）

New features

Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. #45842 #47631
For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. #49758
Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. #54433 ，#47050 ， #54254 。
Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. (#50445)
Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. #52162 。
Paddle-TensorRT supports model input in NHWC format. #49633 。
Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. #49497 。

Improvements

Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. #47993
Support double type input for inference. #51786 。
Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. #45547
Paddle-TensorRT supports more operators into TensorRT inference, including:
- expand_v2，gather_nd，rsqrt，sign，not，onehot，arg_min，temporal_shift，expend_as_v2，setvalue，index_select，round，acosh，square，reduce_max，not_equal，reduce_min，reduce_prod，grid_sampler，elementwise_mod，pad3d ，greater_equal，bitwise，cumsum，matmul_v2，reciprocal，where，bmm，take_along_axis，less_than，greater_than， logical_or， logical_xor， logical_and， less_equal，range，reduce_all，reduce_any ，fill_any_like ，pow
- #47002 , #47589 ，#48223 ，#48557 ， #48655 ， #49113 ， #51207 ，#51028 ，#50341 ，#51498 ，#48534 ，#48684 ， #49393 ， #49615 ，#50934 ，#50974，#50986 ， #52000 ，#51971 ， #52518 ，#44918 ，#48230 ，#47820 ， #46877 ， #48358 ， #48592 ，#48697 , #53088 ， #47974 ， #53462
Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. #46819 ，#47998 ，#48043 ，#48998 ， #49675 , #47495
Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. #53660 ，#53627 ， #53634 ， #53714 ， #53729 ，#53769 ，#53506 ，#53704
Support compilation for versions earlier than GCC12 + CUDA 12.0. #50106
Paddle-TensorRT's DeformableConv plugin supports dynamic Shape input. #50698
For Paddle-TensorRT, add plugin support for lookup_table operator. #46613
Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. #52485
Paddle-TensorRT's LayerNorm plugin supports FP16 computation. #45043
Predictor's input data paddle_infer::Tensor supports bool type. #49388
Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. #47653
conv2d_fusion operator supports NHWC format. #49047
Adjust the directory structure related to Phi operators under C++ inference library. #53091
Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. #50775 。
Optimize Paddle-TensorRT runtime to print log messages. #50181
Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. #51656
Clean up and normalize support for Paddle-TensorRT's FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT's IMatrixMultiplyLayer for support. #52222

Performance optimization

Support multiple lookup_tables into Paddle-TensorRT's Embedding+Eltwise+LayerNorm fusion. #46243 ，#46230
Add MoE fusion Phi operator to improve inference performance of MoE model. #48703
In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. #50554
Optimize memory and video memory in case of inference. #49051 ， #49046 ，#53930
Optimize Layout and enhance Pass. #52997
Support caching of operator Shape inferences to improve model inference performance. #48312
Optimize bias+add+relu fusion using half2 instructions. #49048
Optimize Concat Kernel for multiple inputs using vectorization operations. #49540
Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. #47989 ，#50603 ，#51792 ，#50603
Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. #49438 。
Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. #50082 。
Add Elementwise+Transpose fusion. #50081
Optimize Paddle-TensorRT Group Norm plugin implementation. #49160
For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. #53406
Support inplace operation of Reshape, to reduce copying time of the model at runtime. #49146
Optimize LayerNorm kernel implementation based on oneDNN. #47782
Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. #49509
When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. #45704
CPU OneDNN inference supports suqeeze2 + transpose2 fusion. #47592

XPU inference enhancement and performance optimization

Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. #53334、 #52466、 #53240
Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. #50570、#51346、 #50499、#53982、#50759、#51571、 #53144、#53306
Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. #53130
Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. #52099
Add support for exporting multi-card inference models for Kunlunxin. #50490
Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. #50590
interpolate class operator phi kernel supports FP16. #52358
argmax operator supports INT32 type output. #51303
Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. #52994
Fix segment error of instance_norm when scale and bias are empty. #52627
conv_transpose operator supports FP16. #53626
Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. #54163
Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. #52247 ，#53626
Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. #53580
Add act_add fusion pass and operator phi kernel to improve inference performance. #53965
Add fold_interp_outsize fusion pass, to improve inference performance. #54245
Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. #51108、#51039
Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. #51029
Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. #53696
Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. #51306
subblock supports early copying of H2D parameters to improve inference performance. #51876
Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. #53505
In new executor Kunlunxin D2D copy, support asynchronous execution. #51876
Remove concat operator with only one input. #52304
lookup_table_v2 supports FP16 to remove redundant cast operator. #52888
Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. #52628
Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. #52831

Model quantization

Upgrade of dynamic graph quantization function.
- Add a new API for quantization training of dynamic graph models: paddle.quantization.QAT . Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. (#49398)
- Add a new offline quantization API: paddle.quantization.PTQ . Support exporting quantization model to model format supported by inference. (#50107)
- Add STUB operator to simulate actual quantization operation during training process. (#50510)
Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. #47892， #45911，#48912
Support hybrid parallel training of static graph quantization training. #52219
Fix the problem in the process of dynamic graph quantization:
- Repeat insertion of quantization nodes when exporting quantization training models. #48751
- Fix the problem of inserting quantization nodes into model input. #49926

5. Environment Adaptation

Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x.

ninja compilation adaptation to improve compilation speed. #52433,#48932,#49420,#48435,#49303,#49448,#49838,#50067,#52796,#50431,#49181,#48867,#48490,#48211,#49499,#53076
setuptools compilation and package all-in-one adaptation. #48770,#46957,#49583,#47602,#48301,#50800,#42575),#49826,#49002,#51443,#51528,#52621,#52465
gcc12 support. #52960,#52265,#46546,#52318,#46808,#47466,#52083,#48176,#49423,#49452,#51037,#52007,#52441,#52085,#50817,#52646,#50777,#53288,#54009
c++17 standard support. #53345,#53892,#54282,#49017,#47635,#54258
cuda12 support. #52285,#49592,#52232,#52654,#54641
CodeStyle。#45909,#47772,#48538,#49522,#47264,#49558
Compilation Warning is removed. #47163,#47216,#47309，#47252，#47341，#47399，#47513，#47558，#47706，#52717，#51203，#51336，#51608，#51633,#46644,#53092,#53185,#53246,#53650,#53683,#53687,#53886,#53689,#53679,#53681,#53532,#47137,#47045,#52186,#52490,#53924,#53938,#53945,#53851,#53847,#53818,#53931
Support protobuf upgrade. #49875,#48495,#49673,#52499,#51161,#49168
Support offline compilation of third-party libraries. #54326,#54370,#54335,#54346,#53744,#54319,#53915
Phi independent compilation header file dependency decoupling. #50456,#47088,#52573,#52651
Python2.x decommissioning. #48685

6. Security

Fix bugs such as null pointer usage, illegal address access, memory out of bounds, divide by 0, and Python IndexError PR49976, PR49993 , PR49942, PR49965 , PR50000 , PR50005 , PR49953 , PR49995 , PR49974 , PR50015 , PR50010, PR49979, PR49994, PR49977 , PR49968, PR49984 , PR49958 , PR50008 , PR51714, PR51847, PR51034 , PR51088 , PR51091 , PR51092, PR49966, PR49656, PR52161, PR49548, PR49546, PR49547, PR49549, PR51850

Thanks to our Contributors

This release contains contributions from: 1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao

Release Notes

PaddlePaddle 2.5.0 Release Note EN

PaddlePaddle 2.5.0 Release Note

1. Highlights

2. Incompatibility Upgrade

3. Training Framework (Including Distributed)

Python API

API supporting 0-dimensional tensor

new API

Dynamic graphs

New features

Improvements

bug fix

Performance optimization

Static graphs

The new static graph executor is now fully go-live.

Operator library

Enhance functions of customized operators

Unification of operator architecture

Dynamic-to-static plus combinator

New features

Improvements

bug fix

Performance optimization

Distributed training

Dynamic graph distributed training

Automatic parallel

Parameter server

CUDA

New features

Improvements

bug fix

Performance optimization

Intermediate Representation

CINN compiler

New features

Improvements

bug fix

Performance optimization

Hardware support

CustomDevice

KUNLUNXIN XPU

4. Deployment Direction（Paddle Inference）

New features

Improvements

Performance optimization

XPU inference enhancement and performance optimization

Model quantization

5. Environment Adaptation

6. Security

Thanks to our Contributors

Clone this wiki locally