2.4.2 Release Note

版本修复了已知问题，并新增了少量功能。

训练框架（含分布式）

修复 paddle.utils.dlpack.to_dlpack 在 for 循环里 API 多次创建 dlpack 对象的报错问题，修复引用对象计数错误导致 dlpack 实际指向内容被析构的问题。 #50138
修复 paddle.multiplex API 在多维 Input Tensor 场景下访存越界的问题并添加 check 机制。 #49368
引入 cutlass，实现 gemm+gather+scatter 的融合；优化 sparse conv 的训练和推理性能；优化 batch_norm 在 1D 输入数据下的推理性能。 #50118
修复因使用 constexpr 导致 gcc54 环境下编译失败的问题。 #50421
将 sum op 的 Kernel 迁移到 PHI 算子库，并且修复 infermeta 中 SelectedRows 无法获取正确 dim 的 bug。 #49342
修复 eigen 头文件错误引用导致的偶发编译错误。 #48157
修复 fold 算子在大 bs 输入下访存越界的问题。 #49491
通过增加类型判别，解决发送张量时，维度不统一，造成流水线并行 hang 住的问题。 #50337
修复了自定义算子输出梯度的参数顺序不连续时，反向算子的输出值可能为 None 的 bug。 #48656
修复 paddle.queeze_ API 在 inplace 操作时 shape 重复修改导致结果错误 bug。 #49903
修复动转静模式下无参数 Layer 无法调用 backward 的问题。 #49812
修复 CUDA11.8 在 windows 的编译问题。 #50205
修复 FusedDropoutActBiasGrad 在 H100 上不支持的错误。 #47285
新增 debug_graphviz_path 选项至 build_strategy。 #46531
修复未关闭的 popen 物件。 #47053

完善混合精度推理功能，提高混合精度推理稳定性。重构二阶段式 convert_to_mixed_precision 接口底层实现， enable_use_gpu 新增 precision 参数支持一阶段式。 #49077、#49239、#49477
支持 jetson ampere 架构下编译。 #49364
修复 fc kernel 低精度模式下的精度问题。 #49781
修复 CAPI 下， trt workspace 参数类型的错误。 #48350
修复 Paddle 1.x 版本下 arg_max arg_min 没有 flatten dtype 参数，推理时会报错的问题。 #49771
修复 split infermeta 重构后关于 lod 逻辑信息缺失问题。 #49745
修复常量折叠 pass 不正确设置，导致 conv2d 权重经折叠后为非 persistable 而没有进入 TensorRT engine 问题。 #50105

V2.4.2 fixed known bugs, and added a tiny set of features.

Fix the problem while using paddle.utils.dlpack.to_dlpack API to create dlpack objects multiple times in the for loop, and fix the bug that the reference counting error causes the memory actually pointed by dlpack to be destructed unexpectedly. #50138
Fixed the issue of out-of-bounds memory access when the input tensor is multi-dimensional in paddle.multiplex API. #49368
Fix the occasional compilation error caused by incorrect referencing of the Eigen header file. #48157
Fixed the bug that the output value of the backward operator may be None when the output gradient parameter order of the custom operator is not continuous.#48656
Add cutlass and implement the fusion kernel of gather+gemm+scatter; Optimize training and inference performance of sparse convolution; Optimize inference performance of batch_norm under 1D input data.#50118
Fix compilation failure in gcc54 environment caused by using constexpr. #50421
Move sum op kernel to PHI and fix bug that can't get correct SelectedRows' dims when run infermeta.#49342
Fixed the issue that the fold operator accesses memory out of bounds under large bs input.#49491
Fix the problem that no parameter Layer cannot call backward under dynamic to static mode.#49812
Fix the compile problem of CUDA11.8 on windows platform.#50205
Fix the unsupported error for FusedDropoutActBiasGrad on H100.#47285
Add debug_graphviz_path option into build_strategy.#46531
Fix the not closed popen object.#47053

Improve the functionality and stability of mixed-precision inference. Reconstruct the implementation of interface convert_to_mixed_precision and add parameter precision to interface enable_use_gpu.#49077、#49239、#49477
Support compilation under jetson ampere architecture.#49364
Fixed fc kernel diff.#49781
Fixed the error of trt workspace parameter type under CAPI. #48350
Fixed the error caused by arg_max/arg_min without flatten dtype parameter in Paddle 1.x version. #49771
Fixed the bug of missing information about lod logic after split infermeta's refactoring. #49745
Fixed the bug of the constant-folding pass, which causes the conv2d weight to be non-persistent after folding and not enter the TensorRT engine. #50105