Skip to content

PaddlePaddle 1.6.0

Compare
Choose a tag to compare
@XiaoguangHu01 XiaoguangHu01 released this 01 Nov 12:53
· 115 commits to release/1.6 since this release
28f871d

Release Notes

本版本重点完成了针对用户体验和易用性的全面优化工作,增加并优化了大量的OP,对于基础训练、预测及大规模分布式训练性能有了进一步的提升,模型库及配套工具组件对应实现了全方位的完善升级。

重要更新

  • 用户体验和易用性专项提升,包含全面的文档优化、报错信息优化、配置项优化、接口优化、编译优化、多平台支持以及编程易用性提升等各方面。
  • 训练框架进一步优化了速度,完善了显存优化机制,并支持在框架外部自定义C++/CUDA OP。新增了大量OP,并从多个维度优化了大量存量OP,包括兼容性、行为一致性、功能提升等方面。
  • 分布式训练新增LocalSGD、GEO-SGD等策略,大规模同步训练、异步训练速度继续提升,并支持K8S + Volcano任务提交。
  • 部署能力强化:
    • 服务器端预测库增加C API,并支持版本兼容检查,实现了大量性能优化工作。
    • 发布PaddleLite,定位高性能、多平台、轻量化的端侧预测引擎,并可作为服务器端预测库的加速库。
    • PaddleServing新增超大规模分布式预估服务能力。
    • PaddleSlim强化了量化训练功能,增加了基于硬件的小模型搜索功能。
  • 模型库易用性和丰富度提升:
    • PaddleNLP,发布全新seq2seq相关API和文本生成模型样例。语义表示库新增XLNet预训练模型;开源EMNLP2019阅读理解竞赛冠军模型D-NET,同时支持18个不同抽取式阅读理解数据集打榜。发布飞桨多任务学习库PALM (PAddLe Multi-task learning),更便捷支持多任务机器学习调研。
    • PaddleCV,发布训练部署端到端的图像分割库PaddleSeg。图像分类新增EfficientNet等43个预训练模型。PaddleDetection新增2019 Objects365 Full Track冠军模型、BlazeFace等人脸检测小模型,行人检测和车辆检测的预训练模型。PaddleVideo新增ActivityNet Challenge 2019夺冠模型,扩展包含video caption、video grounding等模型。
    • 发布PaddleSpeech,包含语音识别模型DeepSpeech和语音合成模型 DeepVoice3。
    • 增加PaddleRec的模型覆盖。
  • 配套工具组件全面升级:
    • PaddleHub新增超参优化Auto Fine-tune功能,并全面提升Fine-tune功能的灵活性和易用性,预训练模型数量大幅增加。
    • 飞桨图学习框架PGL正式版发布,易用性、规模性、丰富性全面提升。
    • 飞桨深度强化学习框架PARL并行能力进一步提升,支持进化算法。
    • Paddle2ONNX和X2Paddle全面升级,飞桨和其他框架的模型互转更加方便。
    • 发布飞桨联邦学习框架PaddleFL。

用户体验提升

  • 编程易用性提升
    • fetch变量便利化:针对以往基于变量重命名的存储优化策略必须要求fetch变量设置persistable=True的bug,重构了Inplace复用和跨Operator复用策略,不再强制要求fetch变量必须设置persistable=True,且不会改变任何变量的名称,且均能保证结果的正确性。
    • optimizer.minimize和其他接口调用的位置敏感性问题
      • 针对用户搭建网络时易将exe.run(startup_program)置于optimizer.minimize之后执行,从而导致不明报错的问题,在Optimizer类Op中增加了初始化检查以及易于理解的提示信息,使再出现此类问题时用户能够快速定位错误。
      • 针对用户搭建网络时易将test_program = main_program.clone(for_test=True)置于optimizer.minimize之后执行,从而导致模型测试结果错误的问题,在clone中增加了自动反向裁剪策略,使test_program的clone操作的正确执行不再依赖于optimizer.minimize的先后关系
  • 默认配置项
    • 显存Garbage collection开关默认打开(对应FLAGS_eager_delete_tensor_gb环境变量=0)。
    • build_strategy的选项:
      • build_strategy.enable_inplace策略默认打开,提供更好的内存优化效果。
      • build_strategy.memory_optimize跨Op内存复用优化策略的默认行为调整为:在Garbage collection策略打开时默认关闭(规避两者合用会比只用Garbage collection策略效果差的问题);而在Garbage Collection策略关闭时默认打开。用户可显式设置build_strategy.memory_optimize = True/False强制打开或关闭跨op内存复用优化策略。
      • build_strategy.fuse_all_reduce_opsbuild_strategy.fuse_broadcast_ops选项默认打开,可以减少计算图中的计算节点个数,进而加速计算图执行。
    • execution_strategy选项:
      • num_iteration_per_drop_scope默认值从1改成100,每次迭代之后都要进行一次同步操作,提升速度。
  • 接口优化
    • paddle.fluid.memory_optimize接口废弃,由于该接口存在优化效果差,不稳定等问题,从此版本起,调用该接口不会对用户网络进行任何操作,并可能在后续版本中彻底移除该接口,建议用户删除代码中的paddle.fluid.memory_optimize调用,并使用Garbage Collection策略进行内存优化(设置FLAGS_eager_delete_tensor_gb环境变量为0,默认已开启)。
    • 新增数据读取接口DataLoader。用户可通过fluid.io.DataLoader.from_xxx创建数据加载器,如DataLoader.from_generator, DataLoader.from_dataset等,可通过Python的for-range方式迭代,简化使用方法,统一接口形式。其他数据读取接口如py_reader等将在未来版本中弃用。
    • RecordIO接口移除,不再支持RecordIO接口。
    • 优化data接口,新的fluid.data接口相对fluid.layers.data 接口将对输入的数据的 shape 和 dtype 进行检查,使用None 和 -1 支持可变长维度。如果输入的 shape 或者 dtype 不对,将会报错。
  • 报错信息优化:
    • 简化C++信息栈输出,过滤和paddle函数无关的、对调试几乎没有帮助的栈信息和符号,大幅缩短了信息栈长度,提升了调试体验。
    • 对报错信息栈重新排版,添加清晰的分段标识与提示,并将核心提示置于最后,便于用户迅速定位重要信息,提升了调试体验。
    • 对34个重点python api增加输入类型检查,能正确报出输入类型不符合的错误,避免误导性报错。
    • 增强34个重点Op的维度检查报错信息,能打出详细维度信息,便于用户调试。
    • 针对sequence类op输入不含LoD的Tensor时报错不清晰的问题,为sequence类op增加了Input Tensor LoD信息检查,使错误提示更加直观易懂。
    • 强化机器自动化报错信息输出,在CI中强制推荐使用PADDLE_ENFORCE_XXX来替换PADDLE_ENFORCE接口,模版化打印出更具体的报错信息,并对应完成修复存量修复。
  • 文档优化
    • 全面优化了所有API的中英文文档,保证文档的正确性、规范性、易读性,完善对应示例。
    • 增加了动态图中相关的更多文档说明和实例。
    • 对预测教程文档进行整体修改,重新组织结构和内容,提高了可读性和实用性。
    • 优化了部分指南性文档。
  • 编译优化:
    • 将默认的CMAKE_BUILD_TYPERelWithDebInfo改成Release,减少初次接触的开发者的编译目录大小,避免因为编译目录太大导致编译失败 。
    • 修复inference_lib.cmake编译随机失败的问题。
    • 去掉use_fast_math编译选项,避免为了提升性能而降低了CPU/GPU上的精度。
  • Windows支持增强
    • 支持vs2017编译。
    • 编译流程优化,拆分第三方和Paddle的编译依赖关系,不再依赖openblas的预编译库。
    • 支持cuda10。
    • 增加模型支持,修复之前在windows无法正常运行的模型。
    • 支持Paddle CPU 版本离线安装包。
    • 支持预测SDK C-API。

训练框架

  • 性能优化

    • GPU性能优化
      • 使用cuRAND库优化dropout的GPU实现,dropout op本身加速3.4倍,Transformer base模型和big模型在V100上的训练分别加速3.8%和3.0%。
      • 对smooth_label的CUDA核函数完成代替Eigen实现,smooth_label op本身加速1.47倍。
      • 对 recurrent_op 的冗余 tensor copy 进行 share data,和删除运算过的 scope,该优化使得 benchmark 中 RNN 相关模型显存占用减少了 3 - 4 倍,速度有 2% - 数倍的提升。
    • CPU性能优化
      • BERT优化:新增matmul multi-head MKL的支持。
      • 对lookup_table_op和sequence_pool_op (sum类型)做fuse,使用sparse GEMM优化,PyramidDNN模型在CPU上的训练速度获得8%的提升。
    • 内存/显存优化:
      • 新增变长输入下的MKLDNN分层缓存策略和清理策略,修复MKLDNN在变长输入下内存泄漏问题。
      • 添加了控制流 op 多层嵌套情况下的显存优化策略支持。
      • Allocator容错机制。针对多线程并发申请显存导致显存可能瞬间峰值超标问题,设计了Allocator重试策略,在第一次申请显存失败后会等待最长10s进行失败重试(若期间有显存释放,会提前触发失败重试)。
      • 显存Cache清理。解决了以往TemporaryAllocator和cuDNN workspace单例会cache显存不释放的问题,提高显存利用率。
      • 新增AutoGrowth显存分配策略。用户可通过设置环境变量FLAGS_allocator_strategy=auto_growth开启显存自增长策略,按需分配显存,解决了原有预分配92%可用显存策略占用显存过多、难以按需分配的问题,且不影响模型训练速度。
      • 显存的Allocator容错机制完善,保证Allocator的稳定性。针对多线程并发申请显存导致显存可能瞬间峰值超标问题,设计了Allocator重试策略,在第一次申请显存失败后会等待最长10s进行失败重试(若期间有显存释放,会提前触发失败重试)。
  • OP

    • 支持用户在框架外部、脱离框架自定义C++/CUDA OP。
    • 新增OP
      • 新增eye_op,用于构建单位矩阵,或一批单位矩阵。
      • 新增gather_nd_op,gather_op的高维推广,用于将输入数据中的切片,收集到由索引指定的形状的张量中。
      • 新增scatter_nd_op,scatter_op的高维推广,这个操作与scatter_nd_add_op类似,除了相加的张量是通过零初始化的。相应地,scatter_nd(index, updates, shape) 等价于 scatter_nd_add(fluid.layers.zeros(shape, updates.dtype), index, updates)。 用于根据索引indices将更新数据updates散布到新的(初始为零)张量中。
      • 新增scatter_nd_add_op:通过对Variable中的单个值或切片应用稀疏加法,从而得到输出的Variable。
      • 新增center_loss:用以辅助Softmax Loss进行人脸的训练,利用softmax loss来分开不同类别,利用center loss来压缩同一类别。center loss意思为:为每一个类别提供一个类别中心,最小化mini-batch中每个样本与对应类别中心的距离,从而达到缩小类内距离的目的。
      • 新增LookAHead Optimizer:针对Paddle不支持Lookahead优化算法这一问题,我们新增了这一优化算法。它的核心原理是:维护两个参数,快参数正常做前向反向运算,当快参数更新k次后,用它来更新慢参数,使二者同步。他的效果是在某些模型上能收敛更快。
      • 新增InstanceNorm op 实例归一化:根据每个样本的每个通道的均值和方差做归一化,一般用在图像生成模型中,把一个样本的风格迁移到另一个样本中。
      • 新增PreciseRoiPooling :PrROI Pooling采用积分方式计算每个pool区域的值,这种计算方式将区域中的插值看作是连续的,计算所有插值点求积分得到该区域所包围点的总和,最后除以pool区域面积就得到该区域的值,因此结果更加准确。
      • 新增hard_swish_op:hard_swish激活函数,在MobileNetV3架构中被提出,相较于swish激活函数,具有数值稳定性好,计算速度快等优点。
      • 新增mse_loss_op:均方损失函数,用于计算两个输入间的均方差。
      • 新增elementwise_mod的float/doule kernel。
      • 新增strided_slice op。
      • MKLDNN kernel更新
        • 新增Leaky_relu的MKL-DNN kernel 和 conv + activation fusion pass。
        • 支持不同axis的softmax MKL-DNN kernel。
        • 重构5个op (conv, pooling, batch_norm, softmax,LRN)的FP32 MKL-DNN kernel代码,增强代码可维护性和可读性。
          -OP功能优化升级
      • 部分op参数升级支持tensor及包含tensor的list,支持常数对应维度的推断:
        • slice op 涉及参数startsends
        • reshape op 涉及参数shape
        • expand op 涉及参数expand_times
        • pow op 涉及参数factor
        • fill_constant op 涉及参数 shape ,并将calc_gradient接口中使用的fill_constant_batch_size_like替换为fill_constant;
        • uniform_random op 涉及参数shape, 支持tensor及包含tensor的list;
        • image_resizeresize_nearestresize_bilinearresize_trilinear支持out_shape为tensor或者包含tensor的list,支持常数对应维度的推断,scale 参数支持tensor;
        • 新增crop_tensor,支持shape参数为tensor或者包含tensor的list,支持常数对应维度的推断。
      • 优化部分op输入tensor的维度检查
        • 移除huber_lossrank_losscross_entropy op中输入shape的最后一维强制为1的限制,输出loss的shape与label保持一致。
        • 新增fluid.one_hotfluid.embeddingop,移除input参数shape最后一维为1的限制。
        • 优化sequence_padsequence_unpadop中length的shape,由[n,1]简化为[n]
      • 部分op升级支持channel_last格式输入:
        • conv2d、conv3d、pool2d、pool3d新增data_format参数,支持channel_last格式输入。
        • conv2d_transpose、conv3d_transpose新增data_format参数,支持channel_last格式输入。
        • image_resize、resize_nearest、resize_bilinear、resize_trilinear新增data_format参数,支持channel_last格式输入。
        • group_norm支持channel_last格式输入。
      • 涉及padding操作的OP,支持非对称padding,以及SAMEVALID 两种padding方式
        • conv2d、conv3d、pool2d、pool3d支持上述padding方式。
        • conv2d_transpose、conv3d_transpose支持上述padding方式。
      • 对以下op进行inplace显存优化支持:
        • elementwise_add_grad_grad, elementwise_sub_grad_grad, elementwise_mul_grad_grad, elementwise_div_grad_grad, relu_grad_grad, leaky_relu_grad_grad, sqrt_grad_grad, square_grad_grad。针对GAN模型梯度惩罚显存占用较高的问题,为二重反向op添加inplace,优化其显存占用。
      • 升级部分仅支持LoDTensor输入的OP兼容padding模式,包括linear_crf_op , crf_decoding_op, hash_op, edit_distance_op, chunk_eval_op, warpctc_op, ctc_align_op, row_conv_op。
  • Intel N-Graph集成

    • 增加了ngraph_subgraph_pass对训练的支持,通过build strategy激活N-Graph提供对parallel executor的支持。
    • 修正N-Graph对多线程问题,提供对多线程预测的支持。
  • 动态图

    • 性能优化
      • 对动态图底层执行机制进行了重构,在大部分模型上有30%左右的速度提升 ,显存开销有2%左右下降。
    • 功能完善:
      • 支持基于stop_gradient设置的自动剪枝功能和detach接口,满足冻结部分子网的需求 。
      • 支持模型在不同设备上执行data_transform, 可以使用less_than/greater_than等功能。
      • 重新实现op(unsqueezed_op、unstack_op、flatten_op、fill_constant_op)等,使之能够支持动态图。
    • 易用性提升:
      • 针对部分动态图不支持的接口提供了优化的报错 (包括Variable相关接口和Optimizer相关接口)。
      • 针对Layer中的参数提供了可供访问的接口。
      • 优化动态图save load接口,旧的dygraph下面的 save_persistables 删除。
      • 支持了Layer call()可以使用关键字传入,使得前向执行时可以自定义传入的参数。

预测部署

  • 服务器云端预测库
    • 接口优化
      • 增加预测C API。
      • 针对设置环境变量GLOG_v=4可以打印出预测过程中包含模型op及op fuse的详细log会暴露较多信息,为AnalysisConfig添加DisableGlogInfo()接口(当前仅支持全局最多调用一次),方便使用者关闭GLOG输出,避免模型结构泄漏。
      • 针对用户在使用C++预测库时不易获得模型描述中的输入shape的问题,为AnalysisPredictor添加GetInputTensorShape()接口,方便用户在运行预测引擎之前从模型中拿到输入shape,以避免输入错误的shape。
    • 功能优化
      • 在模型中添加了模型版本号及算子兼容性信息。在此版本之后,旧版本模型在新版本 Paddle 库上使用 AnalysisPredictor 执行预测时会进行兼容性检查。
      • CPU INT8量化预测支持持续加强:支持mobilenet-ssd的训练后量化, 精度下降1%内, 性能提升3倍在第二代智强可扩展处理器6271上;新增Mul op的INT8 MKL-DNN kernel。
    • 性能优化
      • 优化了Mobilenetv2, ShuffleNet, Effecientnet 在CUDA GPU下的预测速度,mobilenetv2 从 5.3ms 减至 1.9ms,Shufflenetv2 从 6.3ms 减至1.4ms,Effecientnet 从60ms 减至 32ms。
      • 实现一个简化Graph中基础op的Pass,预测时,upscale_in_train类型的dropout op直接移除,downgrade_in_infer类型的dropout op使用scale op代替。该优化使ERNIE模型在P40上的预测速度提升1.8%。
      • 实现一个cudnn_placement_pass,将Graph中所有op的use_cudnn设置成true。该优化使ERNIE模型在P40上的预测速度提升10%。
      • 实现fc op的GPU Kernel,并支持将激活操作融合到fc op中。该优化使ERNIE模型在P40上的预测速度提升2.1%。
      • 实现融合fc+elementwise_add+layer_norm操作的Pass和GPU Kernel。该优化使ERNIE模型在P40上的预测速度提升4%。
      • 实现了multihead matmul 融合算法的相关PASS和Kernel。该优化使Ernie模型在P4 GPU上的速度提升超过30%。
      • 优化QAT(训练中量化)训练出来的模型在CPU INT8 kernel上执行的速度。通过PASS对训练出的QAT模型进行修改,结合训练后优化的PASS,使QAT训练出的模型可以在MobilenetV1, MobilenetV2, ResNet50,VGG16上精度变化(相比于FP32模拟量化)在0.1%内,ResNet101和VGG19精度变化在0.3%内,性能在6个模型上提升相比于原始未优化的QAT模型在第二代智强可扩展处理器6271上可达到4-9倍的性能提升。
    • 问题修复
      • 针对之前AnalysisPredictor中设置FLAGS_profile无效的问题,为AnalysisConfig添加EnableProfile()接口,现在用户可以调用该接口开启预测的profiler,而无需设置FLAG。
      • ZeroCopyTensorcopy_from_cpumutable_data等方法添加了uint8模板支持,目前ZeroCopyRun已经可以正确地接收uint8输入进行预测。
      • 针对Paddle-TRT在包含多个op共享同一参数的模型如retinanetfaster_rcnncascade_rcnn中出现的重复设定weight、过早删除参数等bug进行了修复,Paddle-TRT已可以支持上述模型。
  • 移动、嵌入式端侧预测库
  • Paddle Serving
    • 新增支持超大规模分布式预估服务能力
      • 发布了来源于百度内部经过海量数据检验的高性能分布式版本kv存储器组件cube,提供稀疏参数的分布式存储和查找,在高并发条件下单位时间吞吐总量是redis的13倍,是单机版kv存储器rocksDB的6倍。
      • 发布了Elastic CTR解决方案:针对超大规模稀疏参数的CTR任务,提供了基于k8s集群的分布式训练以及serving分布式参数部署预测的流程文档,并提供了一键式的解决方案。
    • PaddleServing编译速度提升
      • 预测接口的编译依赖由paddle源码改为paddle inference lib,编译速度提升6倍。
    • PaddleServing易用性提升
      • 支持Python client
  • PaddleSlim
    • 添加基于硬件的小模型结构搜索功能。
    • 对量化训练、蒸馏和通道裁剪三种策略扩充分类模型示例,添加检测模型示例。
    • 新增部分量化功能的支持,目前用户可选择对同一类型的op仅部分进行量化。
    • 新增对pool2d、elementwise_add等op的量化训练支持。

分布式训练

  • 性能优化
    • 新增LocalSGD多机训练算法:针对GPU多机多卡同步训练过程中存在trainer速度不一致(随机)导致同步等待问题,设计了局部异步训练策略,通过多步异步训练(无通信阻塞)实现慢trainer时间均摊,从而提升同步训练性能。在4机32块V100 GPU卡的配置下,在Resnet50 Imagenet分类任务上,测试集top5准确率达到93%的情况下,训练吞吐提升8.16%。模型链接: https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet
    • 新增GEO-SGD分布式CPU多线程全异步训练算法:通过训练节点维护独立参数且局部多轮更新,同时全局参数增量更新,大幅降低了训练中的通信占比。在文本匹配Simnet_bow模型上,GEO-SGD相比飞桨1.5全异步模式,在25节点12线程下,训练速度提升2.65倍,保持效果对齐。在Word2Vec模型上,GEO-SGD相比飞桨1.5全异步模式,在4、8、16、32节点16线程下,训练速度分别提升3.79倍、3.92倍、4.69倍、6.88倍,效果保持对齐。
    • Fast Resnet:采用可变图像大小、可变batch size和矩形验证图像等策略,显著提升Resnet50模型在ImageNet数据集的训练速度。在4机32块V100 GPU卡的配置下,top5准确率达到93%的时间缩短至35分钟,收敛速度提升2.21倍。模型链接:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/fast_imagenet
  • 新增超大Batch训练优化器RecomputeOptimizer。在内存固定的情况下,Recompute优化器可以显著提高模型可以运行的batch size,提升为原来的 17%-309%;训练效果是无损的,收敛趋势一致,但实际吞吐会有一定损失。
  • 新增Collective Op:all_reduce_op、broadcast_op、all_gahter_op、reduce_scatter_op,支持在组网中实现进程通信。
  • 容错
    • CPU全异步训练模式加入训练节点心跳检查,及时发现异常节点。
    • 加入retry机制,修复rpc errorcode 14的错误。
  • 部署
    • Paddle-K8S-Operator新增支持Volcano Job的提交,支持CPU分布式训练。

模型建设

  • 易用性优化
    • 全面优化了PaddleNLP和PaddleCV主要模型(Transformer,BERT,DMTK,PaddleDetection,PaddleGAN,PaddleVideo,ImageClassification)的安装、自定义数据以及对windows平台的支持等功能和体验。
  • PaddleNLP:
    • 发布文本生成库Seq2seq:
      • 开源多个文本生成模型,包括vanilla seq2seq,seq2seq with memory network,variational seq2seq。
    • 升级阅读理解库:
      • 开源EMNLP2019阅读理解竞赛百度夺冠模型D-Net和相关预训练模型,兼容MRQA2019开放的18个抽取式阅读理解公开数据集的并行训练、高性能评估以及搭建阅读理解serving的相关工作。
    • 发布开放多任务学习库PALM:
      • 开源MRQA2019比赛百度夺冠使用的多任务学习框架PALM,只需要几十行代码就可以完成基于ERNIE、BERT等预训练模型的硬共享、层次共享等多任务学习算法。
  • PaddleCV
    • 发布图像分割库 PaddleSeg:具备丰富数据增强、模块化设计、高性能和端到端部署四大特点。
      • 模型
        • 新增DeeplabV3+/UNet/PSPNet/ICNet四种网络支持,对应预训练模型共18个。
        • 新增车道线分割、人像分割、人体部件分割三个预测模型。
      • 功能
        • 支持softmax loss、bce loss、dice loss以及损失函数组合配置。
        • 支持翻转、旋转、多尺度变换、模糊、色彩饱和度调整等十余种数据增强策略。
        • 支持数据检查、边训边评估、模型导出、自动可视化、调参模式等易用性功能。
        • 支持FP16混合精度训练以及动态Loss Scaling。
        • 支持多进程训练与数据预处理。
      • 端到端部署
        • 提供多平台(Windows/Linux)的C++高性能预测库编译、开发和部署。
        • 基于Paddle Serving提供高性能图像分割服务化部署能力。
    • 升级检测库 PaddleDetection
      • 新增2019 Objects365 Full Track比赛夺冠模型;新增DeformableConv系列模型;新增VGG-SSD系列模型;新增Cascade+Mask+FPN模型;新增更多基于的COCO两阶段模型;新增行人检测和车辆检测预训练模型;新增人脸检测模型Faceboxes和BlazeFace系列模型,并发布改进版的轻量级模型。
      • 功能:
        • 支持multi-scale的训练、multi-scale测试,支持group norm等。支持FP16训练。增加C++预测部署能力,支持Windows和Linux系统。
        • 增加模型压缩量化和剪枝示例。
      • 文档: 增加中文文档,增加基于小数据的快速开始、迁移学习、模型导出、预测部署等文档,增加预测benchmark文档。
    • 完善图像分类模型
      • 发布9个EfficientNet预训练模型:EfficientNet-b0,EfficientNet-b1,EfficientNet-b2,EfficientNet-b3,EfficientNet-b4,EfficientNet-b5,EfficientNet-b6,EfficientNet-b7,EfficientNet-small。精度与论文持平。
      • 持续新增34个预训练模型:DarkNet53, DenseNet121,Densenet161, DenseNet169, DenseNet201, DenseNet264, SqueezeNet1_0, SqueezeNet1_1, ResNeXt50_vd_32x4d, ResNeXt152_64x4d, ResNeXt101_32x8d_wsl, ResNeXt101_32x16d_wsl, ResNeXt101_32x32d_wsl, ResNeXt101_32x48d_wsl, Fix_ResNeXt101_32x48d_wsl,ResNet18_vd,ResNet34_vd,MobileNetV1_x0_25,MobileNetV1_x0_5,MobileNetV1_x0_75,MobileNetV2_x0_75,MobilenNetV3_small_x1_0,DPN68,DPN92,DPN98,DPN107,DPN131,ResNeXt101_vd_32x4d,ResNeXt152_vd_64x4d,Xception65,Xception71,Xception41_deeplab,Xception65_deeplab,SE_ResNet50_vd。
    • 升级PaddleVideo
      • 新增动作定位模型: BMN和BSN,其中BMN模型是ActivityNet2019比赛的冠军 。
      • 新增VideoGrounding方向的BaseLine模型:TALL。
      • 新增VideoCaption方向的BaseLine模型:ETS。
    • 升级PaddleGAN
      • 新增SPADE模型。
      • 替换Instanceorm实现,STGAN上判别器速度提升12%左右。
  • PaddleSpeech:
    • 升级语音识别模型 DeepSpeech 至飞桨最新版本。
    • 开源语音合成模型 DeepVoice3 。
  • PaddleRec:
    • 新增支持分布式训练的DeepFM、XDeepFM、DeepCrossNetwork。

工具组件

  • PaddleHub
    • 新增超参优化Auto Fine-tune功能,实现给定超参搜索空间,自动给出较佳的超参组合
      • 支持两种超参优化算法:基于贝叶斯优化的HAZero和哈密尔顿系统的PSHE2。
      • 支持两种评估方式:Full-Trail和Population-Based。
    • 预训练模型丰富
      • 升级ERNIE 1.0中文模型,提升模型载长文本情况下的效果(max_seq_len=512)。
      • 升级LAC模型至v2.0.0,保持效果的同时精简模型结构,提升预测速度。
      • 新增ERNIE 2.0 英文预训练模型。
      • 新增Ultra-Light-Fast-Generic-Face-Detector-1MB人脸检测模型。
      • 新增人体部件分割ACE2P模型。
      • 新增基于DeepLabv3+的人像分割模型HumanSeg。
      • 新增图像生成模型STGAN、AttGAN、StarGAN。
    • Fine-tune API升级,灵活性与易用性提升
      • 新增阅读理解Fine-tune任务。
      • 新增多指标评估功能。
      • 优化predict接口,提升预测性能。
      • 新增优化策略ULMFiT,包括以下三种配置:
        • Slanted triangular learning rates:斜三角形学习率微调;
        • Discriminative fine-tuning:支持计算图按拓扑序分层采用不同学习率微调;
        • Gradual unfreezing:根据计算图的拓扑结构逐层参数解冻。
  • PGL 图学习框架
    • 对应发布飞桨图学习框架PGL v1.0正式版。
    • 易用性:新增异构图的Metapath采样与Message Passing消息传递双机制,支持包含多种类型节点和边特征的异构图建模,新增Metapath2vec、GATNE等异构图算法。同时,文档、API、Tutorial等材料也进一步完善。
    • 规模性:新增分布式图引擎和分布式Embedding,可支持十亿节点百亿边的超巨图的多种分布式训练模式。新增distributed deepwalk和distributed graphSage两个分布式样例。
    • 丰富性:新增8个、累计13个图学习模型,涵盖了图神经网络和图表征学习的主流模型。新增的8个模型分别是LINE、struc2vec、metapath2vec、GES、GATNE、SGC、Unsup-GraphSage、DGI。
  • PARL 深度强化学习框架
  • PaddleFL 联邦学习
    • 发布飞桨联邦学习框架PaddleFL,方便快捷地支持联邦学习和AI隐私算法研究,并实现了FedAvg算法和基于差分隐私的SGD算法,支持分布式安全共享学习算法调研。https://github.com/PaddlePaddle/PaddleFL
  • Paddle2ONNX
    • 对应升级paddle2onnx至0.2版本。
    • 新增pip安装方式。
    • 适配飞桨 v1.6的算子和ONNX v1.5版本。
    • 新增精度对齐框架,提供新增代码和模型转换的正确性验证功能。
    • 支持ResNet、DenseNe等10个Paddle图像分类模型的转换。
    • 支持SSD_MobileNet、YoloV3_DarkNet5等4个Paddle目标检测模型的转换。
  • X2Paddle
    • 对应升级x2paddle至0.5版本。
    • 新增pip安装方式。
    • 新增统一的caffe、tensorflow和onnx模型计算图中间表示。
    • 支持caffe多分支模型的转换。
    • 大幅提升主流框架的模型转换能力,支持44个tensorflow OP,33个caffe Layer和48个onnx OP。
    • 为Paddle Lite提供多框架模型部署能力,支持包括图像分类、目标检测和语义分割在内共18个模型的无损转换。

BUG修复

  • 修复 rnn_search 模型无法跑起来的bug 。
  • 修复 save_inference_model 在 prune recurernt_op 时的 bug(该 bug 会导致一些 RNN 模型在 save inference model 后 load 预测出错)。
  • 修复了动态图中多个Layer中act和bias等参数不生效的问题(其中包括:BilinearTensorProduct, GRUUnit,Conv2DTranspose ,LayerNorm,NCE )、优化器保存的bug 、python端内存泄漏的问题、部分参数minimize段错误的问题、使用python中has_attr的失效的问题进行了修复。
  • 修复FC mkldnn pass在AVX2机器上的精度diff问题。
  • 升级MKL-DNN到0.20,并提升MKL-DNN单侧覆盖率到90%以上。
  • 修复MKL-DNN训练后量化convolution和dequant op的squash问题。

代码重构和升级

  • 清理了6个废弃的第三方库recordio,snappystream,snappy,jemalloc,anakin,gzstream。

致谢开源社区贡献者

这次的发版包含很多来自英特尔工程师的贡献:感谢Adam Grygielski, Jacek Czaja , DanQing Li, Michal Gallus, Joanna Wozna, Minghui Yu, Feiyue Zhai, Bob Zhu 。除此之外,我们衷心的感谢每一个在这次发版中提问和报告问题给我们的开发者们。

Release Notes

In this version, the author focuses on the overall optimization of user experience and usability, adds and optimizes a large number of OPs; in addition, the performance of basic training, forecast and large-scale distributed training is further improved. The model library and supporting tool modules are fully perfected and upgraded correspondingly.

Important Updates

  • Special improvement of user experience and ease of use, including comprehensive document optimization, error message optimization, configuration item optimization, interface optimization, compiler optimization, multi-platform support, and an all round improvement of program usability.
  • In the training framework, the author further optimizes the speed, improves the GPU memory optimization mechanism, and supports defining customized C++/CUDA operators outside the framework. A large number of operators have been added, and a large number of inventory operators have been optimized in multiple aspects, including compatibility, behavioral consistency and function enhancement.
  • For distributed training, strategies such as LocalSGD and GEO-SGD, etc. have been added. The speed of large-scale synchronous training and asynchronous training continues to increase, and K8S + Volcano task submission is supported.
  • Deployment Capability Enhancement:
    • In the server-side forecasting library, C API has been added, and version compatibility check is supported, enabling a lot of performance optimization work.
    • PaddleLite has been released. It is positioned as a high-performance, multi-platform, lightweight end-side forecasting engine and may be used as an acceleration library for server-side predictive library.
    • In PaddleServing, very large distributed prediction service capability has been added.
    • In PaddleSlim, the quantitative training function is enhanced and the hardware-based small model search capability is added.
  • Improve the ease of use and increase the richness of models
  • For PaddleNLP, release the new seq2seq APIs and the example models for text generation. Add a new pre-training language model XLNet to the language represention toolkit PaddleLARK. Open source D-NET, the champion model of EMNLP2019 reading comprehension competition (MRQA 2019 Shared Task on Generalization), convinient for users to take part in competitions on 18 different extractive reading comprehension datasets. Release toolkit PALM (PAddLe Multi-task learning) to facilitate the study in multi-task machine learning.
  • For PaddleCV, release the end-to-end image segmentation library PaddleSeg, supporting from training to deployment. For image classification, add 43 pre-training models, including EfficientNet. In PaddleDetection, add the champion model of 2019 Objects365 Full Track, BlazeFace and other small models for face detection, as well as the pre-training models for pedestrian detection and vehicle detection. In PaddleVideo, add the winning model of ActivityNet Challenge 2019 championship, and extend the types of models to video caption and video grounding.
  • Release PaddleSpeech, upgrading the speech recognition model DeepSpeech to support Fluid APIs and adding the new text-so-speech model DeepVoice3.
  • Increase the coverage of models in PaddleRec.
    • In PaddleHub, the super-optimized Auto Fine-tune function has been added, and the flexibility and usability of the Fine-tune function has been fully improved, resulting in a significant increase in the number of pre-training models.
    • The official version of the PaddlePaddle learning framework PGL has been released, and the usability, scale and richness have been comprehensively improved.
    • The parallel capability of PaddlePaddle deep learning framework PARL has been further enhanced to support evolutionary algorithms.
    • Paddle2ONNX and X2Paddle have been fully upgraded, and it is more convenient to convert the models of parallel capability and those of other frameworks.
    • The PaddlePaddle federal learning framework PaddleFL has been released.

Improvement for User Experience

  • Improvement of Programing Usability
    • Convenience for fetch variables: the bug which requires setting 'persistable = True' when using 'fetch' in Executor (or ParallelExecutor) has been fixed. Now the variables in 'fetch' list are no longer required to set 'persistable = True' and the name of any variable will not be changed.
    • The problem of location sensitivity from the transfer of optimizer.minimize and other interfaces has been settled.
      • As users are easy to place exe.run(startup_program) behind optimizer.minimize to build the networks, leading to the problem of unknown error, the initialization check and the easy-to-understand prompt information have been added to all the Optimizer so that users can quickly locate such an error when it occurs.
      • As, when building a network, users are liable to put test_program = main_program.clone(for_test=True) behind optimizer.minimize, leading to the problem of incorrect model test results, an auto-pruning strategy has been added to main_program.clone so that the test_program clone operation will no longer depend on the order of the optimizer.minimize.
  • Updated Default Configuration Options
    • The video memory Garbage collection switch is turned on by default (corresponding to environment variable FLAGS_eager_delete_tensor_gb = 0).
    • Options of build_strategy:
      • The build_strategy.enable_inplace strategy is turned on by default, which is a better strategy for memory optimization.
      • The default behavior of build_strategy.memory_optimize (the cross-OP memory reuse optimization strategy) is adjusted to be turned off by default when the Garbage Collection strategy is turned on (to avoid the problem that the combination of both is worse than the use of only the Garbage Collection strategy); or to be turned on by default when the Garbage Collection strategy is turned off. Users can explicitly set build_strategy.memory_optimize = True/False to force the cross-op memory reuse optimization strategy to be turned on or off.
      • build_strategy.fuse_all_reduce_ops and build_strategy.fuse_broadcast_ops are turned on by default, which can reduce the number of nodes in the computation graph, and accelerate the calculation of graph.
    • Option of execution_strategy:
      • The default value of num_iteration_per_drop_scope is changed from 1 to 100, and a synchronization operation will be conducted after each iteration to increase the speed.
  • Interface Optimization
    • paddle.fluid.memory_optimize is deprecated since this memory optimization is poor and unstable. From now on, this interface will do nothing, and may be removed in future. It is recommended to delete paddle.fluid.memory_optimize in your codes and use the Garbage Collection strategy which is turned on by default (environment variable FLAGS_eager_delete_tensor_gb = 0).
    • DataLoader: a new interface for reading data. Users can create data loaders through fluid.io.DataLoader.from_xxx, such as DataLoader.from_generator, DataLoader.from_dataset, etc. It can be iterated in Python for-range to simplify the methods and unify interface forms. Other interface for reading data, such as py_reader, will be deprecated in future.
    • The RecordIO interface has been removed and it is no longer supported.
    • The data interface has been optimized. The new fluid.data interface relative to the fluid.layes.data interface will check the shape and dtype of the input data, and use None and -1 to support variable length dimensions.If the input shape or dtype is incorrect, an error will be reported.
  • Error Message Optimization:
    • Simplify the error message in C++ stack, remove the useless stack information and symbols which are irrelevant to the paddle function and have little help in debugging, greatly shorten the length of the C++ stack and improve the debugging experience.
    • Typeset the error message stack output, add clear segmentation identifiers and prompts, print the core error message and prompts at the end of the screen. Thus, users can quickly find the most important error information and improve their debugging experience.
    • Add input type check for 34 key python APIs, avoid misleading error reports.
    • Enhance the dimension check for 34 key Ops. The detailed dimension error reports improve the user debugging experience.
    • As the error report is unclear when the sequence OPs have a non-LoD Tensor input, add the Input Tensor LoD information check for these sequence OPs, make the error reports more intuitive and easy to read.
    • Enhance t he automation error information output mechanism. Compulsively recommend the PADDLE_ENFORCE_XXX to replace PADDLE_ENFORCE interface in CI. Print out more specific error information in template form and complete the inventory repair correspondingly.
  • Document Optimization
    • All the Chinese and English documents of all APIs have been optimized to ensure the correctness, standardization and legibility of the documents, and improve the corresponding examples.
    • More relevant document notes and examples have been added to DyGraph programing.
    • The inference tutorial document has been optimized, the structure and content have been reorganized, and readability and usability have been improved .
    • Some guide documents have been optimized.
  • Compile Optimization:
    • The default value of CMAKE_BUILD_TYPE has been changed from RelWithDebInfo to Release to reduce the build target folder size and avoid the compilation failure due to this flag.
    • Fix the random compilation failure caused by inference_lib.cmake
    • The flag use_fast_math is used for performance optimization, but may cause reduction of calculation precision for both CPU and GPU, so we have removed this flag.
  • Improve Windows platform experience
    • Enable compile Paddle from source code with VS2017
    • The compilation process has been optimized: we split the compilation dependency between third party and Paddle to improve the compilation experience; the requirement of downloading pre-built OpenBLAS library has been removed
    • Support cuda10 on Windows platform.
    • Fix multiple issues of PaddlePaddle/Moduels repo for windows Platform .
    • Add Paddle CPU version offline installation package.
    • Add C-API inference SDK.

Training Framework

  • Performance Optimization
    • GPU Performance Optimization
      • Using the cuRAND library to optimize the GPU implementation of dropout operator. The dropout operator itself has been accelerated 3.4 times, and the training speed of the Transformer base model and big model on the V100 has been accelerated by 3.8% and 3.0%, respectively.
      • The Eigen implementation of CUDA kernel function for smooth_label operator has replaced, and the smooth_label operator itself has been accelerated 1.47 times.
      • Replace the redundant tensor copy in recurrent operator with data sharing data, and the unused scope has been deleted after computation. Through this optimization, the GPU memory consumption of the RNN-related model in the benchmark has been reduced 3-4 times and the speed has been accelerated by 2%~several times.
    • CPU Performance Optimization
      • BERT optimization: The support for matmul multi-head MKL has been added.
      • Fusion has been made for lookup_table operator and sequence_pool operator (sum type); and by using sparse GEMM optimization, the training speed of the PyramidDNN model on the CPU has been increased by 8%.
    • CPU/GPU Memory Optimization:
      • The MKLDNN layered caching strategy and cleanup strategy under variable-length input has been added to fix the memory leak problem of MKLDNN under variable length input.
      • The GPU memory optimization strategy support in case of control flow operator with multi-level nesting has been added.
      • Allocator Fault Tolerance Mechanism. For the transient out of memory which is caused by multi-threaded GPU memory allocation requests in the same time, the allocation retry strategy has been designed. As a result, after the first application fails, it waits for up to 10s to retry upon failure (if there is any GPU memory releasing during the period, retry upon failure will be triggered in advance).
      • GPU Memory Cache Cleanup. The problem that GPU memory in TemporaryAllocator and Cudnn workspace singleton did not release has been resolved, and the GPU memory utilization has been improved.
      • The AutoGrowth GPU memory allocation strategy has been added. Users can set the environment variable FLAGS_allocator_strategy=auto_growth to enable the auto-growth strategy and allocate the GPU memory on demand. It is designed to solve the problem of the original strategy, which would pre-allocate too many GPU memory (92% of the available GPU memory) and be difficult to allocate on demand. The training speed of models using auto growth strategy would be affected.
  • OP
    • Supports user customization of the C++/CUDA operators outside the framework.
    • Add OP
      • The eye operator has been added to build an identity matrix, or a batch of identity matrices.
      • The a high-dimensional extension of gather_nd operator and gather operator has been added to collect the slices from the input data into the tensor of the shape specified by the index.
      • The high-dimensional extension of scatter_nd operator and scatter operator has been added. This operation is similar to that of scatter_nd_add op except that the added tensor is initialized by zero. Accordingly, scatter_nd(index, updates, shape) is equivalent to scatter_nd_add(fluid.layers.zeros(shape, updates.dtype), index, updates). It is used to spread the updates to the new (initial zero) tensor based on indices.
      • The scatter_nd_add operator has been added: The output Variable is obtained by applying sparse addition to the single value or slice in Variable.
      • The center_loss operator has been added: It is used to assist Softmax Loss in face training. Softmax loss is used to separate different categories, and center loss is used to compress the same category. Center loss means that: providing a category center for each category, minimizing the distance between each sample in the mini-batch and the corresponding category center, so as to achieve the purpose of narrowing the distance within the category.
      • The LookAHead Optimizer has been added: We have added this optimization algorithm for the problem of not supporting the Lookahead optimization algorithm in Paddle. Its core principle is as follows: two parameters are maintained and the fast parameter does forward/backward operations normally. When the fast parameter is updated k times, it will be used to update the slow parameter, so that the two can be synchronized. Its effect is to converge faster for some models.
      • The InstanceNorm operator has been added: The normalization is carried out according to the mean and variance of each channel of each sample, and it is generally used in the image generation model to migrate the style of one sample to another sample.
      • The PreciseRoiPooling operator has been added: PrROI Pooling uses the integral method to calculate the value of each pool area. This calculation method regards the interpolation in the area as continuous, calculates all the interpolation points and obtains the sum of the points surrounded by the area, and finally divides it by the size of the pool area to get the value of the area. Therefore, the result is more accurate.
      • The hard_swish operator has been added: hard_swish activation function has been added and it is proposed in MobileNetV3 framework. Compared to the swish activation function, it has the advantages of good numerical stability and high calculation speed, etc.
      • The mse_loss operator has been added: As mean square loss function, it is used to calculate the mean square error between two inputs.
      • The elementwise_mod's float/doule kernels have been added.
      • The strided_slice operator has been added.
      • The MKLDNN kernel has been updated :
        • The MKL-DNN kernel and conv + activation fusion pass of Leaky_relu have been added.
        • The softmax MKL-DNN kernel of different axes is supported.
        • The FP32 MKL-DNN kernel codes for 5 operators (conv, pooling, batch_norm, softmax, LRN) have been reconstructed to enhance maintainability and readability of codes.
      • OP Function Optimization and Upgrade
      • Some operators are upgraded to support tensor or tensor list as their parameters, and support the shape inference of the corresponding constant dimension:
        • Slice operator involves parameters starts and ends;
        • Reshape operator involves parameter shape;
        • Expand operator involves parameter expand_times;
        • Pow operator involves parameter factor;
        • Fill_constant operator involves parameter shape and replaces the fill_constant_batch_size_like used in the calc_gradient interface with fill_constant;
        • Uniform_random operator involves parameter shape, supports the tensor and the tensor list;
        • Image_resize, resize_nearest, resize_bilinear and resize_trilinear operators support the tensor or tensor list, supports the shape inference of the corresponding constant dimension; and the scale parameter supports tensor;
        • The crop_tensor operator has been added and it supports tensor or tensor list as its shape parameter, and also supports the shape inference of the corresponding constant dimension.
      • The dimension check of the tensor of some operators inputs has been optimized.
        • The restriction that the last dimension of the input shape in huber_loss, rank_loss, and cross_entropy operators is 1 has been removed, and the shape of the output loss is consistent with the label;
        • The fluid.one_hot and fluid.embedding operators have been added and the restriction that the last dimension of the input parameter shape is 1 has been removed.
        • The shape of length in sequence_pad and sequence_unpad operators has been optimized and simplified from [n,1] to [n];
      • Some of the operators upgrades to support channel_last format input:
        • Conv2d, conv3d, pool2d, pool3d operators have added data_format parameters, which support channel_last format input.
        • Conv2d_transpose and conv3d_transpose operators have added data_format parameters, which support channel_last format input.
        • For image_resize, resize_nearest, resize_bilinear and resize_trilinear operators, data_format parameters have been added, which support channel_last format input.
        • The group_norm operators supports channel_last format input.
      • The operators involved in padding operation supports asymmetric padding, and SAME and VALID padding modes.
        • Conv2d, conv3d, pool2d and pool3d operators support the above padding modes.
        • Conv2d_transpose and conv3d_transpose operators support the above padding modes.
      • Inplace GPU memory optimization support is provided for the following operators:
        • Elementwise_add_grad_grad, elementwise_sub_grad_grad, elementwise_mul_grad_grad, elementwise_div_grad_grad, relu_grad_grad, leaky_relu_grad_grad, sqrt_grad_grad, square_grad_grad. For the high GPU memory occupancy of gradient penalty based GAN model, inplace optimization has been added for the double backward operators to optimize its GPU memory occupancy.
      • Some operators that only support LoDTensor input previously have been upgraded to accept the padding input, including linear_crf, crf_decoding, hash , edit_distance , chunk_eval, warpctc, ctc_align, row_conv operators.
  • Intel N-Graph Integration
    • The support of ngraph_subgraph_pass for training has been added. By activating N-Graph through build strategy, the support for parallel executor is provided.
    • The problem of N-Graph in multithreading has been fixed and the support for multithreading forecast is available.
  • Dynamic Graph
    • Performance Optimization
      • The underlying execution mechanism of the dynamic graph has been reconstructed so that most models have a speed increase of about 30%, and the GPU memory occupancy has been reduced by about 2%.
    • Function Perfection:
      • Automatic pruning function and detach interface based on stop_gradient setting are supported to meet the requirements of freezing some subnets.
      • Support executing data_transform on different devices in models and such functions as less_than/greater_than, etc. can be used.
      • The operators (unsqueezed, unstack, flatten and fill_constant), etc. can be implemented again, so that they can support dynamic graphs.
    • Improved Usability:
      • Optimized error reporting for some interfaces that are not supported by dynamic graphs (including Variable-related interfaces and Optimizer-related interfaces) is provided.
      • Accessible interfaces for parameters in Layer are provided.
      • The save/load interfaces of dynamic graph have been optimized, the save_persistables interface under the old dynamic graph mode has been removed.
      • The keyword parameters of Layer call() method is supported so that incoming parameters can be customized during forward execution.

Inference Deployment

  • Server & Cloud Inference Library
    • Interface Optimization
      • The C API for Inference has been added
      • As for the problem of more information being exposed when the environment variable GLOG_v=4 is set to be able to print out the detailed log containing model op and op fuse in the prediction process, we have added the DisableGlogInfo() interface for AnalysisConfig (currently only wholly supporting at most one call), in order to facilitate users to close GLOG output to avoid model structure leakage.
      • As for the problem of users not getting the input shape in the model description easily when using the C++ forecast library, we have added the GetInputTensorShape() interface for the AnalysisPredictor, in order to facilitate users to get the input shape from the model before running the forecast engine to avoid inputting the wrong shape.
    • Function Optimization
      • The model version number and operator compatibility information have been added to the model. After this version, a compatibility check will be performed when performing forecasts on the new version of the Paddle library with the AnalysisPredictor.
      • CPU INT8 quantitative prediction supports continuous enhancement: It supports mobilenet-ssd post-training quantification with accuracy fall within 1%, performance improvement by 3 times on the second generation Zhiqiang scalable processor 6271. In addition, Mul op's INT8 MKL-DNN kernel has been added.
    • Performance Optimization
      • The forecast speed of Mobilenetv2, ShuffleNet, Effecientnet under CUDA GPU has been optimized, with mobilenetv2 reducing from 5.3ms to 1.9ms, Shufflenetv2 from 6.3ms to 1.4ms, and Effecientnet from 60ms to 32ms. ( )
      • The simplification of the Pass of the basic op in the Graph has been realized. As a result, during forecast, the dropout op of the upscale_in_train type will be directly removed, and the dropout op of the downgrade_in_infer type will be replaced by the scale op. This optimization increases the forecast speed of the ERNIE model on the P40 by 1.8%.
      • A cudnn_placement_pass has been realized to set the use_cudnn of all ops in the Graph as true. This optimization increases the forecast speed of the ERNIE model on the P40 by 10%.
      • The GPU Kernel of fc op has been realized and it supports the fusion of activation operations into fc op. This optimization increases the forecast speed of the ERNIE model on the P40 by 2.1%
      • The Pass and GPU Kernel that combine fc+elementwise_add+layer_norm operations have been realized. This optimization increases the forecast speed of the ERNIE model on the P40 by 4%.
      • The relevant PASS and Kernel for the multihead matmul fusion algorithm have been realized.This optimization increases the speed of the Ernie model on the P4 GPU by more than 30%.
      • The speed at which the model from QAT (quantization in training) is executed on the CPU INT8 kernel. By modifying the QAT trained model with PASS, and in consideration of the optimized PASS after training, the QAT trained model can have the precision change (equivalent to FP32 simulation quantization) on MobilenetV1, MobilenetV2, ResNet50 and VGG16 within 0.1%, and the precision change on ResNet101 and VGG19 within 0.3%, and promote the performance on the 6 models 4-9 times of that of the original un-optimized QAT model on the second generation Zhiqiang expandable processor 6271.
    • Troubleshooting
      • As for the problem of the FLAGS_profile set in the previous AnalysisPredictor being invalid, we have added the EnableProfile() interface for AnalysisConfig. Now users can call this interface to open the forecast profiler without setting FLAG.
      • We have added uint8 template support to copy_from_cpu and mutable_data, etc. of ZeroCopyTensor. Currently, ZeroCopyRun can correctly receive uint8 input for forecast.
      • Paddle-TRT has fixed bugs such as re-setting weights and prematurely deleting parameters in models containing multiple ops sharing the same parameter, such as retinanet, faster_rcnn, and cascade_rcnn, so it can support the above models.
  • Mobile, Embedded End-side Forecast Library
    • PaddleLite has been released. It is positioned as a high-performance, multi-platform, lightweight end-side forecast engine and as an acceleration library for the server-side PaddlePaddle native forecast library.See https://github.com/PaddlePaddle/Paddle-Lite for details.
  • Paddle Serving
    • The capability of supporting large-scale distributed estimation services
      • The kv memory module cube of high-performance distributed version from Baidu undergoing the inspection of mass data has been released. It provides distributed storage and lookup of sparse parameters. Under high concurrency conditions, the total throughput per unit time is 13 times that of redis, or 6 times that of the single version kv memory rocksDB.
      • The Elastic CTR solution has been released: For the CTR task of ultra-large-scale sparse parameters, we have provided the process documentation of the distributed training based on k8s cluster and serving distributed parameter deployment forecast, and the one-click solution.
    • Increase of PaddleServing Compilation Speed
      • The compilation dependency of the forecast interface has been changed from paddle source code to paddle inference lib, with the compilation speed increased by 6 times.
    • Promotion of PaddleServing Usability
      • Python client is supported.
  • PaddleSlim
    • The hardware-based small model structure search function has been added.
    • Examples of testing models have been added for three strategic expansion type models, i.e. quantitative training, distillation, and channel tailoring.
    • The support for some quantization functions has been added. Currently, users can choose to quantize only some of the ops of the same type.
    • The quantitative training support for ops such as pool2d and elementwise_add, etc. has been added.

Distributed Training

  • Performance Optimization
    • The LocalSGD multi-machine training algorithm has been added: As for the problem that, in GPU multi-machine multi-card synchronous training process, there is trainer speed inconsistency (random), resulting in synchronous waiting, we have designed the local asynchronous training strategy, which, by multi-step asynchronous training (no communication blocking), has realized the even distribution of trainer time, so as to improve the performance of the synchronous training. In the configuration of four machines and thirty-two V100 GPU cards, the training throughput has increased by 8.16% on the Resnet50 Imagenet classification task with the test set top5 accuracy reaching 93%. Model link: https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet
    • The GEO-SGD distributed CPU multi-threading full-asynchronous training algorithm has been added: We maintain independent parameters and local multi-round update through training nodes, and we also increase and update global parameters, greatly reducing the proportion of communication in training. On the text matching Simnet_bow model, GEO-SGD, compared with PaddlePaddle 1.5 full asynchronous mode, in case of 25 nodes and 12 threads, has increased the training speed by 2.65 times, but can still keep the same effect. On the Word2Vec model, GEO-SGD, compared with PaddlePaddle 1.5 full asynchronous mode. in case of 4, 8, 16, and 32 nodes and 16 threads, has increased the training speed by 3.79 times, 3.92 times, 4.69 times, and 6.88 times respectively, and remains just as effective.
    • Fast Resnet: Via such strategies as variable image size, variable batch size, and rectangular validation image, etc., the training speed of Resnet50 model in ImageNet dataset has been significantly increased. In the configuration of four machines and thirty-two V100 GPU cards, the time for 93% accuracy rate of top5 can be reduced to 35 minutes, and the convergence speed increased by 2.21 times. Model link: https://github.com/PaddlePaddle/Fleet/tree/develop/examples/fast_imagenet
  • The large Batch training optimizer, i.e., RecomputeOptimizer, has been added. In the case of fixed memory, Recompute optimizer can significantly improve the batch size that the model can run, which is improved to 17%-309% of the original. In addition, training loss and convergence curve are consistent compare with the common optimization method, while the actual throughput will have a certain loss.
  • Collective Op has been added: all_reduce_op, broadcast_op, all_gahter_op and reduce_scatter_op support process communication in networking.
  • Fault Tolerance
    • CPU full asynchronous training mode has been added with the training node heartbeat check, to find the abnormal node in time.
    • The retry mechanism has been added to fix rpc errorcode 14 error.
  • Deployment
    • The added Paddle-K8S-Operator supports Volcano Job submission and supports CPU distributed training.

Model Construction

  • Optimization of Usability
    • We optimize the user experience of Paddle models, including making the installation guideline more clear and refactoring the source code for better Windows OS support. Most models now have a brief guideline about using your own data for training/fine-tuning/predicting etc.
  • PaddleNLP:
    • The text generation library has been released:
      • We have opened sources for multiple text generation models, including vanilla seq2seq, seq2seq with memory network and variational seq2seq.
    • The reading comprehension library has been upgraded:
      • We have opened sources for the D-Net model and related pre-training models Baidu used to win the championship in EMNLP2019 reading comprehension contest, and they are compatible with the parallel training, high-performance evaluation and construction of reading comprehension servicing of and in relation to the 18 extraction reading comprehension datasets opened for MRQA2019.
    • The semantic representation library has been upgraded:
      • The semantic representation model XLNet has been added.
    • The open multitasking learning library PALM has been released:
      • We have opened the sources for the multi-tasking learning framework PALM Baidu used to win the championship for MRQA2019 contest. As a result, only dozens of lines of codes are needed to complete multi-task learning algorithms based on ERNIE, BERT and other pre-training models for hard sharing and hierarchical sharing, etc.
  • PaddleCV
    • The image segmentation library PaddleSeg has been released: Four features: rich data augmentation, modular design, high performance and end-to-end deployment, are available.
      • Model
        • The support of four networks, i.e., DeeplabV3+/UNet/PSPNet/ICNet, has been added and such networks are corresponding to 18 pre-trained models.
        • Three new pretrained models for lane segmentation, portrait segmentation, and body part segmentation have been added.
      • Function
        • The combination configuration for softmax loss, bce loss, dice loss is supported.
        • More than ten data augmentation strategies such as flip, rotation, multi-scale transformation, blur, and color saturation adjustment, etc. are supported.
        • Such usability functions as data inspection, model export, automatic visualization, and parameter adjustment mode, etc. are supported.
        • FP16 mixed precision training and dynamic Loss Scaling are supported.
        • Multi-process training and data pre-processing are supported.
      • End-to-end Deployment
        • The compilation, development and deployment of multi-platform (Windows/Linux) C++ high-performance inference library are provided.
        • High-performance image segmentation service deployment capability based on Paddle Serving is provided.
    • The detection library PaddleDetection has been upgraded
      • 2019 Objects365 Full Track competition championship-winning model has been added, DeformableConv series models, VGG-SSD series models, Cascade+Mask+FPN model, more COCO based two-stage models, pedestrian detection and vehicle detection pre-training models, face detection model Faceboxes and BlazeFace series models; and we have released the improved lightweight models.
      • Functions:
        • Multi-scale training, multi-scale testing, and support for group norm, etc. are supported. FP16 training is supported. C++ predictive deployment capability has been added and it supports Windows and Linux systems.
        • Model compression quantification and pruning examples have been added.
      • Documentation: We have added Chinese documents, the documents based on small data, quick start, migration learning, model export, and forecast deployment, etc., and forecast benchmark documents.
    • The image classification model has been perfected
      • 9 EfficientNet pre-training models: EfficientNet-b0, EfficientNet-b1, EfficientNet-b2, EfficientNet-b3, EfficientNet-b4, EfficientNet-b5, EfficientNet-b6, EfficientNet-b7, and EfficientNet-small have been released. The accuracy is the same as it says on the thesis.
      • We have continuously added 34 pre-training models: DarkNet53, DenseNet121, Densenet161, DenseNet169, DenseNet201, DenseNet264, SqueezeNet1_0, SqueezeNet1_1, ResNeXt50_vd_32x4d, ResNeXt152_64x4d, ResNeXt101_32x8d_wsl, ResNeXt101_32x16d_wsl, ResNeXt101_32x32d_wsl, ResNeXt101_32x48d_wsl, Fix_ResNeXt101_32x48d_wsl, ResNet18_vd, ResNet34_vd, MobileNetV1_x0_25, MobileNetV1_x0_5, MobileNetV1_x0_75, MobileNetV2_x0_75, MobilenNetV3_small_x1_0, DPN68, DPN92, DPN98, DPN107, DPN131, ResNeXt101_vd_32x4d, ResNeXt152_vd_64x4d, Xception65, Xception71, Xception41_deeplab, Xception65_deeplab, and SE_ResNet50_vd
    • PaddleVideo has been upgraded
      • The following action positioning models have been added: BMN and BSN, where BMN model is the champion of ActivityNet2019 contest.
      • The BaseLine model for VideoGrounding direction has been added: TALL.
      • The BaseLine model for VideoCaption direction has been added: ETS.
    • PaddleGAN has been upgraded
      • The SPADE model has been added.
      • We have replaced Instanceorm implementation, so the speed of the discriminator on STGAN is increased by about 12%.
  • PaddleSpeech :
    • We have upgraded the speech recognition model DeepSpeech to the latest version of the PaddlePaddle.
    • We have opened the sources for the speech synthesis model DeepVoice3.
  • PaddleRec:
    • DeepFM, XDeepFM, DeepCrossNetwork have been added to support distributed training.

Utility Components

  • PaddleHub
    • The Auto Fine-tune function has been added for hyper-parameter optimization to achieve a given hyper-parameter search space and automatically search better hyper-parameter combination.
      • Two hyper-parameter optimization algorithms: HAZero based on Bayesian optimization and PSHE2 based on Hamiltonian systems, are supported.
      • Supports two evaluation methods: Full-Trail and Population-Based.
    • Enrichment of Pre-trained Model
      • We have upgraded the ERNIE 1.0 Chinese model to improve the effect of the model in case of long text.
      • We have upgraded the LAC model to v2.0.0, which, while maintaining the effectiveness, streamlines the model structure and improves the inference speed.
      • ERNIE 2.0 English pre-trained model has been added.
      • Ultra-Light-Fast-Generic-Face-Detector-1MB face detection model has been added.
      • The body part split ACE2P model has been added.
      • The DeepLabv3+ based portrait segmentation model HumanSeg has been added.
      • The image generation models STGAN, AttGAN and StarGAN have been added.
    • Promotion of Fine-tune API's Upgrade, Flexibility and Usability
      • Reading comprehension Fine-tune task has been added.
      • The multi-indicator evaluation function has been added.
      • The forecast interface to improve forecast performance has been optimized.
      • We have added optimization strategy ULMFiT, including the following three configurations:
      • Slanted triangular learning rates: fine-tuning of the slanted triangle learning rates.
      • Discriminative fine-tuning: Supports the computation graph adopt the fine tuning of different learning rates in hierarchical topological order.
      • Gradual unfreezing: Unfreeze parameters layer by layer according to the topological structure of the computation graph.
  • PGL Graph Learning Framework
    • The official version of PaddlePaddle learning framework PGL v1.0 has been released correspondingly.
    • Flexibility: The Metapath sampling and Message Passing messaging mechanisms for heterogeneous graphs have been added, which support the heterogeneous graph modeling with multiple types of nodes and edge features. We have also added the heterogeneous graph algorithms such as Metapath2vec and GATNE, etc. At the same time, documents, API, Tutorial and other data are further improved.
    • Large-scale capability: We have added the distributed graph engine and distributed Embedding, which can support multiple distributed training modes of 1 billion-nodes and 10 billion-edges super giant graphs. We have added two distributed examples, i.e., distributed deepwalk and distributed graphSage.
    • Big Model Zoo: We have added 8 new graph learning models. In total, there are 13 graph learning models, covering the mainstream models of graph neural networks and graph representation learning. The eight new models are LINE, struc2vec, metapath2vec, GES, GATNE, SGC, Unsup-GraphSage, and DGI.
  • PARL Intensive Deep Learning Framework
    • We have released the corresponding PaddlePaddle intensive deep learning framework PARL 1.2.
    • More complete and parallel RL mechanism, resource scheduling clustering, further lower down the threshold of parallel algorithm implementation.
    • It supports passively parallel evolutionary algorithms with hundreds of CPU concurrent searches (https://github.com/PaddlePaddle/PARL/tree/develop/examples/ES).
    • We have put the more comprehensive official PARL document into use (https://parl.readthedocs.io/en/latest/).
  • PaddleFL Federal Learning
  • Paddle2ONNX
    • Paddle2onnx has been upgraded to v0.2 accordingly.
    • The pip installation method has been added.
    • It is adapted to the PaddlePaddle v1.6 operator and ONNX v1.5 version.
    • Precision alignment framework has been added to provide correctness verification for new code and model transformations.
    • It supports conversion of 10 Paddle image classification models such as ResNet, DenseNet.
    • It supports conversion of 4 Paddle object detection models such as SSD_MobileNet and YoloV3_DarkNet5, etc.
  • X2Paddle
    • We have upgraded X2Paddle to v0.5 accordingly.
    • The pip installation method has been added.
    • Unified caffe, tensorflow and onnx model computation graph intermediate representation has been added.
    • It supports the conversion of caffe multi-branch model.
    • It has significantly improved the model transformation capability of the mainstream framework and support 44 tensorflow OPs, 33 Caffe Layers and 48 ONNX OPs.
    • It provides multi-frame model deployment capabilities for Paddle Lite and supports conversion of 18 models including image classification, object detection and semantic segmentation.

Bug Fixes

  • Fix the bug that the rnn_search model can't run.
  • Fix the bug of save_inference_model with prune recurernt_op (this bug will cause some RNN models saved through 'save inference model' to fail in inference ).
  • Fix the problem of the parameters such as act and bias, etc. being ineffective in some Layers in the dynamic graph (including: BilinearTensorProduct, GRUUnit, Conv2DTranspose, LayerNorm, and NCE), the bug of the optimizer saving, memory leak on the python side, the minimize segment error of some parameters, and the failure of the has_attr in the use of python.
  • Fix the accuracy problem of the FC mkldnn pass on the AVX2 machine.
  • Update MKL-DNN to 0.20 and increased MKL-DNN unit test coverage to over 90%.
  • Fix squash problem quantifying convolution and dequant op after MKL-DNN training.

Code Reconstruction and Upgrade

  • We have cleaned up 6 obsolete third-party libraries, i.e. recordio, snappystream, snappy, jemalloc, anakin and gzstream.

Thanks to the contributors

This release contains contributions from many Intel engineers as well as Adam Grygielski, Jacek Czaja , DanQing Li, Michal Gallus, Joanna Wozna, Minghui Yu, Feiyue Zhai, Bob Zhu . We would also like to thank
everyone who asked questions and reported issues.