Releases · horovod/horovod · GitHub

12 Jun 09:26

maxhgerlach

v0.28.1: Build fixes (ROCm, GCC 12) Latest

Latest

Fixed

Fixed build with gcc 12. (#3925)
PyTorch: Fixed build on ROCm. (#3928)
TensorFlow: Fixed local_rank_op. (#3940)

Assets 2

0 Join discussion

10 May 09:13

maxhgerlach

v0.28.0: Keras 2.11+ optimizers, faster reducescatter, fixes for latest TensorFlow, CUDA, NCCL

Added

TensorFlow: Added new get_local_and_global_gradients to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)

Changed

Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
TensorFlow: Ensured that tf.logical_and within allreduce tf.cond runs on CPU. (#3885)
TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
CUDA_VISIBLE_DEVICES environment variable is no longer passed to remote nodes. (#3865)

Fixed

Fixed build with ROCm. (#3839, #3848)
Fixed build of Docker image horovod-nvtabular. (#3851)
Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
Fixed missing arguments of Keras allreduce function. (#3905)
Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)

Assets 2

0 Join discussion

01 Feb 17:51

maxhgerlach

Custom data loaders in Spark TorchEstimator, more model parallelism in Keras, improved allgather performance, fixes for latest PyTorch and TensorFlow versions

Added

Keras: Added PartialDistributedOptimizer API. (#3738)
Added HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737)
Added support for reducescatter arguments prescale_factor and postscale_factor and moved averaging into Horovod backend. (#3815)
Spark Estimator: Added support for custom data loaders in TorchEstimator. (#3787)
Spark Estimator: Added NVTabular data loader for TorchEstimator. (#3787)

Changed

Improved NCCL performance for fused allgather operations through padding for better memory alignment. (#3727)
Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. (#3727)

Fixed

ROCm: Fixed GPU MPI operations support in build. (#3746)
PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. (#3750)
Fixed memory leak in MPI_GPUAllgather. (#3727)
TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. (#3767)
Keras: Added support for additional arguments to SyncBatchNormalization._moments(). (#3775)
Fixed version number parsing with pypa/packaging 22.0. (#3794)
TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. (#3755)
TensorFlow: Fixed handling of tf.IndexedSlices types when scaling local gradients. (#3786)
Added missing MEMCPY_IN_FUSION_BUFFER timeline event for reducescatter. (#3808)
Fixed build of Docker image horovod-nvtabular. (#3817)
TensorFlow: Several fixes for allreduce and grouped allreduce handling of tf.IndexedSlices. (#3813)
Spark: Restricted PyArrow to versions < 11.0. (#3830)
TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. (#3783)
TensorFlow/Keras: Fixed DistributedOptimizer with Keras 2.11+. (#3822)
PyTorch, ROCm: Fixed allreduce average on process sets. (#3815)

Assets 2

0 Join discussion

14 Oct 08:20

EnricoMi

Hotfix: Fixing packaging import during install

Fixed

Fixed packaging import during install to occur after install_requires. (#3741)

Assets 2

13 Oct 12:29

EnricoMi

Better support for model parallel, more reduction operations for allreduce (min, max, product), grouped allgather and reducedscatter, Petastorm reader level parallel shuffling, NVTabular data loader

Added

Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)
Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)
Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)
TensorFlow: Added register_local_var functionality to distributed optimizers and local gradient aggregators. (#3695)
TensorFlow: Added support for local variables for BroadcastGlobalVariablesCallback. (#3703)
Enabled use of native ncclAvg op for NCCL allreduces. (#3646)
Added support for additional reduction operations for allreduce (min, max, product). (#3660)
Added 2D torus allreduce using NCCL. (#3608)
Added support for Petastorm reader level parallel shuffling. (#3665)
Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)
Added support for int8 and uint8 allreduce and grouped_allreduce in TensorFlow. (#3649)
Added support for batched memory copies in GPUAllgather. (#3590)
Added support for batched memory copies in GPUReducescatter. (#3621)
Added hvd.grouped_allgather() and hvd.grouped_reducescatter() operations. (#3594)
Added warning messages if output tensor memory allocations fail. (#3594)
Added register_local_source and use_generic_names funtionality to DistributedGradientTape. (#3628)
Added PartialDistributedGradientTape() API for model parallel use cases. (#3643)
Spark/Lightning: Added reader_worker_count and reader_pool_type. (#3612)
Spark/Lightning: Added transformation_edit_fields and transformation_removed_fields param for EstimatorParams. (#3651)
TensorFlow: Added doc string for hvd.grouped_allreduce(). (#3594)
ROCm: Enabled alltoall. (#3654)

Changed

Default Petastorm reader pool is changed from process to thread for lower memory usage. (#3665)
Keras: Support only legacy optimizers in Keras 2.11+. (#3725)
Gloo: When negotiating, use gather rather than allgather. (#3633)
Use packaging.version instead of distutils version classes. (#3700)

Deprecated

Deprecated field shuffle_buffer_size from EstimatorParams. Use shuffle to enable shuffle or not. (#3665)

Removed

Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)

Fixed

TensorFlow: Fixed the optimizer iteration increments when backward_passes_per_step > 1. (#3631)
Fixed FuseResponses() on BATCHED_D2D_PADDING edge cases for Reducescatter and/or ROCm. (#3621)
PyTorch: Fixed Reducescatter functions to raise HorovodInternalError rather than RuntimeError. (#3594)
PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)
Fixed race condition in PyTorch allocation handling. (#3639)
Build: Fixed finding nvcc (if not in $PATH) with older versions of CMake. (#3682)
Fixed reducescatter() and grouped_reducescatter() to raise clean exceptions for scalar inputs. (#3699)
Updated Eigen submodule to fix build on macOS with aarch64. (#3619)
Build: Correctly select files in torch/ directory to be hipified. (#3588)
Build: Modify regex match for CUDA|ROCm in FindPytorch.cmake. (#3593)
Build: Fixed ROCm-specific build failure. (#3630)

Assets 2

21 Jun 09:19

EnricoMi

Reducescatter for NCCL, MPI and Gloo, AMD GPU XLA Op implementation, Spark Estimator improvements, TensorFlow Data Service Horovod job, Elastic run API

Added

Added hvd.reducescatter() operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574)
Added AMD GPU XLA Op Implementation. (#3486)
Added Horovod job to spin up distributed TensorFlow Data Service. (#3525)
Spark: Expose random seed as an optional parameter. (#3517)
Add Helm Chart. (#3546)
Elastic: Add elastic run API. (#3503)
Spark Estimator: Expose random seed for model training reproducibility. (#3517)
Spark Estimator: Add option whether to use GPUs at all. (#3526)
Spark Estimator: Expose parameter to set start method for multiprocessing. (#3580)

Changed

MXNet: Updated allreduce functions to newer op API. (#3299)
TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. (#3464)
TensorFlow: Clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up. (#3505)
Make HorovodVersionMismatchError subclass ImportError instead of just a standard Exception. (#3549)
Elastic: Catch any exception to prevent the discovery thread from silently dying. (#3436)
Horovodrun: Exit check_build (--check-build) via sys.exit to flush stdout. (#3272)
Spark: Use env to set environment vars in remote shell. (#3489)
Build: Avoid redundant ptx generation for maximum specified compute capability. (#3509)

Deprecated

MXNet: Deprecated average argument of allreduce functions. (#3299)
Public and internal APIs: deprecate use of np, min_np, max_np,
use num_proc, min_num_proc, and max_num_proc, respectively, instead. (#3409)
Horovodrun: Providing multiple NICS as comma-separated string via --network-interface is deprecated,
use --network-interface multiple times or --network-interfaces instead. (#3506)
horovod.run: Argument network_interface with comma-separated string is deprecated,
use network_interfaces with Iterable[str] instead. (#3506)

Fixed

Fallback to NCCL shared lib if static one is not found. (#3500
Spark/Lightning: Added missing tranform_spec for Petastorm datamodule. (#3543)
Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling save_hyperparameters(). (#3527)
Elastic: Fixed empty hostname returned from HostDiscoveryScript. (#3490)
TensorFlow 2.9: Fixed build for API change related to tensorflow_accelerator_device_info. (#3513)
TensorFlow 2.10: Bumped build partially to C++17. (#3558)
TensorFlow: Fixed gradient update timing in TF AggregationHelperEager. (#3496)
TensorFlow: Fixed resource NotFoundError in TF AggregationHelper. (#3499)

Assets 2

21 Apr 08:28

EnricoMi

Hotfix: DBFSLocalStore get_localized_path implementation

Fixed

Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. (#3510)

Assets 2

10 Mar 18:38

tgaddair

Hotfix: Fix ignored cuda arch flags

Fixed

[Setup] Require fsspec >= 2010.07.0 (#3451)
Fix ignored cuda arch flags (#3462

Assets 2

03 Mar 20:39

EnricoMi

Hotfix: CMake better finding CUDA

Fixed

Extended CMake build script to often find CUDA even if nvcc is not in $PATH. (#3444)

Assets 2

02 Mar 15:57

tgaddair

Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions

Added

Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
TensorFlow: Added in-place broadcasting of variables. (#3128)
Elastic: Added support for resurrecting blacklisted hosts. (#3319)
MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
Spark/Lightning: Added history to lightning estimator. (#3214)

Changed

Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
Moved released Docker image horovod and horovod-cpu to Ubuntu 20.04 and Python 3.8. (#3393)
Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
Make checkpoint name optional so that user can save to h5 format. (#3411)

Deprecated

Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)

Removed

Spark: Removed h5py<3 constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)

Fixed

Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
PyTorch: Fixed finalization of ProcessSetTable. (#3351)
Fixed remote trainers to point to the correct shared lib path. (#3258)
Fixed imports from tensorflow.python.keras with tensorflow 2.6.0+. (#3403)
Fixed Adasum communicator init logic. (#3379)
Lightning: Fixed resume logger. (#3375)
Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
Fixed possible integer overflow in multiplication. (#3368)
Fixed the pytorch_lightning_mnist.py example. (#3245, #3290)
Fixed barrier segmentation fault. (#3313)
Fixed hvd.barrier() tensor queue management. (#3300)
Fixed PyArrow "list index out of range" IndexError. (#3274)
Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
Spark/Lightning: Fixed setting limit_train_batches and limit_val_batches. (#3237)
Elastic: Fixed ElasticSampler and hvd.elastic.state losing some indices of processed samples when nodes dropped. (#3143)
Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
Ray: Fixed RayExecutor to fail when num_workers=0 and num_hosts=None. (#3210)
Spark/Lightning: Fixed checkpoint callback dirpath typo. (#3204)

Assets 2