Skip to content

Releases: horovod/horovod

v0.28.1: Build fixes (ROCm, GCC 12)

12 Jun 09:26
1d217b5
Compare
Choose a tag to compare

Fixed

  • Fixed build with gcc 12. (#3925)
  • PyTorch: Fixed build on ROCm. (#3928)
  • TensorFlow: Fixed local_rank_op. (#3940)

v0.28.0: Keras 2.11+ optimizers, faster reducescatter, fixes for latest TensorFlow, CUDA, NCCL

10 May 09:13
Compare
Choose a tag to compare

Added

  • TensorFlow: Added new get_local_and_global_gradients to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)

Changed

  • Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
  • TensorFlow: Ensured that tf.logical_and within allreduce tf.cond runs on CPU. (#3885)
  • TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
  • CUDA_VISIBLE_DEVICES environment variable is no longer passed to remote nodes. (#3865)

Fixed

  • Fixed build with ROCm. (#3839, #3848)
  • Fixed build of Docker image horovod-nvtabular. (#3851)
  • Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
  • Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
  • Fixed missing arguments of Keras allreduce function. (#3905)
  • Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)

Custom data loaders in Spark TorchEstimator, more model parallelism in Keras, improved allgather performance, fixes for latest PyTorch and TensorFlow versions

01 Feb 17:51
bfaca90
Compare
Choose a tag to compare

Added

  • Keras: Added PartialDistributedOptimizer API. (#3738)
  • Added HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737)
  • Added support for reducescatter arguments prescale_factor and postscale_factor and moved averaging into Horovod backend. (#3815)
  • Spark Estimator: Added support for custom data loaders in TorchEstimator. (#3787)
  • Spark Estimator: Added NVTabular data loader for TorchEstimator. (#3787)

Changed

  • Improved NCCL performance for fused allgather operations through padding for better memory alignment. (#3727)
  • Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. (#3727)

Fixed

  • ROCm: Fixed GPU MPI operations support in build. (#3746)
  • PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. (#3750)
  • Fixed memory leak in MPI_GPUAllgather. (#3727)
  • TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. (#3767)
  • Keras: Added support for additional arguments to SyncBatchNormalization._moments(). (#3775)
  • Fixed version number parsing with pypa/packaging 22.0. (#3794)
  • TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. (#3755)
  • TensorFlow: Fixed handling of tf.IndexedSlices types when scaling local gradients. (#3786)
  • Added missing MEMCPY_IN_FUSION_BUFFER timeline event for reducescatter. (#3808)
  • Fixed build of Docker image horovod-nvtabular. (#3817)
  • TensorFlow: Several fixes for allreduce and grouped allreduce handling of tf.IndexedSlices. (#3813)
  • Spark: Restricted PyArrow to versions < 11.0. (#3830)
  • TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. (#3783)
  • TensorFlow/Keras: Fixed DistributedOptimizer with Keras 2.11+. (#3822)
  • PyTorch, ROCm: Fixed allreduce average on process sets. (#3815)

Hotfix: Fixing packaging import during install

14 Oct 08:20
3460487
Compare
Choose a tag to compare

Fixed

  • Fixed packaging import during install to occur after install_requires. (#3741)

Better support for model parallel, more reduction operations for allreduce (min, max, product), grouped allgather and reducedscatter, Petastorm reader level parallel shuffling, NVTabular data loader

13 Oct 12:29
c638dce
Compare
Choose a tag to compare

Added

  • Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)
  • Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)
  • Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)
  • TensorFlow: Added register_local_var functionality to distributed optimizers and local gradient aggregators. (#3695)
  • TensorFlow: Added support for local variables for BroadcastGlobalVariablesCallback. (#3703)
  • Enabled use of native ncclAvg op for NCCL allreduces. (#3646)
  • Added support for additional reduction operations for allreduce (min, max, product). (#3660)
  • Added 2D torus allreduce using NCCL. (#3608)
  • Added support for Petastorm reader level parallel shuffling. (#3665)
  • Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)
  • Added support for int8 and uint8 allreduce and grouped_allreduce in TensorFlow. (#3649)
  • Added support for batched memory copies in GPUAllgather. (#3590)
  • Added support for batched memory copies in GPUReducescatter. (#3621)
  • Added hvd.grouped_allgather() and hvd.grouped_reducescatter() operations. (#3594)
  • Added warning messages if output tensor memory allocations fail. (#3594)
  • Added register_local_source and use_generic_names funtionality to DistributedGradientTape. (#3628)
  • Added PartialDistributedGradientTape() API for model parallel use cases. (#3643)
  • Spark/Lightning: Added reader_worker_count and reader_pool_type. (#3612)
  • Spark/Lightning: Added transformation_edit_fields and transformation_removed_fields param for EstimatorParams. (#3651)
  • TensorFlow: Added doc string for hvd.grouped_allreduce(). (#3594)
  • ROCm: Enabled alltoall. (#3654)

Changed

  • Default Petastorm reader pool is changed from process to thread for lower memory usage. (#3665)
  • Keras: Support only legacy optimizers in Keras 2.11+. (#3725)
  • Gloo: When negotiating, use gather rather than allgather. (#3633)
  • Use packaging.version instead of distutils version classes. (#3700)

Deprecated

  • Deprecated field shuffle_buffer_size from EstimatorParams. Use shuffle to enable shuffle or not. (#3665)

Removed

  • Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)

Fixed

  • TensorFlow: Fixed the optimizer iteration increments when backward_passes_per_step > 1. (#3631)
  • Fixed FuseResponses() on BATCHED_D2D_PADDING edge cases for Reducescatter and/or ROCm. (#3621)
  • PyTorch: Fixed Reducescatter functions to raise HorovodInternalError rather than RuntimeError. (#3594)
  • PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)
  • Fixed race condition in PyTorch allocation handling. (#3639)
  • Build: Fixed finding nvcc (if not in $PATH) with older versions of CMake. (#3682)
  • Fixed reducescatter() and grouped_reducescatter() to raise clean exceptions for scalar inputs. (#3699)
  • Updated Eigen submodule to fix build on macOS with aarch64. (#3619)
  • Build: Correctly select files in torch/ directory to be hipified. (#3588)
  • Build: Modify regex match for CUDA|ROCm in FindPytorch.cmake. (#3593)
  • Build: Fixed ROCm-specific build failure. (#3630)

Reducescatter for NCCL, MPI and Gloo, AMD GPU XLA Op implementation, Spark Estimator improvements, TensorFlow Data Service Horovod job, Elastic run API

21 Jun 09:19
48e0aff
Compare
Choose a tag to compare

Added

  • Added hvd.reducescatter() operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574)
  • Added AMD GPU XLA Op Implementation. (#3486)
  • Added Horovod job to spin up distributed TensorFlow Data Service. (#3525)
  • Spark: Expose random seed as an optional parameter. (#3517)
  • Add Helm Chart. (#3546)
  • Elastic: Add elastic run API. (#3503)
  • Spark Estimator: Expose random seed for model training reproducibility. (#3517)
  • Spark Estimator: Add option whether to use GPUs at all. (#3526)
  • Spark Estimator: Expose parameter to set start method for multiprocessing. (#3580)

Changed

  • MXNet: Updated allreduce functions to newer op API. (#3299)
  • TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. (#3464)
  • TensorFlow: Clear locally accumulated gradient by assigning with zeros_like to avoid infinite gradient not correctly cleared up. (#3505)
  • Make HorovodVersionMismatchError subclass ImportError instead of just a standard Exception. (#3549)
  • Elastic: Catch any exception to prevent the discovery thread from silently dying. (#3436)
  • Horovodrun: Exit check_build (--check-build) via sys.exit to flush stdout. (#3272)
  • Spark: Use env to set environment vars in remote shell. (#3489)
  • Build: Avoid redundant ptx generation for maximum specified compute capability. (#3509)

Deprecated

  • MXNet: Deprecated average argument of allreduce functions. (#3299)
  • Public and internal APIs: deprecate use of np, min_np, max_np,
    use num_proc, min_num_proc, and max_num_proc, respectively, instead. (#3409)
  • Horovodrun: Providing multiple NICS as comma-separated string via --network-interface is deprecated,
    use --network-interface multiple times or --network-interfaces instead. (#3506)
  • horovod.run: Argument network_interface with comma-separated string is deprecated,
    use network_interfaces with Iterable[str] instead. (#3506)

Fixed

  • Fallback to NCCL shared lib if static one is not found. (#3500
  • Spark/Lightning: Added missing tranform_spec for Petastorm datamodule. (#3543)
  • Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling save_hyperparameters(). (#3527)
  • Elastic: Fixed empty hostname returned from HostDiscoveryScript. (#3490)
  • TensorFlow 2.9: Fixed build for API change related to tensorflow_accelerator_device_info. (#3513)
  • TensorFlow 2.10: Bumped build partially to C++17. (#3558)
  • TensorFlow: Fixed gradient update timing in TF AggregationHelperEager. (#3496)
  • TensorFlow: Fixed resource NotFoundError in TF AggregationHelper. (#3499)

Hotfix: DBFSLocalStore get_localized_path implementation

21 Apr 08:28
Compare
Choose a tag to compare

Fixed

  • Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. (#3510)

Hotfix: Fix ignored cuda arch flags

10 Mar 18:38
Compare
Choose a tag to compare

Fixed

  • [Setup] Require fsspec >= 2010.07.0 (#3451)
  • Fix ignored cuda arch flags (#3462

Hotfix: CMake better finding CUDA

03 Mar 20:39
ebd1350
Compare
Choose a tag to compare

Fixed

  • Extended CMake build script to often find CUDA even if nvcc is not in $PATH. (#3444)

Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions

02 Mar 15:57
b089df6
Compare
Choose a tag to compare

Added

  • Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
  • TensorFlow: Added in-place broadcasting of variables. (#3128)
  • Elastic: Added support for resurrecting blacklisted hosts. (#3319)
  • MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
  • Spark/Lightning: Added history to lightning estimator. (#3214)

Changed

  • Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
  • Moved released Docker image horovod and horovod-cpu to Ubuntu 20.04 and Python 3.8. (#3393)
  • Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
  • Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
  • Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
  • Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
  • Make checkpoint name optional so that user can save to h5 format. (#3411)

Deprecated

  • Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)

Removed

  • Spark: Removed h5py<3 constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)

Fixed

  • Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
  • PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
  • PyTorch: Fixed finalization of ProcessSetTable. (#3351)
  • Fixed remote trainers to point to the correct shared lib path. (#3258)
  • Fixed imports from tensorflow.python.keras with tensorflow 2.6.0+. (#3403)
  • Fixed Adasum communicator init logic. (#3379)
  • Lightning: Fixed resume logger. (#3375)
  • Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
  • Fixed possible integer overflow in multiplication. (#3368)
  • Fixed the pytorch_lightning_mnist.py example. (#3245, #3290)
  • Fixed barrier segmentation fault. (#3313)
  • Fixed hvd.barrier() tensor queue management. (#3300)
  • Fixed PyArrow "list index out of range" IndexError. (#3274)
  • Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
  • Spark/Lightning: Fixed setting limit_train_batches and limit_val_batches. (#3237)
  • Elastic: Fixed ElasticSampler and hvd.elastic.state losing some indices of processed samples when nodes dropped. (#3143)
  • Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
  • Ray: Fixed RayExecutor to fail when num_workers=0 and num_hosts=None. (#3210)
  • Spark/Lightning: Fixed checkpoint callback dirpath typo. (#3204)