Releases: horovod/horovod
Releases · horovod/horovod
v0.28.1: Build fixes (ROCm, GCC 12)
v0.28.0: Keras 2.11+ optimizers, faster reducescatter, fixes for latest TensorFlow, CUDA, NCCL
Added
- TensorFlow: Added new
get_local_and_global_gradients
to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)
Changed
- Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
- TensorFlow: Ensured that
tf.logical_and
within allreducetf.cond
runs on CPU. (#3885) - TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
CUDA_VISIBLE_DEVICES
environment variable is no longer passed to remote nodes. (#3865)
Fixed
- Fixed build with ROCm. (#3839, #3848)
- Fixed build of Docker image horovod-nvtabular. (#3851)
- Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
- Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
- Fixed missing arguments of Keras allreduce function. (#3905)
- Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)
Custom data loaders in Spark TorchEstimator, more model parallelism in Keras, improved allgather performance, fixes for latest PyTorch and TensorFlow versions
Added
- Keras: Added
PartialDistributedOptimizer
API. (#3738) - Added
HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX
environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737) - Added support for reducescatter arguments
prescale_factor
andpostscale_factor
and moved averaging into Horovod backend. (#3815) - Spark Estimator: Added support for custom data loaders in TorchEstimator. (#3787)
- Spark Estimator: Added NVTabular data loader for TorchEstimator. (#3787)
Changed
- Improved NCCL performance for fused allgather operations through padding for better memory alignment. (#3727)
- Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. (#3727)
Fixed
- ROCm: Fixed GPU MPI operations support in build. (#3746)
- PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. (#3750)
- Fixed memory leak in
MPI_GPUAllgather
. (#3727) - TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. (#3767)
- Keras: Added support for additional arguments to
SyncBatchNormalization._moments()
. (#3775) - Fixed version number parsing with pypa/packaging 22.0. (#3794)
- TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. (#3755)
- TensorFlow: Fixed handling of
tf.IndexedSlices
types when scaling local gradients. (#3786) - Added missing
MEMCPY_IN_FUSION_BUFFER
timeline event for reducescatter. (#3808) - Fixed build of Docker image horovod-nvtabular. (#3817)
- TensorFlow: Several fixes for allreduce and grouped allreduce handling of
tf.IndexedSlices
. (#3813) - Spark: Restricted PyArrow to versions < 11.0. (#3830)
- TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. (#3783)
- TensorFlow/Keras: Fixed
DistributedOptimizer
with Keras 2.11+. (#3822) - PyTorch, ROCm: Fixed allreduce average on process sets. (#3815)
Hotfix: Fixing packaging import during install
Fixed
- Fixed packaging import during install to occur after install_requires. (#3741)
Better support for model parallel, more reduction operations for allreduce (min, max, product), grouped allgather and reducedscatter, Petastorm reader level parallel shuffling, NVTabular data loader
Added
- Spark Estimator: Added support for custom data loaders in KerasEstimator. (#3603)
- Spark Estimator: Added NVTabular data loader for KerasEstimator. (#3603)
- Spark Estimator: Added gradient accumulation support to Spark torch estimator. (#3681)
- TensorFlow: Added
register_local_var
functionality to distributed optimizers and local gradient aggregators. (#3695) - TensorFlow: Added support for local variables for
BroadcastGlobalVariablesCallback
. (#3703) - Enabled use of native
ncclAvg
op for NCCL allreduces. (#3646) - Added support for additional reduction operations for
allreduce
(min, max, product). (#3660) - Added 2D torus
allreduce
using NCCL. (#3608) - Added support for Petastorm reader level parallel shuffling. (#3665)
- Added random seed support for Lightning datamodule to generate reproducible data loading outputs. (#3665)
- Added support for
int8
anduint8
allreduce
andgrouped_allreduce
in TensorFlow. (#3649) - Added support for batched memory copies in
GPUAllgather
. (#3590) - Added support for batched memory copies in
GPUReducescatter
. (#3621) - Added
hvd.grouped_allgather()
andhvd.grouped_reducescatter()
operations. (#3594) - Added warning messages if output tensor memory allocations fail. (#3594)
- Added
register_local_source
anduse_generic_names
funtionality toDistributedGradientTape
. (#3628) - Added
PartialDistributedGradientTape()
API for model parallel use cases. (#3643) - Spark/Lightning: Added
reader_worker_count
andreader_pool_type
. (#3612) - Spark/Lightning: Added
transformation_edit_fields
andtransformation_removed_fields
param forEstimatorParams
. (#3651) - TensorFlow: Added doc string for
hvd.grouped_allreduce()
. (#3594) - ROCm: Enabled
alltoall
. (#3654)
Changed
- Default Petastorm reader pool is changed from
process
tothread
for lower memory usage. (#3665) - Keras: Support only legacy optimizers in Keras 2.11+. (#3725)
- Gloo: When negotiating, use
gather
rather thanallgather
. (#3633) - Use
packaging.version
instead ofdistutils
version classes. (#3700)
Deprecated
- Deprecated field
shuffle_buffer_size
fromEstimatorParams
. Useshuffle
to enable shuffle or not. (#3665)
Removed
- Build: Removed std::regex use for better cxxabi11 compatibility. (#3584)
Fixed
- TensorFlow: Fixed the optimizer iteration increments when
backward_passes_per_step > 1
. (#3631) - Fixed
FuseResponses()
onBATCHED_D2D_PADDING
edge cases for Reducescatter and/or ROCm. (#3621) - PyTorch: Fixed Reducescatter functions to raise
HorovodInternalError
rather thanRuntimeError
. (#3594) - PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. (#3594)
- Fixed race condition in PyTorch allocation handling. (#3639)
- Build: Fixed finding
nvcc
(if not in$PATH
) with older versions of CMake. (#3682) - Fixed
reducescatter()
andgrouped_reducescatter()
to raise clean exceptions for scalar inputs. (#3699) - Updated Eigen submodule to fix build on macOS with aarch64. (#3619)
- Build: Correctly select files in
torch/
directory to be hipified. (#3588) - Build: Modify regex match for CUDA|ROCm in
FindPytorch.cmake
. (#3593) - Build: Fixed ROCm-specific build failure. (#3630)
Reducescatter for NCCL, MPI and Gloo, AMD GPU XLA Op implementation, Spark Estimator improvements, TensorFlow Data Service Horovod job, Elastic run API
Added
- Added
hvd.reducescatter()
operation with implementations in NCCL, MPI, and Gloo. (#3299, #3574) - Added AMD GPU XLA Op Implementation. (#3486)
- Added Horovod job to spin up distributed TensorFlow Data Service. (#3525)
- Spark: Expose random seed as an optional parameter. (#3517)
- Add Helm Chart. (#3546)
- Elastic: Add elastic run API. (#3503)
- Spark Estimator: Expose random seed for model training reproducibility. (#3517)
- Spark Estimator: Add option whether to use GPUs at all. (#3526)
- Spark Estimator: Expose parameter to set start method for
multiprocessing
. (#3580)
Changed
- MXNet: Updated allreduce functions to newer
op
API. (#3299) - TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. (#3464)
- TensorFlow: Clear locally accumulated gradient by assigning with
zeros_like
to avoid infinite gradient not correctly cleared up. (#3505) - Make
HorovodVersionMismatchError
subclassImportError
instead of just a standardException
. (#3549) - Elastic: Catch any exception to prevent the discovery thread from silently dying. (#3436)
- Horovodrun: Exit check_build (
--check-build
) viasys.exit
to flush stdout. (#3272) - Spark: Use
env
to set environment vars in remote shell. (#3489) - Build: Avoid redundant ptx generation for maximum specified compute capability. (#3509)
Deprecated
- MXNet: Deprecated
average
argument of allreduce functions. (#3299) - Public and internal APIs: deprecate use of np, min_np, max_np,
use num_proc, min_num_proc, and max_num_proc, respectively, instead. (#3409) - Horovodrun: Providing multiple NICS as comma-separated string via
--network-interface
is deprecated,
use--network-interface
multiple times or--network-interfaces
instead. (#3506) - horovod.run: Argument
network_interface
with comma-separated string is deprecated,
usenetwork_interfaces
withIterable[str]
instead. (#3506)
Fixed
- Fallback to NCCL shared lib if static one is not found. (#3500
- Spark/Lightning: Added missing
tranform_spec
for Petastorm datamodule. (#3543) - Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling
save_hyperparameters()
. (#3527) - Elastic: Fixed empty hostname returned from
HostDiscoveryScript
. (#3490) - TensorFlow 2.9: Fixed build for API change related to
tensorflow_accelerator_device_info
. (#3513) - TensorFlow 2.10: Bumped build partially to C++17. (#3558)
- TensorFlow: Fixed gradient update timing in TF
AggregationHelperEager
. (#3496) - TensorFlow: Fixed resource
NotFoundError
in TFAggregationHelper
. (#3499)
Hotfix: DBFSLocalStore get_localized_path implementation
Fixed
- Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. (#3510)
Hotfix: Fix ignored cuda arch flags
Hotfix: CMake better finding CUDA
Fixed
- Extended CMake build script to often find CUDA even if
nvcc
is not in$PATH
. (#3444)
Elastic mode improvements, MXNet async dependency engine, fixes for latest PyTorch and TensorFlow versions
Added
- Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. (#3190)
- TensorFlow: Added in-place broadcasting of variables. (#3128)
- Elastic: Added support for resurrecting blacklisted hosts. (#3319)
- MXNet: Added support for MXNet async dependency engine. (#3242, #2963)
- Spark/Lightning: Added history to lightning estimator. (#3214)
Changed
- Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. (#3261, #3371)
- Moved released Docker image
horovod
andhorovod-cpu
to Ubuntu 20.04 and Python 3.8. (#3393) - Spark Estimator: Don't shuffle row groups if training data requires non-shuffle (#3369)
- Spark/Lightning: Reduced memory footprint of async dataloader. (#3239)
- Elastic: Improved handling NCCL errors under elastic scenario. (#3112)
- Spark/Lightning: Do not overwrite model with checkpoint by default. (#3201)
- Make checkpoint name optional so that user can save to h5 format. (#3411)
Deprecated
- Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. (#3190)
Removed
- Spark: Removed
h5py<3
constraint as this is not needed anymore for Tensorflow >2.5.0. (#3301)
Fixed
- Elastic Spark: Fixed indices in initial task-to-task registration. (#3410)
- PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. (#3352)
- PyTorch: Fixed finalization of ProcessSetTable. (#3351)
- Fixed remote trainers to point to the correct shared lib path. (#3258)
- Fixed imports from
tensorflow.python.keras
with tensorflow 2.6.0+. (#3403) - Fixed Adasum communicator init logic. (#3379)
- Lightning: Fixed resume logger. (#3375)
- Fixed the checkpoint directory structure for pytorch and pytorch lightning. (#3362)
- Fixed possible integer overflow in multiplication. (#3368)
- Fixed the
pytorch_lightning_mnist.py
example. (#3245, #3290) - Fixed barrier segmentation fault. (#3313)
- Fixed
hvd.barrier()
tensor queue management. (#3300) - Fixed PyArrow "list index out of range" IndexError. (#3274)
- Elastic: Fixed all workers sometimes failing on elastic Horovod failure. (#3264)
- Spark/Lightning: Fixed setting
limit_train_batches
andlimit_val_batches
. (#3237) - Elastic: Fixed ElasticSampler and
hvd.elastic.state
losing some indices of processed samples when nodes dropped. (#3143) - Spark/Lightning: Fixed history metrics for estimator serialization. (#3216)
- Ray: Fixed RayExecutor to fail when
num_workers=0
andnum_hosts=None
. (#3210) - Spark/Lightning: Fixed checkpoint callback
dirpath
typo. (#3204)