Skip to content

Releases: openucx/ucx

v1.16.0

16 Apr 13:39
e4bb802
Compare
Choose a tag to compare

1.16.0 (April 15, 2024)

Features:

UCP

  • Added tag offload rendezvous protocol in new infrastructure
  • Added rcache to old protocols infrastructure
  • Added multi-fragment protocols for stream API in new infrastructure
  • Enabled new protocols infrastructure by default
  • Removed context param from ucp_memh_put
  • Added assertion if trying to register unsupported memory type
  • Adjusted rendezvous latency to improve scalability
  • Improved endpoint configuration logging information
  • Added check for max length of user defined Active Message header
  • Added rcache support for mem type memory registration
  • Enabled error handling for rndv/put_zcopy protocol
  • Enabled v2 as default client/server connection establishment packet version
  • Enabled rendezvous protocol selection for reachable MDs only
  • Added ucp_rkey_compare API to enable rkey comparison
  • Added release version to worker address to enable wire compatability
  • Added support for memory invalidation for rendezvous through DC transport
  • Enabled the use of strong fence with new protocols infrastructure

UCT

  • Added UCS_MEMORY_TYPE_RDMA memory type for better latency on supported devices
  • Implemented is_reachable_v2 API for IB transport
  • Added ep_is_conntected API

RDMA CORE (IB, ROCE, etc.)

  • Added Floating LID(FLID) based routing support
  • Added latency and min_zcopy configuration variables to ROCm-IPC
  • Added support for indirect MR for cross-gvmi mkey instead of direct MR with DEVX UMEM

TCP

  • Added filter for eliminate bridge devices from lane selection

GPU (CUDA, ROCM)

  • Added support for handling memh with multiple registrations
  • Added performance estimation BW based on GPU type
  • Adjusted rocm/ipc latency and zcopy threshold parameters
  • Improved error message when libnvidia-ml not installed
  • Added profiling to Cuda runtime API calls
  • Adjusted gdr_copy estimated BW to improve protocol selection

Shared Memory

  • Adjusted FIFO_SIZE to improve scalability
  • Removed redundent rcahce implementation in knem transport
  • Added support for symmetric rkey to improve memory usage

UCS

  • Improved scalability of connection establishment flow
  • Improved memtype cache performance by replacing ptrhead_lock to spinlock
  • Added support for VLAN over channel bonding interface
  • Added LRU cache and Usage Tracker datastructures
  • Improved cross-NUMA device detection
  • Added support for PCIe gen5 bandwidth detection

Build

  • Added LCOV coverage report as a build option
  • Added binutils 2.40 library dependencies
  • Added development modulefile

Tools

  • Added information about sizes of ucp_request_t fields in ucx_info
  • Added ucx env to profiling output
  • Added MAD RTE in ucx_perftest to support setups without IPoIB

Tests

  • Added GTEST_LOG_LEVEL env var to set log level just before test run
  • Disabled protov1 and ud_verbs tests for valgrind mode
  • Reduced gtest execution time

Documentation

  • Added a few details to coding style

Bugfixes:

UCP

  • Reverted wireup latency calculation which caused lanes selection issue
  • Fixed strong fence to always ensure ordering
  • Fixed registration of memh for RNDV protocol
  • Fixed rndv_put and rkey_ptr assertion failure
  • Fixed performance estimation for multi-fragment protocols
  • Fixed memory registration error handling
  • Fixed buffer overflow of large log messages
  • Fixed progress enabling for selected lanes
  • Fixed atomic lanes progress enabling
  • Added missing rendezvous schemes to environment variable documentation
  • Fixed bcopy BW estimation for AMD
  • Fixed lanes information printing for new protocols infrastructure
  • Fixed rndv_am protocol thresholds
  • Fixed fp8 packing issue
  • Fixed Intel OneAPI compilation error
  • Fixed CM address packing on server side
  • Fixed endpoint reconfiguration issue due to asymmetrical selection
  • Fixed asymmetrical selection due to wire compatability issue
  • Fixed potential deadlock with cuda_copy and RTR protocol
  • Fixed tag_recv return value on immediate completion
  • Fixed memory corruption by proper memh handling in tag offload rendezvous
  • Changed default allocator to not use reserved huge pages
  • Fixed rndv put protocol to avoid early completion
  • Fixed rndv_put transport selection for device to device scenario
  • Disabled rendezvous pipeline protocol selection when using non-contiguous buffer
  • Fixed crash in rendezvous protocol rkey pack after failed memory registration

RDMA CORE (IB, ROCE, etc.)

  • Fixed compilation failure when DevX is explicitly disabled
  • Fixed crash when using PCIe relaxed ordering
  • Fixed remote access error with rc_verbs transport
  • Fixed endpoint address management in unified mode
  • Fixed assertion failure when configured with UCX_IB_ADDR_TYPE=ib_global
  • Fixed overwritten MD attribute capabilities when querying a device
  • Fixed ibv_reg_mr error by registering memory in rcache callback
  • Disabled MR multithreading registration
  • Fixed mlx5 WQE posting error due to compiler memory copy optimizations

TCP

  • Fixed assymetric lanes selection issue due to inconsistent device listing

GPU (CUDA, ROCM)

  • Fixed compilation flags to support ROCm 6.0
  • Fixed values of D2H_THRESH and latencey params
  • Fixed Cuda memory support for iov datatype
  • Increased max number of agents in ROCm
  • Fixed cuda_ipc transport being disabled if a CUDA device is not set during initialization

Shared Memoey

  • Fixed posix and cma transport selection by enhancing reachability checks
  • Fixed UGNI build failure
  • Fixed latency overhead for knem and cma transports
  • Fixed possible out-of-order issue in mm_iface

UCS

  • Fixed a deadlock when forked debugger is attached during an error in rcache operation
  • Fixed crash due to passing null pointer to log function
  • Fixed crash due to incorrect hashing method
  • Fixed crash in configuration parser cleanup by moving it after profiler cleanup
  • Fixed floating point division by zero during protocols initialization

UCM

  • Fixed occasional crash in bisto hooks by adding a lock before hooking
  • Fixed compilation error when building on PPC64

Java

  • Fixed go tests by setting CUDA device before allocating CUDA memory
  • Fixed perftest error detection and hanging issue

Tools

  • Fixed cpu model type for AMD Genoa in ucx_info
  • Enhanced multi-thread test output

Build

  • Fixed JUCX package publishing, so it will include support for ARM
  • Fixed ROCm building and testing
  • Removed libnvidia-compute version dependency
  • Removed libibmad/libumad from default build configuration to avoid runtime dependency

Packaging

  • Fixed already existing target error when using cmake find_package(ucx) twice

v1.16.0 RC5

03 Apr 10:56
e20264e
Compare
Choose a tag to compare
v1.16.0 RC5 Pre-release
Pre-release

1.16.0 RC5 (April 02, 2024)

Features:

UCS

  • Added support for PCIe gen5 bandwidth detection

Bugfixes:

UCP

  • Fixed rndv_put transport selection for device to device scenario

RDMA CORE (IB, ROCE, etc.)

  • Disabled MR multithreading registration

v1.16.0 RC4

12 Mar 14:11
5b996de
Compare
Choose a tag to compare
v1.16.0 RC4 Pre-release
Pre-release

1.16.0 RC4 (March 12, 2024)

Bugfixes:

UCP

  • Disabled rendezvous pipeline protocol selection when using non-contiguous buffer

RDMA CORE (IB, ROCE, etc.)

  • Fixed mlx5 WQE posting error due to compiler memory copy optimizations

GPU (CUDA, ROCM)

  • Fixed cuda_ipc transport being disabled if a CUDA device is not set during initialization

UCM

  • Fixed compilation error when building on PPC64

Packaging

  • Fixed already existing target error when using cmake find_package(ucx) twice

v1.16.0 RC3

20 Feb 13:07
35eb167
Compare
Choose a tag to compare
v1.16.0 RC3 Pre-release
Pre-release

1.16.0 RC3 (February 20, 2024)

Bugfixes:

UCP

  • Fixed crash in rendezvous protocol rkey pack after failed memory registration

v1.16.0 RC2

18 Feb 16:00
34d9966
Compare
Choose a tag to compare
v1.16.0 RC2 Pre-release
Pre-release

1.16.0 RC2 (January 21, 2024)

Features:

UCP

  • Added tag offload rendezvous protocol in new infrastructure
  • Added rcache to old protocols infrastructure
  • Added multi-fragment protocols for stream API in new infrastructure
  • Enabled new protocols infrastructure by default
  • Removed context param from ucp_memh_put
  • Added assertion if trying to register unsupported memory type
  • Adjusted rendezvous latency to improve scalability
  • Improved endpoint configuration logging information
  • Added check for max length of user defined Active Message header
  • Added rcache support for mem type memory registration
  • Enabled error handling for rndv/put_zcopy protocol
  • Enabled v2 as default client/server connection establishment packet version
  • Enabled rendezvous protocol selection for reachable MDs only
  • Added ucp_rkey_compare API to enable rkey comparison
  • Added release version to worker address to enable wire compatability
  • Added support for memory invalidation for rendezvous through DC transport
  • Enabled the use of strong fence with new protocols infrastructure

UCT

  • Added UCS_MEMORY_TYPE_RDMA memory type for better latency on supported devices
  • Implemented is_reachable_v2 API for IB transport
  • Added ep_is_conntected API

RDMA CORE (IB, ROCE, etc.)

  • Added Floating LID(FLID) based routing support
  • Added latency and min_zcopy configuration variables to ROCm-IPC
  • Added support for indirect MR for cross-gvmi mkey instead of direct MR with DEVX UMEM

TCP

  • Added filter for eliminate bridge devices from lane selection

GPU (CUDA, ROCM)

  • Added support for handling memh with multiple registrations
  • Added performance estimation BW based on GPU type
  • Adjusted rocm/ipc latency and zcopy threshold parameters
  • Improved error message when libnvidia-ml not installed
  • Added profiling to Cuda runtime API calls
  • Adjusted gdr_copy estimated BW to improve protocol selection

Shared Memory

  • Adjusted FIFO_SIZE to improve scalability
  • Removed redundent rcahce implementation in knem transport
  • Added support for symmetric rkey to improve memory usage

UCS

  • Improved scalability of connection establishment flow
  • Improved memtype cache performance by replacing ptrhead_lock to spinlock
  • Added support for VLAN over channel bonding interface
  • Added LRU cache and Usage Tracker datastructures
  • Improved cross-NUMA device detection

Build

  • Added LCOV coverage report as a build option
  • Added binutils 2.40 library dependencies
  • Added development modulefile

Tools

  • Added information about sizes of ucp_request_t fields in ucx_info
  • Added ucx env to profiling output
  • Added MAD RTE in ucx_perftest to support setups without IPoIB

Tests

  • Added GTEST_LOG_LEVEL env var to set log level just before test run
  • Disabled protov1 and ud_verbs tests for valgrind mode
  • Reduced gtest execution time

Documentation

  • Added a few details to coding style

Bugfixes:

UCP

  • Reverted wireup latency calculation which caused lanes selection issue
  • Fixed strong fence to always ensure ordering
  • Fixed registration of memh for RNDV protocol
  • Fixed rndv_put and rkey_ptr assertion failure
  • Fixed performance estimation for multi-fragment protocols
  • Fixed memory registration error handling
  • Fixed buffer overflow of large log messages
  • Fixed progress enabling for selected lanes
  • Fixed atomic lanes progress enabling
  • Added missing rendezvous schemes to environment variable documentation
  • Fixed bcopy BW estimation for AMD
  • Fixed lanes information printing for new protocols infrastructure
  • Fixed rndv_am protocol thresholds
  • Fixed fp8 packing issue
  • Fixed Intel OneAPI compilation error
  • Fixed CM address packing on server side
  • Fixed endpoint reconfiguration issue due to asymmetrical selection
  • Fixed asymmetrical selection due to wire compatability issue
  • Fixed potential deadlock with cuda_copy and RTR protocol
  • Fixed tag_recv return value on immediate completion
  • Fixed memory corruption by proper memh handling in tag offload rendezvous
  • Changed default allocator to not use reserved huge pages
  • Fixed rndv put protocol to avoid early completion

RDMA CORE (IB, ROCE, etc.)

  • Fixed compilation failure when DevX is explicitly disabled
  • Fixed crash when using PCIe relaxed ordering
  • Fixed remote access error with rc_verbs transport
  • Fixed endpoint address management in unified mode
  • Fixed assertion failure when configured with UCX_IB_ADDR_TYPE=ib_global
  • Fixed overwritten MD attribute capabilities when querying a device
  • Fixed ibv_reg_mr error by registering memory in rcache callback

TCP

  • Fixed assymetric lanes selection issue due to inconsistent device listing

GPU (CUDA, ROCM)

  • Fixed compilation flags to support ROCm 6.0
  • Fixed values of D2H_THRESH and latencey params
  • Fixed Cuda memory support for iov datatype
  • Increased max number of agents in ROCm

Shared Memoey

  • Fixed posix and cma transport selection by enhancing reachability checks
  • Fixed UGNI build failure
  • Fixed latency overhead for knem and cma transports
  • Fixed possible out-of-order issue in mm_iface

UCS

  • Fixed a deadlock when forked debugger is attached during an error in rcache operation
  • Fixed crash due to passing null pointer to log function
  • Fixed crash due to incorrect hashing method
  • Fixed crash in configuration parser cleanup by moving it after profiler cleanup
  • Fixed floating point division by zero during protocols initialization

UCM

  • Fixed occasional crash in bisto hooks by adding a lock before hooking

Java

  • Fixed go tests by setting CUDA device before allocating CUDA memory
  • Fixed perftest error detection and hanging issue

Tools

  • Fixed cpu model type for AMD Genoa in ucx_info
  • Enhanced multi-thread test output

Build

  • Fixed JUCX package publishing, so it will include support for ARM
  • Fixed ROCm building and testing
  • Removed libnvidia-compute version dependency
  • Removed libibmad/libumad from default build configuration to avoid runtime dependency

v1.16.0-rc1

28 Dec 08:46
76758f8
Compare
Choose a tag to compare
v1.16.0-rc1 Pre-release
Pre-release
Merge pull request #9557 from yosefe/topic/uct-ib-add-flid-based-rout…

…ing-support-v1.16.x

UCT/IB: Add FLID based routing support - v1.16.x

v1.15.0

29 Sep 11:22
348d14f
Compare
Choose a tag to compare

1.15.0 (September 28, 2023)

Features:

UCP

  • Added 2-stage pipeline protocol in the new protocol infrastructure
  • Added reset and abort functionality of rendezvous protocols in the new infrastructure
  • Added zero-copy rendezvous data send protocol in the new infrastructure
  • Added support for user memory handle in the new protocol infrastructure
  • Added option to force ODP registration for certain memory types
  • Enabled lock free memory region deregistration
  • Updated allow/deny transport list feature to control auxiliary transport selection
  • Multiple performance improvements of the new protocol infrastructure
  • Multiple improvements in error and debug messages

UCT

  • Split UCT_MD_MKEY_PACK_FLAG_INVALIDATE into two flags for RMA and AMO
  • Added put_zcopy and get_zcopy scheme support for self transport
  • Added base implementation of is_reachable_v2 API using intra/inter flag
  • Introduced MD capability for non-blocking registration memory types

RDMA CORE (IB, ROCE, etc.)

  • Added implementation of is_reachable_v2 routine to IB interface
  • Added option to control CQE zipping per CQ RX/TX direction
  • Added option to specify how DCI selects port under RoCE LAG
  • Added hw_dcs to the list of policies to select DCI by an endpoint
  • Removed implicit on-demand paging
  • Added option to set RoCE lag dct port for response under queue affinity mode
  • Improved IB memlock limit logging

UCS

  • Added ucs_string_buffer_rbrk() to split token

GPU (CUDA, ROCM)

  • Added support for atomic reply_buffer on GPU memory
  • Added system device information for AMD GPUs
  • Improved performance estimation of gdr_copy transport
  • Added a simplistic implementation of performance estimation of cuda_ipc transport
  • Improved performance estimation of cuda_ipc on Hopper architecture
  • Added rcache parameters for rocm transports
  • Introduced dmabuf support for rocm transports
  • Implemented asynchronous progress for the zcopy operations in the rocm_copy transport
  • Added option to enable using cross-device dmabuf file descriptor for rocm

Java

  • Added Java bindings for exported memh feature

Tests

  • Added a rocm docker container for testing
  • Added option to send client_id in iodemo test
  • Added support for multiple connections to the same server in iodemo test
  • Added synchronization before exit to hello world examples

Tools

  • Added user-side memcpy option for AM benchmarks in ucx_perftest
  • Added wireshark LUA dissectors for some UCX protocols

Build

  • Added support for binutils 2.40
  • Added versioned dependency to switch between packages with the same names
  • Added a separate xpmem deb subpackage
  • Added aarch64 support to the binary distribution pipeline
  • Removed dependency on libnuma

Bugfixes:

UCP

  • Fixed assertion when sending from non-contiguous GPU buffer to managed buffer
  • Fixed the race condition on endpoint configurations
  • Fixed endpoint reconfiguration issues due to asymmetrical selection
  • Fixed endpoint reconfiguration error due to wrong locality detection
  • Fixed crash during connection manager cleanup
  • Fixed rkey index calculation for rendezvous protocol
  • Fixed rcache dump function
  • Removed logging from rkey unpack in release mode
  • Fixed dobule free of rkey in rendezvous protocol
  • Fixed rendezvous pipeline protocol error flow
  • Fixed error handling in rendezvous get zcopy protocol
  • Replay pending requests of wireup EP CM during connection establishment to prevent potential ordering issues and wrong configuration
  • Pass user-provided memory type to the function that checks whether the buffer can be sent inline or not
  • Avoid memory registration during UCP context initialization
  • Fixed CPU/device atomics selection in the new protocol infrastructure
  • Multiple fixes in the new protocol infrastructure information output

UCT

  • Added check for dmabuf kernel support in ROCm memory domain
  • Fixed exported memh packing
  • Fixed an error in checking return status of multi-threaded memory registration function

RDMA CORE (IB, ROCE, etc.)

  • Fixed dma-buf based memory region registration
  • Fixed memory handle data corruption when PCIe relaxed ordering is enabled
  • Fixed performance degradation when indirect atomic key is not supported by the hardware
  • Fixed remote access error to strict-order keys because of wrong offset
  • Added check for UAR support to memory domain opening
  • Fixed updating port counters for devx qp
  • Fixed ibv_create_cq error message on node without Infiniband
  • Fixed performance degradation due to using 2 paths on NDR400 by default
  • Removed unnecessary async lock which otherwise would block UD progress

GPU (CUDA, ROCM)

  • Fixed CUDA IPC performance degradation due to libnuma removal

UCS

  • Fixed lane selection and added bandwidth estimation for Sapphire Rapids family
  • Fixed displaying wrong environment variable suggestions
  • Fixed VFS warning output
  • Fixed SEGV in ucs_debug_backtrace_next(), upon previous SEGV handling, due to ENOMEM situation
  • Fixed memory corruption when using UCX_MPOOL_FIFO=y

UCM

  • Fixed conditional jump patching
  • Fixed mremap() override

GPU (CUDA, ROCM)

  • Fixed usage of dmabuf when the buffer is not page-aligned
  • Removed async_cb from cuda_copy to avoid the issue with UCP worker async lock

Java

  • Fixed leakage of jucx_request global references

Documentation

  • Updated ucp_worker_release_address description

Tests

  • Fixed wrong usage of ep_close in examples

Tools

  • Fixed memory access flags in perftest
  • Removed support for librte from perf
  • Fixed worker flush deadlock when using multiple workers in ucx_perftest

Build

  • Changed 'unsupported option' ICC command line warning to error
  • Removed never used fault-injection configuration option
  • Fixed obsolete macro warnings in new autoconf/libtool
  • Fixed building UCX with GCC 13
  • Fixed UCX RPM build on machines that have libxpmem-devel rpm from MLNX_OFED installation
  • Fixed ucx-rdmacm package requirements
  • Fixed compilation errors with armcc-22.1
  • Fixed passing port number to goperftest

v1.15.0 RC6

21 Sep 07:18
e674114
Compare
Choose a tag to compare
v1.15.0 RC6 Pre-release
Pre-release

1.15.0 RC6 (September 20, 2023)

Bugfixes:

UCP

  • Fixed assertion when sending from noncontig GPU buffer to managed buffer.

v1.15.0 RC5

12 Sep 15:51
9b3aeaa
Compare
Choose a tag to compare
v1.15.0 RC5 Pre-release
Pre-release

1.15.0 RC5 (September 12, 2023)

Bugfixes:

UCP

  • Fixed the data race on endpoint configurations.

v1.15.0 RC4

03 Sep 07:31
efdf63b
Compare
Choose a tag to compare
v1.15.0 RC4 Pre-release
Pre-release

1.15.0 RC4 (August 30, 2023)

Bugfixes:

RDMA CORE (IB, ROCE, etc.)

  • Fixed dma-buf based memory region registration
  • Fixed memory handle data corruption when PCIe relaxed ordering is enabled

UCS

  • Fixed lane selection, adding bandwidth estimation for Sapphire Rapids family