



# In-Network Collective Operations: Game Changer or Challenge for AI Workloads?

Torsten Hoefer<sup>ID</sup>, ETH Zürich and Microsoft

Mikhail Khalilov, ETH Zürich

Josiah Clark, AMD

Surendra Anubolu, Mohan Kalkunte, Karen Schramm, and Eric Spada, Broadcom, Inc.

Duncan Roweth and Keith Underwood, Hewlett Packard Enterprise

Adrian Caulfield<sup>ID</sup>, Abdul Kabbani, and Amirreza Rastegari, Microsoft

This paper summarizes the opportunities of in-network collective operations for accelerated collective operations in artificial intelligence (AI) workloads. We provide sufficient detail to make this important field accessible to nonexperts in AI or networking, fostering a connection between these communities.

**M**odern artificial intelligence (AI) and high-performance computing (HPC) workloads necessitate substantial computational resources, particularly when

dealing with large language models (LLMs). The size and complexity of these models render single accelerators, such as graphical processing units (GPUs), insufficient for efficient training and inference. Consequently, distributed systems have become integral to the training, serving, and inference processes of LLMs.

Digital Object Identifier 10.1109/MC.2025.3616048  
Date of current version: 22 December 2025

To address these computational demands, organizations are deploying some of the most powerful supercomputers, equipped with tens of thousands of GPUs. These supercomputers facilitate the extensive parallel processing required for LLM training. The distribution of LLMs primarily uses three fundamental dimensions of parallelism: data parallelism (DP), pipeline parallelism (PP), and tensor parallelism (TP).<sup>1</sup> All those methods use collective operations that may be accelerated with special “in-network collective” (INC) hardware support. We proceed to first explain their potential use for AI model training and inference and then discuss their benefits and complexities.

DP is used during training by replicating model copies to accelerate training by distributing mini-batches of data across multiple accelerators. Each accelerator processes different subsets of the data simultaneously, thereby enhancing training efficiency. Fully sharded data parallelism (FSDP)<sup>2,3</sup> optimizes memory usage of DP by partitioning the model parameters, optimizer state, and gradients across different accelerators,

thereby preventing an increase in memory requirements.

PP is used in inference and training where it distributes the model weights across multiple accelerators, effectively partitioning the model to fit within the local memory of each accelerator. This approach executes different segments of the model pipeline on distinct accelerators, thereby optimizing memory usage and computational efficiency. Yet, usually, the processing time for each batch is increased when going through the pipeline.

TP (or operator parallelism) is used in inference and training where it accelerates the processing of individual pipeline stages by distributing the tensors (data arrays) across multiple accelerators. This method divides the tensors into smaller chunks that can be processed in parallel, thus reducing the overall computation time required for each stage of the model.

**Figure 1** illustrates how the three forms of parallelism can be used to distribute training of a transformer model (left to right, three replicas) with three decoder blocks, each consisting of the typical multi-head attention, normalization, and feed-forward

layers. DP shows how the whole model is replicated and PP and TP show the sharding of the parameters layer-wise, and tensor-block-wise, respectively. Each color mark the distribution to three different accelerators in each example.

While other forms of parallelism, such as sequence parallelism exist, they exhibit similarities to the aforementioned dimensions and are omitted for simplicity. When distributing models, it is crucial to consider the data distribution and resulting communication required for forward and backward passes of the model. Efficient distribution strategies ensure that data are partitioned and synchronized across all accelerators, minimizing communication overhead and maximizing computational throughput.

## THE ROLE OF COLLECTIVE OPERATIONS AND CCLS IN AI

Collective communication operations are used for inter-processor communication during AI training and inference. These nonblocking operations are defined in the Message Passing Interface (MPI) Standard.<sup>4</sup> Subsets of these operations specific to different



**FIGURE 1.** Three dimensions of AI parallelism. Colors mark three different accelerators, potentially 27 total when combined.

# OUTLOOK 2026: TECHNOLOGY PREDICTIONS

GPU architectures are found in processor element (PE) specific collective communication libraries (CCLs).<sup>a</sup>

MPI-like collective operations form the backbone of most distributed deep learning frameworks. In Allgather, each PE starts with a scalar and broadcasts it to each other PE. Broadcast collective is a subset of Allgather. Allreduce starts with a vector at each PE and ends with the sum of the vectors at each PE. Reduce\_scatter performs the same summation but shards the result such that each PE ends up with a single scalar. Alltoall transposes a distributed matrix.

DP and TP use Allreduce for tasks like gradient averaging and distributed tensor processing (for example, matrix multiplication). FSDP uses Allgather and Reduce\_scatter operations to distribute model parameters, optimizer states, and gradients across multiple accelerators. CCLs support the MPI concept of communicators that defines subsets of communicating groups of processes to perform collective operations. This feature is crucial for AI workloads and INC implementations because collective operations typically use subsets of the processes in a job, grouped into communicators, as defined by the parallelism schemes outlined previously.

The Mixture of Experts model, as exemplified by DeepSeek-V3,<sup>5</sup> relies on Alltoall(v) operations to distribute data among experts. While other distribution schemes might necessitate broadcast operations, such rooted collectives are relatively rare. They are often implemented as Allgather operations, effectively functioning as an all-broadcast. Additionally, scan operations could be beneficial for

workloads that utilize small neural networks and operate under stringent latency constraints. However, there are no widely known use cases yet for MPI's scan or exscan in this context.

## OPPORTUNITIES FOR IN-NETWORK COMPUTATION TO ACCELERATE AI

Edge-INC offloads some of the CCL collective operations from the compute accelerator to the local network ("edge") interfaces. This approach reduces memory contention and overall latency at the endpoint by enabling data to be forwarded directly in a streaming manner,<sup>6</sup> without the need to be stored in the accelerator's memory. Consequently, this method enhances efficiency and minimizes the performance bottlenecks typically associated with memory access.

Core-INC offloads some collective operations into the network ("core") switches themselves, as exemplified by NVIDIA's SHARP.<sup>7</sup> Switches actively participate in the operation, performing computations such as summing data. This approach leverages the computational capabilities embedded within the network infrastructure, thereby optimizing data processing and reducing latency or bandwidth consumption within the network core. By utilizing Core-INC, the network can achieve significant performance improvements through distributed computation at the network core.

## Collective acceleration using edge-INC

Edge-INC exclusively involves the host network interface (NI). Approaches such as Portals 4<sup>8</sup> and sPIN<sup>6</sup> enable this method and have a minimal implementation footprint, exemplified by techniques such as triggered

operations.<sup>9</sup> Additionally, they can implement advanced communication protocols, such as multicast-based constant-time broadcast<sup>10</sup> to further enhance data transfer efficiency and reduce latency. Using specific NI offloads, these methods minimize the computational burden on the host and optimize overall system performance.

Asynchronous progression and full collective operation offload from the accelerator to the NI are enabled, allowing for the complete overlap of computation on the accelerator with communication in the network. Edge-INC eliminates host memory access latency and bandwidth demands by directly forwarding messages from the NI. It simultaneously reduces control overhead on the host while saving significant latency for large-scale systems.

Figure 2 illustrates an edge-INC scenario with N nodes, each equipped with an accelerated NI, depicted by the red cog. It also annotates where Core-INC would live. In the context of a pipelined linear Broadcast or the second phase of a ring reduction, node i receives data from node i-1 (modulo N), deposits it into the main memory, and subsequently sends it to node i+1 (modulo N). The figure demonstrates a yellow packet arriving at step 1, being deposited into the main memory and simultaneously forwarded at step 2 by the NIC.

Without edge-INC, the NI would first write this data into dynamic random-access memory and then read it again to transmit it, resulting in an additional memory transaction. This process essentially doubles the load on the host bus and the CPU or accelerator memory (illustrated in blue). For high-bandwidth NICs, this increased load can lead to significant computational slowdowns.

INC for  
ring attn

<sup>a</sup>For example, NVIDIA Collective Communications Library (NCCL).

## Collective acceleration using core-INC

While edge-INC reduces node-level contention and latencies, core-INC reduces communication costs at the network level. However, not all collective operations can be equally accelerated by core-INC. Core-INC is most effective when data are reduced during the operation, as in (All) Reduce, or replicated during the operation, as in broadcast and Allgather.

Broadcast operations benefit from various bandwidth- and latency-optimal algorithms tailored for endpoint-based implementations. For instance, pipelined ring algorithms are optimal for handling very large (infinitely large) messages, while Fibonacci trees are most efficient for the smallest messages, effectively covering the range in between.<sup>11</sup> The tradeoff between bandwidth and latency is particularly noteworthy; core-INC can leverage switch-reliable multicast to significantly reduce latency during broadcast operations.

Allreduce operations benefit most from the network switches performing the operations. Core-INC also cuts the



FIGURE 2. A hybrid system with edge-INC and core-INC.

needed network bandwidth in half.<sup>12</sup> Endpoint-based algorithms must send and receive each segment twice, once in the reduction and once in the broadcast phase, but core-INC only requires sending it once and receiving it once because it is reduced by the switches.

Figure 3 shows the benefits of a core-INC-based Allreduce compared to an endpoint-bandwidth-optimal ring algorithm. The figure shows a full fat tree and a subset of nodes marked "B" that participate in the Allreduce. Core-INC communicates along a single tree (shown in green) where all nodes send to the root at the top-left switch in the network and then the root reliably broadcasts the result back to all

involved nodes. Here, each node sends each segment once and then receives the final result. During the reduction phase, the highlighted switch "X" collects the input from its three incoming children, reduces (for example, sums) it, and sends it toward the root switch. During the broadcast phase, switch X replicates the data from the root switch to all its green children. The red schedule shows an endpoint-based ring algorithm and possible routes through the network. This algorithm, albeit occupying the same number of links, has to perform two rounds.

Allgather is more complex as the data are not reduced in the network. Yet, it can save network bandwidth



FIGURE 3. The INC-based Allreduce algorithm requires less total data movement in comparison to the ring algorithm.

because Allgather is an all-broadcast where each node broadcasts to all other nodes. Node S in Figure 3 would send its data to the root switch, which would replicate it twice to its children, who would then replicate it again to their children who in turn will deliver it to the destination nodes. Excluding edge links, the data traverses only nine links. The red schedule excluding the dashed lines shows a standard pipelined Broadcast occupying a total of 13 links. Larger jobs and oversubscribed networks would show even more pronounced benefits. Thus, core-INC can reduce the network bandwidth utilization significantly.<sup>12</sup>

Reduce\_scatter can be implemented as multiple reductions that are identical to broadcast trees. Here, the same idea as Allgather applies and bandwidth can be saved in the core network. Furthermore, concurrent Reduce\_scatter can be combined with Allgather in Core-INC to gain bandwidth savings up to 2× due to their different bottlenecks.<sup>12</sup>

Alltoall cannot be optimized easily with core-INC as there is no reduction in data at all. Alltoall simply transposes a large array. Here, edge- or core-INC could be used to synchronize nodes to orchestrate congestion-free schedules for sending the data. This is generally hard given that the network may not be exclusively used by a single tenant.

Core-INC and edge-INC both have complex relationships with the system itself. From a reliability perspective, Core-INC utilizes fewer links, but becomes more dependent on the links it uses and carries state in the switches that is not resilient to switch failure. Edge-INC can use traditional techniques to route around failures, but incurs delays during failures and is more dependent on a high fraction

of the bandwidth being available. Job fragmentation can cause point-to-point communications in edge-INC to take substantially longer, whereas core-INC is more immune to fragmentation—until the fragmentation reaches a level where the network is no longer able to achieve a data reduction. Both edge-INC and core-INC can be complementary and multiply their benefits. For example, a local NI can take charge of coordinating the core-INC such that the accelerator is completely freed from communication overheads and full overlap of communication and computation can be achieved.

### PROBLEMS FOR IN-NETWORK COMPUTE ACCELERATION OF AI

The advantages of INC are both evident and substantial, with potential traffic reductions at both edge links and in the core network of up to 2× for operations such as Allreduce, Reduce\_scatter, Broadcast, and Allgather. Additionally, INC can significantly reduce host memory load and provide opportunities for overlapping computation and communication during collective operations.<sup>4</sup> However, the intricacies of INC can be complex and challenging to navigate. In the following sections, we present several obstacles that architects and engineers developing INC systems must consider.

#### Low-precision data types

One of the most effective optimizations in deep learning and AI is the utilization of low-precision data types. Reducing the number of bits used to represent numerical values not only linearly decreases data volume and movement but can also lead to a quadratic increase in computational performance. Specifically, reducing

values from 16-bit- to 8-bit-precision results in a 2× savings in memory and memory bandwidth, as well as a potential 4× speedup in computations. The computational speedup is because the multiplication of  $n$ -bit integers generally requires  $O(n^2)$  logic or time, making low-precision representations highly advantageous for efficient AI computations.

Low-precision types can only express a narrower range of numbers. An 8-bit integer can represent 256 distinct values, while a 4-bit integer is limited to just 16 different values. The selection of which numbers these types represent can be determined either algorithmically, such as setting a uniform range like -128 to 127 for signed integers, or through a predefined code-book. Moreover, the range can be dynamically adjusted by scaling factors that are applied block- or tensor-wise to better fit the data's dynamic range. However, a significant challenge with low-bit representations, particularly in processing long sequences, is temporary overflow, underflow, or accumulating rounding errors. This occurs when intermediate computations exceed the representational capacity of the low-bit format, even if the final result could theoretically be accurately represented within that format.

To illustrate the challenges with low-precision arithmetic, consider performing a series of operations using a signed int4 type (range -8 to 7): 7-5 + 5+5-3-7. If we compute from left to right, we get intermediate results of 7, 2, 7, then an overflow to -3, -6, and an underflow to 2. Although the final result might be correct, these intermediate results are inaccurate, leading to errors when used in further calculations like dot-products or matrix multiplications, especially when combined

with multiplication operations. Another example is the multiplication task  $2 * 2 * 3/2$ , which yields intermediate results of 2, 4, then an overflow to -4, and finally the incorrect -2, instead of the expected 6. These precision issues also affect sequences of mixed addition and multiplications. Since floating-point operations fundamentally involve both multiplication and addition in operations (for example, floating-point multiplication multiplies the mantissas and adds the exponents), the same problems can occur. To prevent these errors, AI accelerators employ higher-precision internal accumulation registers.

Core-INC requires sending intermediate results to upstream switches by design, which introduces challenges with precision and bandwidth. Since the maximum bandwidth savings from core-INC is a factor of two, transmitting numbers with higher precision to maintain accuracy would essentially negate this advantage. Compounding the issue, most internal higher-precision registers are significantly larger than just twice the size of the input data types. This situation is commonly referred to as the “problem of communicating the accumulator” in core-INC systems, where accumulators need to be relayed between switches. Conversely, in edge-INC scenarios, where large vectors of numbers are reduced, one can mitigate this by designating a specific host for each range, thereby allowing for the local maintenance of a high-precision accumulator at each NI, keeping the benefits of precision without the overhead of excessive data transmission.

One potential, though complex, workaround involves sending only low-precision values from the edge hosts to the first reduction switch,

then using a high-precision accumulator for communication between core switches, finally casting down to the lower target precision at the root switch. This approach requires careful consideration because the accumulation switches are not always the first in the chain. For instance, in the core-INC example in Figure 3, the source S does not directly connect to an accumula-

previously discussed, incorporate scaling factors. However, these scaling factors are not uniformly applied across entire tensors or large blocks but are instead tailored to smaller blocks that constitute the basic units of computation. An example of this is a set of 16 integer values scaled by an exponential floating-point value, a method referred to as block floating point.

## SOME DEEP LEARNING ACCELERATORS AND WORKLOADS LEVERAGE VECTOR DATA TYPES, WHICH, LIKE THE QUANTIZED NUMBERS PREVIOUSLY DISCUSSED, INCORPORATE SCALING FACTORS.

tion switch; instead, the accumulator size should increase at the second hop, which coincidentally is also the root. In other parts of the green network tree in our example, the precision would need to be increased at the second switch encountered. If more than two group members would be connected to an edge switch, then the result would need to be upcast there. Even if this strategy is implemented correctly, it still necessitates transmitting at least double the data volume upwards through intermediate links, thus diminishing the potential for bandwidth savings within the core network. Furthermore, such savings would depend strongly on the tree topology.

### Vector data types

Some deep learning accelerators and workloads leverage vector data types, which, like the quantized numbers

This technique has been adopted to formulate various blocked data types, such as MxFP,<sup>13</sup> enhancing efficiency in deep learning computations.

If block floating-point and similar vector data types are extensively utilized, INC systems might need to inherently support these formats. Although one could convert these block formats into types that INC currently supports, such conversions introduce additional overhead. This overhead could diminish the performance benefits of INC, particularly since operations on these vector types can be executed very efficiently on native architectures using traditional collective algorithms like ring. Consequently, implementing complex type handling within the INC switch or NI might be necessary, which could increase both cost and design complexity.

Another broad challenge in this domain is the rapid evolution of data

## OUTLOOK 2026: TECHNOLOGY PREDICTIONS

types used in deep learning. Over the past decade, we've witnessed a "Cambrian explosion" of various data types, with new formats like BF16, E4M3, and E5M2 gaining quick and widespread acceptance due to their availability on modern accelerators. This swift pace of change in data types poses a significant challenge for the networking field, which often operates within a slower silicon design cycle, making it difficult to keep pace with the latest advancements in deep learning.

### KV cache on decode step?

#### Sparse vector reductions

Sparse computations offer significant advantages for deep learning workloads by reducing computational overhead and memory usage.<sup>14</sup> This advantage extends into network computations but introduces the problem of fill-in. Similar to the issue of communicating accumulators, fill-in can lead to even more substantial slowdowns. The core problem arises when multiplying two sparse vectors in a large index space; the result often becomes much larger because it must accommodate as many elements as the union of the indices from both vectors. If nonzeros are randomly distributed, this leads to a rapid increase in vector size as computations progress through the reduction tree. Consequently, vectors might become so dense that even switching to a dense representation becomes more efficient at some height along the tree.<sup>15,16</sup>

Supporting sparse reductions in core-INC is inherently complex and prone to errors, which can negate much of the efficiency gains provided. To circumvent this issue, one potential strategy involves sharding the index space across endpoints, thereby directing all values associated with a particular index space to

their respective endpoint. It's mathematically provable, given certain distributions of nonzero elements, that such an approach can optimize communication volume, potentially reducing the data traffic in the core network more effectively than with core-INC methods. These advanced, endpoint-based schemes<sup>17</sup> could be effectively realized through edge-INC, allowing for all the associated benefits to be fully exploited.

#### Bitwise reduction result reproducibility

For certain workloads and use cases, such as debugging, bitwise reproducibility is critical. This is straightforward with integer data types, but floating-point sum is not associative. Therefore, floating-point calculations can only achieve bitwise reproducibility if the exact order of operations is maintained or if specific reproducibility schemes are employed.<sup>18,19</sup> However, these schemes typically introduce up to twice the data and computational overhead, which can negate the  $2 \times$  efficiency gain offered by core-INC, akin to the problem encountered with accumulator communication. While there might be room for innovation in developing schemes for INC, ensuring a consistent order of operations is often the simplest solution.

Users care about intra-job and inter-job reproducibility. Intra-job reproducibility ensures that reductions within a single job are bitwise identical, while inter-job reproducibility extends this guarantee across different jobs, specifically to those involving the same number of processes. Achieving inter-job reproducibility necessitates ensuring an identical tree structure for data reduction across jobs, which presents significant challenges because the

distribution of processes across nodes can vary greatly between different job executions. Ensuring consistent tree ordering is both theoretically and practically complex and is an open research problem. It might not be feasible for certain configurations of process-to-node mappings.

#### Endpoint interfaces and coordination

When implementing INC, it's essential for the switch to incorporate a basic networking stack to handle reliability, flow control, and congestion management. This is due to the shared nature of the physical link, where both INC and regular end-to-end traffic compete for bandwidth, necessitating compatible congestion control mechanisms. Additionally, NIs at the endpoints must manage INC states and contexts in hardware, leading to increased complexity in logic (for example, when considering link aggregation) and additional memory overhead.

Constructing core-INC groups and their corresponding reduction trees presents significant system-level challenges and hindered early adoption. The hardware resources available at each switch constrain the number of trees it can support. Given that switches are interconnected in complex topologies, and jobs launch on dynamic subsets of nodes, creating and dismantling these trees must be managed efficiently at job (or even communicator) initiation and termination. Multitenancy can exacerbate this by increasing the number of trees within the network, potentially necessitating strict isolation measures. However, many of these complexities can be sidestepped with edge-INC systems, which can operate effectively using simpler switches designed purely for

data movement, thereby reducing the overhead associated with managing intricate network structures.

### Encryption and authentication

Encryption and authentication crucial in confidential computing systems. Unfortunately, core-INC faces significant challenges in supporting end-to-end encryption due to its data manipulation operations. While homomorphic encryption presents a potential solution, its application is currently limited and most effective for integer data types only. Chrapek et al. develop several initial ideas for homomorphic core-INC systems.<sup>20</sup> Yet, for a system to support all operations and data types, it would necessitate extending trust to the switches, thereby substantially expanding the trust domain. Employing separate keys for INC communication can help limit the trust domain's scope. However, any data transmitted through INC remains susceptible to compromise.

Implementing encryption and authentication in a core-INC system remains challenging and a topic for research, even when utilizing separate keys and security domains. This complexity arises because switches must partake in key rotation and rekeying, potentially incorporating key derivation and other advanced security mechanisms. The necessary overhead in memory for managing these keys and the logic required for system administration is not only intricate and costly but also introduces additional security vulnerabilities. Edge-INC systems may simplify the security handling as only the local NI would need to be part of the trust domain.

### PERFORMANCE FOR INC IN DEEP LEARNING

The primary objective of INC is to expedite computations within the



**FIGURE 4.** Performance model for INC-enabled DP training.

network and decrease communication volumes. While reducing communication volumes is advantageous because it frees network resources to handle other types of traffic, accelerating computations presents a more intricate challenge. In the context of accelerating real-world applications, a variant of Amdahl's Law becomes a significant constraint. Consequently, claims of a tenfold improvement in application performance should prompt scrutiny from astute performance experts.

In Figure 4, we consider the following simple example of DP parallel training where we model a fixed 8GiB Allreduce with varying iteration times to adjust the communication overhead (portion of iteration time (%) spent in Allreduce) between 10%–50% on the x-axis. Without INC, the ring Allreduce takes 352 ms, and INC can reduce the time to 151 ms, a nearly 60% speedup. Yet, due to Amdahl's Law, the maximum speedup achieved is 34%. In a realistic case where nonoverlapped communication overheads are 20%, INC would be limited to an 11% overall speedup. We note that even a

5% reduction in runtime due to INC can be substantial given that the network typically costs less than 20% of the system.

We highlight numerous potential benefits that INC can offer to collective operations driving AI inference and training. However, we also identify several major challenges specific to AI workloads that any INC system must address. We predict that these complexities and the potentially limited advantages may result in slow adoption. Consequently, we foresee that the adoption of INC will remain gradual for the next years.

A promising area for the adoption of INC is within a local network context, where communication costs are particularly high. In such environments, single-switch solutions can provide significant benefits. These solutions would be simpler to engineer compared to more complex multiswitch deployments, as they can essentially function as extensions to the existing network nodes. By leveraging

### ABOUT THE AUTHORS

**TORSTEN HOEFLER** is a professor of computer science at ETH Zürich, 8092 Zürich, Switzerland, and a scientific advisor to Microsoft, Redmond, WA 98052 USA. Hoefler received his Ph.D. from Indiana University. He is a Fellow of IEEE. Contact him at htor@ethz.ch.

**MIKHAIL KHALILOV** is a doctoral student at ETH Zürich, 8092 Zürich, Switzerland. His research interests include high-performance computing and datacenter interconnects. Khalilov received his M.Sc. in applied math and informatics from HSE University. Contact him at mikhailov@ethz.ch.

**JOSIAH CLARK** is an ML engineer at AMD, Austin, TX 78735 USA. Contact him at josiah.clark@amd.com.

**SURENDRA ANUBOLU** is a senior technical director and a distinguished engineer at Broadcom, Palo Alto, CA 94304 USA. Contact him at surendra.anubolu@broadcom.com.

**MOHAN KALKUNTE** is the vice president of Architecture and Technology at Broadcom, Palo Alto, CA 94304 USA. He is a Fellow of IEEE. Contact him at mohan.kalkunte@broadcom.com.

**KAREN SCHRAMM** is a vice president of Architecture at Broadcom, Palo Alto, CA 94304 USA. Contact her at karen.schramm@broadcom.com.

**ERIC SPADA** is a distinguished engineer at Broadcom, Palo Alto, CA 94304 USA. Contact him at eric.spada@broadcom.com.

**DUNCAN ROWETH** is an HPE Fellow and the chief architect at Hewlett Packard Enterprise, Palo Alto, CA 94304 USA. Contact him at duncan.roweth@hpe.com.

**KEITH UNDERWOOD** is a senior distinguished technologist at Hewlett Packard Enterprise, Palo Alto, CA 94304 USA. Contact him at keith.underwood@hpe.com.

**ADRIAN CAULFIELD** is a partner engineering manager at Microsoft, Redmond, WA 98052 USA. Caulfield received his Ph.D. from the University of California, San Diego. Contact him at acaulfie@microsoft.com.

**ABDUL KABBANI** is an engineer at Microsoft, Redmond, WA 98052 USA and an adjunct professor at the University of California, Santa Cruz. Kabbani received his Ph.D. from Stanford University. Contact him at abdulkabbani@microsoft.com.

**AMIRREZA RASTEGARI** is the lead HPC performance engineer at Microsoft Azure, Redmond, WA 98052 USA. Rastegari received his Ph.D. in scientific computing from the University of Michigan, Ann Arbor. Contact him at arastegari@microsoft.com.

single-switch INC solutions, it becomes feasible to enhance performance and reduce communication overhead without extensive reengineering of the network infrastructure. Given the current advancements in networking technology, we anticipate that these single-switch INC solutions will be adopted in more use cases soon and will prove to be highly effective in addressing the communication challenges faced by local networks. This approach

not only streamlines implementation but also maximizes the performance gains in a cost-effective manner.

Standardization provides a common ground for various products and approaches to make them accessible to wide community and enable fair market competition. Thus, it is crucial for the success and widespread adoption of INC technologies. One promising initiative toward this goal is Ultra Ethernet, spearheaded by a consortium

working group that includes many of the coauthors of this article. The current plans focus on core-INC with edge-INC being on the horizon. The development of any INC specification must be lean and straightforward to justify the necessary investments. Ensuring simplicity and clarity will be key to its effectiveness and practical implementation. Although it remains to be seen whether the standard will meet the high expectations set for it, there is a

strong prediction that the upcoming specification will significantly benefit the adoption of INC. Nevertheless, it is important to acknowledge that INC will continue to face challenges, particularly in terms of technical complexity and integration into existing systems. The ongoing efforts in standardization and collaboration within the consortium are vital steps toward overcoming these challenges and realizing the potential benefits of INC. □

## ACKNOWLEDGMENT

We thank the whole INC working group in UEC and all others who have contributed to the development of the thinking behind this article.

## REFERENCES

1. T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," *ACM Comput. Surv.*, vol. 52, no. 4, pp. 1-43, 2019, doi: [10.1145/3320060](https://doi.org/10.1145/3320060).
2. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, "Zero: Memory optimizations toward training trillion parameter models," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, 2020, pp. 1-16, doi: [10.1109/SC41405.2020.00024](https://doi.org/10.1109/SC41405.2020.00024).
3. Y. Zhao et al., "PyTorch FSDP: Experiences on scaling fully sharded data parallel," *Proc. VLDB Endowment*, vol. 16, no. 12, pp. 3848-3860.
4. T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and performance analysis of non-blocking collective operations for MPI," in *Proc. ACM/IEEE Conf. Supercomput. (SC)*, 2007, pp. 1-10.
5. A. Liu et al., "Deepseek-v3 technical report," 2025, arXiv:2412.19437.
6. T. Hoefler, S. D. Girolamo, K. Taranov, R. E. Grant, and R. Brightwell, "sPIN: High-performance streaming processing in the network," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, 2017, pp. 1-16.
7. R. L. Graham et al., "Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction," in *Proc. 1st Int. Workshop Commun. Optim. HPC (COMHPC)*, Piscataway, NJ, USA: IEEE Press, 2016, pp. 1-10, doi: [10.1109/COMHPC.2016.006](https://doi.org/10.1109/COMHPC.2016.006).
8. S. D. Girolamo, P. Jolivet, K. D. Underwood, and T. Hoefler, "Exploiting offload-enabled network interfaces," *IEEE Micro*, vol. 36, no. 4, pp. 6-17, Jul./Aug. 2016, doi: [10.1109/MM.2016.56](https://doi.org/10.1109/MM.2016.56).
9. R. Brightwell et al., "The portals 4.3 network programming interface," Sandia Nat. Lab. (SNL-NM), Albuquerque, NM, USA, Tech. Rep. Jun. 2023. [Online]. Available: <https://www.sandia.gov/app/uploads/sites/144/2023/03/portals43.pdf>
10. T. Hoefler, C. Siebert, and W. Rehm, "A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast," in *Proc. IEEE Int. Parallel Distrib. Process. Symp.*, 2007, pp. 1-8, doi: [10.1109/IPDPS.2007.370475](https://doi.org/10.1109/IPDPS.2007.370475).
11. T. Hoefler and D. Moor, "Energy, memory, and runtime tradeoffs for implementing collective communication operations," *J. Supercomput. Frontiers Innov.*, vol. 1, no. 2, pp. 58-75, 2014.
12. M. Khalilov, S. D. Girolamo, M. Chrapek, R. Nudelman, G. Bloch, and T. Hoefler, "Network-offloaded bandwidth-optimal broadcast and Allgather for distributed AI," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, Piscataway, NJ, USA: IEEE Press, 2024, pp. 1-17, doi: [10.1109/SC41406.2024.00109](https://doi.org/10.1109/SC41406.2024.00109).
13. B. D. Rouhani et al., "Microscaling data formats for deep learning," 2023, arXiv:2310.10537.
14. T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, "Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks," *J. Mach. Learn. Res.*, vol. 22, no. 1, pp. 10,882-11,005, 2021.
15. C. Renggli, D. Alistarh, M. Aghagolzadeh, and T. Hoefler, "SparCML: High-performance sparse communication for machine learning," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, 2019, pp. 1-15, doi: [10.1145/3295500.3356222](https://doi.org/10.1145/3295500.3356222).
16. D. D. Sensi, S. D. Girolamo, S. Ashkeboos, S. Li, and T. Hoefler, "Flare: Flexible in-network allreduce," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, 2021, pp. 1-16.
17. S. Li and T. Hoefler, "Near-optimal sparse allreduce for distributed deep learning," in *Proc. 27th ACM SIGPLAN Symp. Princ. Pract. Parallel Program. (PPoPP)*, 2022, pp. 1-16, doi: [10.1145/3503221.3508399](https://doi.org/10.1145/3503221.3508399).
18. W. Ahrens, H. D. Nguyen, and J. Demmel, "Efficient reproducible floating point summation and BLAS," Elect. Eng. Comput. Sci. Univ. of California, Berkeley, CA, USA, Tech. Rep. no. UCB/EECS-2016-121, Dec. 2015.
19. A. Arteaga, O. Fuhrer, and T. Hoefler, "Designing bit-reproducible portable high-performance applications," in *Proc. IEEE 28th Int. Parallel Distrib. Process. Symp.*, 2014, pp. 1235-1244, doi: [10.1109/IPDPS.2014.127](https://doi.org/10.1109/IPDPS.2014.127).
20. M. Chrapek, M. Khalilov, and T. Hoefler, "HEAR: Homomorphically encrypted allreduce," in *Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC)*, New York, NY, USA: ACM, 2023, pp. 1-17.