

# Scalable Multi-QPU Circuit Design for Dicke State Preparation: Optimizing Communication Complexity and Local Circuit Costs

Ziheng Chen<sup>1,2</sup>, Junhong Nie<sup>3</sup>, Xiaoming Sun<sup>1,2,\*</sup>, Jialin Zhang<sup>1,2</sup>, Jiadong Zhu<sup>1,\*</sup>

<sup>1</sup> Institute of Computing Technology, Chinese Academy of Sciences

<sup>2</sup> University of Chinese Academy of Sciences

<sup>3</sup> Shandong University

January 29, 2026

## Abstract

Preparing large-qubit Dicke states is of broad interest in quantum computing and quantum metrology. However, the number of qubits available on a single quantum processing unit (QPU) is limited—motivating the distributed preparation of such states across multiple QPUs as a practical approach to scalability. In this article, we investigate the distributed preparation of  $n$ -qubit  $k$ -excitation Dicke states  $D(n, k)$  across a general number  $p$  of QPUs, presenting a distributed quantum circuit (each QPU hosting approximately  $\lceil n/p \rceil$  qubits) that prepares the state with communication complexity  $O(p \log k)$ , circuit size  $O(nk)$ , and circuit depth  $O(p^2 k + \log k \log(n/k))$ . To the best of our knowledge, this is the first construction to simultaneously achieve logarithmic communication complexity and polynomial circuit size and depth.

We also establish a lower bound on the communication complexity of  $p$ -QPU distributed state preparation for a general target state. This lower bound is formulated in terms of the canonical polyadic rank (CP-rank) of a tensor associated with the target state. For the special case  $p = 2$ , we explicitly compute the CP-rank corresponding to the Dicke state  $D(n, k)$  and derive a lower bound of  $\lceil \log(k+1) \rceil$ , which shows that the communication complexity of our construction matches this fundamental limit.

**Keywords:** Distributed quantum computing, communication complexity, Dicke state, quantum state preparation, quantum circuit synthesis.

## 1 Introduction

Despite the rapid progress of quantum hardware in recent decades, a fundamental limitation persists in the number of qubits that can be controlled and made to interact coherently on a single quantum processing unit (QPU) [1, 2, 3]. Consequently, distributed quantum computing (DQC) has emerged as a natural, practically grounded solution for executing quantum tasks that demand large qubit counts [4, 5, 6, 7, 8]. In DQC, quantum computing tasks (e.g., quantum state preparation, unitary operation execution) are implemented collaboratively across interconnected QPUs—devices linked via quantum or classical communication channels—thereby circumventing the qubit-capacity bottleneck of standalone devices while remaining fully compatible with hardware constraints [9, 10, 11, 12, 13]. DQC is usually modelled using distributed quantum circuits—collections of quantum circuits that support non-local two-qubit gates, which act on data qubits belonging to different QPUs [10, 13, 14, 15, 16, 17]. In DQC research, besides the traditional metrics used to quantify intra-processor computing costs (such as quantum circuit size and depth), one must also optimize inter-QPU communication costs—specifically, the number of non-local two-qubit gates [13, 18, 19, 20, 21]<sup>1</sup>.

---

\*Corresponding authors. sunxiaoming@ict.ac.cn; zhujiadong2016@163.com

<sup>1</sup>While communication cost is typically quantified by the number of exchanged qubits in quantum communication complexity research, the count of non-local two-qubit gates serves as a more natural metric within the circuit model. Notably, these two measures are equivalent up to a constant factor—and thus equivalent in the complexity-theoretic sense.

Dicke states [22]—equal superpositions of all computational-basis states with fixed hamming weight—play an important role in several related areas of quantum information science, such as quantum networking[23, 24], quantum game theory[25], quantum tomography[26] and quantum metrology[27, 28]. Formally, for an  $n$ -qubit system, a Dicke state with  $k$  excitations (denoted  $D(n, k)$ ) is defined as

$$D(n, k) = \frac{1}{\sqrt{\binom{n}{k}}} \sum_{\substack{x \in \{0,1\}^n \\ \text{ham}(x)=k}} |x\rangle,$$

where the hamming weight  $\text{ham}(x) = k$  denotes the number of 1-valued bits in the  $n$ -bit string  $x$ . Owing to their theoretically attractive properties and practical utility, a wealth of research has emerged in recent years focused on the efficient preparation of Dicke states—specifically, the optimization of the size and depth of the quantum circuits required to generate such states [29, 30, 31, 32, 33].

The potential applications of Dicke state preparation encompass quantum computing tasks demanding a large number of qubits; for instance, Dicke states  $D(n, n/2)$  surpass the classical precision limit of  $O(1/\sqrt{n})$  for  $n$  particles and attain the Heisenberg limit of  $O(1/n)$  in quantum metrology, a scaling that enables far higher precision for large qubit counts [27, 28]. For this reason, investigating low-complexity Dicke state preparation methods within the DQC paradigm is critical to advancing such large-scale quantum operations. In the DQC paradigm, two existing approaches can be used for preparing Dicke states across interconnected QPUs, though each suffers from notable complexity-related drawbacks. The first approach adapts circuit designs for general distributed state preparation to the specific case of Dicke states, while the second splits a single-QPU Dicke state preparation circuit into subcircuits that are executed across distributed QPUs. The former prioritizes minimizing the number of non-local gates, thereby lowering communication complexity. For instance, extending the general distributed state preparation framework from [34] to Dicke states yields a logarithmic communication cost for the 2-QPU scenario<sup>2</sup>. However, this design abstracts away local computation costs, ignoring the exponential circuit size and depth incurred on each individual QPU. In contrast, the second approach partitions the qubit register of a single-QPU Dicke state preparation circuit into disjoint subsets, each assigned to a separate QPU [13, 21]. Since state-of-the-art single-QPU Dicke state preparation circuits achieve polynomial scaling in both circuit size and depth [29, 30, 31], this qubit-partitioning strategy preserves polynomial circuit complexity on each QPU—a marked improvement over the first approach’s prohibitive exponential local complexity. That said, the method incurs a communication complexity that is at least polynomial in scale, representing an exponential degradation relative to the first approach’s logarithmic communication cost. Crucially, from the perspective of experimental quantum implementations, inter-processor communication imposes far more stringent bottlenecks than intra-processor computing operations in distributed quantum systems: communication introduces qubit decoherence, transmission latency, and hardware synchronization challenges that are far harder to mitigate than local gate noise. This is why, even though the second approach circumvents the first’s catastrophic local complexity issues, its polynomial communication cost still renders it undesirable for large-scale distributed Dicke state preparation.

In this paper, we seek to address the limitations of the two aforementioned approaches and propose a distributed Dicke state preparation circuit that simultaneously optimizes both communication overhead and local computation costs. Formally, we investigate the following problem: Given a Dicke state  $D(n, k)$  and  $p$  QPUs, each allocated approximately  $\lceil n/p \rceil$  qubits<sup>3</sup>, can we construct a distributed quantum circuit spanning  $p$  QPUs such that the intra-QPU computation cost—i.e., circuit size and depth—scales polynomially, while the inter-QPU communication complexity scales logarithmically?

---

<sup>2</sup>In fact, [34] establishes that both the upper and lower bounds on communication complexity for general distributed state preparation over 2 QPUs equal the log-rank of a matrix associated with the target state. This work, however, does not address the special case of Dicke states; accordingly, the logarithmic upper and lower bound results for Dicke states are derived by the authors. For further details, see Section 5.

<sup>3</sup>In addition to the  $\lceil n/p \rceil$  data qubits, a modest number of ancillary qubits are permitted per QPU, but their usage is tightly restricted. This per-QPU qubit limit reflects practical hardware constraints: while relaxing it would enable state preparation using fewer QPUs, doing so violates the foundational goal of distributed quantum computing (DQC) research—addressing the qubit capacity limitations of standalone devices.

## 1.1 Our Contributions

We answer the problem in the affirmative. Specifically, for a Dicke state  $D(n, k)$ , we obtain the following results:

- **Efficient circuit design:** We propose a distributed Dicke state preparation circuit for deployment across  $p$  QPUs, where each QPU is equipped with approximately  $\lceil n/p \rceil$  qubits. This circuit achieves logarithmic-scaling communication complexity, alongside polynomial circuit size and depth. A detailed performance comparison is provided in Table 1. Notably, for constant  $p$ , the circuit depth and size match those of the state-of-the-art non-distributed preparation.
- **Communication complexity lower bound:** We establish that for any target state  $|\psi\rangle$ , a communication complexity lower bound for preparing  $|\psi\rangle$  across  $p$  QPUs is given by the logarithm of the CP-rank of a  $p$ -order tensor associated with  $|\psi\rangle$ . Unfortunately, computing the CP-rank of a tensor is NP-hard for  $p \geq 3$ [35], and we have not yet succeeded in evaluating the CP-rank of Dicke states for this general case. In the special case  $p = 2$ , our CP-rank-based lower bound reduces to the rank-based lower bound derived in [34]. In this 2-QPU scenario, we explicitly compute the lower bound, which equals  $\log k$ —a result that shows our construction’s upper bound is tight.

Table 1: Resource costs for preparing the Dicke state  $D(n, k)$  across  $p$  QPUs, assuming for simplicity that  $p$  divides  $n$ .

| Reference        | Size     | Depth                         | Qubits per QPU | Communication Complexity |
|------------------|----------|-------------------------------|----------------|--------------------------|
| [34] ( $p = 2$ ) | $O(2^n)$ | $O(2^n)$                      | $n/2$          | $O(\log k)$              |
| [31]             | $O(nk)$  | $O(k + \log(k) \log(n/k))$    | $n/p$          | $\Omega(k)$              |
| Ours             | $O(nk)$  | $O(p^2k + \log(k) \log(n/k))$ | $n/p + 2$      | $O(p \log k)$            |

## 1.2 Paper Organization

The remainder of the paper is organized as follows. In Section 2, we introduce key background concepts necessary for understanding the construction and analysis of our proposed distributed quantum circuit. In Section 3, to make our presentation more clear, we focus on 2-QPU scenario present our circuit design for the and establish its optimality by analyzing both its communication complexity and local circuit complexity. In Section 4, we generalize the aforementioned results to the general  $p$ -QPU case. In Section 5, we study the lower bound of communication complexity for dsitributed quantum state preparation. In Section 6, we conclude our paper and discuss some open and future problems.

## 2 Preliminaries

In the remainder of the paper, we adopt the notation  $[n] = \{0, 1, \dots, n - 1\}$ .

### 2.1 Distributed Quantum Computing

In this subsection, we briefly review fundamental concepts in quantum computing, and refer readers to [36] for a comprehensive treatment. We then transition to DQC, where we formally define key metrics for evaluating the implementation cost of distributed quantum circuits.

In quantum computing, qubits serve as the fundamental units of information. The state of a single qubit can be a superposition of the computational basis states  $|0\rangle$  and  $|1\rangle$ , and is written as  $\alpha|0\rangle + \beta|1\rangle$ , where  $\alpha, \beta \in \mathbb{C}$  and  $|\alpha|^2 + |\beta|^2 = 1$ . More generally, the state of an  $n$ -qubit system can be expressed as

$$\sum_{x \in [2]^n} \alpha_x |x\rangle, \quad \alpha_x \in \mathbb{C},$$

where  $x$  denotes a binary string of length  $n$  and the amplitudes satisfy the normalization condition

$$\sum_{x \in [2]^n} |\alpha_x|^2 = 1.$$

Quantum gates form the elementary operations on qubits, with each gate represented by a unitary matrix that describes its action on the qubit space. The most widely adopted model for quantum computing is the circuit model, which consists of a sequence of quantum gates applied to selected qubits. Following standard quantum computing conventions, in this paper we restrict our focus to circuits constructed from a natural universal gate set, comprising CNOT gates and arbitrary single-qubit gates. This gate set is universal, meaning it can synthesize any arbitrary quantum operation with arbitrary precision.

Within the framework of DQC, we partition a regular quantum circuit across QPUs. The parent circuit is decomposed into interconnected subcircuits that interact via inter-QPU communication gates—i.e., gates that act on qubits located on different QPUs. Similar to non-distributed quantum computing, two key metrics in DQC are circuit size and depth: the size captures the overall computational cost, while the depth reflects the temporal cost of execution, constrained by the coherence time of the qubits. In DQC, these metrics quantify local computation cost, as opposed to communication cost. Instead of considering the individual circuit of each QPU in isolation, we evaluate both metrics by treating the quantum circuits across different QPUs as a single, unified parent circuit. The formal definitions of circuit size and depth are as follows.

**Definition 1** (Circuit Size). *The size of a distributed quantum circuit is defined as the total number of quantum gates in the unified parent circuit, excluding inter-QPU communication gates.*

**Definition 2** (Circuit Depth). *The depth of a distributed quantum circuit is the number of sequential gate layers in the unified parent circuit. Gates acting on disjoint qubit sets—capable of parallel execution—are grouped into the same layer.*

Quantum gates acting on qubits located across different QPUs incur substantially higher physical realization costs, owing to the inter-QPU communication required for their execution. Consequently, it is necessary to introduce a dedicated metric that quantifies the count of these non-local gates, rather than relying solely on conventional circuit size metrics. We formalize this metric as follows.

**Definition 3** (Communication Complexity). *Let  $C$  be a quantum circuit on  $n$  qubits consisting only of one- and two-qubit gates. Let  $f : [n] \rightarrow [p]$  be a partition that assigns the  $n$  qubits to  $p$  QPUs. The communication complexity of  $C$  with respect to the partition  $f$ , denoted by  $\text{c.c.}^f(C)$ , is the number of two-qubit gates whose operands are assigned to different QPUs under  $f$ .*

For a quantum state  $|\psi\rangle$ , its communication complexity on  $p$  QPUs is defined as

$$\text{c.c.}_p(|\psi\rangle) := \min_{\substack{\text{balanced } f : [n] \rightarrow [p] \\ C : C|0\rangle = |\psi\rangle}} \text{c.c.}^f(C),$$

where a partition  $f$  is balanced if, for every  $i \in [p]$ , the size of its preimage  $|f^{-1}(i)|$  is either  $\lfloor n/p \rfloor$  or  $\lceil n/p \rceil + 1$ .

## 2.2 Dicke State

In this subsection, we present the formal definition of the Dicke state  $D(n, k)$  and the concept of the Dicke unitary—a key operation that underpins the non-distributed Dicke states preparation.

**Definition 4** (Hamming Weight). *The Hamming weight of a binary string  $x$ , denoted by  $\text{ham}(x)$ , is the number of 1s in  $x$ .*

**Definition 5** (Dicke State). *Dicke state  $D(n, k)$  is an  $n$ -qubit quantum state that is an equal superposition of all computational basis states with Hamming weight  $k$ .*

$$D(n, k) = \frac{1}{\sqrt{\binom{n}{k}}} \sum_{\substack{x \in \{0,1\}^n \\ \text{ham}(x)=k}} |x\rangle.$$

Existing non-distributed approaches to preparing Dicke states do more than merely generate a single target state [29, 30, 31]. In fact, these methods implement a broader transformation that maps a set of basis states to the corresponding Dicke states. This motivates the following definition.

**Definition 6** (Dicke Unitary). *A Dicke unitary  $U_k^n$  is defined by*

$$U_k^n |0^{n-j}1^j\rangle = D(n,j), \quad \forall j \in [k+1].$$

The state-of-the-art synthesis of the Dicke unitary in the non-distributed setting, which will be useful for our circuit design in Sections 3 and 4, is due to [31]. Specifically, they establish the following result.

**Lemma 1** ([31]). *For integers  $n \geq k \geq 0$ , the Dicke unitary  $U_k^n$  can be implemented by a quantum circuit consisting of CNOT gates and single-qubit gates with circuit depth  $O(k + \log k \log(n/k))$  and size  $O(nk)$ .*

### 2.3 Tensor and Canonical Polyadic Rank

In this subsection, we introduce the definition of tensors and the canonical polyadic rank (CP-rank)—a metric that plays a pivotal role in our analysis of communication complexity lower bounds. A tensor is a natural generalization of vectors and matrices, extending the structure of one-dimensional vectors and two-dimensional matrices to higher-order, multi-dimensional arrays.

**Definition 7** (Tensor). *A  $p$ -order tensor  $T = (T_{i_0 i_1 \dots i_{p-1}})$  is a  $p$ -dimensional array of numbers, where each index  $i_j, j \in [p]$  corresponds to one mode of the tensor.*

When  $p = 1$ , a tensor reduces to a vector, and when  $p = 2$ , it coincides with a matrix. In this work, we need an extension of the classical notion of matrix rank to higher-order tensors.

**Definition 8** (Canonical Polyadic Rank (CP-rank)). *For a  $p$ -order tensor  $T$ , a canonical polyadic decomposition (CP-decomposition) of  $T$  is an expression of the form*

$$T = \sum_{i \in [r]} v_i^{(0)} \otimes v_i^{(1)} \otimes \dots \otimes v_i^{(p-1)},$$

where  $v_i^{(0)}, v_i^{(1)}, \dots, v_i^{(p-1)}$  are vectors for each  $i \in [r]$ , and  $\otimes$  denotes the outer product. The canonical polyadic rank, or CP-rank, of  $T$ , denoted by  $\text{CPrank}(T)$ , is the smallest integer  $r$  for which such a decomposition exists.

Intuitively, the CP-rank measures the minimum number of rank-one tensors needed to express  $T$  as a sum of separable components, generalizing the concept of matrix rank to higher-order settings. In particular, when  $p = 2$ , the CP-rank coincides with the usual matrix rank: for any matrix  $M$ , we have  $\text{rank}(M) = \text{CPrank}(M)$ .

### 2.4 Some Useful Circuit Implementations

In this subsection, we collect several circuit implementation results that will be instrumental to our circuit design.

Lemma 2 characterizes the asymptotically optimal cost for preparing an arbitrary quantum state, while Lemma 3 extends this result to the controlled state preparation setting.

**Lemma 2** (State Preparation, [37, 38]). *Given  $m$  ancillary qubits, any  $n$ -qubit quantum state  $|\psi\rangle$  can be prepared by a quantum circuit consisting of CNOT gates and single-qubit gates. The circuit has depth  $O\left(n + \frac{2^n}{n+m}\right)$  and size  $O(2^n)$ .*

**Lemma 3** (Controlled State Preparation, [38]). *Let  $k, m \geq 0$  and  $n > 0$  be integers. Given a family of  $n$ -qubit quantum states  $\{|\psi_i\rangle\}_{i \in [2^k]}$ , the controlled state preparation*

$$|i\rangle |0\rangle^{\otimes n} \mapsto |i\rangle |\psi_i\rangle, \forall i \in [2^k],$$

can be implemented by a quantum circuit consisting of CNOT gates and single-qubit gates. The circuit has depth  $O\left(n + k + \frac{2^{n+k}}{n+k+m}\right)$ , size  $O(2^{n+k})$ , and uses  $m$  ancillary qubits.

Lemma 4 characterizes the cost of implementing an addition circuit.

**Lemma 4** (Ripple-Carry Adder, [39]). *Let  $|a\rangle$  and  $|b\rangle$  denote two  $n$ -bit integers encoded in binary, and let  $|z\rangle$  be a single-qubit register. The adder*

$$|a\rangle |b\rangle |z\rangle \mapsto |a\rangle |a+b\rangle |z \oplus (a+b)_n\rangle,$$

where  $(a+b)_n$  denotes the most significant (carry-out) bit of the sum  $a+b$ , can be implemented by a quantum circuit of depth  $O(n)$  and size  $O(n)$ .

Lemma 5 characterizes the cost of a particular circuit structure composed of Toffoli gates, leveraging the notion of quantum fan-out gates as defined in Definition 9. The cost of implementing quantum fan-out gates is established in Lemma 6.

**Definition 9** (Quantum Fan-out Gate). *For  $n \geq 1$ , the quantum fan-out gate  $F_n$  is the  $(n+1)$ -qubit unitary operator defined by*

$$F_n |a\rangle |b_1\rangle \cdots |b_n\rangle = |a\rangle |a \oplus b_1\rangle \cdots |a \oplus b_n\rangle.$$

**Lemma 5** ([40]). *Suppose there are  $n$  Toffoli gates which share a common control qubit, and all other qubits involved are pairwise disjoint. Then the circuit can be implemented using fan-out gates, CNOT gates, and single-qubit gates with size  $O(n)$  and depth  $O(1)$ .*

**Lemma 6** ([41]). *The quantum fan-out gate  $F_n$  can be implemented using  $n$  CNOT gates with circuit depth  $O(\log n)$  and without ancillary qubits.*

### 3 2-QPU Dicke State Preparation

In this section, we present a circuit for preparing the  $n$ -qubit Dicke state  $D(n, k)$  across two QPUs, each containing  $(n/2 + 2)$  qubits. Throughout this section, we assume for simplicity that  $n$  is even.

**Theorem 1.** *For two QPUs, each equipped with  $(n/2 + 2)$  qubits, there exists a quantum circuit that prepares the Dicke state  $D(n, k)$  with circuit size  $O(nk)$  and circuit depth  $O(k + \log k \log(n/k))$ . Moreover, the number of communication gates required between the two QPUs is  $\lceil \log(k+1) \rceil$ .*

Remarkably, the circuit depth and size costs in Theorem 1 are directly inherited from Lemma 1. In other words, up to constant factors, our distributed preparation protocol introduces no additional overhead in circuit depth or size. The remainder of this section is devoted to a detailed description of the construction.

Intuitively, the communication complexity quantifies the amount of information that must be shared between the QPUs. For the Dicke state, without loss of generality, assume that  $k \leq n/2$ . Our construction relies on the following decomposition:

$$\sqrt{\binom{n}{k}} D(n, k) = \sum_{j=0}^k \sqrt{\binom{n/2}{j}} D(n/2, j) \sqrt{\binom{n/2}{k-j}} D(n/2, k-j).$$

Specifically, the QPUs coherently share a superposition over the index  $j$ . As a result, the distributed state preparation problem is reduced to the corresponding non-distributed preparation of local Dicke states.

To reduce the communication cost, the index  $j$  should be encoded in binary using  $\lceil \log(k+1) \rceil$  qubits. Recall from Definition 6 that the Dicke unitary expects the index  $j$  in unary encoding. Therefore, the main challenge is to efficiently implement a transformation between these encodings. To this end, we introduce one-hot encoding as an intermediate representation. The formal definitions of the binary, one-hot, and unary encodings are given below.

**Definition 10.** *Let  $n \in \mathbb{N}$  and let  $j \in [n+1]$ . There are three standard ways to encode the integer  $j$  into a quantum state:*

- Binary encoding:

$$|j\rangle = |j_{\lceil \log_2(n+1) \rceil - 1} \dots j_1 j_0\rangle,$$

where

$$j = \sum_{l=0}^{\lceil \log_2(n+1) \rceil - 1} j_l 2^l, \quad j_l \in \{0, 1\}.$$

- One-hot encoding:

$$|e_j^{n+1}\rangle = |0^{n-j} 1^j\rangle,$$

which is an  $(n+1)$ -qubit state with a single excitation at position  $j$ .

- Unary encoding:

$$|u_j^n\rangle = |0^{n-j} 1^j\rangle,$$

consisting of  $j$  consecutive ones followed by zeros.

Overall, our construction proceeds in three phases. In the first phase, QPU<sub>0</sub> prepares a superposition of Hamming weights encoded in binary and transfers this state to QPU<sub>1</sub>. In the second phase, we convert the binary encoding into a one-hot encoding; this is the key step of our construction. Using only two ancillary qubits, we implement this transformation with polynomial size and depth. In the final phase, we transform the one-hot encoding into a unary encoding and invoke Lemma 1 to complete the state preparation.

### Phase 1: Hamming-weight Distribution.



Figure 1: **Phase 1.** Using the quantum state preparation circuit for a general state, QPU<sub>0</sub> distributes the amplitudes according to  $D(n/2, j)$  and  $D(n/2, k - j)$ , which are prepared on QPU<sub>0</sub> and QPU<sub>1</sub>, respectively. Subsequently, QPU<sub>0</sub> “sends” the  $|j\rangle$  register to QPU<sub>1</sub>, thereby assigning the corresponding task that QPU<sub>1</sub> must complete. After the minus circuit, the roles of QPU<sub>0</sub> and QPU<sub>1</sub> in the subsequent phases become symmetric; specifically, each performs the mapping  $|j\rangle \mapsto D(n/2, j)$  using ancillary qubits.

Figure 1 illustrates the three steps used to distribute the Hamming weights across the QPUs. Specifically, each computational basis state  $|j\rangle$  is to be mapped to  $D(n/2, j)$  in the subsequent phases. After the completion of this phase, no further communication between QPUs is required, and all operations within each QPU are performed symmetrically and in parallel. The following describes these steps in detail.

- QPU<sub>0</sub> prepares the state

$$\sum_{j=0}^k \alpha_j |j\rangle = \frac{1}{\sqrt{\binom{n}{k}}} \sum_{j=0}^k \sqrt{\binom{n/2}{j}} \sqrt{\binom{n/2}{k-j}} |j\rangle$$

on  $\lceil \log(k+1) \rceil$  qubits. The amplitudes correspond to the fraction of  $(j, k-j)$  pairs distributed across the two QPUs. This step is implemented following Lemma 2. By utilizing all  $(n/2+2)$  available qubits, the resulting circuit has size  $O(2^{\log k}) = O(k)$  and depth  $O(\log k + (2k/n)) = O(\log k)$ .

- QPU<sub>0</sub> copies the  $\lceil \log(k+1) \rceil$  qubits to QPU<sub>1</sub> using a single layer of  $\lceil \log(k+1) \rceil$  CNOT gates. This step has circuit size  $\lceil \log(k+1) \rceil$  and depth 1.
- QPU<sub>1</sub> performs the mapping  $|j\rangle \mapsto |k-j\rangle$  on the  $\lceil \log(k+1) \rceil$  qubits using the adder circuit described in Lemma 4. This circuit has size and depth  $O(\log k)$ .



Figure 2: An example of **Phase 2** for  $k = 3$ . The key observation is that, under the one-hot encoding, the operation  $+2^l$ , and hence  $+j_l \times 2^l$ , can be implemented straightforwardly using swap and controlled-swap gates, respectively. Moreover, the state of the control qubit  $|j_l\rangle$  can be inferred by detecting the presence of a swap operation, which reduces the number of ancillary qubits required from  $O(\log k)$  to  $O(1)$ .

In summary, Phase 1 requires circuit size

$$O(k) + \lceil \log(k+1) \rceil + O(\log k) = O(k). \quad (1)$$

and depth

$$O(\log k) + 1 + O(\log k) = O(\log k). \quad (2)$$

The communication complexity incurred in this phase is  $\lceil \log(k+1) \rceil$ .

### Phase 2: Binary Encoding to One-hot Encoding.

From this phase, we focus on a specific QPU, noting that the others operate similarly. In general, we recursively convert the binary encoding into the one-hot encoding, proceeding from the lower bits to the higher bits.

Take  $k = 3$  as an illustrative example. The corresponding circuit is shown in Figure 2, where qubits unaffected by the operations are omitted for clarity. Let

$$j = j_1 \times 2^1 + j_0 \times 2^0, \quad j_0, j_1 \in [2].$$

Our goal is to implement the transformation

$$|j_1 j_0 000\rangle \mapsto |0e_j^4\rangle.$$

Recalling Definition 10, we begin with

$$|\psi_0\rangle = |j_1 j_0 001\rangle = |j_1 j_0 00e_0^1\rangle.$$

The controlled-swap gate effectively performs an addition  $+j_0 \times 2^0$  in the one-hot encoding, resulting in

$$|\psi_1\rangle = |j_1 j_0 0e_0^2\rangle.$$

The occurrence of the swap indicates that  $j_0 = 1$ . Consequently, a CNOT gate can be applied to reset the qubit  $|j_0\rangle$  to  $|0\rangle$ , yielding

$$|\psi_2\rangle = |j_1 00e_{j_0}^2\rangle.$$

Thus, the qubit originally storing  $|j_0\rangle$  is freed and can be reused as part of the one-hot encoding register.

Next, we apply two controlled-swap gates to realize the addition  $+j_1 \times 2^1$ , and then eliminate  $|j_1\rangle$  in an analogous manner. For simplicity, we summarize the results of these steps as

$$\begin{aligned} |\psi_3\rangle &= |j_1 e_{j_1 j_0}^4\rangle, \\ |\psi_4\rangle &= |0e_{j_1 j_0}^4\rangle. \end{aligned}$$

Returning to the general case, Figure 3 illustrates the circuit segment that processes the qubit  $|j_l\rangle$  for some  $l \in [\lceil \log(k+1) \rceil]$ . The  $2^l$  controlled-swap gates collectively implement the operation  $+j_l \cdot 2^l$  on the register  $|e_{j_{l-1} \dots j_1 j_0}^{2^l}\rangle$ . Subsequently,  $2^l$  CNOT gates are applied to eliminate the qubit  $|j_l\rangle$ , as analyzed in the preceding example. Iterating this procedure from  $|j_0\rangle$  through  $|j_{\lceil \log(k+1) \rceil - 1}\rangle$  completes this phase.



Figure 3: The subcircuit handling  $|ji\rangle$  in **Phase 2**. After each such subcircuit, the number of qubits used for the one-hot encoding doubles, while the number of qubits used for the binary encoding decreases by one. The qubit freed from the binary encoding is repurposed for the one-hot encoding, except for the most significant qubit,  $|j_{\lceil \log(k+1) \rceil - 1}\rangle$ . In addition to the qubit representing 0 in the one-hot encoding, a total of two ancillary qubits are required.

Each controlled-swap gate is decomposed into three Toffoli gates, and Lemmas 5 and 6 are used to optimize their implementation. As a result, the controlled-swap layer can be realized with circuit size  $O(2^l)$  and depth  $O(l)$ . The subsequent CNOT layer corresponds to a fan-in operation; hence, Lemma 6 applies directly, yielding circuit size  $O(2^l)$  and depth  $O(l)$ . Summing over all  $l \in [[\log(k+1)]]$ , we obtain a total circuit size

$$\sum_{l=0}^{\lceil \log(k+1) \rceil} O(2^l) = O(k) \quad (3)$$

and depth

$$\sum_{l=0}^{\lceil \log(k+1) \rceil} O(l) = O(\log^2 k) \quad (4)$$

for this phase.

### Phase 3: Local Non-distributed Preparation.



Figure 4: **Phase 3.** The downstairs CNOT gates transform the one-hot encoding into the unary encoding, introducing an additional 1 due to the representation of 0 in the one-hot encoding. The subsequent  $X$  gate removes this extra 1. After this correction, the Dicke unitary, implemented in a non-distributed manner, can be applied directly.

In the final phase, illustrated in Figure 4, we first convert the one-hot encoding  $|e_j^{k+1}\rangle$  into the unary encoding  $|u_j^k\rangle$ . This conversion is implemented using  $k$  CNOT gates. Since Definition 10 represents the one-hot encoding on  $(k+1)$  qubits and the unary encoding on  $k$  qubits, the extra qubit is removed by a final  $X$  gate. For clarity, we explicitly describe the mapping as

$$|e_i^{k+1}\rangle \mapsto |0^{k-j}1^{j+1}\rangle \mapsto |u_i^k 0\rangle.$$

This transformation has circuit size  $O(k)$  and depth  $O(k)$ .

Subsequently, we apply Lemma 1 and use the Dicke unitary  $U_k^n$  to prepare the state  $D(n/2, j)$ . The resulting circuit depth and size are directly inherited from the non-distributed Dicke state preparation.

The cost of this phase consists of the circuit size

$$O(k) + O(nk) = O(nk). \quad (5)$$

and depth

$$O(k) + O(k + \log k \log(n/k)) = O(k + \log k \log(n/k)) \quad (6)$$

Summing the contributions from equations 1, 3, and 5, we obtain the total size of the circuit:

$$O(k) + O(k) + O(nk) = O(nk).$$

Similarly, summing the bounds from equations 2, 4, and 6, the total depth of the circuit is

$$O(\log k) + O(\log^2 k) + O(k + \log k \log(n/k)) = O(k + \log k \log(n/k)),$$

Communication occurs only in Phase 1, with complexity  $\lceil \log(k+1) \rceil$ .

In summary, the circuit depth and size incurred by distributing the Hamming weight and performing the encoding conversion do not exceed those of the non-distributed Dicke unitary, as established in Lemma 1. This completes the proof of Theorem 1.

## 4 General $p$ -QPU Dicke State Preparation

In this section, we extend our circuit construction to the general case of  $p$  QPUs. For simplicity, we assume throughout the remainder of the paper that  $p$  divides  $n$  evenly, i.e.,  $n = kp$  for some positive integer  $k$ . The results proved for the divisible case can be straightforwardly generalized to the non-divisible case by substituting  $n/p$  with  $\lceil n/p \rceil$  in Table 1.

**Theorem 2.** *For  $p$  QPUs, each equipped with  $(n/p+2)$  qubits, there exists a quantum circuit that prepares the Dicke state  $D(n, k)$  with circuit size  $O(nk)$  and circuit depth  $O(p^2k + \log k \log(n/k))$ . Moreover, the number of communication gates required between the QPUs is  $O(p \log k)$ .*

*Proof.* The proof is structured into three parts. In Part 1, we compare the cases  $p = 2$  and  $p \geq 3$ , noting that the  $p \geq 3$  case differs from the  $p = 2$  case only in Phase 1 of the  $p = 2$  construction. By emphasizing these differences, we outline the core idea of our distributed circuit construction. In Part 2, we detail how to adapt Phase 1—the Hamming-weight-distributed phase—from the  $p = 2$  case to the general  $p \geq 3$  scenario. In Part 3, we analyze the total circuit cost (size, depth, and communication complexity) and thereby complete the proof.

### Part 1.

Compared with the construction in Section 3, the overall construction differs only in Phase 1. Specifically, we must now coordinate the actions of the  $p$  QPUs to prepare a coherent superposition of the form  $|j_0\rangle|j_1\rangle\cdots|j_{p-1}\rangle$  distributed across the  $p$  QPUs. Once this distribution is achieved, the remaining phases proceed in close analogy to the 2-QPU case and admit a similar cost analysis.

Unlike the 2-QPU case, when  $p \geq 3$  we can no longer assume that  $k \leq n/p$ . We therefore define

$$k' = \min(n/p, k),$$

which represents the maximum Hamming weight which can be assigned to a single QPU. Let  $J = (j_i)_{i \in [p]}$ . The feasible region for the distribution of Hamming weight is then given by

$$F_p = \left\{ J : j_i \in [k'+1] \forall i \in [p], \text{ and } \sum_{i \in [p]} j_i = k \right\}.$$

Now we decompose the Dicke state  $D(n, k)$  into  $p$  components as

$$\sqrt{\binom{n}{k}} D(n, k) = \sum_{J \in F_p} \sqrt{\binom{n/p}{j_0}} D(n/p, j_0) \cdots \sqrt{\binom{n/p}{j_{p-1}}} D(n/p, j_{p-1}).$$

## Part 2.

In what follows, we detail the modified Hamming-weight distribution phase. We propagate communication sequentially across the QPUs—a design choice well suited to a linear QPU connection topology. A key observation underpins this approach: each QPU only requires knowledge of the total Hamming weight prepared by the preceding QPUs, rather than their individual contributions. This intentional design significantly reduces the communication complexity of the distributed circuit.



Figure 5: Modified **Phase 1** for the general  $p$ -QPU case. The first  $\lceil \log(k+1) \rceil$  qubits in  $QPU_i$  are used to encode  $|j_i\rangle$ , which specifies the local state  $D(n/p, j_i)$  to be prepared in the subsequent phases.  $QPU_0$  allocates the amplitudes of  $|j_0\rangle$  using the subcircuit QSP. For  $i = 1, 2, \dots, p-2$ ,  $QPU_i$  allocates the amplitudes of  $|j_i\rangle$  and computes the partial sum  $|J_i\rangle$  within the subcircuit  $V_i$  (See Figure 6.), using  $|J_{i-1}\rangle$  as input, where the register  $|J_{i-1}\rangle$  is copied to a second set of  $\lceil \log(k+1) \rceil$  qubits by  $QPU_{i-1}$ . The final processor,  $QPU_{p-1}$ , only needs to compute  $|j_{p-1}\rangle$  via the subcircuit  $A$ , which implements the operation  $k - J_{p-2}$ . Finally, the temporarily stored partial sums are uncomputed using the subcircuits  $W_i$  (See Figure 7.) for  $i = p-2, p-3, \dots, 1$ , followed by a final layer of  $\lceil \log(k+1) \rceil$  CNOT gates to complete the cleanup.

For notation convenience, define the partial sums  $J_i = j_0 + j_1 + \dots + j_i$  for  $i \in [p]$ , and set

$$\alpha_{J_i} = \sum_{\substack{j_{i+1}, \dots, j_{p-1} \\ \text{such that } J \in F_p}} \binom{n/p}{j_{i+1}} \dots \binom{n/p}{j_{p-1}},$$



Figure 6: The subcircuit  $V_i, i = 1, 2, \dots, p - 2$  in Figure 5. Conditioned on the total Hamming weight  $|J_{i-1}\rangle$  accumulated from the previous QPUs, QPU $_i$  allocates the amplitudes of  $|j_i\rangle$  according to their frequency in  $F_p$ . The register  $|j_i\rangle$  is then combined with  $|J_{i-1}\rangle$  to form  $|J_i\rangle$ , which is subsequently forwarded to QPU $_{i+1}$ .



Figure 7: The subcircuit  $W_i, i = 1, 2, \dots, p - 2$  in Figure 5. After eliminating  $|J_i\rangle$  in QPU $_{i+1}$ , the partial sum  $|J_i\rangle$  stored in QPU $_i$  is restored to  $|J_{i-1}\rangle$  via a subtraction subcircuit, enabling further uncomputation.

which is well-defined by the definition of  $F_p$ . Moreover, we will henceforth use  $k$  in place of  $k'$ . Since  $k' = \min(n/p, k) \leq k$ , this substitution does not affect our upper-bound analysis.

Figure 5 depicts the circuit implementing the modified Phase 1 for the general  $p$ -QPU case. The procedure is carried out according to the following steps.

- QPU $_0$  prepares the state

$$\frac{1}{\sqrt{\binom{n}{k}}} \sum_{\substack{j_0 \in J \\ \text{such that } \exists J \in F_p}} \sqrt{\binom{n/p}{j_0}} \sqrt{\alpha_{J_0}} |j_0\rangle$$

using  $\lceil \log(k+1) \rceil$  qubits, which is depicted in Figure 5 as the gate  $QSP$ . It then copies these qubits to QPU $_1$  via a single layer of  $\lceil \log(k+1) \rceil$  CNOT gates. By utilizing all available  $(n/p + 2)$  qubits, Lemma 2 implies that the resulting circuit has size  $O(k)$  and depth  $O(\log k)$ .

- For  $i = 1, 2, \dots, p - 2$ , QPU $_i$  applies Lemma 3 to implement the controlled quantum state preparation

$$|J_{i-1}\rangle |0\rangle \mapsto |J_{i-1}\rangle \frac{1}{\sqrt{\alpha_{J_{i-1}}}} \sum_{\substack{j_i \in J \\ \text{such that } \exists J \in F_p}} \sqrt{\binom{n/p}{j_i}} \sqrt{\alpha_{J_i}} |j_i\rangle$$

on  $2\lceil \log(k+1) \rceil$  qubits. The register  $|j_i\rangle$  is then added to  $|J_{i-1}\rangle$  to obtain  $|J_i\rangle$ , namely,

$$|J_{i-1}\rangle |j_i\rangle \mapsto |J_{i-1} + j_i\rangle |j_i\rangle = |J_i\rangle |j_i\rangle.$$

Subsequently,  $|J_i\rangle$  is copied to QPU $_{i+1}$  using a single layer of  $\lceil \log(k+1) \rceil$  CNOT gates. This operation is depicted in Figure 5 as the gate  $V_i$ , with further details shown in Figure 6. Again, by exploiting all  $(n/p + 2)$  available qubits, Lemmas 3 and 4 together imply that the resulting circuit has size  $O(k^2)$  and depth  $O(pk)$ .

- Upon reaching  $\text{QPU}_{p-1}$ , the copied register  $|J_{p-2}\rangle$  suffices to determine the remaining Hamming weight. Therefore,  $\text{QPU}_{p-1}$  only needs to implement

$$|J_{p-2}\rangle |0\rangle \mapsto |J_{p-2}\rangle |k - J_{p-2}\rangle = |J_{p-2}\rangle |j_{p-1}\rangle,$$

thereby completing the preparation, which is depicted in Figure 5 as the gate  $A$ . By Lemma 4, both the circuit size and depth are  $O(\log k)$ .

- Finally, for  $i = p-2, p-3, \dots, 1$ , we uncompute the auxiliary registers  $|J_i\rangle$  stored in  $\text{QPU}_{i+1}$ . Specifically, we first erase the register  $|J_i\rangle$  in  $\text{QPU}_{i+1}$  using its copy in  $\text{QPU}_i$  via a layer of  $\lceil \log(k+1) \rceil$  CNOT gates. Then we apply

$$|J_i\rangle |j_i\rangle \mapsto |J_i - j_i\rangle |j_i\rangle = |J_{i-1}\rangle |j_i\rangle$$

within  $\text{QPU}_i$  to prepare for the next iteration. This operation is depicted in Figure 5 as the gate  $W_i$ , with further details shown in Figure 7. Upon reaching  $\text{QPU}_0$ , a final layer of  $\lceil \log(k+1) \rceil$  CNOT gates suffices to complete the uncomputation. By Lemma 4, this entire uncomputation step incurs circuit size and depth  $O(\log k)$ .

In this phase, our circuit achieves the size of

$$O(k) + O(pk^2) + O(\log k) + O(p \log k) = O(pk^2). \quad (7)$$

and the depth of

$$O(\log k) + O(p^2k) + O(\log k) + O(p \log k) = O(p^2k) \quad (8)$$

The number of communication gates across the  $p$  QPUs is  $O(p \log k)$ .

### Part 3.

Now we combine this with Phase 2 and 3 within each QPU, as described in section 3. Summing the contributions from equations 7, 3, and 5, the total circuit size is

$$O(pk^2) + O(k) + O(nk) = O(nk).$$

Similarly, combining the bounds from equations 8, 4, and 6, the total circuit depth satisfies

$$O(p^2k) + O(\log^2 k) + O(k + \log k \log(n/k)) = O(p^2k + \log k \log(n/k)),$$

The equation follows from our replacement of  $k'$  with  $k$ . The communication complexity is  $O(p \log k)$  as analyzed in Part 2. This completes the proof of Theorem 2.  $\square$

## 5 Lower Bound of Communication Complexity

In this section, we investigate lower bounds on communication complexity. Specifically, we show that preparing a state  $|\psi\rangle$  across  $p$  QPUs requires a communication complexity that is lower bounded by the logarithm of the CP-rank of a  $p$ -order tensor. This tensor is obtained from the state vector of  $|\psi\rangle$  by reshaping it according to the qubit assignment among the QPUs. Unfortunately, computing the CP-rank of a general  $p$ -order tensor is NP-hard for  $p \geq 3$  [35], and we are therefore unable to evaluate the CP-rank of the tensor associated with the Dicke state in this regime. However, for the case  $p = 2$ , the CP-rank reduces to the matrix rank. In this setting, we explicitly compute the rank of the Dicke state's state matrix and show that our construction for  $p = 2$  achieves this lower bound.

### 5.1 $p$ -QPU Distributed Preparation for General States

We first introduce the state tensor of a quantum state.

**Definition 11** (State Tensor). *Let  $f : [n] \rightarrow [p]$  be a partition that assigns the  $n$  qubits to  $p$  QPUs, and let  $Q_j = f^{-1}(j)$  denote the set of qubits assigned to  $\text{QPU}_j$ . Any quantum state  $|\psi\rangle$  distributed across the  $p$  QPUs can be written as*

$$|\psi\rangle = \sum_{\substack{i_j \in [2^{|Q_j|}] \\ j \in [p]}} T_{i_0 i_1 \dots i_{p-1}}^f |i_0 i_1 \dots i_{p-1}\rangle,$$

where  $i_j$  indexes the computational basis states of  $\text{QPU}_j$  for each  $j \in [p]$ . Collecting these coefficients, we define the tensor

$$T_{|\psi\rangle}^f = \left( T_{i_0 i_1 \dots i_{p-1}}^f \right)$$

to be the state tensor associated with the state  $|\psi\rangle$  and the partition  $f$ .

Intuitively, a local unitary operation acting on  $\text{QPU}_j$  corresponds to a linear transformation applied along mode  $j$  of the state tensor. As we show below, the communication complexity required to prepare  $|\psi\rangle$  is closely related to the CP-rank of its associated state tensor.

**Theorem 3.** *For any quantum state  $|\psi\rangle$  and the number of QPUs  $p$ , the communication complexity satisfies*

$$\text{c.c.}_p(|\psi\rangle) \geq \min_{\text{balanced } f : [n] \rightarrow [p]} \log \text{CPrank} \left( T_{|\psi\rangle}^f \right).$$

*Proof.* We prove the theorem via its contrapositive. Specifically, we show that for any quantum state  $|\psi\rangle$  and the number of QPUs  $p$ , if the communication complexity of  $|\psi\rangle$  on  $p$  QPUs,  $\text{c.c.}_p(|\psi\rangle) = l$  for some integer  $l \geq 0$ , then for all balanced partition  $f : [n] \rightarrow [p]$ , the CP-rank of the corresponding state tensor satisfies  $\text{CPrank} \left( T_{|\psi\rangle}^f \right) \leq 2^l$ .

We proceed by induction on the communication complexity  $l$  for any balanced partition  $f : [n] \rightarrow [p]$ . For the base case  $l = 0$ , no communication is allowed, so the QPUs can only prepare a product state  $|\psi_0\rangle$ , where  $\text{CPrank} \left( T_{|\psi_0\rangle}^f \right) = 1$ .

Next, we consider the induction step. Suppose, by the induction hypothesis, that for communication complexity  $l - 1$  we obtain

$$|\psi_{l-1}\rangle = \sum_{w \in [2]^{l-1}} |v_0(w), v_1(w), \dots, v_{p-1}(w)\rangle.$$

Without loss of generality, assume that the protocol requires  $\text{QPU}_0$  to perform a local operation and then communicate with  $\text{QPU}_1$ . Let

$$|\psi'_{l-1}\rangle = \sum_{w \in [2]^{l-1}} |v'_0(w), v_1(w), \dots, v_{p-1}(w)\rangle$$

denote the state after the local operation on  $\text{QPU}_0$ . When  $\text{QPU}_0$  sends a qubit to  $\text{QPU}_1$ , it must first separate this qubit from its local system. In general, this qubit may be entangled with the other qubits in  $\text{QPU}_0$ , so the resulting state can be written as

$$|\psi_l\rangle = |\psi'_{l-1}\rangle = \sum_{w \in [2]^{l-1}, z \in [2]} |v''_0(w, z), c(w, z), v_1(w), \dots, v_{p-1}(w)\rangle,$$

where  $c(w, z)$  represents the communicated qubit and

$$|v'_0(w)\rangle = \sum_{z \in [2]} |v''_0(w, z), c(w, z)\rangle.$$

It follows that  $\text{CPrank} \left( T_{|\psi_l\rangle}^f \right) \leq 2^l$ , as required.  $\square$

## 5.2 2-QPU Distributed Preparation for Dicke States

We establish a lower bound on the communication complexity required to prepare Dicke states across 2 QPUs as follows.

**Theorem 4.** *To prepare  $D(n, k)$  across two quantum processing units (QPUs), each holding  $n/2$  qubits, the communication complexity satisfies*

$$\text{c.c.}_2(D(n, k)) \geq \lceil \log(k+1) \rceil.$$

Recall Theorem 1. By Theorem 4, we show that when  $p = 2$ , our construction for preparing Dicke states attains the lower bound on communication complexity.

The proof of Theorem 4 relies directly on Theorem 3. In the special case where  $p = 2$ , the CP-rank of the state tensor coincides with the rank of the state matrix, which is considerably easier to compute.

*Proof.* By definition,  $D(n, k)$  is symmetric under permutations of qubits. Consequently, it is unnecessary to specify the partition  $f$  in the proof. Therefore, let  $M_{D(n,k)} = (a_{x,y})$  be the matrix defined in Definition 11 when  $p = 2$ , and let  $R_x$  denote the row of  $M_{D(n,k)}$  indexed by  $x$ . By definition,

$$a_{x,y} = \begin{cases} 1, & \text{if } \text{ham}(x) + \text{ham}(y) = k, \\ 0, & \text{otherwise.} \end{cases}$$

We first observe that each row  $R_x$  depends only on the Hamming weight  $\text{ham}(x)$ . Indeed, if  $\text{ham}(x_1) = \text{ham}(x_2)$ , then for every  $y$ ,

$$\text{ham}(x_1) + \text{ham}(y) = k \iff \text{ham}(x_2) + \text{ham}(y) = k,$$

and hence

$$R_{x_1} = R_{x_2}.$$

Since  $\text{ham}(x)$  ranges over  $\{0, 1, \dots, k\}$  for rows that are not identically zero, there are at most  $k + 1$  distinct rows. Therefore,

$$\text{rank}(M_{D(n,k)}) \leq k + 1.$$

On the other hand, if  $\text{ham}(x_1) \neq \text{ham}(x_2)$ , then the supports of the corresponding rows are disjoint. Indeed, there is no  $y$  such that

$$\text{ham}(x_1) + \text{ham}(y) = k \text{ and } \text{ham}(x_2) + \text{ham}(y) = k$$

hold simultaneously. Consequently,

$$\text{supp}(R_{x_1}) \cap \text{supp}(R_{x_2}) = \emptyset.$$

It follows that the rows corresponding to distinct Hamming weights are linearly independent, and thus

$$\text{rank}(M_{D(n,k)}) \geq k + 1.$$

Combining the above inequalities, we conclude that

$$\text{rank}(M_{D(n,k)}) = k + 1.$$

Substituting this into Theorem 3, we obtain

$$\text{c.c.}_2(D(n, k)) \geq \lceil \log \text{rank}(M_{D(n,k)}) \rceil = \lceil \log(k + 1) \rceil.$$

□

## 6 Conclusion

This work addresses the critical scalability challenge in preparing large-qubit  $k$ -excitation Dicke states  $D(n, k)$  by exploring their distributed preparation across a general number  $p$  of QPUs. We present a novel distributed quantum circuit architecture that achieves both logarithmic communication complexity and polynomial circuit depth and size for distributed Dicke state preparation. Additionally, we establish a general lower bound on the communication complexity of  $p$ -QPU distributed state preparation, formulated via the CP-rank of a tensor associated with the target state. For  $p = 2$ , we explicitly compute the CP-rank of the corresponding Dicke state tensor, derive a lower bound of  $\log k$ , and verify that our construction's communication complexity matches this fundamental limit, confirming its optimality in this case.

While this work advances the state of the art in distributed quantum state preparation, it also opens avenues for future research. We conjecture that our upper bound on communication complexity is tight for  $p \geq 3$ , though we have not yet proven this assertion. A key direction for future work is to derive the CP-rank of the Dicke state-associated tensor for  $p \geq 3$ —a result that would enable rigorous verification of our conjecture, establish general optimality bounds for distributed Dicke state preparation, and deepen the understanding of tensor ranks in the context of multi-QPU distributed quantum systems. Further extensions could also explore the robustness of our circuit to noise in inter-QPU communication channels, as well as its practical implementation on emerging distributed quantum hardware.

## References

- [1] J. Preskill, “Quantum computing in the nisq era and beyond,” *Quantum*, vol. 2, p. 79, 2018.
- [2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell *et al.*, “Quantum supremacy using a programmable superconducting processor,” *Nature*, vol. 574, no. 7779, pp. 505–510, 2019.
- [3] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke *et al.*, “Noisy intermediate-scale quantum algorithms,” *Reviews of Modern Physics*, vol. 94, no. 1, p. 015004, 2022.
- [4] V. S. Denchev and G. Pandurangan, “Distributed quantum computing: a new frontier in distributed systems or science fiction?” *SIGACT News*, vol. 39, no. 3, p. 77–95, Sep. 2008.
- [5] A. S. Cacciapuoti, M. Caleffi, F. Tafuri, F. S. Cataliotti, S. Gherardini, and G. Bianchi, “Quantum internet: Networking challenges in distributed quantum computing,” *IEEE Network*, vol. 34, no. 1, pp. 137–143, 2020.
- [6] M. Caleffi, M. Amoretti, D. Ferrari, J. Illiano, A. Manzalini, and A. S. Cacciapuoti, “Distributed quantum computing: A survey,” *Computer Networks*, vol. 254, p. 110672, 2024.
- [7] D. Barral, F. J. Cardama, G. Díaz-Camacho, D. Faflde, I. F. Llovo, M. Mussa-Juane, J. Vázquez-Pérez, J. Villasuso, C. Piñeiro, N. Costas, J. C. Pichel, T. F. Pena, and A. Gómez, “Review of distributed quantum computing: From single qpu to high performance quantum computing,” *Computer Science Review*, vol. 57, p. 100747, 2025.
- [8] G.-L. Long, “Interfacing superconducting and atomic qubits via unconventional geometric quantum operations,” *Science China Physics, Mechanics & Astronomy*, vol. 65, no. 4, p. 240361, 2022.
- [9] L. Li, X. Sun, J. Zhang, and J. Zhu, “Space-bounded communication complexity of unitaries,” *arXiv preprint arXiv:2511.04250*, 2025.
- [10] D. Main, P. Drmota, D. Nadlinger, E. Ainley, A. Agrawal, B. Nichol, R. Srinivas, G. Araneda, and D. Lucas, “Distributed quantum computing across an optical network link,” *Nature*, pp. 1–6, 2025.
- [11] Z. Davarzani, M. Zomorodi, and M. Houshmand, “A hierarchical approach for building distributed quantum systems,” *Scientific Reports*, vol. 12, no. 1, p. 15421, 2022.
- [12] D. Ferrari, S. Carretta, and M. Amoretti, “A modular quantum compilation framework for distributed quantum computing,” *IEEE Transactions on Quantum Engineering*, vol. 4, pp. 1–13, 2023.
- [13] I. Ghodsollahee, Z. Davarzani, M. Zomorodi, P. Pławiak, M. Houshmand, and M. Houshmand, “Connectivity matrix model of quantum circuits and its application to distributed quantum circuit optimization,” *Quantum Information Processing*, vol. 20, no. 7, p. 235, 2021.
- [14] M. Sarvaghad-Moghaddam and M. Zomorodi, “A general protocol for distributed quantum gates,” *Quantum Information Processing*, vol. 20, no. 8, p. 265, 2021.

- [15] D. Dadkhah, M. Zomorodi, and S. E. Hosseini, “A new approach for optimization of distributed quantum circuits,” *International Journal of Theoretical Physics*, vol. 60, no. 9, pp. 3271–3285, 2021.
- [16] J.-Y. Wu, K. Matsui, T. Forrer, A. Soeda, P. Andrés-Martínez, D. Mills, L. Henaut, and M. Murao, “Entanglement-efficient bipartite-distributed quantum computing,” *Quantum*, vol. 7, p. 1196, 2023.
- [17] M. G. Davis, J. Chung, D. Englund, and R. Kettimuthu, “Towards distributed quantum computing by qubit and gate graph partitioning techniques,” in *2023 IEEE International Conference on Quantum Computing and Engineering (QCE)*, vol. 1. IEEE, 2023, pp. 161–167.
- [18] X. Chen, Z. Chen, P. Zhu, X. Cheng, and Z. Guan, “Circuit partitioning and transmission cost optimization in distributed quantum circuits,” *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2025.
- [19] M. Zomorodi-Moghadam, M. Houshmand, and M. Houshmand, “Optimizing teleportation cost in distributed quantum circuits,” *International Journal of Theoretical Physics*, vol. 57, no. 3, pp. 848–861, 2018.
- [20] M. Bandic, L. Prielerger, J. Nüßlein, A. Ovide, S. Rodrigo, S. Abadal, H. Van Someren, G. Vardoyan, E. Alarcon, C. G. Almudever *et al.*, “Mapping quantum circuits to modular architectures with qubo,” in *2023 IEEE International Conference on Quantum Computing and Engineering (QCE)*, vol. 1. IEEE, 2023, pp. 790–801.
- [21] F. Burt, K.-C. Chen, and K. K. Leung, “Generalised circuit partitioning for distributed quantum computing,” in *2024 IEEE International Conference on Quantum Computing and Engineering (QCE)*, vol. 2. IEEE, 2024, pp. 173–178.
- [22] R. H. Dicke, “Coherence in spontaneous radiation processes,” *Physical review*, vol. 93, no. 1, p. 99, 1954.
- [23] R. Prevedel, G. Cronenberg, M. S. Tame, M. Paternostro, P. Walther, M. S. Kim, and A. Zeilinger, “Experimental realization of dicke states of up to six qubits for multiparty quantum networking,” *Phys. Rev. Lett.*, vol. 103, p. 020503, Jul 2009.
- [24] A. Chiuri, C. Greganti, M. Paternostro, G. Vallone, and P. Mataloni, “Experimental quantum networking protocols via four-qubit hyperentangled dicke states,” *Phys. Rev. Lett.*, vol. 109, p. 173604, Oct 2012.
- [25] S. K. Özdemir, J. Shimamura, and N. Imoto, “A necessary and sufficient condition to play games in quantum mechanical settings,” *New Journal of Physics*, vol. 9, no. 2, p. 43, 2007.
- [26] G. Tóth, W. Wieczorek, D. Gross, R. Krischek, C. Schwemmer, and H. Weinfurter, “Permutationally invariant quantum tomography,” *Physical review letters*, vol. 105, no. 25, p. 250403, 2010.
- [27] P. Hyllus, W. Laskowski, R. Krischek, C. Schwemmer, W. Wieczorek, H. Weinfurter, L. Pezzé, and A. Smerzi, “Fisher information and multiparticle entanglement,” *Physical Review A—Atomic, Molecular, and Optical Physics*, vol. 85, no. 2, p. 022321, 2012.
- [28] Y. Ouyang, N. Shettell, and D. Markham, “Robust quantum metrology with explicit symmetric states,” *IEEE Transactions on Information Theory*, vol. 68, no. 3, pp. 1809–1821, 2021.
- [29] A. Bärtschi and S. Eidenbenz, “Deterministic preparation of dicke states,” in *International Symposium on Fundamentals of Computation Theory*. Springer, 2019, pp. 126–139.
- [30] ———, “Short-depth circuits for dicke state preparation,” in *2022 IEEE International Conference on Quantum Computing and Engineering (QCE)*. IEEE, 2022, pp. 87–96.
- [31] P. Yuan and S. Zhang, “Depth-efficient quantum circuit synthesis for deterministic dicke state preparation,” *arXiv preprint arXiv:2505.15413*, 2025.
- [32] J. Yu, S. R. Muleady, Y.-X. Wang, N. Schine, A. V. Gorshkov, and A. M. Childs, “Efficient preparation of dicke states,” *Physical Review Letters*, vol. 136, no. 3, p. 030601, 2026.

- [33] Y. Li, G. Tian, X. He, and X. Sun, “Preparation of hamming-weight-preserving quantum states with log-depth quantum circuits,” *arXiv preprint arXiv:2508.14470*, 2025.
- [34] A. Ambainis, L. J. Schulman, A. Ta-Shma, U. Vazirani, and A. Wigderson, “The quantum communication complexity of sampling,” *SIAM Journal on Computing*, vol. 32, no. 6, pp. 1570–1585, 2003.
- [35] J. Håstad, “Tensor rank is np-complete,” *Journal of algorithms*, vol. 11, no. 4, pp. 644–654, 1990.
- [36] M. A. Nielsen and I. L. Chuang, *Quantum computation and quantum information*. Cambridge university press, 2010.
- [37] X. Sun, G. Tian, S. Yang, P. Yuan, and S. Zhang, “Asymptotically optimal circuit depth for quantum state preparation and general unitary synthesis,” *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 42, no. 10, pp. 3301–3314, 2023.
- [38] P. Yuan and S. Zhang, “Optimal (controlled) quantum state preparation and improved unitary synthesis by quantum circuits with any number of ancillary qubits,” *Quantum*, vol. 7, p. 956, 2023.
- [39] M. Remaud, “Optimizing t and cnot gates in quantum ripple-carry adders and comparators,” in *Proceedings of Recent Advances in Quantum Computing and Technology*, 2024, pp. 56–61.
- [40] P. Gokhale, S. Koretsky, S. Huang, S. Majumder, A. Drucker, K. R. Brown, and F. T. Chong, “Quantum fan-out: Circuit optimizations and technology modeling,” in *2021 IEEE International Conference on Quantum Computing and Engineering (QCE)*. IEEE, 2021, pp. 276–290.
- [41] W. Zi, J. Nie, and X. Sun, “Shallow quantum circuit implementation of symmetric functions with limited ancillary qubits,” *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2025.