



# Leveraging Technology Insight in SFQ Logic Synthesis

Rassul  
Bairamkulov

LSI EPFL

September 1, 2023



## OUTLINE

**Rapid Single-Flux Quantum – Fundamentals**

**Gate Compounding Technique**

**Multiphase Clocking**

**Summary**

# Superconductive Electronics: Applications

- Today
  - Digital signal processing
  - Radars and communications
  - High-speed analog-to-digital converters
- Goal
  - Energy-efficient data centers
  - Space electronics
  - Interface to quantum computers



- Holmes, D. S., Ripple, A. L., & Manheimer, M. A. (2013). Energy-efficient superconducting computing—Power budgets and requirements. *IEEE Transactions on Applied Superconductivity*, 23(3), 1701610-1701610.

- Fabrication technology
  - State of the art : 6k JJ/mm<sup>2</sup>
  - Compare to 90M transistors/mm<sup>2</sup> (TSMC 7nm)
- Expensive memory
  - Poor density of inductor-based memory
- Architectural differences
  - **Gate-level pipelining**
  - **Fanout of one**



- Holmes, D. S., Ripple, A. L., & Manheimer, M. A. (2013). Energy-efficient superconducting computing—Power budgets and requirements. *IEEE Transactions on Applied Superconductivity*, 23(3), 1701610-1701610.

# Superconductive Electronics: History

- Discovered by H.K. Onnes in 1911
  - Resistance of Hg disappeared at 4.2 K
  - Nobel Prize in Physics 1913
- Resistance of exactly 0
  - No voltage drop
  - Persistent current



**EPFL** Superconductive Electronics: Josephson Effect

- Discovered in 1962 by B. Josephson
- Nobel Prize in Physics 1973
- Tunneling of supercurrent through insulating layer
- No voltage drop



# Josephson Junction (JJ)

- Two superconductors separated by a weak link
- No voltage drop with DC current below critical current  $I_c$
- Increasing current beyond  $I_c$ 
  - Increases inductance of JJ
  - Produces a voltage pulse
    - area of approximately  $2.07 \text{ mV ps}$



# Basic JJ Loop



# Basic JJ Loop



# Storage Loop



# Josephson Transmission Line



- Transmits SFQ from A to B
- Rejects SFQ from B to A



# Merger = JTL + 2xBuffer

- Combines pulses from two branches into one output branch
- Could be viewed as asynchronous OR



- Duplicates SFQ pulse to two output branches



# Clocked Elements: DFF

- Most logic gates in RSFQ are clocked
- Storage loop holds the state
- Clock signal reads the state



# Clocked Elements



# Clocked Elements: Inverter

- Relatively expensive operation
- Produces clock pulse if no input arrives at D



# Clocked Elements



# Clocked Elements: XOR



# Clocked Elements: OR = Merger + DFF

- DFFs ensure simultaneous arrival of pulses to the merger
- AND/OR based on tuning  $I_0$ 
  - Large  $I_0$  – OR
  - Small  $I_0$  – AND



- Ensure correct order of signal arrival
- Fundamental issue in superconductive electronics



- Ensure correct order of signal arrival
- Fundamental issue in superconductive electronics





## OUTLINE

**Rapid Single-Flux Quantum – Fundamentals**

**Gate Compounding Technique**

**Multiphase Clocking**

**Summary**

# Three Types of SFQ Devices

- Asynchronous

- Asynchronous input
  - Asynchronous output



- Synchronizers

- Asynchronous input
  - **Synchronized** output



- Synchronized

- **Synchronized** input
  - Asynchronous output



# Three Types of SFQ Devices

- Asynchronous

- Asynchronous input
- Asynchronous output



- Synchronizers

- Asynchronous input
- Synchronized output



- Synchronized

- Synchronized input
- Asynchronous output



# Gate Compounding Technique

- AS right before SA
  - Synchronize inputs to SA gates
- AA gates at periphery
  - Enrich functionality without extra cycles



- R. Bairamkulov and G. De Micheli, "Compound Logic Gates for Pipeline Depth Minimization in Single Flux Quantum Integrated Systems," *Proceedings of the ACM Great Lakes Symposium on VLSI, Knoxville, Tennessee, 2023.*

# Gate Compounding: Example

- SA gates sensitive to input timing
  - Place them after AS gates



# Gate Compounding: NIMPLY

- Replace DFF with inverter



# Gate Compounding: NIMPLY



# Gate Compounding: XNOR



# Gate Compounding: XNOR



- Ensure correct order of signal arrival
- Fundamental issue in superconductive electronics



# Example

10 elements  
4 cycles



6 elements  
1 cycle



# 4-bit Brent Kung Adder: Simulation

- Gate compounding
  - 4 clock cycles
  - 658 JJs



- Conventional gates\*
  - 6 clock cycles
  - 928 JJs

\*T. Kawaguchi, K. Takagi, and N. Takagi. 2021. Rapid Single-Flux-Quantum Logic Circuits Using Clockless Gates. IEEE Transactions on Applied Superconductivity 31, 4 (June 2021).

# Technology Mapping: Enumeration

- Synthesize using precomputed optimal structures
  - 65536 4-input functions



# Technology Mapping: Enumeration

- Synthesize using precomputed optimal structures
  - 65536 4-input functions



# Technology Mapping: Enumeration

- Synthesize using precomputed optimal structures
  - 65536 4-input functions



# Technology Mapping: Enumeration

- Synthesize using precomputed optimal structures
  - 65536 4-input functions



# Synthesis Results

- EPFL and ISCAS benchmarks
- On average
  - 24% smaller area
  - 33% smaller depth

| Benchmark | #dff     |          |       | #jj      |          |       | Delay    |          |       | Runtime, s |
|-----------|----------|----------|-------|----------|----------|-------|----------|----------|-------|------------|
|           | Baseline | Our work | Ratio | Baseline | Our work | Ratio | Baseline | Our work | Ratio |            |
| sin       | 13,666   | 17,627   | 1.29  | 215,318  | 126,694  | 0.59  | 182      | 86       | 0.47  | 0.399      |
| cavlc     | 522      | 987      | 1.89  | 16,339   | 15,098   | 0.92  | 17       | 11       | 0.65  | 0.009      |
| dec       | 8        | 16       | 2.00  | 5,469    | 6,324    | 1.16  | 4        | 4        | 1.00  | 0.006      |
| int2float | 270      | 443      | 1.64  | 6,432    | 5,616    | 0.87  | 16       | 10       | 0.63  | 0.004      |
| priority  | 9,064    | 14,754   | 1.63  | 102,085  | 95,370   | 0.93  | 127      | 125      | 0.98  | 0.013      |
| c499      | 476      | 512      | 1.08  | 7,758    | 5,593    | 0.72  | 13       | 8        | 0.62  | 0.040      |
| c880      | 774      | 1,179    | 1.52  | 12,909   | 8,359    | 0.65  | 22       | 13       | 0.59  | 0.013      |
| c1908     | 696      | 799      | 1.15  | 12,013   | 5,553    | 0.46  | 20       | 11       | 0.55  | 0.025      |
| c3540     | 1,159    | 1,556    | 1.34  | 28,300   | 22,231   | 0.79  | 31       | 18       | 0.58  | 0.034      |
| c5315     | 2,908    | 3,727    | 1.28  | 52,033   | 33,524   | 0.64  | 23       | 13       | 0.57  | 0.091      |
| c7552     | 2,429    | 4,744    | 1.95  | 48,482   | 28,900   | 0.60  | 19       | 13       | 0.68  | 0.115      |
| Average   |          |          | 1.53  |          |          | 0.76  |          |          | 0.67  |            |

- R. Bairamkulov, A. Tempia Calvino and G. De Micheli, "Synthesis of SFQ Circuits with Compound Gates," to appear in Proceedings of the IEEE/IFIP International Conference on Very Large Scale Integration, Sharjah, United Arab Emirates, 2023



# OUTLINE

**Rapid Single-Flux Quantum – Fundamentals**

**Gate Compounding Technique**

**Multiphase Clocking**

**Summary**

# Multiphase Clocking

- Effective method for reducing the number of PB DFF
- No high-frequency clock
- Tradeoff with throughput



# EPFL State of the Art\*

- -40% area with two phases
- Up to -70% with 10 phases

TABLE I  
NUMBER OF REQUIRED DFFS WITH GIVEN NUMBER OF CLOCK PHASES

| Designs                | AMD2901                       | ss_pcm | simple_spi | des_area | ethernet | pci_bridge32 | spi   | mem_ctrl | Avg.         | Improvement  |
|------------------------|-------------------------------|--------|------------|----------|----------|--------------|-------|----------|--------------|--------------|
| Original Clocked Gates | 1042                          | 524    | 946        | 3947     | 275      | 22225        | 3004  | 8362     | <b>5041</b>  | N/A          |
| Clock Phases           | Number of Required Extra DFFs |        |            |          |          |              |       |          |              |              |
| <b>1</b>               | 3546                          | 995    | 2995       | 4571     | 564      | 168810       | 13140 | 51321    | <b>30743</b> | N/A          |
| <b>2</b>               | 1539                          | 305    | 1146       | 1510     | 210      | 76751        | 5851  | 23255    | <b>13821</b> | <b>55.0%</b> |
| <b>3</b>               | 982                           | 105    | 599        | 756      | 90       | 47798        | 3477  | 13821    | <b>8454</b>  | <b>72.5%</b> |
| <b>4</b>               | 600                           | 24     | 288        | 399      | 61       | 33167        | 2304  | 9234     | <b>5760</b>  | <b>81.3%</b> |
| <b>5</b>               | 533                           | 21     | 132        | 262      | 32       | 22419        | 1577  | 6060     | <b>3880</b>  | <b>87.4%</b> |
| <b>6</b>               | 334                           | 9      | 103        | 194      | 26       | 18504        | 1149  | 4574     | <b>3112</b>  | <b>89.9%</b> |
| <b>7</b>               | 229                           | 8      | 60         | 116      | 25       | 15272        | 897   | 3143     | <b>2469</b>  | <b>92.0%</b> |
| <b>8</b>               | 200                           | 7      | 42         | 88       | 21       | 11755        | 813   | 3047     | <b>1997</b>  | <b>93.5%</b> |
| <b>9</b>               | 159                           | 7      | 36         | 64       | 0        | 9488         | 507   | 1715     | <b>1497</b>  | <b>95.1%</b> |
| <b>10</b>              | 103                           | 6      | 22         | 64       | 0        | 8714         | 426   | 1670     | <b>1376</b>  | <b>95.5%</b> |

TABLE II  
TOTAL AREA WITH GIVEN NUMBER OF CLOCK PHASES

| Clock Phases | Total Area ( $mm^2$ ) |        |            |          |          |              |      |          |              |              |
|--------------|-----------------------|--------|------------|----------|----------|--------------|------|----------|--------------|--------------|
|              | AMD2901               | ss_pcm | simple_spi | des_area | ethernet | pci_bridge32 | spi  | mem_ctrl | Avg.         | Improvement  |
| <b>1</b>     | 23.1                  | 7.8    | 19.7       | 49.5     | 4.1      | 837.2        | 79.5 | 270.9    | <b>161.5</b> | N/A          |
| <b>2</b>     | 16.3                  | 5.3    | 12.8       | 38.8     | 2.8      | 474.3        | 53.9 | 162.7    | <b>95.9</b>  | <b>40.6%</b> |
| <b>3</b>     | 15.0                  | 4.6    | 10.5       | 36.6     | 2.4      | 361.5        | 45.1 | 130.2    | <b>75.7</b>  | <b>53.1%</b> |
| <b>4</b>     | 13.4                  | 4.4    | 9.5        | 36.5     | 2.3      | 303.3        | 39.3 | 113.0    | <b>65.2</b>  | <b>59.6%</b> |
| <b>5</b>     | 12.6                  | 4.5    | 8.7        | 36.7     | 2.2      | 258.9        | 37.5 | 99.1     | <b>57.5</b>  | <b>64.4%</b> |
| <b>6</b>     | 12.9                  | 4.4    | 8.5        | 37.2     | 2.2      | 244.3        | 36.4 | 91.5     | <b>54.7</b>  | <b>66.1%</b> |
| <b>7</b>     | 11.4                  | 4.4    | 9.0        | 37.3     | 2.2      | 237.2        | 34.8 | 87.0     | <b>52.9</b>  | <b>67.2%</b> |
| <b>8</b>     | 10.6                  | 4.6    | 8.6        | 37.6     | 2.2      | 218.7        | 35.3 | 87.7     | <b>50.7</b>  | <b>68.6%</b> |
| <b>9</b>     | 10.6                  | 4.4    | 9.5        | 37.8     | 2.1      | 219.4        | 32.9 | 82.3     | <b>49.9</b>  | <b>69.1%</b> |
| <b>10</b>    | 10.4                  | 4.6    | 8.9        | 37.8     | 2.1      | 212.1        | 33.0 | 83.3     | <b>49.0</b>  | <b>69.6%</b> |

\*X. Li, M. Pan, T. Liu, and P. A. Beerel, “Multi-Phase Clocking for Multi-Threaded Gate-Level-Pipelined Superconductive Logic,” Proceedings of the IEEE Computer Society Symposium on VLSI, pp. 62–67, 2022.

- Assign phases to gates
- Minimize number of DFFs
- Only valid for AS gates
- No support for asynchronous elements



$$\min_{\sigma(g) \forall g \in G} \sum_{(i,j) \in E} \left\lfloor \frac{\sigma(j) - \sigma(i)}{n} \right\rfloor,$$

Subject to:

$$\sigma(i) < \sigma(j) \forall (i, j) \in E,$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = 0 \forall i \in I,$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = \left\lfloor \frac{\sigma(j)}{n} \right\rfloor \forall i, j \in O.$$



# Multiphase Clocking: Example



# Multiphase Clocking: Example



# Multiphase Clocking: Conflicts



# Multiphase Clocking: Conflicts



# Multiphase Clocking: Conflicts



- Placing DFFs is intractable
- Split the problem into two levels
- 1. Phase assignment (macro)
  - Associate each gate with a stage
- 2. DFF placement (micro)
  - Can be solved optimally

$$\min_{\sigma(g) \forall g \in G} \sum_{(i,j) \in E} \left\lfloor \frac{\sigma(j) - \sigma(i) + (j \in G_{SA})}{n} \right\rfloor,$$

Subject to:

$$\sigma(i) < \sigma(j) \quad \forall (i,j) \in E, j \in G_{AS},$$

$$\sigma(i) \leq \sigma(j) \quad \forall (i,j) \in E, j \notin G_{AS},$$

$$\sigma(g) \geq \sigma(a) + (a \notin G_{AS}) \quad \forall g \in G_{SA}, a \in FI(g),$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = 0 \quad \forall i \in I,$$

$$\left\lfloor \frac{\sigma(i)}{n} \right\rfloor = \left\lfloor \frac{\sigma(j)}{n} \right\rfloor \quad \forall i, j \in O.$$

# Multiphase Clocking: Optimization



# Multiphase Clocking: Optimization



- Each DFF location is a binary variable
- Satisfy constraints with fewest variables

- Input
  - compound-gate circuits without DFF
- Output
  - Multiphase netlist
    - phases assigned to each gate
    - inserted DFFs for path balancing
- Comparison with Dual Clocking Method (DCM)\*
- Up to 5x reduction in area

TABLE I  
COMPARISON OF MULTIPHASE CLOCKING WITH DUAL CLOCKING METHOD [10] FOR DIFFERENT THROUGHPUTS

|           | 1/7 throughput |           |            |           |         |         | 1/12 throughput |        |            |        |           |          |         |
|-----------|----------------|-----------|------------|-----------|---------|---------|-----------------|--------|------------|--------|-----------|----------|---------|
|           | DCM            |           | Multiphase |           | Change  |         | DCM             |        | Multiphase |        | Change    |          |         |
|           | #DFF           | #JJ       | #DFF       | #JJ       | #DFF    | #JJ     |                 | #DFF   | #JJ        | #DFF   | #JJ       |          |         |
| int2float | 117            | 7'770     | 217        | 5'136     | +85.47% | -33.90% |                 | 39     | 5'140      | 46     | 3'939     | +17.95%  | -23.37% |
| priority  | 8'562          | 257'252   | 3'285      | 45'094    | -61.63% | -82.47% |                 | 4'225  | 158'568    | 1'775  | 34'524    | -57.99%  | -78.23% |
| voter     | 7'204          | 447'044   | 2'180      | 162'804   | -69.74% | -63.58% |                 | 3'732  | 355'144    | 1'568  | 158'520   | -57.98%  | -55.36% |
| c432      | 224            | 10'734    | 342        | 5'116     | +52.68% | -52.34% |                 | 118    | 7'124      | 240    | 4'402     | +103.39% | -38.21% |
| c880      | 362            | 14'658    | 254        | 6'190     | -29.83% | -57.77% |                 | 187    | 9'483      | 119    | 5'245     | -36.36%  | -44.69% |
| c1908     | 282            | 13'169    | 125        | 3'529     | -55.67% | -73.20% |                 | 144    | 8'739      | 69     | 3'137     | -52.08%  | -64.10% |
| c3540     | 776            | 43'437    | 589        | 17'016    | -24.10% | -60.83% |                 | 282    | 26'897     | 440    | 15'973    | +56.03%  | -40.61% |
| c1355     | 193            | 8'739     | 46         | 4'515     | -76.17% | -48.34% |                 | 119    | 6'149      | 44     | 4'501     | -63.03%  | -26.80% |
| s13207    | 1'795          | 106'346   | 1'837      | 42'382    | +2.34%  | -60.15% |                 | 571    | 60'766     | 1'082  | 37'097    | +89.49%  | -38.95% |
| s5378     | 645            | 50'766    | 808        | 21'761    | +25.27% | -57.13% |                 | 255    | 34'053     | 368    | 18'681    | +44.31%  | -45.14% |
| s382      | 56             | 4'448     | 89         | 2'411     | +58.93% | -45.80% |                 | 9      | 2'750      | 48     | 2'124     | +433.33% | -22.76% |
| Geomean   | 557.04         | 29'868.20 | 413.48     | 11'965.54 | -25.77% | -59.94% |                 | 227.85 | 19'565.38  | 229.57 | 10'467.10 | +0.76%   | -46.50% |

\*G. Pasandi and M. Pedram, "Depth-Bounded Graph Partitioning Algorithm and Dual Clocking Method for Realization of Superconducting SFQ Circuits," *ACM JETC*, Vol. 17, No. 1, October 2020.



# OUTLINE

**Rapid Single-Flux Quantum – Fundamentals**

**Gate Compounding Technique**

**Multiphase Clocking**

**Summary**

- SFQ technology offers orders of magnitude reduction in power and delay
  - ~100x lower power (including refrigeration)
  - ~10x faster speed (tens to hundreds of GHz)
- Technological issues
  - Manufacturing
    - e.g. JJ/Inductor scaling
  - Architectural
    - Gate-level pipelining
    - Limited fanin
- Gate compounding technique maximizes the use of asynchronous gates
  - Lower pipeline depth → smaller path balancing overhead
- Multiphase clocking efficiently trades throughput to reduce area
  - Up to 60% smaller area as compared to dual clocking method



Thank you!