

# Ultra-Low Power 18-Transistor Fully-Static Contention-Free Single-Phase Clocked Flip-Flop in 65nm CMOS

Yunpeng Cai, *Student Member, IEEE* Anand Savanth, *Member, IEEE*, Pranay Prabhat, *Member, IEEE*, James Myers, *Member, IEEE*, Alex S. Weddell, *Member, IEEE*, Tom J. Kazmierski, *Senior Member, IEEE*

**Abstract**—Flip-flops are essential building blocks of sequential digital circuits, but typically occupy a substantial proportion of chip area and consume significant amounts of power. This work proposes 18TSPC, a new topology of fully-static contention-free Single-Phase Clocked (SPC) Flip-Flop (FF) with only 18 transistors, the lowest number reported for this type. Implemented in 65nm CMOS, it achieves 20% cell area reduction compared to the conventional Transmission Gate FF (TGFF). Simulation results show the proposed 18TSPC is 3 times more efficient than TGFF in the Energy-Delay space. To demonstrate EDA compatibility and circuit/system-level benefits, a shift-register and an AES-128 encryption engine have been implemented. Chip experimental measurements at 0.6V, 25°C show that, compared to TGFF, the proposed 18TSPC achieves reductions of 68% and 73% in overall and clock dynamic power, respectively, and 27% lower leakage.

**Index Terms**—ultra-low power, single-phase clocked, flip-flop

## I. INTRODUCTION

THE rapid growth in deployment of Internet of Things (IoT) devices [1] means that processors are now becoming pervasive. IoT finds applications in various areas including healthcare, smart environments, and transportation [2]. However, along with the widespread deployment of these devices there comes a natural desire to reduce their energy/power demands: this can extend device active times, or mean that their batteries can be made smaller (reducing their cost and size). There is also a need to reduce the cost of device production, and minimizing the silicon area occupied by processors is a key consideration.

Scaling down the supply voltage brings power reduction benefits. Sub-threshold techniques adopt aggressive supply voltage scaling, below the threshold voltage, but have significant impact on variability and performance [3]. In contrast, Near-Threshold Voltage (NTV) techniques [4] allow the supply voltage to be brought close to the threshold voltage (but not below it), with a reduced impact on variability and performance characteristics, making this regime of operation more suitable for industry adoption. However, the variability related issues are still significant compared to conventional super-threshold techniques, calling for careful circuit design [5]. Flip-flops (FFs), as the key component in sequential logic

Yunpeng Cai, Anand Savanth, Alex S. Weddell and Tom J. Kazmierski are with Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, U.K.

Pranay Prabhat and James Myers are with ARM Ltd., Cambridge, CB1 9NJ, U.K.  
E-mail: yc4g13@soton.ac.uk



Fig. 1. Conventional Transmission Gate FF (TGFF) [12]

circuits, have a great impact on the performance, robustness, size and total power consumption of systems [6][7]. Motivated by this, recent research in FFs has focused on developing low-power and reduced-variability circuits, especially in the low voltage operation region [8][9].

Robustness is a primary factor in the design of ultra-low power FFs for low voltage operation. Conventionally, dynamic logic is introduced to achieve better speed [6]. However, dynamic logic is vulnerable to process variation, making dynamic circuits less robust at NTV [9], so *Fully-Static* FF operation is therefore desirable for ultra-low power designs. Additionally, for ultra-low power FFs, *single-phase clocked* (SPC) operation maximizes power efficiency in the NTV region, since the inverter chain (which provides the complemented clock signal) can be eliminated [10]. Also, contention paths need to be eliminated in ultra-low power FF designs, since the contention between the pull-down network and the keepers results in increased power consumption [11]. Also, any ratioed logic is vulnerable to process variation which may be exacerbated at NTV levels [9]. Thus, ultra-low power FFs should be *contention-free*, avoiding data contention paths. In addition, reduced area helps reduce silicon real-estate costs.

By analyzing the properties of the widely-used Transmission Gate FF (TGFF) [12][17] (Fig. 1) and other state-of-the-art (SoA) ultra-low power FF designs including the Topologically-Compressed FF (TCFF) [8] (Section II), it was found that SoA ultra-low power FFs do not meet all the above requirements, and their claimed benefits can reduce significantly as yield, EDA and system level issues are addressed [15].

This paper proposes 18TSPC, a SPC FF with only 18 transistors (the lowest reported for a fully-static contention-free SPC FF) with a novel master-slave topology (Section III). With a simplified topology, it delivers a 20% reduction in cell area compared to TGFF. Unlike SoA designs, 18TSPC meets all ultra-low power FF design requirements. It has



been implemented in 65nm CMOS along with a TGFF in 320-bit shift-register and AES-128 encryption engine design. This proves EDA compatibility and demonstrates circuit and system-level benefits. The design was first simulated (Section IV) then experimentally validated (Section V) at 0.6V, 25°C, at various Data Activity Rate ( $\alpha$ ), showing that the proposed



18TSPC achieves reductions of 68% and 73% in overall ( $P_{\alpha=10\%}$ ) and clock dynamic power ( $P_{\alpha=0\%}$ ), respectively, and 27% lower leakage compared to TGFF. Furthermore, unlike TCFF, the measurements indicate superior 18TSPC performance at NTV.

## II. REVIEW OF STATE-OF-THE-ART SPC FFs

The Static Single-Phase Contention-Free FF (S2CFF) [9] (Fig. 2a) is based on the dynamic True Single-Phase Clock (TSPC) FF [18], with an additional conventional slave latch, and uses 24 transistors (equal to the conventional TGFF). It is a fully-static circuit without contention issues, which suggests suitability for NTV operation. The topology improves power efficiency at all  $\alpha$  compared to TGFF; however, although it has the same transistor count, its complex topology results in layout area overheads. Additionally, it has 5 transistors connected to the clock (highlighted in red in Fig. 2a), leading to higher clock tree capacitances and associated power overheads.

In the Cross Charge-Control FF (XCFF) (Fig. 2b) [13] and Adaptive-Coupling FF (ACFF) (Fig. 2c) [14], dynamic logic nodes and contention paths are introduced in the design to improve speed. This can, however, degrade robustness when the supply voltage is decreased. Furthermore, the contention current results in extra power consumption during data transitions. In XCFF, the dynamic nodes are indicated as  $X_1$  and  $X_2$ . Contending devices and nodes in XCFF and ACFF are the highlighted inverters. Although the contention issue in ACFF can be mitigated by carefully modifying the width ratio of transistors in the slave latch, or by adding devices, this results in area and power overheads.

The Topologically-Compressed FF (TCFF) (Fig. 2d) [8] uses 21 transistors (fewer than the conventional TGFF). Its fully compressed topology improves power efficiency for all  $\alpha$  compared to TGFF. However, a design limitation can be observed. For correct operation, in the case when  $D$  is rising at  $CK = 0$ , data 0 is expected to be latched at node  $n_1$  if  $n_2$  is pulled up to  $vd_2$  (turns on  $M03$ ). For this,  $vd_2$  should be at supply voltage (VDD), otherwise  $n_2$  can be weakened which leads to high setup time and latch failure. However, in practice a voltage drop is observed at  $vd_1$  and  $vd_2$  in this condition. Owing to the latency of  $M19$  turning off, a temporary short-circuit path exists, weakening  $vd_1$  from VDD via the path  $M11 \rightarrow M12 \rightarrow M15 \rightarrow M19 \rightarrow GND$ . Since  $M18$  is on,  $vd_2$  is lower than VDD. The  $M05$  pull-up effort is weakened since  $vd_2 < VDD$ . Note that  $M15$  will not be off, since  $n_3$  will not be pulled down to zero until  $n_2$  crosses the mid-rail of VDD.  $n_2$  in this scenario can be slowly rising, or



Fig. 4. Design development of proposed 18TSPC from multiplexer scheme with topology compression.

floating at mid-rail, due to the degraded  $vd2$ . This analysis is supported by the SPICE simulation results (Fig. 3) at both supply voltages ( $VDD = 1.2V$  and  $0.6V$ ). Also, the voltage drop issue cannot efficiently be resolved just by resizing, as the Monte-Carlo simulation of [15] still shows a high setup time and very low yield (approx. 5%) owing to this limitation.

Recently, a True-Single-Phase-Clock FF with 18 transistors was proposed [16], shown in Fig. 2e. A dynamic node (N1) and contention paths (pull-up network M15, M16 contend with pull-down network M11, M12; pull-up transistor M10 contends with M17, M18) exist in the design. The FF design was implemented in 28nm FDSOI, which achieved a 40% improvement at 0.4V in energy/cycle compared with conventional MSFF. However, a non-complementary topology is used in its slave latch, i.e. the NMOS (M16) is used for pull-up, which can lead to voltage degradation in internal node N3. To mitigate the voltage drop issue, a poly-bias technique is applied to highlighted transistors. Further, to enable ultra-low voltage operation, a back bias voltage is applied to lower the threshold voltage of the design, requiring extra design effort. Further, the output buffer is eliminated which makes the circuit vulnerable to noise at output port Q [19]. Eliminating the output buffer also brings the problem of decreasing fanout. To improve its robustness and increase the fanout of the FF design, an output inverter needs to be inserted. Owing to this, the total transistor count would increase to 20.

### III. PROPOSED SINGLE-PHASE CLOCKED FF

#### A. SPC FF design approach

The aim of the design is to carry forward the enhancements achieved by previously-reported FFs in terms of cell area, power consumption and performance, but to overcome the limitations of these designs. To do this, the initial step is to evaluate the Boolean function of a positive-edge triggered Master-Slave FF (MSFF):

$$D_{ML}^{present} = \overline{CK} \cdot D + CK \cdot D_{ML}^{previous} \quad (1)$$

$$D_{SL}^{present} = \overline{CK} \cdot D_{SL}^{previous} + CK \cdot D_{ML}^{present} \quad (2)$$



Fig. 5. (a) Net states at  $Y1$  and  $Y2$  at different  $D$ ,  $D_{SL}^{previous}$  and  $CK$  states. (b) The proposed 18-Transistors Single-Phase Clocked FF (18TSPC).

In Equation 1,  $D$  is the data input,  $D_{ML}^{present}$  is the present data in the master latch, and  $D_{ML}^{previous}$  is the data which has been latched from  $D$  during the previous low  $CK$ . In Equation 2,  $D_{SL}^{present}$  is the present data in the slave latch, and  $D_{SL}^{previous}$  is the data which has been latched from the output of the master latch during the previous high  $CK$  in the slave latch.

Based on these equations, MSFF can be abstracted using two multiplexers [12], shown in Fig. 4(a). However, the original MUX2-based FF requires inverters to apply a complemented clock signal. To eliminate the internal clock inverters for the select ( $CK$ ) pin, a combination of a compound OR-AND-INVERTER (OAI21) gate and a NAND2 gate topology is adopted as the MUX2 circuit (Fig. 4b). By adopting the OAI21-based MUX2, the MSFF (Fig. 4a) can be constructed in a reduced gate level topology (Fig. 4c).

#### B. SPC FF circuitry reduction

It can be observed from the table in Fig. 4d that  $F1$  and  $F3$  are logically equivalent in all scenarios. This implies that



Fig. 6. (a) 18TSPC operation diagram at different  $CK$  and  $D$  states, highlighting the active devices, logic high nets and logic low nets. (b) Worst-case Hold time path analyse of the 18TSPC.

NAND gate  $N3$  in the slave latch (Fig. 4c) is redundant, so  $N3$  can be merged with gate  $N1$  (Fig. 4e). In the schematic-level design, gates  $R1-N2$  and  $R2-N4$  are combined as compound gates OAI21. The reduced NAND gate results in a saving of four transistors. The reduced topology results in a 20-transistor FF, with six transistors connected to  $CK$  (Fig. 4f).

To further reduce the number of clocked transistors, a transistor merging process is applied to the 20-transistor SPC FF (Fig. 4f). When  $CK$  is low, the clock-connected PMOS  $M1$  and  $M3$  are turned on, and nodes  $X1$  and  $X2$  are pulled up to VDD. Otherwise,  $X1$  and  $X2$  are floating. Hence  $M1$  and  $M3$  can be merged. Further, when  $CK = 1$ , NMOS  $M2$  and  $M4$  are on, node  $Y1$  and  $Y2$  is pulled down to 0. When  $CK = 0$ , NMOS  $M2$  and  $M4$  are turned off, the voltage level at node  $Y1$  and  $Y2$  depends on the signal  $D$  and  $F2$  respectively (see Fig. 5a). This shows  $M2$  and  $M4$  can be replaced with a single clocked-NMOS (connected between  $Y1$  and  $Y2$ ), working as a pass transistor. When  $CK = 0$ ,  $Y1$  and  $Y2$  are isolated since the clocked-NMOS is off. For  $CK = 1$ , the states of  $Y1$  and  $Y2$  are same ( $Y1 = Y2 = 0$ ). This transistor merging results in the proposed 18-transistor SPC FF (18TSPC), shown in Fig. 5b.

### C. 18TSPC operation and timing path analysis

Fig. 6a shows the operation of the 18TSPC at different  $CK$  and  $D$  states. No contention paths or dynamic nodes are observed in any of the scenarios. When  $CK = 0$ , devices on  $D$  only change the state of  $L1$  in the master latch. Since the slave latch remains isolated from  $D$  for  $CK = 0$ , the switching on  $L1$  does not induce any data corruption in the slave latch. When  $CK = 1$ ,  $D$  is isolated, and the FF outputs the previous latched data at  $L1$  in the master latch.

The setup time of the 18TSPC is determined by the propagation delay from  $D$  to  $F1$ . The hold time is determined by the speed of  $L2$  settling to its final value after the rising edge of  $CK$ . As shown in Fig. 6b, the worst-case hold time scenario is experienced when  $D$  falls too close to the rising edge of  $CK$ . If  $M6$  is turned off by  $D$  before net  $L2$  is fully



Fig. 7. Layout of the S2CFF, TGFF and the proposed 18TSPC.

discharged, a hold violation may be observed. The highlighted path in Fig. 6b is the critical hold time path of the design, and due to the proposed topology, the hold time is positive. The SPICE simulation waveform illustrates both correct operation and the hold-violation scenarios.

### IV. SIMULATION RESULTS AND ANALYSIS

To evaluate the proposed design, 18TSPC, S2CFF and TGFF have been laid-out and characterized with TSMC 65nm CMOS technology. For fair comparison, the transistor sizes of each FF were tuned to achieve the minimum energy ( $E_0$ ) point of the Energy-Efficient Curve (EEC), which is considered as the minimum size for correct functionality [20]. Post-layout Monte-Carlo simulations (10k runs) were performed for each FF, to evaluate functionality at different PVT corners. For EDA synthesis and further place-and-route (P&R) considerations, only the M1 metal layer is adopted in the proposed FF layouts. Fig. 7 shows the layouts of S2CFF, TGFF and the proposed

TABLE I  
DYNAMIC POWER, ENERGY/CYCLE AND ENERGY-DELAY PRODUCT (ED) COMPARISON

| PVT Corner     | FFs    | CK pin Power (nW) |      |      |      | D pin Power (nW) |      |      |      | Q pin Power (nW) |      | Energy at<br>$\alpha = 100\%$<br>(fJ/cycle) | Norm. ED<br>(i.e. Min E) |  |  |
|----------------|--------|-------------------|------|------|------|------------------|------|------|------|------------------|------|---------------------------------------------|--------------------------|--|--|
|                |        | D!                |      | D    |      | CK!              |      | CK   |      | CK               |      |                                             |                          |  |  |
|                |        | rise              | fall | rise | fall | rise             | fall | rise | fall | rise             | fall |                                             |                          |  |  |
| TT/0.6V/25°C*  | TGFF   | 1.33              | 1.68 | 1.48 | 1.52 | 0.93             | 1.64 | 0.06 | 0.62 | 1.86             | 1.93 | 6.64                                        | 46.81                    |  |  |
|                | S2CFF  | 0.10              | 1.63 | ~0.0 | 0.50 | 1.24             | 1.49 | ~0.0 | 0.16 | 1.79             | 1.88 | 4.01                                        | 40.84                    |  |  |
|                | 18TSPC | 0.28              | 0.80 | ~0.0 | 0.32 | 0.37             | 0.73 | ~0.0 | 0.17 | 1.73             | 1.42 | 2.99                                        | 21.99                    |  |  |
| TT/1.2V/25°C** | TGFF   | 5.51              | 7.04 | 6.03 | 6.56 | 3.82             | 7.22 | ~0.0 | 2.78 | 7.97             | 8.74 | 27.10                                       | 43.97                    |  |  |
|                | S2CFF  | 7.37              | 0.22 | ~0.0 | 2.36 | 5.36             | 6.46 | ~0.0 | 0.82 | 7.75             | 8.15 | 17.91                                       | 40.71                    |  |  |
|                | 18TSPC | 0.93              | 3.53 | ~0.0 | 1.50 | 1.53             | 3.39 | ~0.0 | 0.61 | 7.42             | 6.13 | 11.80                                       | 23.99                    |  |  |

\*CK transition time 1.4038 ns, D transition time: 2.7556 ns, Q load capacitance: 0.0202 pF, CK Frequency: 6.66 MHz

\*\*CK transition time: 0.0894 ns, D transition time: 0.3255 ns, Q load capacitance: 0.0279 pF, CK Frequency: 6.66 MHz

ED = (Energy/cycle)·(D-to-Q delay). D-to-Q delay= setup time + CK-to-Q delay

Norm. ED: Normalized by an unloaded minimum inverter ( $ED_{invo} = 0.58 \text{ fJ} \cdot \text{ns}$  at 0.6V,  $0.21 \text{ fJ} \cdot \text{ns}$  at 1.2V)



Fig. 8. Normalized Energy/cycle with  $\alpha = 100\%$  at nominal supply voltage (1.2V for 65nm CMOS, 1.0V for 45nm FDSOI [9]) and NTV (0.6V for 65nm CMOS, 0.4V for 45nm FDSOI [9]).

18TSPC, which shows a 20% and 29% reduction in cell area over TGFF and S2CFF respectively. Owing to its reduced circuitry and lower transistor count, 18TSPC achieves the lowest leakage power (104 pW at TT/1.2V/25°C) of the three FF cells, 27% less than TGFF and 32% less than S2CFF. In 18TSPC, the clocked transistor count is 4, 1 transistor less than S2CFF. Hence, the clock pin capacitance of 18TSPC (2.16 fF) is 37% less than S2CFF. Since only two transistors are directly connected to the CK pin, TGFF also achieves the lowest clock pin capacitance (1.09 fF) of the three FFs. However, more transistors (12 in total) are clock signal related in TGFF, which leads to higher dynamic power. To reduce the area and CK network power for the conventional MSFF, one single clock inverter chain can be shared with multiple FFs, i.e. Multi-Bit FF (MBFF) topology [21]. Benefiting from the reduced topology, a multi-bit 18TSPC still has lower area compared to TGFF-based multi-bit FFs. The 18TSPC-based design shows 11% area reduction versus a 2-bit TGFF based MBFF. Compared with a 4-bit design, the area saving is 5%. A multi-bit 18TSPC also shows superior power efficiency compared to TGFF-based MBFF cells. At  $\alpha = 0\%$ , a 2-bit 18TSPC-based MBFF achieves a 66% power saving and a 4-bit 18TSPC-based design achieves a 60% power saving

compared to the TGFF-based MBFFs. At  $\alpha = 100\%$ , the same designs show a 56% and 54% power saving, respectively.

Table I shows the dynamic power and energy (per cycle) of each FF at different  $D$  and  $CK$  switching scenarios at TT/0.6V/25°C and TT/1.2V/25°C. The power data is the mean value collected from the power lookup table in generated Liberty files (.lib). In contrast with TGFF, the  $CK$  pin power is evenly distributed in each scenario.

The dynamic power in SPC FFs is activity-dependent. In S2CFF, more  $CK$  power is consumed when  $D = 0$  &  $CK$  rising and  $D = 1$  &  $CK$  falling. In 18TSPC, higher  $CK$  power is reported for  $D = 0$  &  $CK$  falling and  $D = 1$  &  $CK$  falling. The unevenly distributed  $CK$  power in different transition scenarios is quite dependent on the topology of SPC FF. Overall, the proposed 18TSPC achieves lower dynamic power at different  $D$  and  $CK$  switching scenarios, and achieves lowest energy (2.99 fJ/cycle at TT/0.6V/25°C and 11.8 fJ/cycle at TT/1.2V/25°C) among the three FFs. The normalized results shown in Fig. 8 highlight a 55% energy reduction versus TGFF at TT/0.6V/25°C and 56% energy saving against TGFF at TT/1.2V/25°C is achieved. Since the FFs are implemented to achieve  $E_0$ , the Energy-Delay (ED) product can be considered as the  $MinE$  point on EEC [20]. The  $MinE$  of the proposed 18TSPC is about  $1.8\times$  and  $1.7\times$  better than TGFF and S2CFF, respectively, in the ED space at 1.2V. At 0.6V, the 18TSPC is about  $2.1\times$  more efficient than TGFF and  $1.9\times$  better than S2CFF in ED space. The 18TSPC shows energy efficiency in ED space at both nominal voltage and NTV operation.

Fig. 9 shows the D-to-Q delay simulation results for the three FFs at SS/0.6V/25°C. No functional failure was observed over 10K simulations. The proposed 18TSPC has a lower mean ( $\mu$ ) value in D-to-Q delay distribution than S2CFF (35% lower). The result shows the proposed design has a higher  $\mu$  than TGFF (19% higher), considered as the performance penalty. The  $\mu + 3\sigma$  value of the 18TSPC D-to-Q delay over 10K simulations is 14.78 ns, 34% lower than S2CFF, and 17% higher than TGFF.

To evaluate EDA compatibility and system level characteristics, all three FFs were used to implement AES-128 macros using industry-standard EDA tools. Fig. 10 shows the floorplan for each design. The clock trees are highlighted, illustrating similar complexity of each design. In the AES-128



Fig. 9. 10K Monte-Carlo simulation results of D-to-Q Delay.



Fig. 10. AES 128 floorplan of (a) 18TSPC, (b) TGFF and (c) S2CFF, the clock tree is highlighted.

macro, FFs contribute 4% of standard cells and all variants were synthesized for identical area and timing constraints, as highlighted in Table II.

Owing to the positive hold time characteristic of the 18TSPC, more hold buffers are inserted into the 18TSPC-based AES implementation. Because of this, the 18TSPC based AES-128 macro consumes higher combinational power ( $P_{comb.}$ ), 2% and 0.8% higher than the TGFF and S2CFF-based designs, respectively. Due to the better power efficiency of the proposed design, the register and clock network power ( $P_{REG}+P_{CK\_net}$ ) of 18TSPC-based design is 37% lower than the TGFF-based macro and 36% lower than the S2CFF-based macro with clock gating applied. However, owing to the limited contribution of FFs in AES-128, the overall dynamic power ( $P_{total}$ ) is merely 2.3% lower than the TGFF-based implementation (vector based simulation). A small negative slack in hold time is observed in the synthesis result, verified through static timing analysis after full RC extraction.

In modern SoC design, FFs are implemented with scan paths for testability. An MUX2 gate is added to the proposed 18TSPC in cell level, named S\_18TSPC. The S\_18TSPC and the standard scan FF (S\_TGFF) were also used to implement the AES-128 macro with the same setup as in Table II, as highlighted in Table III. The S\_18TSPC still has a 4%

TABLE II  
AES-128 SYNTHESIS RESULTS COMPARISON

| Area                | Unit    | 18TSPC | TGFF | S2CFF |
|---------------------|---------|--------|------|-------|
| <b>CK Buffers</b>   | #       | 17     | 11   | 15    |
| <b>Hold Buffers</b> | #       | 72     | 0    | 0     |
| <b>NO. FFs</b>      | #       | 385    | 385  | 385   |
| $P_{REG}$           | $\mu W$ | 2.76   | 5.65 | 5.15  |
| $P_{CK\_net}$       | $\mu W$ | 1.62   | 1.38 | 1.68  |
| $P_{comb.}$         | $\mu W$ | 57.4   | 56.3 | 56.97 |
| $P_{total}$         | $\mu W$ | 61.8   | 63.3 | 63.8  |
| <b>WNS_SETUP</b>    | ps      | 0      | 0    | 0     |
| <b>WNS_HOLD</b>     | ps      | 9      | 0    | 0     |

Die Area :  $200 \mu m \times 299.6 \mu m$ , Target CK Frequency: 20 MHz, Clock Uncertainty: 30 ps, Clock-Gating applied  
Process Corners: 1.2V/TT/25°C, 1.08V/SS/125°C, 1.32V/FF/-40°C  
WNS\_HOLD: Worst Negative Hold Slack  
WNS\_SETUP: Worst Negative Setup Slack

TABLE III  
SCAN FFs IMPLEMENTATION RESULTS

|                                  | Unit                | S_18TSPC  | S_TGFF |       |
|----------------------------------|---------------------|-----------|--------|-------|
| <b>Cell Level (TT/1.2V/25°C)</b> | <b>+Transistors</b> | #         | 10     | 8     |
| <b>Cell Level (TT/1.2V/25°C)</b> | <b>Area</b>         | norm.     | 0.96   | 1     |
| <b>Cell Level (TT/1.2V/25°C)</b> | <b>Hold</b>         | ps        | 3      | -6    |
| <b>Cell Level (TT/1.2V/25°C)</b> | <b>Setup</b>        | ps        | 170    | 91    |
| <b>AES-128</b>                   | <b>Area</b>         | $\mu m^2$ | 36710  | 36514 |
| <b>AES-128</b>                   | <b>NO. FFs</b>      | #         | 385    | 385   |
| <b>AES-128</b>                   | $P_{REG}$           | $\mu W$   | 4.84   | 5.82  |
| <b>AES-128</b>                   | $P_{CK\_net}$       | $\mu W$   | 1.56   | 1.35  |
| <b>AES-128</b>                   | $P_{comb.}$         | $\mu W$   | 56.3   | 57.5  |
| <b>AES-128</b>                   | $P_{total}$         | $\mu W$   | 62.7   | 64.6  |
| <b>AES-128</b>                   | <b>WNS_SETUP</b>    | ps        | 0      | 0     |
| <b>AES-128</b>                   | <b>WNS_HOLD</b>     | ps        | 0      | 0     |

Die Area :  $200 \mu m \times 299.6 \mu m$ , Target CK Frequency: 20 MHz, Clock Uncertainty: 30 ps, Clock-Gating applied  
Process Corners: 1.2V/TT/25°C, 1.08V/SS/125°C, 1.32V/FF/-40°C  
WNS\_HOLD: Worst Negative Hold Slack  
WNS\_SETUP: Worst Negative Setup Slack

area saving at the cell level. Owing to the added MUX gate at the Data in path of the 18TSPC, the data path delay is increased inducing a higher setup time compared to the S\_TGFF. However, the S\_18TSPC has a lower hold time than the original 18TSPC since the added MUX2 increases the data path delay.

In the AES-128 macro, the S\_18TSPC-based design has slightly higher area overhead (0.5%) than the S\_TGFF based implementation due to the higher number of inserted clock buffers in the clock tree. Accordingly, the  $P_{CK\_net}$  of the S\_18TSPC based macro is 15% higher than the S\_TGFF design. Note that, the dominant contributor of sequential power consumption ( $P_{REG}+P_{CK\_net}$ ) is  $P_{REG}$  which accounts for up to 80% of the total sequential power. As shown in Table III, the  $P_{REG}$  of the S\_18TSPC-based design consumes 23% less power than the S\_TGFF design. Hence, the total sequential power consumption of the S\_18TSPC-based design is still 15% lower than the S\_TGFF-based macro with clock gating



Fig. 11. Block Diagram of the 320-bit Shift-Reg.



Fig. 12. (a) Die micrograph: two blocks are built in the test chip, the AES-128 and the Shift Register (Shift-Reg). (b) Test Board.

applied. Since the hold time of S\_18TSPC is lower than 18TSPC, no hold time violation is observed. The  $P_{total}$  of the S\_18TSPC-based design is 3% lower than the S\_TGFF based implementation (vector based simulation).

According to the simulation results and analysis, 18TSPC shows advantages in power characteristics and cell area, and its EDA compatibility has been proved.

## V. EXPERIMENTAL VALIDATION

To validate the proposed design, the 18TSPC-based AES-128 macro, targeting nominal voltage operation ( $VDD = 1.2V$ ), was included in a test chip. As discussed in Section IV, the proportion of FF cells in the AES-128 block is limited (4%), so it can be difficult to show the power benefit of the proposed design. Therefore, to quantify the benefits of the FF in isolation, two 320-bit shift registers (18TSPC and TGFF-based) with synthesized clock trees were also implemented for nominal voltage operation ( $VDD = 1.2V$ ), with no hold buffers required between FF stages. Referring to the S2CFF and TGFF ED product (Table I), TGFF was chosen as the reference design for comparison. The block diagram is shown in Fig. 11, and the fabricated test chip is shown in Fig. 12a. A 32-bit Arm Cortex-M0 [22] micro-controller based test board is shown in Fig. 12b, which provides the state monitor,



Fig. 13. Measured power of 320-bit Shift-Reg against  $\alpha$  at (a) 1.2V (with  $0^\circ C$ ,  $25^\circ C$  and  $85^\circ C$ ) (b) 0.6V with  $25^\circ C$



Fig. 14. Measured total power of 320-bit Shift-Reg with (a)  $\alpha = 100\%$  (b)  $\alpha = 0\%$  with fixed clock frequency ( $F_{Board\_MAX} = 66$  MHz) at different supply voltage.

power supply connections and USB interfaces for function monitoring, power measurement and communicating with the



Fig. 15. Measured results of the 18TSPC AES-128 block (Typical Die).

host computers, respectively.

Fig. 13a shows the measured normalized power vs  $\alpha$  at 1.2V with maximum clock frequency of the board ( $F_{Board\_MAX} = 66$  MHz). At  $\alpha = 0\%$ , the total power is reduced by 68.5%. The average  $\alpha$  of FFs in systems is typically 5% to 15% [14]. Measurement results show a 62.5% power saving at  $\alpha = 10\%$ . The benefits are retained at 0°C and 85°C. Fig. 13b shows the measured power vs  $\alpha$  at 0.6V at 25°C, measured results indicate that at  $\alpha = 0\%$  the total power saving is increased to 73% and at  $\alpha = 10\%$  the power saving is increased to 68%.

Fig. 14a shows the measured power with  $\alpha = 100\%$  at different VDD. The clock frequency is set as 66 MHz ( $F_{Board\_MAX}$ ). As VDD decreases to 0.85V, the 18TSPC-based Shift-Reg ceased to work at 66 MHz. Due to the performance penalty (Fig. 9), 18TSPC needs to work at lower frequency when  $VDD < 0.85V$ . For TGFF, with better D-to-Q delay characteristic, functionality was maintained with a 66 MHz clock frequency down to 0.65V. From the result, it can be seen that the proposed 18TSPC saves 39% power at 1.2V and the power benefit is maintained as VDD is decreased until the TGFF register fails at 0.65V. The power benefit with  $\alpha = 0\%$  ( $P_{\alpha=0\%}$ ) (CK pin dynamic power dominant) is shown in Fig. 14b. At 1.2V, the  $P_{\alpha=0\%}$  of 18TSPC is 68.7% less than the reference TGFF. At the minimum VDD of 18TSPC for 66 MHz clock frequency operation (VDD = 0.85V), the  $P_{\alpha=0\%}$  saving increased to 69.4%. Although Fig. 14a shows that total power is equivalent for both designs at their minimum operating voltage (TGFF = 0.65V and 18TSPC = 0.85V) with fixed frequency ( $F_{Board\_MAX}$ ), note that the  $P_{\alpha=0\%}$  of 18TSPC at 0.85V is still 54.3% less than TGFF at 0.65V.

Logic Built-In Self-Test (LBIST) is applied to 18TSPC-based AES-128 for functional test, active power and maximum frequency measurements. Fig. 15 shows the total power of the AES-128 macro at different supply voltages with the respective maximum clock frequency. Leakage power is also measured at various supply voltages. Although the AES-128 is functionally correct at 0.6V, the results with acceptable clock frequencies ( $F_{CK} > 0.1$  MHz) are shown. The minimum operation voltage and respective maximum  $F_{CK}$  of the 18TSPC AES-128 macro is 0.7V with 0.81 MHz. The leakage power at 0.7V is 62 nW. For 1.2V operation, the test macro shows maximum  $F_{CK}$  of 56 MHz with 2.3 mW active power consumption and the leakage power at 1.2V is 390 nW.

Fig. 16 shows the measurement on minimum functional



Fig. 16.  $V_{min}$  distribution of Shift-Reg over 92 test chips.



Fig. 17. Functional  $V_{min}$  of AES-128 block and Shift-Reg with 0.1 MHz clock frequency at different temperature condition.

voltage ( $V_{min}$ ) of the Shift-Reg over 92 test chips. Note that the on-chip macro design was targeted for 1.0-1.2V operation, but the measurement results show a mean  $V_{min}$  of 0.63V. The functionality at low voltage is mainly limited by the increased hold time of the FFs. To enable lower voltage operation, hold buffers should be inserted between stages during macro implementation.

For the temperature-related measurements, the chip was placed in a temperature chamber. The temperature effects on the functional  $V_{min}$  of AES-128 and Shift-Reg with 0.1 MHz clock frequency are shown in Fig. 17. Owing to their higher sensitivity to the decreased threshold voltage induced by higher temperature at low VDD (leading to decreased gate delay and stronger temperature inversion effects [23]), for both blocks the functional  $V_{min}$  is decreased as the temperature increases. The AES block is a combinational logic-dominant circuit which brings a variety of hold paths with some containing multiple 2-stack gates, which makes the  $V_{min}$  of the AES more sensitive to temperature. Only one type of hold path exists in the shift register, meaning that temperature has less effect on its  $V_{min}$  ( $\Delta V_{min} = 70\text{mV}$  over 0 °C - 80 °C) compared to the AES-128 ( $\Delta V_{min} = 120\text{mV}$  over 0 °C - 80 °C).

## VI. CONCLUSION

This work proposed 18TSPC, a fully-static and contention-free SPC FF with the lowest reported number of transistors

TABLE IV  
SUMMARY OF COMPARISON WITH PRIOR-WORKS

| FF Design                 | 18TSPC    | TGFF     | S2CFF        | TCFF        | ACFF*            | XCFF*        | TSPC-18T*     |
|---------------------------|-----------|----------|--------------|-------------|------------------|--------------|---------------|
| Year                      | This work | std-cell | ISSCC'14 [9] | JSSC'14 [8] | ISSCC'11 [14]    | VLSI'05 [13] | TCASI'18 [16] |
| Technology (Reported)     | 65nm      | -        | 45nm SOI     | 40nm        | 40nm             | 100nm        | 28nm FDSOI    |
| Type                      | Static    | Static   | Static       | Static      | Static           | Dynamic      | Semi-Dynamic  |
| Contention                | No        | No       | No           | No          | Partial          | Yes          | Yes           |
| Single Phase Clock        | Yes       | No       | Yes          | Yes         | Yes              | Yes          | Yes           |
| Complementary Topology    | Yes       | Yes      | Yes          | Yes         | Yes              | Yes          | No            |
| Output Inverter           | Yes       | Yes      | Yes          | Yes         | Yes              | No           | No            |
| Poly Biasing              | No        | No       | No           | No          | No               | No           | Yes           |
| FBB/RBB                   | No        | No       | No           | No          | No               | No           | Yes           |
| Transistors (CK/TOTAL)    | 4/18      | 12/24    | 5/24         | 3/21        | 4/22             | 4/21         | 4/18          |
| Norm. Power @10% $\alpha$ | 0.32      | 1        | 0.6 [9]      | 0.34 [8]    | 0.4 [14]         | 1.2 [13]     | 0.42 [16]     |
| Setup (ns)**              | 9.2       | 4.66     | 14.7         | 137         | *Not Implemented |              |               |
| Hold (ns)**               | 11        | -2.9     | -10.2        | -8          | **SS/0.54V/25°C  |              |               |
| CK-to-Q (ns)**            | 14.6      | 14.8     | 14.5         | 13.4        |                  |              |               |

18TSPC, TGFF, S2CFF, TCFF are characterised with 65nm.

(18), demonstrating a 20% cell area reduction with respect to the conventional TGFF. Fewer devices also results in 27% lower leakage. With a *MinE* driven circuit implementation, the proposed design has higher D-to-Q delay and hold time than TGFF. Although a performance penalty is observed, thanks to the low power characteristic of the proposed design, 18TSPC achieves 1.8× better ED product. Chip measurement results show a 62.5% reduction in overall power at  $\alpha = 10\%$ , and a 68% reduction in  $P_{\alpha=0\%}$  at 1.2V, 25°C. When VDD scales down to NTV level (VDD = 0.6 V), the overall power benefit at  $\alpha = 10\%$  increases to 68% and the  $P_{\alpha=0\%}$  benefit increased to 73% compare to the conventional TGFF. Also, the chip test with an AES-128 macro proves the compatibility of the proposed 18TSPC for automatic EDA implementation based on standard cells. A brief summary of the proposed 18TSPC and comparison with prior-works is shown in Table IV. The proposed 18TSPC has better power characteristics than the SoA S2CFF design.

#### ACKNOWLEDGMENT

The authors would like to thank David Flynn, David Bull, Sheng Yang and Shidhartha Das for useful discussions and helpful reviews. Thanks are also extended to Rohan Gaddh and Graham Knight for support with crypto-engine implementation and chip measurement. Experimental data used in this paper can be found at DOI: <http://doi.org/10.5258/SOTON/D0678>.

#### REFERENCES

- [1] L. Atzori, A. Iera, and G. Morabito, "The Internet of Things: A survey," *Computer networks*, vol. 54, no. 15, pp. 2787–2805, 2010.
- [2] K. L. Chang, J. S. Chang, B. H. Gwee, and K. S. Chong, "Synchronous-Logic and Asynchronous-Logic 8051 Microcontroller Cores for Realizing the Internet of Things: A Comparative Study on Dynamic Voltage Scaling and Variation Effects," *IEEE Trans. Emerg. Sel. Topics Circuits Syst.*, vol. 3, no. 1, pp. 23–34, March 2013.
- [3] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, "Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits," *Proc. IEEE*, vol. 98, no. 2, pp. 253–266, Feb 2010.
- [4] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, "Near-Threshold Voltage (NTV) design: opportunities and challenges," in *Proc. 49th Annu. Design Automation Conf.* ACM, 2012, pp. 1153–1158.
- [5] U. R. Karpuzcu, N. S. Kim, and J. Torrellas, "Coping with parametric variation at Near-Threshold Voltages," *Micro, IEEE*, vol. 33, no. 4, pp. 6–14, 2013.
- [6] V. Stojanovic and V. G. Oklobdzija, "Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems," *IEEE J. Solid-State Circuits*, vol. 34, no. 4, pp. 536–548, 1999.
- [7] M. Alioto, E. Consoli, and G. Palumbo, "Variations in Nanometer CMOS Flip-Flops: Part I :Impact of Process Variations on Timing," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 62, no. 8, pp. 2035–2043, Aug 2015.
- [8] N. Kawai, S. Takayama, J. Masumi, N. Kikuchi, Y. Itoh, K. Ogawa, A. Ugawa, H. Suzuki, and Y. Tanaka, "A fully static topologically-compressed 21-transistor flip-flop with 75% power saving," *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2526–2533, 2014.
- [9] Y. Kim, W. Jung, I. Lee, Q. Dong, M. Henry, D. Sylvester, and D. Blaauw, "A static contention-free single-phase-clocked 24t flip-flop in 45nm for low-power applications," in *IEEE ISSCC Dig. Tech. Papers*, Feb 2014, pp. 466–467.
- [10] N. Pinckney, D. Blaauw, and D. Sylvester, "Low-Power Near-Threshold Design: Techniques to Improve Energy Efficiency," *IEEE Solid-State Circuits Mag.*, vol. 7, no. 2, pp. 49–57, Spring 2015.
- [11] V. Kursun and E. G. Friedman, "Variable threshold voltage keeper for contention reduction in dynamic circuits," in *ASIC/SOC Conference*. IEEE, 2002, pp. 314–318.
- [12] M. R. Jan, C. Anantha, and N. Borivoje, *Digital Integrated Circuits-A Design Perspective*. Prentice-Hall, 2002.
- [13] A. Hirata, K. Nakanishi, M. Nozoe, and A. Miyoshi, "The cross charge-control flip-flop: A low-power and high-speed flip-flop suitable for mobile application SoCs," in *Symp. VLSI Circuits Dig. Tech. Papers*. IEEE, 2005, pp. 306–307.
- [14] C. K. Teh, T. Fujita, H. Hara, and M. Hamada, "A 77% energy-saving 22-transistor single-phase-clocking D-flip-flop with adaptive-coupling configuration in 40nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb 2011, pp. 338–340.
- [15] Y. Cai, A. Savanth, P. Prabhat, J. Myers, A. S. Weddell, and

- T. Kazmierski, "Evaluation and analysis of single-phase clock flip-flops for ntv applications," in *Power and Timing Modeling, Optimization and Simulation (PATMOS)*, Sept 2017, pp. 1–6.
- [16] F. Stas and D. Bol, "A 0.4-v 0.66-fJ/cycle Retentive True-Single-Phase-Clock 18T Flip-Flop in 28-nm Fully-Depleted SOI CMOS," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 3, pp. 935–945, March 2018.
- [17] G. Gerosa, S. Gary, C. Dietz, D. Pham, K. Hoover, J. Alvarez, H. Sanchez, P. Ippolito, T. Ngo, S. Litch, J. Eno, J. Golab, N. Vanderschaaf, and J. Kahle, "A 2.2 W, 80 MHz superscalar RISC microprocessor," *IEEE J. Solid-State Circuits*, vol. 29, no. 12, pp. 1440–1454, Dec 1994.
- [18] J. Yuan and C. Svensson, "New single-clock CMOS latches and flipflops with improved speed and power savings," *IEEE J. Solid-State Circuits*, vol. 32, no. 1, pp. 62–69, Jan 1997.
- [19] N. H. Weste and D. Harris, *CMOS VLSI design: a circuits and systems perspective*. Pearson Education India, 2015.
- [20] M. Alioto, E. Consoli, and G. Palumbo, "General strategies to design nanometer flip-flops in the energy-delay space," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 7, pp. 1583–1596, July 2010.
- [21] Y. T. Shyu, J. M. Lin, C. P. Huang, C. W. Lin, Y. Z. Lin, and S. J. Chang, "Effective and efficient approach for power reduction by using multi-bit flip-flops," *IEEE Trans. Very Large Scale Integration (VLSI) Syst*, vol. 21, no. 4, pp. 624–635, April 2013.
- [22] "Arm, Cortex-M0 Revision: r0p0 Technical Reference Manual," <http://www.arm.com/products/processors/cortex-m/> cortex-m0.php, accessed: 2017-09-30.
- [23] D. Flynn, R. Aitken, A. Gibbons, and K. Shi, *Low power methodology manual: for system-on-chip design*. Springer Science & Business Media, 2007.



**Pranay Prabhat** received the B.Tech. degree in electronics and communication engineering from Indian Institute of Technology, Guwahati, India, and the M.Sc. degree in microelectronics systems design from the University of Southampton, Southampton, U.K., in 2005 and 2007, respectively. He is currently a Principal Research Engineer with Arm, Cambridge, U.K. His research interests include low power digital circuits and systems.



**Yunpeng Cai** received the BEng degree in Electronic Engineering from University of Central Lancashire, UK, in 2013, and MSc degree in Microelectronics System Design from University of Southampton, UK, 2014. He is currently pursuing the PhD degree in Electronics and Electrical Engineering. From 2016-2017, he completed a seven-month intern-ship in Arm Research, Cambridge, U.K. He was a visiting Ph.D student in Arm research for one month in 2017. His research interests also include computer vision, 3D object reconstruction and reliable ultra-low power system design.



Saxby Award from the University of Liverpool in 2010.

**Anand Savanth** (M'13) received the degree in ECE from Visvesvaraya Technological University, India, in 2008, and the Masters degree in microelectronics from the University of Liverpool. He is currently pursuing the Ph.D. degree with the ECS Group, University of Southampton, with a focus on custom and analog-assisted circuits for IoT platforms and energy harvesting applications. He has been with the Applied Silicon Group, ARM Research, developing low power custom circuits for Arm embedded and applications processors. He received the Sir Robin



**James Myers** (M'14) received the M.Eng. degree in electrical and electronic engineering from Imperial College, London, U.K., in 2004. He is a Principal Research Engineer with Arm Research, Cambridge, U.K., where he leads research into energy constrained wireless sensor nodes for the Internet of Things. In 2007, he joined Arm where he was initially responsible for developing reference implementation flows for the various Arm soft processor cores. Since 2009, he has been with R&D full time focused on deployable techniques for reduction of CPU and SoC power, embodied across a dozen tape-outs. He holds multiple U.S. patents. His research interests include energy harvesting, power gating, sub-threshold circuits, and better than worst-case design.



**Alex S. Weddell** (GSM'06-M'10) received the M.Eng. degree (1st class honors) in electronic engineering from the University of Southampton, U.K., in 2005 and the Ph.D. degree from the same institution in 2010 with the thesis A Comprehensive Scheme for Reconfigurable Energy-Aware Wireless Sensor Nodes. He is currently working as Research Fellow at the University of Southampton, on the EPSRC-funded project Next-Generation Energy Harvesting ElectronicsA Holistic Approach. His main research focus is in the areas of energy harvesting, energy management, and wireless sensing. He has published 14 peer-reviewed papers in these areas, and has worked on a number of associated projects including the U.K. Ministry of Defence consortium-funded Adaptive Energy-Aware Sensor Networks, and EPSRC-funded platform grant Future Directions in Intelligent Sensing. He contributed a chapter on Wireless Devices and Sensor Networks in the book Energy Harvesting for Autonomous Systems (Artech House, 2010).



**Tom Kazmierski** (M'95-SM'10) received the M.S. degree in electronic engineering from Warsaw University of Technology, Warsaw, Poland, in 1973 and the Ph.D. degree from the Military Academy of Technology, Warsaw, in 1976.

In 1984, he joined the Department of Electronics and Computer Science, University of Southampton, Southampton, U.K., where he has pursued research in numerical modelling, simulation, and synthesis techniques for computer-aided design of very large scale integration (VLSI) circuits. From 1990 to 1991, he was a Visiting Research Scientist with the IBM VLSI Technology Division, San Jose, CA, where he developed and patented synchronization techniques for multi-solver simulation backplanes. He has contributed to the development of the VHDL-AMS standard by IEEE, served as Chair of the IEEE DASC P1076.1 (VHDL-AMS) Working Group from 1999 to 2005, and is currently serving as the P1076.1 WG Secretary. He has published over 100 papers and given a number of invited talks and tutorials mostly in the area of analogue and mixed signal synthesis and HDLs. In recent years, he has been working on web-based electronic design frameworks and applications of VHDL-AMS to high-level system modelling and synthesis, involving modelling of mixed-domain systems, automated analog, and mixed-signal synthesis for ASIC design, including synthesis of artificial and VLSI neural networks.