

## SEQUENTIAL CIRCUIT DESIGN

- \* Sequential circuit output depends on the both past and present inputs.
- SC are mainly made up of flip-flops and latches. These are called as Sequencing elements.
- The Sequencing elements inserts delay to the circuit.

- ① Dynamic Circuits → clock is one of the inputs → Eg: domino, CMOS logic
- ② Static circuits → No clock. Eg: CMOS, pass transistors.

### \* Sequencing of static circuits:



### \* Different types of sequencing methods:

- 1) flip-flops
- 2) 2 Phase transparent latches
- 3) Pulsed latches

① flip-flops:



can be also connected as



② 2-Phase transparent latches



### ③ Pulsed latches



### \* Max-Delay Constraints:

#### • TPmining notations:



$t_{cd} \rightarrow$  contamination delay

$tpd \rightarrow$  propagation delay



$t_{setup} \rightarrow$  Setup time

$t_{hold} \rightarrow$  hold time

$t_{ceq} \rightarrow$  Contaminat'n clock to  $\Phi Q$

$t_{pca} \rightarrow$  Propogat'n clock to  $\Phi$

\* Max Delay Constraints:



$$T_{c\min} \geq t_{\text{pcq}} + t_{\text{pd}} + t_{\text{setup} 2}$$

$$T_{c\min} \rightarrow f_{\max}$$

- Ideally clock pulse would be available for computation of combination logic
- The sequencing overheads of flip flops consume these times.
- If combinational logic delay is large the receiving element will miss setup time and sample wrong value resulting in setup time failure or max-delay failure.
- Solution for this is increasing clock period.
- The fig show max-delay timing constraints on a path from one flip-flop to next.
- Clock is assumed to be ideal
- Path begin with rising edge of clock @  $F_1$  triggering  $F_1$ . The data propagates as output of  $F_1$  is a  $Q$  through combinational logic to  $D$ . Setting up  $F_2$  before next rising

### \* Min-delay constraints: edge.

→ This implies that clock period must be large

→ Maximum delay is given by

$$tpd \leq T_c - (t_{setup} + t_{pcq})$$

↳ Sequencing overhead.

### \* Min-delay Constraints



→ Sequencing elements can be placed back to back without any intervening combinational circuit and still get proper functionality.

→ but if the hold time is large and the contamination delay is small then data can incorrectly propagate between the stages/system.

→ This is called race condition, hold time failure, min-delay failure

- It can only be fixed by redesigning the logic, not by slowing the clock.
- fig shows the min-delay timing constraints on a path from  $D_1$  to  $D_2$  assuming ideal clocks with no skew.
- The path begins with the rising edge of clock triggering  $F_1$ .
- The data may begin to change @  $D_1$  after a clk-to-a contamination delay. & @  $D_2$  after another logic contamination delay
- However, it must not reach  $D_2$  until at least the hold time thold after the clock edge, lest it corrupt the contents of  $F_2$ . Hence we solve for the minimum logic contamination delay :

$$tcd \geq thold - tcrq$$

- If the contamination delay through the flip-flop exceeds the hold time, you can safely use back-to-back f-fs.
- If not, you must explicitly add delay between the f-fs. (e.g. with buffer) or use special slow f-fs with greater than normal contamination delay on paths that require back-to-back flops.
- scan chains are common example of paths with back-to-back flops.

## True Single-phase-clock (TSPC) latches and flip-flops

- Conventional latches require both true and complementary clock signals.
- In modern CMOS Systems, the complement is normally generated locally with an inverter in the latch cell
- The TSPC latches and flip-flops replace the inverter-transmission gate or C<sup>2</sup>MOS stage with a pair of stages requiring only clk, not its complement.



- Note that this  $f_f$  produces a momentary glitch on  $\bar{Q}$  after the rising clock edge when D is low for multiple cycles; this  $\uparrow$ s the activity factor of downstream cks and costs power.
- extends TSPC principle to handle domino, RAMs, and other precharged circuits.
- The dynamic TSPC latches were used on ground breaking Alpha 21064 microprocessor.
- Logic can be built  $\pitchfork$  into first stage of each latch.
- The latch is not easy to staticize.
- In any case, the clk must also be reasonably sharp to prevent races with both transistors are partially ON.

## GLOBAL CLOCK GENERATION.

A global clock generator receives an external clock signal and produces the clock that will be distributed across the die. Clock generator is simply a buffer to drive the large capacitance of the clock distribution network.

However, the input pad, buffering, distribution network, & gates have significant delay that leads to a large skew bt external clk & physical cks received @ cked elements.

Moreover, this delay varies with processing & environment. Becos of this skew, clocked elements on the chip are no longer synchronized with external I/O slgs.

Guaranteeing setup & hold times becomes problematic, particularly @ high freq where skew exceeds half of clk period.

More sophisticated clock generators use phase-locked loops (PLLs) to compensate for this delay.

PLLs can perform freq multiplication to provide an on-chip @ a higher frequency than the external clock.

→ Synchronous chip interface with PLL



chip1 sends a clk and Dout to chip2 and receives Din back from chip2. The data should be synchronized to clk so each chip can sample it on the <sup>edge</sup> clk edge. However, the internal clk on chip2 is delayed through clk distribution network. The PLL adjusts internal clock in chip2 to correct for this delay & keep the clock synchronized with the data.

→ A PLL receives a reference clock and a feedback clock and produces an output clock. The phase and frequency of the output clock is automatically adjusted until the feedback clock is exactly aligned with the reference clk.

In our application, ref clk comes from chip 1. The fdbk clk is tapped off one of the physical clock wires that also drives registers. The Dp clk gclk signal sent into clock distribution network. ∴ PLL adjusts gclk until clk driving data registers is aligned with the external clock exclk. This eliminates the symmetric skew caused by the clock distribution delay.



↳ Block diagram of typical PLL

The heart of the PLL is a Voltage-controlled oscillator (VCO). The control voltage is adjusted until the oscillator produces an Dp clk of proper phase @ same freq as ref clk. This control is performed with charge pump & RC filter. A phase detector determines whether feedback clk leads or lags ref clk. The charge pump consists of a pair of current sources enabled by the up and down signals to adjust the voltage on Vctrl until the feedback clk becomes aligned with the reference.



The above diagram is block diagram of PLL that performs freq multiplication. It uses a divide-by-N counter on the fclk to generate a clk aligned in phase but @ N times freq of the ref clk. Ex: The Pentium 4 uses a 100MHz external sys clk that can be multiplied up to a 3GHz core clock. With another divide - by-M counter on refclk terminal, PLL can produce any rational N/M multiple of the ref freq.

### \* DLL (Delay-locked loop)



## Conventional CMOS latches

→ Pass Transistor latch  
transparent latch buffer



→ has four limitations

Limitations:  
→ O/p floats, diffusion i/p  
→ O/p noise sensitivity

- 1] O/p doesn't swing from rail to rail (from gnd- vdd).
- 2] It never rises above  $V_{DD} - V_T$
- 3] dynamic output
- 4] potential noise issue.



→ uses CMOS TG instead of single CMOS pass transistor to offer rail-to-rail o/p swing  
↑ transmission gate

No  $V_T$  drop

→ It requires additional complementary clock.



Inverting buffer  
+ Restoring  
- Inverted o/p

→ Adds an output inverter so that the state node is isolated from the output noise

→ This creates an inverting latch.



→ behaves as inverting latch with buffered input and non buffered output.

Tristate feedback



+ static  
+ static latches  
are now  
essential

$\overline{Q} \rightarrow$  When clock = 1 Input transmission  
gate is ON and feedback  
tristate is OFF  
 $\rightarrow$  When clock = 0 Input transmi  
ssion gate is OFF however feedback  
tristate is ON.



$\rightarrow$  adds an input to the  
buffer so that there is  
input @ the transistor  
gate rather than  
unbuffered diffusion

but both (e) & (f) introduces large noise glitches @  
the output that may enter @ x through the feedback  
circuit.

g)



Buffered  
+ static  
+ No voltage drop  
+ Noise sensitivity  
+ diffusion ilp

$\rightarrow$  Robust transparent latch  
that address all the  
deficiencies mentioned so far.  
 $\rightarrow$  The latch is static  
 $\rightarrow$  all nodes swing rail to  
rails

$\rightarrow$  Input drives transistor gate rather than diffusion  
 $\rightarrow$  static noise is isolated.

Datapath latch  
used data path applications



- Input noise is preferably controlled
- much faster
- more compact.

+ faster & compact

Ex: Intel uses this as datapath latch



- jamb latch, a variation of
- Reduces clock load
- Saves two transistor by using a weak feedback inverter instead of a tristor.



- jamb latch commonly used in register files, FPGA cells
- They read out on to a single data output line.

→ and any one is enabled (Read/write) @ any given time with its RD signal.

→ commonly used in register files and fpga cells

### Conventional flip-flop design

• flip-flop is built as pair of back-to-back latches



a) → dynamic inverter flip flop

b)  $\rightarrow$  with feedback , non-inverting static flip flop



## Global clock distribution.

→ The global clock must be distributed across the chip in a way that reaches all of the clocked elements @ nearly same time.

The clock distribution system must be carefully designed to equalize the flight time between the clock generator and the clocked receivers

→ can be classified as grids, H-trees, spines, adhoc or hybrid.

\* **Grids:** A clock grid is a mesh of horizontal and vertical wires driven from the middle or edges.

The mesh is fine enough to deliver the clock to points nearby every clocked element. The resistance is low bt any two nearby points in the mesh so the skew is also low bt nearby clocked elements.

This reduces the chance of hold-time problems because such problems tend to occur bt nearby elements where propagat<sup>n</sup> delay bt elements is also small.

Grids also compensate for much of random skew because shorting the clock together makes variations in delay irrelevant.

They can be routed early in the design without detailed knowledge of latch placement. However, grids do have significant systematic skew bt pts closest to drivers and the

points furthest away. They also consume a large amt of metal resources & hence have a high switching cap and power consumptn.

\* H-Trees An H-tree is a fractal structure built by drawing an H shape, then recursively drawing H shapes on each of the vertices.

With enough recursions, the H-tree can distribute a clock from the center to within an arbitrarily short distance of every point on the chip while maintaining exactly equal wire lengths. Buffers are added as necessary to serve as repeaters. If the clock loads were uniformly distributed around the chip, the H-tree would have zero systematic skew.

Moreover, the trees tend to use less wire and thus have lower capacitance than grids.

In practice, H-tree still shows some skew because the clock loads are not uniform, loading some leaves of the tree more than others. Moreover, the tree often must be routed around obstructions such as memory arrays. The leaves of H do not reach every point on the chip, so some short physical clock wires are required after the local clock gather.



\* Spines: As with the grid, the clock buffers are located in a few rows across the chip. However, instead of driving single clock grid across entire die, the spines drive length-matched serpentine wire to each small group of clocked elements. If loads are uniform, the spine avoids the systematic skew of the grid by matching the length of the clock wires. Each serpentine is driven individually so gaters can be used to save power by not switching certain wires.

The serpentine is also easy to design and each load can be tuned individually. However, a system with many clocked elements may require a large number of serpentine routes, leading to high area and capacitance for the clock NW. Like trees, spines also may have large local skews between nearby elements driven by different serpentines.

\* Ad-hoc: Many ASICs running @ relatively low frequency (100's of MHz) still get away with an ad-hoc clock distribution network in which the clock is routed haphazardly with some attempt to equalize wire lengths or add buffers to equalize delay. Such adhoc NWs can have reasonably low systematic skews because the buffers sizes can be adjusted until the nominal delays

are nearly equal.

However, they are subject to serve random skew when process variations affect wire and gate delays differently. This is the level that most commonly available tools support. Most design teams using ad-hoc clock networks also lack the resources to do a careful analysis of random skew, jitter, & drift.

∴ They should be conservative in defining a skew budget & must be careful about hold time violations.

\* **hybrid**: A hybrid combination of the H-tree and grid offers lower skew than either an H-tree or grid alone. In the hybrid approach, an H-tree is used to distribute the clock to a large number of points across the die. A grid shorts these points together. Compared to a simple grid, the hybrid approach has lower systematic skew because the grid is driven from many points instead of just the middle or edge.

Compared to H-tree, the hybrid approach is less susceptible to skew from nonuniform load distribution. The grid also reduces local skew & brings the clock to nearly every location where it is needed.