

**Latch** sync.: clock ; async: local handshake

: waste on clk wait if finish early (dead time)

CSE clocked storage element 1. latch (level sensitive)

2. FF (edge triggered)

3. register (sys-level latch / FF)

Behavioral



Transparent high latch

if  $clk = 1$ ,  $Q = D$

Timing  $t_{DQ}$   $t_{CQ}$  (rising)

$t_{setup}$   $t_{hold}$  window before / after falling edge that  $D$  must be stable

Build 1. cross-coupled gates (pos. feedback)



3 op. points (2 stable)

2. cap. (leak)

Latch static

stronger, need fight stored transistor



dynam

Voltage transfer compressed  $\downarrow$   $\rightarrow$  only one op

if too weak, not writable. Also fight  $\rightarrow I_{crow}$

Want to shut feedback while writing. Gate w/  $\bar{clk}$



↳ gated feedback (not in SRAM, since want density)

D issue: leak (same w/ dyn. logic)

Not used any more



FF (edge)

param  $t_{cq}$ ,  $t_{setup}$ ,  $t_{hold}$

L1 L2  
master-slave



@negedge, L1 closes, L2 opens

Late mode (long path)



$t_{cq} + t_{setup} = \text{latch overhead}$

( $\pm t_{skew} +$ )

$$t_{cq} + t_d + t_{setup} \leq T_{cycle}$$

Early mode (short path)  $t_{cq} + t_d \geq t_{hold}$  (+ tskew) (E. shift register)

Short path padding w/  $\rightarrow D_o$

Need indep. control of clk phase in  $\phi/\phi'$

sln.



$t_{setup, hold}$ : master

$t_{cq}$ : slave



rising edge of slave  $\rightarrow$  captured by falling  $\sim$  of master

non-overlap window for all  $\rightarrow$  master receive earlier



con: deadtime at window  $\rightarrow$  less time for long path  $\rightarrow f_{max} \downarrow$

can be tuned in modern processor



Pulse mode clocking (D-latch  $\rightsquigarrow$  ff)

worse early m. issue ( $+t_{pulse}$ )  $\rightarrow$  two-sided constraints

clk chopper (one shot) clk pulse @ negedge

if global chop  $\rightarrow$  distribute  $\rightarrow$  runting.

if local chop  $\rightarrow$  more gates :

(separated)

Transparent latch design



Make logic between M-S

Long path 1. D must arrive before closing edge of any latch (assume  $t_{setup} = 0$ ) } global constraints

2. Loop latency  $\leq$  # loop cycles

Cycle stealing: if one Q group too big, OK



Arbitrarily choose a starting point

Now 1, 0.5, 0.5, 0 (skewed)

barely works, no deadtime, can stop



Now 0.25, 1, 0.5, 0.25 2.25 cycles  $\geq 2$ , loop violation?

Data does not have to launch at CLK T. Can push 1<sup>st</sup> 0.25 back  
so cycle is still 2.25  $\rightarrow$  false violation ✓

Timing violation may not be local



Dyn. latch / FF (rare)



(master-slave) ↗ fast!

if clk, clk-bar both 1, race, may flush

NORA (no race, C<sup>2</sup> mos)

break connection w/ pt



TSPC (true single phase clocking)



if c=1,  $Q = \bar{Q}_i = D$

if c=0, opaque (dynamic if D or  $Q_i = 1$ )

either Q /  $Q_i$  dyn., but not both

FF (↗)

inverting



Dyn. flaw waste half of the cycle on pre

pipeline 2-phase, hide pre

when  $\phi_1$ , H,  $\phi_2$  pre, closed



Partition issue. Need = :(

If eliminate  $\phi_2$  latch,  $\phi_2$  prevents next to be corrupted by pre. Need value before pre.

Race  $\phi_2$  eval vs  $\phi_1$  pre, so mostly  $\phi_2$  latch still kept

Sta. mix dyn.



1. Static at end of any cycle, before pp

X: power. If FF outputs same Q  $\rightarrow$  static \*; if diff Q  $\rightarrow$  dyn. \* (only switch once, hazard-free)

static after dyn. is worst ————— (flutter when driven by dyn.)

## Microprocessor (E. GPU, ...)

1. Data path : logic blocks handing / op. data (E. adder, shifter, register)

2. Control (random logic, not structured)

tend buggy :

3. Arrays : memory (very structured) (E. RF, SRAM, embedded DRAM)

~~Num fixed point (int)~~, signed, unsigned  
2's c

fp (mantissa + exp), fpu

2's c overflow =  $C_n \oplus C_{n-1}$

endian in virtuoso, make bus  $a\langle 0:31 \rangle$

il  $a\langle 31:0 \rangle$

big  $a\langle 0:31 \rangle$



Adder (FA)  $S = a \oplus b \oplus c_m$

$$Cout = ab + ac_{in} + bc_{in} \quad (\text{majority})$$

self dual  $\bar{s}(a, b, c_m) = s(\bar{a}, \bar{b}, \bar{c}_m)$  } if building push-pull, identical pdn, pdn  
 $\bar{c} \sim = c(\sim, \sim)$  } great for layout

E. 8-bit ripple carry



for ripple,  $S_3$  slowest

layout : just FA, and tile it well



Can make generate  $Cout$  and  $\sum$ . alternate inverting inputs

Sizing  $\overbrace{\dots}^{2w/w} \overbrace{\dots}^{2w/w} \overbrace{\dots}^{2w/w} \overbrace{\dots}^{2w/w} \dots$  want  $w \rightarrow \infty$ , dim. returns  
 $\downarrow$  load in  $C$  by  $S$  circuit

$S$  not in crit. path. make  $S \downarrow \rightarrow$  size  $C$  chain (same size)

sub invert  $B$ ,  $Cin = 1$



$$\text{Parallel} \quad C_{\text{out}} = abt + ac_{\text{in}} + bc_{\text{in}}$$

$$= g + pc_{\text{in}} \quad g \equiv ab, p \equiv a+b = a \oplus b \rightarrow S = p \oplus c_{\text{in}}$$

1. Carry bypass E. break 1b-bit into E.  $4 \times 4$

MUX, sel bit ANDs all  $p_i = a_i \oplus b_i$

E.  $n=1b$   $t_{\text{crit}} \approx 3t_{\text{max}} + 2.4t_{\text{tripple}}$

2. Carry select E.  $4 \times 4$

Calculate all carries if  $c_{\text{in}}$  is 0; if  $c_{\text{in}} = 1$

use MUX to select

$$t_{\text{cnt}} = t_{\text{pg}} + t_{\text{carry}0} + t_{\text{carry}1} + 4t_{\text{max}} + t_{\text{sum}}$$

3. CLA  $C_{i+1} = g_i + p_i c_i$

$$= g_i + p_i g_{i-1} + p_i \dots p_1 g_0 + p_i \dots p_0 c_0$$

E. 4 : radix-4 lookahead

E. 16 can take  $4 \times$  radix 4 & ripple :

Hierachial  $P_0 \equiv p_3 p_2 p_1 p_0 c_{\text{in}}$

$$G_0 \equiv g_3 + p_3 g_2 + p_3 p_2 g_1 + p_3 p_2 p_1 g_0$$



(x4)



tree adders (radix-4). Radix 2 is simpler.  $g_0, p_0 \rightarrow$



RS 0-1-2-4

fo of 1 or 2!



78 58 08

B12



area efficient, but not same fanout



harder to layout



Shifter 1. logical always pad 0 (shamt)

2. arithmetic pad MSB (sext) when shift right

logarithmic shifter

$s = \text{shamt}$

if  $s_2$ , need shift by 4.

if arithmetic, shift in msb

stack = 2 + (input inverter) : (

can buf w/ D0



Barrel shifter (pt) (4-bit)

shamt: 1-hot  $s_{3:0}$

arith >

2 stack height ⚡

need ⊥

Rotator out → in



## Control

1. random logic  $\rightarrow$  multi-level  $\rightarrow$  semi-custom

2. structured logic (PLA) programmable logic array, area ineff.

1. RTL (structured, careful)  $\rightarrow$  logic synthesis tool  $\xrightarrow{\text{layout}}$  P&R

cells (lib)  $\xrightarrow{\text{static timing analysis}}$



P&R uses feedback to optimize

**PLA** programmable logic array  
structured (E. instr. decode)

Two-level logic : SoP, PoS

Espresso : minimize SoP terms (heuristic)

bubble pushing AND-OR  $\Rightarrow$  NOR-NOR

Static PLA E.  $f_0 = x_0 x_1 + \bar{x}_2$ ,  $f_1 = x_0 x_1 x_2 + \bar{x}_2 + \bar{x}_0 x_1$



Need 4 prod. terms. Pseudo-nmos pullup for each



(noninverted)

| Espresso format  | AND plane |       |       | OR plane |       | Espresso |
|------------------|-----------|-------|-------|----------|-------|----------|
|                  | $x_0$     | $x_1$ | $x_2$ | $f_0$    | $f_1$ |          |
|                  | 1         | 1     | -     | 1        | 0     |          |
| 0: use $\bar{x}$ | -         | -     | 0     | 1        | 1     |          |
| 1: use $x$       | 1         | 1     | 1     | 0        | 1     |          |
| -: d.c.          | 0         | 1     | -     | 0        | 1     |          |

Dynamic PLA dyn. nor → dyn. nor (turn pseudo-n into clked)

but no  $\text{Do}$  available for domino → clk w/  $\phi_{\text{AND}}$ ,  $\phi_{\text{OR}}$ . Don't start  $\phi_{\text{OR}}$  before AND eval'd.

self-timed → need delay elements

Replica: same circuit topology, timing testing



(large C on precharge node E. 3 nfets, all inputs high)

slow eval put 1 input high, rest 0

wait for output to be eval'd to 0 → φ\_or ↑



## Testing

$I_{DDQ}$  &  $I_{DD}$  at quiescent (some leak, but can't be huge)

Shut any static current (E. analog, half latch)

## Memory (array) SDRAM (volatile)

D trench into Si (deep reactive ion etching) → Cap

S row decoder : wordlines (demux 1-hot)

col mux : bitlines  
1ob

Logical organization E.  $1024 \times 64$

Physical E.  $256 \times 256$   
8b 8b take 2b addr. to col. mux  
8b to row dec



Cell (6T)

R 1. precharge bit, bit

(row dec)

2. select 1 wordline (access on for word)

3. state pulls low bit or  $\overline{\text{bit}}$

4. detect (don't want to wait all the way  $\rightarrow 0$ , slow, power)

issue R may upset cell (read stability)

at precharge, both sides  $V_{DD}$ ,  $D_o$  can't go all  $\rightarrow 0$ . (wing stability)

Need access transistor to  $1.5 \sim 2x$  smaller than  $D_o$ .

W 1. precharge

2. sel. wordline

3. Pull bit/ $\overline{\text{bit}}$  down  $\rightarrow$  flip SRAM

issue W writability (want diagram to collapse)

access T  $3x$  stronger than  $D_o$

Sizing by changing L and W.  $\rightarrow$  dense

Layout w/ push rules (denser DRC)

small  $\rightarrow$  dopant variation  $\rightarrow V_T \uparrow \rightarrow$  strength  $\rightsquigarrow \rightarrow$  cells may fail

sln. 1. when not accessing, run at  $< V_{DD}$ . Only  $V_{DD}$  @ reading

2. redundancy (E columns), take out if bad



$\overline{\Phi_2} \rightarrow Q$   $\overline{\Phi_2} \rightarrow Q$  pre



Peripheral self timed; or  $\phi_1, \phi_2$

R1. pre @  $\phi_2 = 1$

2. access @  $\phi_1 = 1$  (clk qualification)

detect: skewed DO (single-ended)

W same pre, access clk

data needs to be  $\phi_1$ -stable (E. from  $\phi_2$  latch)

3-high stack

+Mux column mux gives more space for W.

4-high stack

sln. combine 2-stack AND logic



SA (differential read)

limited by GBW  $\rightarrow$  regenerative circuits

need know when to amplify  $\rightarrow$  clk



access iso on, guard off (DO off), one low

pre: iso off, guard on, DO amplify w/ regenerative feedback

issue if inv. not exact same, offset  $\rightarrow$  flip to wrong state

Inverters w/ regenerative feedback not limited by GBW issue  $\rightarrow$  fast, but need clk.

Decoder (glorified AND) E. 3 $\rightarrow$ 8: 8 ANDs

issue: input ↑, high  $\rightarrow$  multilevel

Final output match SRAM roles  $\rightarrow$  special DR for dec.

Typically dynamic nor:

race-based nor

Make  $\phi_1$  delay  $\phi_2$  to avoid race cond.

DO pfet critical  $\rightarrow$  need wide fingers



predecoder decoder



and-and

$\downarrow$  bubble



hand-nor

