

## HDL (Hardware Description Language)

1. Verilog → C-like
2. VHDL ⇒ VHSIC Description language  
Very High speed Integrated Circuit
3. BSV ⇒ Bluespace System Verilog

Simulate:

Isim Xilinx ISE 14.7

Synthesize FPGA lab.

Verilog programme ⇒ Collection of modules.

Inside "always" '=' is blocking assignment & ' $\leftarrow$ ' is non-blocking.

Outside "always" we have to write:

assign a = b      otherwise it is an error.

Take care while writing combinational logic inside "always".

1. All RHS must appear in sensitivity list
2. All LHS must be assigned/ altered in each possible case.
- 3.

Basic Logic Design:

Decoder: Latency of decoder → Time taken to decode

↳ Depends on slowest of operations × no. of bits.

3 bit decoder:  $b_0 = \bar{a}_2 \bar{a}_1 \bar{a}_0$

$b_1 = \bar{a}_2 \bar{a}_1 a_0$

So latency of decoder × no. of inputs.

Latency → dictates the fastness of logic circuit

CLASSMATE

Date \_\_\_\_\_  
Page \_\_\_\_\_



Multiplexer:



| $a_0$ | $a_1$ | sel | $b$   |
|-------|-------|-----|-------|
| x     | x     | 0   | $a_0$ |
| x     | x     | 1   | $a_1$ |

$$b = a_0 \cdot \overline{sel} + a_1 \cdot sel$$



Increasing no of input in a Mux increases latency.

Encoder:



only one of  
these is 1.  
Others are 0.

Priority Encoder:  $n \rightarrow 2^n$  →  $n \rightarrow 1$

When multiple bits are 1 as input in an encoder, priority is set (inside) for among diff. bits.

E.g.  $a_0=1, a_1=1, a_2=0, a_3=0, a_4=1, a_5=0, a_6=1, a_7=0, a_8=0$

Suppose it is programmed/fixed for 4 as higher priority than 2 and 7, then output is  $b_0=0, b_1=0, b_2=1$ .

Therefore  $b_0=0$  as second priority is 0.

Comparator:

equal  
less than  
greater than

Any Digital logic:

Performance  $\propto$  latency  $\propto$  Longest Path.

PLA: Programmable Logic Array.

Implementing full adder

$$\text{sum} = \bar{a}\bar{b}c_{in} + \bar{a}b\bar{c}_{in} + a\bar{b}\bar{c}_{in} + abc_{in}$$

$$c_{out} = \bar{a}bc_{in} + a\bar{b}c_{in} + ab\bar{c}_{in} + abc_{in}$$



Positive level-sensitive latches



Positive edge triggered

Synchronous  $\rightarrow$  occurring at same time (in sync)  
 Ex. when 2 flip-flops have same clock, they are said to be synchronous.

classmate

Date \_\_\_\_\_  
 Page \_\_\_\_\_

Setup time: D must be in a the required state a short time before clockedge

Hold time: A minimum time after clock edge for which D should remain in ~~the~~ same state for correct functioning



Propagation delay: Clock to Q delay.  
 (in Flip-flop)

✓ Cycle time  $\geq$  setup time + latency of f + propagation delay of X of Y  
 + clock skew time.  
 (between X and Y)



Clock skew time: Time difference b/w clock edge at X and Y due to wire length or capacitive-resistive delays etc.

### Register file (Memory structure)





2 Write ports : Two  $n$  to  $2^n$  decoders, 2 Write Enable, 2 Data Inputs.

$c^o$  for clock 0 and  $c'$  for clock 1.



Read ports use very large multiplexers and hence can be very slow. So, different design of Read ports:



For 2 working read-write ports, we need 2 decoders, 2 switches at each flip-flop.

FPGA (Xilinx Spartan 3E)

CLB: 2 logic slices (SLICE L)

2 Memory slices (SLICEM)

## Shift Registers:

Registers which shift the stored data in forward/reverse direction.

Inputs: [shift amount]  
[shift direction]  
and sometimes

## Asynchronous input:

Use D flip-flop as synchronizer



Failure:



Solution: Chains of 2 flip-flops

Works if metastability period < clock period.

## Memory Structures:

Register file (discussed)

SRAM (Static RAM) → volatile memory (Holds memory as long as given power)

Fast memory → Gigantic Register file

Chip select → to enable specific chip out of a bunch of chips  
↳ 1 bit

Rows are called vaults

Each bit → 2 cross-coupled inverters.

All bits in column are connected to a common wire called  
bitline



$\begin{array}{c} \diagup \\ \text{T} \\ \diagdown \end{array}$  : Pass transistor / Access transistor / NMOS

For 2 read ports: 2 decoders, 2 bitlines for every column,  
2 NMOS at each bit, 2 output enable.

- \* Increasing height of SRAM increases length of bitline and size of decoder also, leading to SRAM being slowed down.
- \* Increasing width of SRAM increases length of Wordline leading to SRAM being slowed down.

Building large SRAMs :



Another way: (Can be Power effective)



### Dynamic RAM:

Each cell contains only a capacitor, content leaks away.

Why?

We require high density storage.



DRAM Bank : 1 wordline per row

Row Decoder

Row Address



2 Bitlines per column :

Bitline are precharged to  $V_{DD}/2$ .  
When wordline is high, both switches start closing /

gaining charge

Sense Amplifier senses this and interprets the state of

column

Why this ?  $\rightarrow$  Less area required.

RAS is <sup>desired</sup> sent along with row address ( RAS is basically read enable ).

Row Address  
Decoder

Row Address & Column address are never sent simultaneously  
So, a single decoder can be used.

Burst length of DRAM module :

We can read  $N$  consecutive columns by giving address of only 1st column.

Channel: 64 bit, let output width be  $k$

if  $\times k$  chips ~~are~~, so no. of chips in

$$\text{a rank} = \frac{64}{k}$$

E.g.  $\times 4$  chips  $\Rightarrow$  16 chips in a rank for 64-bit channel

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

- Steps:
1. Precharge the bank (Close existing row)
  2. Activating a row
  3. Read Column (CAS)

Row Hit (only step 3), Row miss (step 2 & 3), Row conflict (All steps)

Example:

Let's assume each step above requires  $x$  cycles.

Read from the foll addresses:

(r0, c0), (r2, c3), (r3, c1), (r0, c2), (r3, c3)

If DRAM reads in the given order, it takes  $15x$  cycles.

If DRAM intelligently reads, it first reads all columns from a single row and then move onto next row rather than changing row & column each time ( $\because$  changing a row requires more cycles).

So, it takes  $3x + n + 3x + 3x + n = 11x$  cycles.

DDR ~~RA~~ DRAM

Double Data Rate

Can transfer the data on both clock edges: Two data transfers in a cycle.

We can't increase clock frequency  $\because$  power consumption & frequency and it ~~also~~ increases.

The first burst has variable latency and after that each half cycle brings a ~~burst~~ (column width of data  $\times$  no. of chips) i.e. (no. of bits in channel) amount of data.



DDR3 - 2133  $\Rightarrow$  1.067 GHz

DDR3 - 2400

DDR4 - 2400  $\Rightarrow$  2400 million transfers per sec.

$\downarrow$   
1.2 GHz



SRAM gives an address A (32 bit to DRAM controller).

& DRAM controller has to send 64 bytes from address A to address A+63 back to SRAM.

## Address Decoding:



## Error correction:

Each DRAM address: 9 bits of Data + Ecc (Error correction code)  
(8 bits + Parity)

SECDED: Single error correct Double error detect codes.

## Finite State Machines

State elements in any sequential logic.

⇒ Finite no. of states

State Diag



Output  $f^n$  depends only on current state: Moore machine  
on current state & Input: Mealy machine.

## Finite State Machines

Two's Compliment:

Rule: an integer  $x$  and its negative add up to  $2^n$

$$\Rightarrow x + (-x) = (2^n - 1) + 1$$

$$\Rightarrow (-x) = (2^n - 1) - x + 1$$

$(2^n - 1) - x$  is bitwise inversion of  $x$ .

So, to find  $-x$ ,

1. Bitwise invert  $x$
2. Add one to it.

- To convert 2's complement binary no. to decimal, use  $-2^{n-1}$  as coeff of for MSB. and rest is same as regular binary number.

Ex:  $7 + (-6)$

$$+6 = 0110$$

$$-6 = 1010 \quad (\text{In 2's complement})$$

$$\begin{array}{r} 7 \\ \hline -6 \\ \hline 10001 \end{array}$$

$$(-7) + 6$$

$$-7 = 1001$$

$$6 = 0110$$

$$1111 = -2^3 + 2^2 + 2 + 1 = -1$$

$$(b_{n-1} b_{n-2} \dots b_0) = b_{n-1} \times (-2^{n-1}) + \sum_{i=n-2}^0 b_i 2^i$$

$\alpha$

In 2's complement

$$-Y = (1 b_{n-1} \dots b_0) = -2^{n-1} + \alpha$$

$$\begin{aligned} Y + (-Y) &= 2^n \\ Y &= 2^n - (-Y) = 2^n - \cancel{-} \\ &= 2^n + 2^{n-1} - \cancel{\alpha} = 2^{n-1} + \alpha \\ \Rightarrow -Y &= -2^{n-1} + \alpha. \end{aligned}$$



$\boxed{A} + \boxed{B}$

Here A & B must be supplied in 2's complement

$\boxed{A} - \boxed{B}$

form.

Imbalanced ranges on positive and negative sides.

One's Complement:

+7 - 6

0 1 1 1

1 0 0 1

1 0 0 0 0 X

There are different cases that require different treatment.

Biased Representation

$n=4$ , bias = 8.

-8 to +7

$-8 \Rightarrow 0000$

Imbalanced ranges

$-7 \Rightarrow 0001$

$+7 \Rightarrow 1111$

6 ways to represent integer in C :

|                    |   |                                 |
|--------------------|---|---------------------------------|
| short              | } | $n=16$ ( 16 bit representation) |
| unsigned short     | } |                                 |
| int                | } | $n=32$                          |
| unsigned int       | } |                                 |
| long long          | } | $n=64$                          |
| unsigned long long | } |                                 |

unsigned  $\rightarrow$  takes only non-negative integer.

by default  $\Rightarrow$  unsigned means unsigned int.

All computers store -ve integers in 2's complement form.

### Overflow

In 2's complement representation, overflow occurs iff carry in to MSB is different from carry out from MSB.

Cin Cout

|   |   |
|---|---|
| 0 | 0 |
| 0 | 1 |
| 1 | 0 |

1 1



$c_{in} = 0, cout = 1.$

$$\Rightarrow a_{n-1} = b_{n-1} = 1.$$

In this case  $a_{n-1} = 0 \Rightarrow \underline{+ve} \Rightarrow \text{wrong} \Rightarrow \text{overflow}.$

$c_{in} = 1, cout = 0.$

$$\Rightarrow a_{n-1} = 0, b_{n-1} = 0.$$

$$\Rightarrow a_{n-1} = 1. \Rightarrow -ve \Rightarrow \text{wrong} \Rightarrow \text{overflow}.$$

$c_{in} = 0, cout = 0$

atleast one of  $a_{n-1}$  &  $b_{n-1}$  is 1 and atleast one of  $a_{n-2}, b_{n-2}$  is 1.

► Here since carry is zero, so no ~~of~~ overflow is obvious.

$c_{in} = 1, cout = 1.$

atleast one of  $a_{n-1}$  &  $b_{n-1}$  is 1 and atleast one of  $a_{n-2}, b_{n-2}$  is 1

(from Lecture - slides \*)

### Floating-Point numbers:

computers  $\rightarrow$  ~~represent~~, floating-point binary number using normalized scientific notation.

$$1.b \times 2^e \quad b, e \rightarrow \text{binary numbers.}$$

### IEEE 754

$[1.b] \rightarrow \text{significand}$



$$(-1)^{\text{sign}} \cdot 2^{\text{Exponent}} \cdot (1.\text{mantissa})$$

Exponent fields  $\rightarrow$  true & -ve exponents.

A larger magnitude floating-point no. should have large 31-bit magnitude.

So, can't use 2's complement.

$\therefore$  1 at start of a -ve exponent makes it larger than +ve exponent.

So, Biased encoding

Bias = 127.

Encoded Exp = 127 + Actual Exp.

So, Biased Encoding preserves monotonicity.

Min +ve number =

Denormalized number:

Largest +ve denormalized number:

$$\begin{array}{c} 0 | 0 - - - 0 | 1 1 - - - 1 \\ (1 - 2^{-23}) \times 2^{-126} \end{array}$$

Smallest +ve denormalized number:

$$\begin{array}{c} 0 | 0 - - 0 | 0 - - 0 | \\ 0.0 - - 0_1 \times 2^{-126} = 2^{-149} \end{array}$$

Representable magnitudes :

0  $\rightarrow$  (Exponent = -127)

Denormalized :  $2^{-149}$  to  $(1 - 2^{-23}) \times 2^{-126}$

Normalized :  $2^{-126}$  to  $(2 - 2^{-23}) \times 2^{127}$

## Rounding :

$$x = \overbrace{1.11\ldots}^{24}$$

$$x' = \overbrace{1.11\ldots}^{-10}$$

$$x - x' = \underbrace{0.0 \dots}_{24} 011$$

$$x - x'' = \underbrace{0.0 - \dots}_{24} 001$$

$$x'' = \underbrace{1 \cdot 11 - \dots - 1}_{23}$$

$$x''' - x = \underbrace{0.00}_{24} - \dots - 01$$

$$\underline{x''' = 2.00 - - 0}$$

$$= \frac{1}{2^{24}}$$

## Instruction Set Architecture (Assembly Language).

Implementation of ISA should be simple.

Operations: Basic: +, -, \*, /

Source operands, Operation → Destination operand.

MIPS instruction "add \$a, \$b, \$c"  
processor family Destination

For  $a = b + c + d + e$

MIPS : add a, b, c

add a, a, d

- 11 - Page

*as* — , ,

Compiler :  $\frac{\text{HLL program}}{\downarrow}$   $\longrightarrow$  MIPS assembly language program  
High level lang.

Note: Eq. diff codes in HLL, lead to diff translations in MIPS assembly lang.

$$\text{Eg. } d = (b+c) - e \equiv \begin{array}{l} \text{add } d, b, c \\ \text{sub } d, d, e \end{array}$$

Spill: deallocate and store it in memory.

CLASSMATE  
Date \_\_\_\_\_  
Page \_\_\_\_\_

Operands come from register from register file or memory location (SRAM Cache or DRAM)

HLL program to Assembly language problem:

- Mapping HLL operators to computer instructions
- Mapping HLL operands to computer instruction operands.

In MIPS, many instructions allow only register operands

In x86, both register & memory operands are allowed.

Width of register file dictates width of register operand.  
(32/64 bit).

Mapping variables to operands:

- Register allocation of variables is compiler's responsibility  
Goal is to minimize no. of spills & spills.

Example: 4 registers: r1, r2, r3, r4.

C program:  $a = b + c;$  //  $a \rightarrow r1, b \rightarrow r2, c \rightarrow r3$

$d = e + f;$  //  $d \rightarrow r4$

$a = a + d;$

$b = a + e;$

$c$  is deallocated (without spill)

In step 2, deallocate  $b$ , because that would save a spill  
if we deallocate  $a$ , spill will be required. (to store  
updated value of  $a$  in memory).

Memory operands:

Direct memory addressing mode (requires 32 bit of address for 4 GB memory)

Computer doesn't always know address of all variables

Displacement mode of memory addressing:

$100(\$2) \rightarrow (100 + \text{contents of } \$2) \rightarrow \text{address of memory operand}$   
12 bits (7 for 100 & 5 for \$2).

100 → displacement

\$2 → base register

Careful regarding not affecting total access time by addition in this mode.

"load \$1, 100(\$2)" now takes 23 bits instead of 43 bits

$a = (\text{int} *)\text{malloc}(\text{size});$

$x = a[0] + r; \Rightarrow \begin{cases} \text{load } \$1, 0(\$2) \\ \text{add } \$1, \$1, 1 \\ \text{store } \$1, 0(\$2) \end{cases}$

$a[1] \Rightarrow 1(\$2).$