

# ECE 270 : Embedded Logic Design $\Rightarrow$ Co + DC



Resource for Quiz: [hobbits.0x3.net/wiky/Main\\_Page](http://hobbits.0x3.net/wiky/Main_Page)  
Lab videos : youtube

Programming: 1st half  $\rightarrow$  Verilog  
2nd half  $\rightarrow$  Embedded C

Theory: FPGA and SoC

Virado 2019.1 (including SDK)  
*arch user repository*

|           |               |       |
|-----------|---------------|-------|
| * GRADES: | mid sem       | 30 %. |
|           | end sem       | 30 %. |
|           | Surprise quiz | 28 %. |
|           | lab hw        | 15 %. |

## LECTURE: 1

\* Which is faster : Analog vs digital ?

$\Rightarrow$  Depends on the use case

\* No product is purely digital/analog ?

$\Rightarrow$  Analog is present in nature however digital can be processed easily and has more use cases.



HDL  $\rightarrow$  Hardware Description language  
 $\hookrightarrow$  eg  $\Rightarrow$  Verilog

## LECTURE : 2

\* **combinational circuit**: The output depends upon the present input (same clock cycle)

\* **sequential circuit**: Output depends upon the current input and the current state of the circuit  
what we get → output + next state  
because we go from one state to another

⇒ Note: combinational circuits use clock as its input as well.

\* **D flip flop**: Input is stored at falling (edge triggered) or rising edge of the clock



\* **Sequential circuit using combinational ckt**



\* **FSM (finite state machine)**

⇒ Up Counter



Note: if curr state =  $S_n$ , the output is  $n$

2 bits required to store in mem



\* **Another example**



e.g. if  $PS = S_0$  ( $00$ )  $\Rightarrow$  output is  $3$  ( $11$ )  
i.e.  $NOT(0), NOT(0) \Rightarrow (11)$

{ Here we should use 2 not gates for the 2 bits }

so, we are storing current state + processing inputs (using comb dkt)

\* **Example : 3**

counter :  $2 \rightarrow 4 \rightarrow 6$



$S_0 = 00$

$S_1 = 01$

$S_2 = 10$

$S_3 = 11$

we are not storing  $2/4/6$  {output}  
we are storing  $S_0/S_1/S_2$  {state}

since we represent states using 2 bits, we need 2 flip flops

use either mux or K-map to find suitable dkt

ASIC / SoC      FPGA

describes gates we use lookup tables to implement logic

done using test benches, doesn't guarantee functional correctness

we can use the same test benches for its components

the components are connected with each other depending on their component temp

NOTE: no. of states is equal to no. of flip flops

ASIC / SoC      FPGA

combinational dkt delay here we need to run through the circuit with its placement requirement

post-place and route timing simulator again after routing in order to verify functionality and timing constraint

IF you identify problems in the timing report increase the place effort level

using re-entering routing

using multi-pass place and route

re-design the logic paths to use fewer layers of logic, move to a faster device, or allocate more time for the paths.

(\*.bit)

Final Step: generating the bitstream

downloaded directly to the FPGA

OR

converted to a PROM file which contains programming information.

- **ASIC design flow:** routed design is used to generate photomask for producing integrated circuits  
 Application specific integrated chip

FPGA

The application can be configured and changed even after fabrication.

VHDL code can be converted to .bit and downloaded directly to FPGA

ASIC

fabricate the chip for one specific application we have to send our VHDL code to the fabrication organization

90% devices are ASIC as of this moment

## • APPLICATION SPECIFIC IC $\Rightarrow$ ASIC

- # High non-recurring engineering cost (NRE) {the cost required for fabrication of the FIRST product} one time cost to research, design, develop and test a new product
- # High cost for engineering change orders hence testing is critical
- # Lowest price for high volume production
- # fastest clock performance (high performance) because ASIC is intended for specific application
- # Unlimited size and low power consumption
- # Design and test tools are expensive
- # expensive IPs
- # steep learning curve

## \* MICRO-CONTROLLERS

simple computer placed inside a single chip with all the necessary components like memory, timers etc., embedded inside and performs a specific task

sequential execution  
commands: one by one  
one task at a time

cannot carry out parallel operation

A microcontroller designed  $\equiv$  GPU for parallel operations

- # consumes less power than FPGA and suitable for edge cases

## \* MICRO-PROCESSORS

ICs that come with a computer or CPU inside and are equipped with processing power.

- # No peripherals like ROM, memory

CPU      GPU      ASIC

flexibility      efficiency

Solution for the near future  $\equiv$  ARM + FPGA + GPU

microcontroller : time limited

FPGA : space limited

## \* LAB : 1 Design of Full Adder

20/08/29



In Hardware, execution occurs parallelly whereas in software, we write programs that run sequentially

Two major HDLs: Verilog & VHDL

more popular  
syntax close to C

easy to master  
more prominent in Indian VLSI industry

1983 : introduced by Gateway Design System

Inverted as a SIMULATION language. SYNTHESIS was an afterthought.

1987 : Verilog

Synthesizable by Synopsys

1989 : Gateway DESIGN SYSTEM acquired by Cadence

Latest verilog version

= system Verilog

↳ much simpler than initial versions

1981 - 1983 : US Dept of defence developed VHDL (VHSIC HDL)

very high speed integrated circuit hardware description language

open source unlike Verilog  
(closed src)

Afraid of losing market share, Cadence made Verilog open sourced (1990)

1995 : became IEEE standard 1364

Hardware : parallel processing  
Software : sequential processing

In Verilog, all lines execute parallelly unlike languages like C, C++, etc.

Verilog looks like C but describes hardware

Understand the circuit and specifications then figure out the code

## \* VERILOG

- Verilog HDL is case sensitive
- all keywords are in lower case
- statements terminated by semi-colon ;
- Two data types : Net (wire) → default datatype  
variable (Reg, Integer, real, time, realtime)
- Primitive Logic Gates and Switch-Level gates are built in

## \* EXAMPLES



| in1 | in2 | out |
|-----|-----|-----|
| 0   | 0   | 0   |
| 0   | 1   | 0   |

// module\_name <ports>

module AND (out, in1, in2);

  input in1, in2;

  output wire out;

  // in1 and in2 are also

  // wire datatype since

  // it is default type

  assign out = in1 & in2;

  // data flow - continuous assignment

endmodule

## \* 4BIT FULL-ADDER

⇒ Half Adder

| x | y | C | S |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 0 |

| C <sub>i</sub> | X <sub>i</sub> | Y <sub>i</sub> | C <sub>i+1</sub> | S <sub>i</sub> |
|----------------|----------------|----------------|------------------|----------------|
| 0              | 0              | 0              | 0                | 0              |
| 0              | 1              | 0              | 0                | 1              |
| 1              | 0              | 1              | 1                | 0              |
| 1              | 1              | 0              | 1                | 0              |

| C <sub>i</sub> | X <sub>i</sub> | Y <sub>i</sub> | C <sub>i+1</sub> | S <sub>i</sub> |
|----------------|----------------|----------------|------------------|----------------|
| 0              | 0              | 1              | 0                | 1              |
| 0              | 1              | 0              | 0                | 0              |
| 1              | 0              | 1              | 1                | 0              |
| 1              | 1              | 0              | 1                | 0              |

module full\_adder\_1bit (

  input FA1\_InA,

  input FA1\_InB,

  input FA1\_InC,

  output FA1\_OutSum,

  output FA1\_OutC,

);

  assign FA1\_OutSum = FA1\_InA ^ FA1\_InB ^ FA1\_InC;

  assign FA1\_OutC = (FA1\_InA ^ FA1\_InB) | (FA1\_InA & FA1\_InB);

endmodule



Homework: currently the output = 5 bits

HW → make the output → 4 bit with 1 bit for overflow





input [L..0] <  
output reg [L..0] ou  
);  
always @ (\*)

rule

both are same

n1  
= 0

(in 1, in 2)

an  
ni

A timing diagram illustrating the inputs to a D flip-flop. The top horizontal line represents the data input (d), which is high during the first half of the clock cycle and low during the second half. The bottom horizontal line represents the clock input (clk), which is high throughout the entire cycle. The output of the flip-flop is shown as a vertical line that remains low until the end of the clock pulse and then rises to a high level.

value to 3

or /

eration has  
this act.

- Asynchronous : operation can start at any time (irrespectively of the current state)

- pass a digital signal to make reset synchronous

The diagram shows a D flip-flop circuit. An input D is connected to one input of a DOR gate. The other input of the DOR gate is connected to the output of a NOT gate. The output of the DOR gate is connected to the clock input of a flip-flop. The flip-flop has two outputs: Q and Q-bar. A feedback line from the Q-bar output goes through another NOT gate to the second input of the DOR gate. A third input to the DOR gate is labeled 'rst' (reset), which is also connected to the second input of the NOT gate.

```

always @ (posedge clk)
begin
    if (!rst)
        q <= 0;
    else
        q <= d;
end

```



- the datatype  
here
- Note: is reg datatype linked to a register in memory in the hardware?  
Not necessarily  
eg: not in mux code  
but yes in register code and flip flop code
  - \* MODULE PORTS : provide interface for the module  
= interface to communicate with its environment  
 $\begin{matrix} \swarrow & \downarrow & \searrow \\ \text{input} & \text{output} & \text{inout} \end{matrix}$
  - Declaration: <Port direction><width><port\_name>
  - Port direction can be input/output/inout
- A rectangular box labeled "INPUT". An incoming arrow from the left is labeled "reg or wire". An outgoing arrow to the right is labeled "wire".
- A rectangular box labeled "OUTPUT". An incoming arrow from the left is labeled "wire". An outgoing arrow to the right is labeled "reg or wire".
- A rectangular box labeled "INOUT". Incoming arrows from both sides are labeled "wire". Outgoing arrows to both sides are labeled "wire".
- An Input Port  
driven by external entity
  - An Output Port  
driven by internal entity
  - An Inout Port  
driven by both internal and external ent

\* Module Interconnections
- ⇒ Named Association
- FAQ fa-by-name (.cout(COUT), .sum(SUM),  
.b(B), .a(A), .cin(Cin))
- ⇒ Order Association

\* Coding a 4:1 Mux using 3x 2

```
assign out = a & sel | b & ~sel;  
endmodule.
```

- Sized (dynamic size)
- Unsized (always 32 bits)

- In a radix of binary, hexa, octa, decimal (default)

|         |                      |                         |
|---------|----------------------|-------------------------|
| $54_9$  | $\equiv$             | $32^c d\ 54_9$          |
| $'h8FF$ | $\equiv$             | $32^c h\ 8FF$           |
| $'O765$ | $\equiv$             | $32^c O\ 765$           |
| $4'b11$ | $\equiv$             | $4'b11$                 |
| $8d9$   | $\times \Rightarrow$ | $8'h9 \quad \checkmark$ |

- \* Negative Number
  - [number]      Signed

# \* LAB: ELD

27/08/24 8:30AM

- For counter to increment every second, Design 8 bit up counter ( $0 \rightarrow 255$ ) using behavioural modelling
- Design 1Hz clock from input 100MHz clock using clock divider
- Lab HW: Design up/down counter with maximum count of 85
- Write a verilog code where output is delayed version of input by 1 clk cycle

Ans) just make a D flip flop

Note: if we want 3 cycle delay, we pass the input through 3 D flip flops



```
module delay_2(Din, CLK, out);
    input Din, CLK;
    output reg* out;
```

```
D_FF F1(Din, CLK, Q);
D_FF F2(Q, CLK, out);
endmodule
```

\*Note: Q and out are reg inside each D-FF block but will be wire in this module because both are inputs from perspective of our module. Look up previous lecture for the same.

$\boxed{\text{delay 2}} \longrightarrow \text{out}$

$\downarrow \text{reg} \quad \downarrow \text{reg}$

$\boxed{F1} \quad \boxed{F2}$

- 8 bit up counter
  - (1) block diagram
  - (2) define all signals
  - (3) write code



number of flip flops = 8

because number of states = 256

Note: in the flip flop, we just store the state of the circuit

$$NS = PS + 1$$

```
module counter(
    input CLK, reset,
    output [7:0] count
);
```

```
// flip flop
```

```
always @ (posedge CLK)
```

```
begin
```

```
    if (reset)
```

```
        PS <= 8'b00000000
```

```
// same as PS <= 8'd0
```

```
    else
```

```
        PS <= NS
```

```
end
```

```
// finally, assigning the output
```

```
assign count = PS; // if we take count as wire
```

or \*: some functionality

```
always @ (PS)
```

// if we take count as reg

```
begin
```

```
    count = PS;
```

```
end
```

```
endmodule
```

- Note: we need to define NS and PS here because they should be 8bit each but by default size = 1bit

Synchronous active high reset  $\Rightarrow$  D flip flop

- Testbench:

The test bench verilog file will be higher as compared to src file in context of hierarchy.

## \* LECTURE : 6 (Architecture)

27/08/24

3 - 4:30PM

- Programmable Logic Device (PLD)
  - Devices whose...  
Internal Architecture is predefined by manufacturer but are created in a way so that they can be configured in the field to perform variety of functions
- Programmability at the software level  
eg: Arduino / RPI Pico  
But you cannot change the instruction set architecture of the CPU

### \* Fusible Link Technology



### \* PROM: programmable read-only memory (1970)



- blow the fuses as per your logic
  - one-time programmable
  - Single PROM instead of multiple chips
    - smaller
    - lighter
    - cheaper
    - less prone to errors (fewer solder joints)
    - easy to identify errors / correct errors
  - Designed for use as memories to store computer programs and constant data values
  - Also useful for implementation for simple logic function such as LUT & state machines
- Very high amount of fuses required for complex CKTs  
programmable once only  
need to switch fuse once blown

### \* EPROM: Erasable PROM (1971: by intel)

- can be erased (UV rays)
- multiple time programmable
- smaller in size than fusible linked devices
- bring the IC back, takes minutes to erase by putting in UV container and then put it back
- whole thing is erased all together
- Cannot erase sections of the device.
- expensive
- erasing process becomes complex as density of transistors increases

### \* EEPROM: Electrically EPROM

### \* PLA: Programmable Logic Arrays (1975)

- High delay
- Makes the left side of PROM programmable as well
- Did not get adopted because people were more comfortable with SOF form

### \* Programmable Logic Device

→ SPLD : Simple

→ CPLD : complex

- Complex PLDs (CPLD)

1984

→ Need for bigger (functionally), smaller (size), faster and cheaper technology

→ MegaPAL: interconnection of 4 PAL

high power consumption

Programmable Array Logic

→ 1984: Altera introduced CPLD using CMOS (high density, low power) and EPROM / EEPROM (programmability)

E<sup>2</sup>PROM

= multiple SPLDs

→ Added multiplexers to each SPLD so that only the necessary stuff is processed

= to combat higher power consumption

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

LUT lookup table

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power is cut

Programmable LUT

SRAM cells

000  
001  
010  
011  
100  
101  
110  
111

Programmable logic blocks (EEPROM, SRAM)

so that data is erased as soon as power

## • LUT as memory & LUT as ALU

↳ lookup table

→ Operations on memory: read or write

we provide the address of the

memory → 9 bytes

here

you provide the data and the

1 byte ← address

9 bytes ←



## \* FPGA

CLB: configuration logic block

BRAM: block RAM (used to store large amt of data)

input/output block

CMT: clock management tile

FIFO logic

BUFG: Global buffer

DSP: digital signal processing

BUFIO and BUFR: Input/output & Regional buffer

MGT: multi-gigabit transceiver

- we use one oscillator for generating one clock
- we use CMT block for clocks other than the one generated by oscillator

## \* Configurable Logic Block [CLB]



Our flip flop now ↗

| Slices | LUT | Flip Flops | Arithmetic & carry chains |
|--------|-----|------------|---------------------------|
| 2      | 8   | 16         | 2                         |

(4 LUT/slice) (8 FF/slice)  
= 6 input LUT

total: 64 bit of data stored in LUTs  
8 bits stored in each LUT (6 bits input to LUT)

= 512 bit of data in one CLB X see below

## • CLB : SLICES

① SLICEM: Full slice } read and write only  
↳ combinational circuit

- LUT can be used for logic and memory / SRL (shift register)



② SLICEL: logic and arithmetic only } read only

- LUT can only be used for logic (not memory / SRL)



## • Memory vs Vector

- Vector and mem declarations are not same

- in a vector, all bits can be assigned a value in one statement

- in memory, assigned separately.

reg [7:0] vect = 8'b 10100011

reg array [7:0]; // 8 locations of 1 bit

array [7] = ...;

array [6] = ...;

:

array [0] = ...;



## \* LAB:3 (Running on hardware)

03/09/24

⇒ Let's say our program is an 8-bit upcounter  
the program will run on the hardware  
but how will you observe the output?

- ① One way could be using LEDs (8 of them for 8 bits) to represent the output physically on the board.
- ② Also, we need an additional circuit to convert 100MHz → 1Hz from oscillator ↴
- ③ Also, how will you implement reset signal?  
How about a physical button connected to the board.

Let's take example of 3 bit up counter



here, duty cycle is 50%



We need to find  $\chi$  such that  $\chi > 4 \text{MHz}$  and number of bits of freq. divider

$$\text{here } \frac{\chi}{2^3} = 1 \Rightarrow \chi = 8388608 \text{ Hz} = 8.388 \text{ MHz}$$

(= clock division)  
and hence the frequency divider / counter  
is 23 bits

To get 2Hz as output from above,

We can change no. of bit of counter to 22 ✓



We need 4 codes

→ CMT

→ freq division

→ Counter

→ Top code to converge them all ↴

Notes:

- ① How to get CMT on Vivado?

IP Catalog → Clocking wizard → Clocking → output options  
clocks (clk-100m)  $\equiv 100\text{MHz}$   
 $\equiv 8.388\text{MHz}$

Note: the "locked" signal is high when the output signal reaches the intended frequency

after generating clock-wizard,  
we need to instantiate it

To instantiate clocking-wizard:

SOURCES

↓  
IP SOURCES

↓  
clk-wiz-0

↓  
Instantiation Template

↓  
clk-wiz-0.v

copy verilog code  
from here to top-count.v



## \* Lecture : 8

03/09/21

To get stats about the project go to project summary tab on Vivado

⇒ vector & memory

`reg [7:0] my-reg [0:31];`

↳ memory with 32 positions of 8 bit size each

`integer matrix[4:0][0:31];`

↳ 2 dimensional memory

`wire [1:0] regL [0:3];`

`wire [1:0] reg2 [3:0];`

`array2 [100][7][31:24];`

↳ 4th byte from  
101<sup>th</sup> column and 8<sup>th</sup> row



`reg [31:0] Data-RAM [0:255]`

read 2nd byte from Address 11  
(index)

`Data-RAM[11][15:8]`

2nd and 3rd byte from Address 77

`Data-RAM[77][23:8]`

printf : C :: \$display : Verilog

## \* Vector indexing

`reg [63:0] word;`

`reg [3:0] byte_num;`

`reg [7:0] byteN;`

`byteN = word [byte_num * 8 + : 8]`

$$= 4 \times 8 + : 8$$

= 32 + : 8 (forward direction)

$$= [39:32]$$

$$\text{word}[7+:16] = [22:7]$$

$$a[31-:8] = a[31:24]$$

8 bits of data

from index 31

in backwards

direction

for(i=0; i<5; i=i+1)

\$display ("%s", str[i\*8:8]);

⇒ edcba

## \* Verilog : Register vs Integer

- Reg is by default 1 bit wide data type. If more bits are required, we use range declaration.

- Integer is a 32 bit wide datatype.

- Integer cannot change its width. It is fixed.

- Not much utility as compared to Reg / Net

- Typically used for constants or loop variables

- Vivado automatically trims unused bits of Integers.

eg: Integer i = 255;

→ then i = 8 bits

## \* OPERATORS

{  
↳ Unary  
↳ Binary  
↳ Ternary } based on the number of operands

`a = ~b;`

`a = a && b;`

`a = b ? c : d;`

## \* BUS OPERATORS

[ ] Bit/Port Select  $A[0] = 1'b1$

{ } Concatenation  $\{A[5:2], A[7:6], 2'b01\}$

{x}{y} Replication  $\{3\{A[7:6]\}\}$

$$= 6'b101010$$

<< shift left logical  $\times 2^x$

>> shift right logical  $\div 2^x$

shifting bits is very cheap (= signal rerouting)  
used to perform multiplication and division powers of 2.

eg:  $6(=4'b0110) \xleftarrow{\ll} 4'b1100 (=8+4=12)$

$\xrightarrow{\gg} 4'b0011 (=2+1=3)$

works perfectly only for unsigned numbers

when working with signed numbers,  
towards LSB during right shift, you need to retain the signed bit

OR just use shift right arithmetic  
works for signed 2's complement

no such problems when shifting towards msB (<<)

assign { b[7:0], b[15:8] } = { a[15:8], a[7:0] }

↳ byte swap

eg:  $a = 8'hAB = 16'h00AB$

$b = 16'hAB00$

- Note:

left orith.  
<<<)

- For multiplication, FPGAs have DSP48 dedicated to fast math.

However, if there are no free DSPs, used which is large and slow.

- For multiplication of two N bit numbers, result is max  $2^N$  bits number
  - make sure to define your variables and their size explicitly.

Bitwise operators:  
 operates on &  
 each bit individually

|            |          |           |
|------------|----------|-----------|
| $\sim$     | inverser | Output    |
| &          | And      | can be    |
|            | Or       | multi-bit |
| $\wedge$   | not      |           |
| $\sim\sim$ | XNOR     |           |

  - if the two operands are of different lengths, the shorter one is padded with its MSB with signed bit  
 ↴ so number of gates required:  $\max\{\text{len}(A), \text{len}(B)\}$
  - by default everything is unsigned
  - this is how we can tell the tool that we want signed operation:

assign out = (\$signed(a)) < (\$signed(b))  
 or we can store the value in signed way using \$signed and do operation normally,

Logical operators:  
 ! NOT  
 && AND  
 || OR  
 == EQUAL  
 != NOT EQUAL  
 $<, >, \leq, \geq$  COMPARISON

operator is 1  $\Rightarrow$  other logical operator output = 1  
else: 0 (FALSE) (TRUE)

$\Rightarrow$  Reduction Operators: output is also one bit

|                   |                               |
|-------------------|-------------------------------|
| & AND             |                               |
| $\sim\&$ NAND     |                               |
| OR                | notation: <operator><operand> |
| $\sim $ NOR       | eg: $\sim\&A$                 |
| $\wedge$ XOR      | $= \sum_{i=0}^n \sim\& A[i]$  |
| $\sim\wedge$ XNOR |                               |

$\Rightarrow$  Conditional Operators: condition ? true\_val : false\_val

2:1 mux  $\rightarrow$  sel ? a : b

\* PRACTICE:

```
module max (
    input a, b, c,
    output out
)
assign out = (a > b) ?
    ((a > c) ? a : c)
    : ((b > c) ? b : c)
```

Hw: design 4:1 mux using conditional operators  
design 1 bit equality comparator using ↑

used most in: synthesizable  
code

test benches

- 2 basic blocks: always & initial ↑
  - in behavioural modelling
  - both run in parallel (will not block the execution of other blocks)
- All of them start at simulation time  $O(\#0)$
- INITIAL BLOCK
  - starts at #0 and executes only once.
  - we cannot use "always" inside "initial" & vice versa
  - used for initializing / setting global constants

initial

begin

..

#70 \$finish after 100 units  
Wait To end  
more units  
 $b = \#50$  C & d  
 $\#50 b = c8d$  } calculated & assigned  
calculated at  $t=0$   
but assigned at  $t=50$

In the hardware, delay is created in context of clock cycles instead of seconds.

- **ALWAYS BLOCK**
  - Starts at  $t=0$
  - Executes statements continuously in a loop
  - Statements inside always block are executed either sequentially (=) or parallelly ( $\in$ )

assignment      assignment

- describes the functionality of the circuit
- used for clock declaration

always      always @(\*)      always @(posedge ...)

...      ...      ...

net

\* Combinational Circuits using always

- Common ERRORS
  - bad because of parallel execution, one cannot be o/p of 2 blocks

⇒ ERROR: Multi driver error  
Some variable driving two blocks  
Synthesis error

eg: always @ (posedge clk)

begin

if (rst-n)

$Q \leftarrow D$ ;

end

always @ (negedge rst-n)

begin

if (!rst-n)

$Q \leftarrow 1'b0$ ;

end

- Q is being updated by two blocks simultaneously

## # Parameters

```
module something(
    parameter foo = 1'b0
)
```

\* Digital clock (minute : seconds)  
(behavioural modelling)

$0 \rightarrow 59 \downarrow \downarrow 0 \rightarrow 59$

6 bit up counter

always @  $\downarrow$

HW: modify the digital clk so that the output of CMT block is 16.777 MHz  
clock management time  $\downarrow$   
24 bit size of the counter



sec\_reg: output of sequential ckt (flip flop)

sec\_next: output of combinational ckt

hence we cannot

initialise it

eg: we do not initialize output of AND ckt

if we initialise it, the tool might ignore the 2nd always block (the one whose o/p is sec-next)

ff: takes the next value and assigns to the current value

note: we are making the minutes logic and seconds logic in separate always blocks due to parallel processing

## • Behavioral modelling for comb. ckt

- (1) multi-driver error
- (2) Incomplete sensitivity list
- (3) Incomplete branch and incomplete o/p assignments

### SOLUTIONS ( $\equiv$ Guidelines)

① → have separate always block for each identifier

→ do not update multiple identifiers in one always block

② designing AND gate wrong. comb. ckt's  
always @ (in1, in2) begin  
out = in1 & in2; end

Note: we are just explaining our logic through code to the tool. Some tools might give warning but some might not. Hence resulting in incorrect logic.

\* Note: leaving out an input trigger might result in a sequential circuit

③ always @ \* → intention: combinational circuit  
if (a>b)  
gt = 1'b1  
else if (a=b)  
eq = 1'b1

\* Problem 1: 2 outputs of one always block  
\* Problem 2: for each condition, only one variable of the two variables are getting updated and hence the other variable is stored in memory which we don't want because sequential

→ assign values to all variables in each condition

→ deal with all cases in an if else block using else and in a switch block using default

another example:

→ OR we can initial vars to a value in the start of an always block

case (s)  
2'b00 : y = 1'b1; } not considering  
2'b10 : y = 1'b0; } case where  
2'b11 : y = 1'b1; } s = 2'b01  
endcase

Solution:  
either define all cases or use default keyword  
default: y = 1'b0;

CASE IF - ELSE  
All cases are checked simultaneously Priority based

\* CASE  
↳ full case: all possible outcomes are accounted  
↳ parallel case: all stated alternatives are mutually exclusive

eg: case (sel)  
2'b11 : out <= a;  
2'b10 : out <= b;  
2'b01 : out <= c;  
default : out <= d;  
endcase

full case ✓  
parallel case ✓

eg2: case (sel)  
2'b1? : out <= a; } parallel x  
2'b?1 : out <= b; } because of ambiguity when sel  
default: out <= c; }  
note: for sel = 2'b11  $\Rightarrow$  out = a is 2'b11  
because of higher priority

summary  
• If an always block executes and a variable is not assigned

→ variable has to be stored  
↳ not combinational ckt  
↳ unnecessary complex  
↳ might not be synthesizable

• USE BLOCKING ASSIGNMENT FOR COMBINATIONAL CIRCUITS

\* BLOCKING / NON-BLOCKING

→ Note: non blocking works only behavioural modelling i.e. always / initial block

### ① BLOCKING

statements are executed in the order they are specified in a sequential block

does not blocks execution in a parallel block

### ⇒ RULE

(1) Always @(\*) : use blocking

(2) Always @ (posedge clk) : use non-blocking

eg<sup>1</sup>) always @ (posedge clk)  
begin  
reg1 <= #1 in1;  
reg2 <= @ (negedge clk) in2 ^ in3;  
reg3 <= reg2;  
end

Note: the values in1, in2 and in3 are stored at posedge clk

hence reg3 will have the previous value of reg2 and not in1

also, it does not matter if in1 & in2 changed when clk hit neg edge, it will still take the value at initial pos edge to calculate reg2

note: this code isn't synthesizable

eg<sup>2</sup>) always @ (posedge clk)  
a = b;  
always @ (posedge clk)  
b = a;

note: both always block execute at the same exact time since they are parallel blocks theoretically.

but on the hardware, it could happen that block (1) executes before (2) or vice versa

Conditions: (1) both execute at same time  
a = b  
b = a

(2) (1) then (2)  
a = b  
b = a

(3) (2) then (1)  
b = a  
a = b

case: (1) both at same time  
a = b  
b = a

(2) (1) → (2)  
a = b  
b = a

(3) (2) → (1)  
b = a  
a = b

eg<sup>3</sup>) always @ (posedge clk)  
begin  
q1 = in;  
q2 = q1;  
out = q2;



non-blocking  
always @ (posedge clk)  
begin  
q1 <= in;  
q2 <= q1;  
out <= q2;



3 clk cycles delay ✓

for sequential ckt's use non-blocking assignment only

but for comb. ckt's use blocking assignment only

when doing both sequential & comb.  
we do non blocking

## • FPGA Architecture

2 LUT outputs:  $O_5, O_6$

overall outputs:  $-, -MUX, -Q$   
 eg:  $D, D_{MUX}, D_Q$

through multiplexer  
 passed through flip flop  
 (synchronous)  
 and delayed by 1 clock cycle

if we combine 2 6bit LUTs: we get a  
 64 locations  $\leftarrow$  7bit LUT  
 of 1bit size each  $\hookrightarrow$  128 locations



Combining two 2bit LUTs to get a 3bit LUTs

$$\text{eg: } I = 100 \Rightarrow \text{out} = E \\ I = 001 \Rightarrow \text{out} = F$$

Similarly, our S-series FPGA, can combine all 4 6bit LUTs to get max one 8bit LUT

## \* 3input LUT



3input comb. ckt means 8 rows & 1 output cal in truth table

so, only 2 comb ckt can be implemented in one 3input LUT



this is 2input LUTs x 2 how many comb ckt of 2 inputs can be implemented?  
 because 2input = 4 locations and so, either of the output can be used at once

## • LUT 3 architecture



how many comb ckt for 3bit input? ONE

how many comb ckt for 2bit input? TWO???

set sel line to 0/1 const

limitation

$$2 \times 2\text{bit inputs: } f_1(A, B) = y_1 \\ f_2(C, D) = y_2$$

but we only have 3 inputs available at a time

but do note that no. of outputs match i.e. we have problem with input only

but, we can implement this:

$$f_1(A, B, C) \\ f_2(D, E, F)$$

each LUT can store 4 locations

but you need 8 locations in each LUT for implementing two comb ckt over LUT 3 with 2bit input

since we now have 2inputs only, we need 4 locations only with each LUT which works here.

Conclusion: we can implement 2 comb ckt with 2bit input in LUT 3 only if both the inputs are same

note: upper limit of no. of comb ckt is equal to the no. of outputs

max. no. of comb ckt that works with no constraint: 2

$$f_1(A) ; f_2(B)$$

total locations required:  $2^1 + 2^1 = 4$  ✓  
 inputs:  $2 < 3$  ✓  
 outputs: 2 ✓

Conclusion-2:

1 comb ckt with 3 inputs ✓

2 comb ckt with 2 inputs (with common inputs) ✓

2 comb ckt with 1 input ✓

\* Now what about LUT 6?

| inputs               | comb. ckt. | constraint |
|----------------------|------------|------------|
| $f_1(A, B, C, D, E)$ | 6          | 1          |
| $f_2(F, G, H, I, J)$ | 5          | 2          |
| $382$ (over less)    | 2          | 2          |

$$\text{eg: } f_1(A, B, C) \\ f_2(D, E, F)$$

$$\text{eg: } f_1(A, B, C) \\ f_2(B, C, D)$$

note:  $f_1, f_2, f_3$  wont work because we only have 2 outputs but they give 3 outputs

but inputs: 1 for sel 3 upper LUT 2 lower LUT

$$f_1(A, B, C) \\ f_2(D, E, F)$$

because we have only 5 inputs (rest 1 is sel)

\* Finite state machine

## ① MOORE machine



## ② MEALY machine



\* LAB: 5  $\Rightarrow$  design of sequence detector 17/09/24

## FSM

midsem code: write a fsm code for something..

FSM: sequential circuit

" "

FFs + 2 comb. ckt

↓  
output + next state

1011 sequence detector



types: overlapping / non-overlapping

⇒ FSM: finite state machine  
every state has a meaning

## MOORE

$S_0$ : " "  
 $S_1$ : " 1"  
 $S_2$ : " 11"  
 $S_3$ : " 110"  
 $S_4$ : " 1101"



non-overlapping 1111 seq. detector

overlapping case



every state should be unique

Verilog code for FSM (moore)

(1101)

① module definition

2 comb. ckt + 1 sequential ckt

② define variables for present and next state  
size  $\geq$  number of states

`reg[2:0] present-state,  
next-state;`

`parameter S0 = 3'b000  
S1 = 3'b001  
S2 = 3'b010  
S3 = 3'b011  
S4 = 3'b100`

\* Jhre always blocks

because we have three blocks

1 seq + 2 comb

1 seq + 2 comb</

\* lecture: 12

17/09/24

⇒ SLICE ARCHITECTURE



### LUT

- ↳ combinational ckt
- ↳ memory
- ↳ shift register

#### ① LUT as memory

- only in SLICEM
- synchronous write operation, WEN must be high
- asynchronous / synchronous read operation
  - ↳ when passed through a FF

#### \* Single Port

common address port for synchronous write and asynchronous read i.e. read and write addresses share the same address bus.



#### \* Dual Port

one port for async write + sync read  
one port for async read



#### \* Simple dual port



#### \* 64x1 : Dual Port



#### \* 64x2 : quad port

