

# ECE 270 : Embedded Logic Design $\Rightarrow$ Co + DC



Resource for Quiz: [hobbits.0x3.net/wiky/Main\\_Page](http://hobbits.0x3.net/wiky/Main_Page)  
Lab videos : youtube

Programming: 1st half  $\rightarrow$  Verilog  
2nd half  $\rightarrow$  Embedded C

Theory: FPGA and SoC

Virado 2019.1 (including SDK)  
*arch user repository*

|           |               |       |
|-----------|---------------|-------|
| * GRADES: | mid sem       | 30 %. |
|           | end sem       | 30 %. |
|           | Surprise quiz | 28 %. |
|           | lab hw        | 15 %. |

## LECTURE: 1

\* Which is faster : Analog vs digital ?

$\Rightarrow$  Depends on the use case

\* No product is purely digital/analog ?

$\Rightarrow$  Analog is present in nature however digital can be processed easily and has more use cases.



HDL  $\rightarrow$  Hardware Description language  
 $\hookrightarrow$  eg  $\Rightarrow$  Verilog

## LECTURE : 2

\* **combinational circuit**: The output depends upon the present input (same clock cycle)

\* **sequential circuit**: Output depends upon the current input and the current state of the circuit  
what we get → output + next state  
because we go from one state to another

⇒ Note: combinational circuits use clock as its input as well.

\* **D flip flop**: Input is stored at falling (edge triggered) or rising edge of the clock



\* **Sequential circuit using combinational ckt**



\* **FSM (finite state machine)**

⇒ Up Counter



Note: if curr state =  $S_n$ , the output is  $n$

2 bits required to store in mem



\* **Another example**



relation b/w present state & next state  $\Rightarrow NS = PS + 1$



e.g. if  $PS = S_0$  ( $00$ )  $\Rightarrow$  output is  $3$  ( $11$ )  
i.e.  $NOT(0), NOT(0) \Rightarrow (11)$

{ Here we should use 2 not gates for the 2 bits }

so, we are storing current state + processing inputs (using comb. ckt)

\* **Example : 3**

counter :  $2 \rightarrow 4 \rightarrow 6$

use either MUX or K-map to find suitable ckt



NOTE: no. of states is equal to no. of flip flops

since we represent states using 2 bits, we need 2 flip flops



NGD : Native Generic Database

NCD : Native circuit description

delay : Logic delay + Routing delay

= combinational ckt delay here we need to run through the post-place and route static timing simulator again after routing in order to verify functionality and timing constraint

If you identify problems in the timing report

- increase the buffer effort level

- using re-entering routing

- using multi-pass place and route

- redesign the logic paths to use fewer layers of logic, moving to a faster device, or allocate more time for the paths.

(\*.bit)  
Final Step: generating the bitstream

downloaded directly to the FPGA

OR

converted to a PROM file which contains programming information.

- **ASIC design flow:** routed design is used to generate photomask for producing integrated circuits  
 Application specific integrated chip

FPGA

The application can be configured and changed even after fabrication.

VHDL code can be converted to .bit and downloaded directly to FPGA

ASIC

fabricate the chip for one specific application we have to send our VHDL code to the fabrication organization

90% devices are ASIC as of this moment

## • APPLICATION SPECIFIC IC $\Rightarrow$ ASIC

- # High non-recurring engineering cost (NRE) {the cost required for fabrication of the FIRST product} one time cost to research, design, develop and test a new product
- # High cost for engineering change orders hence testing is critical
- # Lowest price for high volume production
- # fastest clock performance (high performance) because ASIC is intended for specific application
- # Unlimited size and low power consumption
- # Design and test tools are expensive
- # expensive IPs
- # steep learning curve

## \* MICRO-CONTROLLERS

simple computer placed inside a single chip with all the necessary components like memory, timers etc., embedded inside and performs a specific task

sequential execution  
commands: one by one  
one task at a time

cannot carry out parallel operation

A microcontroller designed  $\equiv$  GPU for parallel operations

- # consumes less power than FPGA and suitable for edge cases

## \* MICRO-PROCESSORS

ICs that come with a computer or CPU inside and are equipped with processing power.

- # No peripherals like ROM, memory

CPU      GPU      ASIC

flexibility      efficiency

Solution for the near future  $\equiv$  ARM + FPGA + GPU

microcontroller : time limited

FPGA : space limited

## \* LAB : 1 Design of Full Adder

20/08/29



In Hardware, execution occurs parallelly whereas in software, we write programs that run sequentially

Two major HDLs: Verilog & VHDL

more popular  
syntax close to C

easy to master

more prominent  
in Indian VLSI industry

1983 : introduced by Gateway Design System

Inverted as a SIMULATION language. SYNTHESIS was an afterthought.

1987 : Verilog

Synthesizer by Synopsys

1989 : Gateway DESIGN SYSTEM acquired by Cadence

Latest verilog version

= system Verilog

↳ much simpler than initial versions

1981 - 1983 : US Dept of defence developed VHDL (VHSIC HDL)

very high speed integrated circuit hardware description language

open source  
unlike Verilog  
(closed src)

Afraid of losing market share  
(Cadence made Verilog open sourced (1990))

1995 : became IEEE standard 1364

Hardware : parallel processing  
Software : sequential processing

In Verilog, all lines execute parallelly unlike languages like C, C++, etc.

Verilog looks like C but describes hardware

Understand the circuit and specifications then figure out the code

## \* VERILOG

- Verilog HDL is case sensitive
- all keywords are in lower case
- statements terminated by semi-colon ;
- Two data types : Net (wire) → default datatype  
variable (Reg, Integer, real, time, realtime)

- Primitive Logic Gates and Switch-Level gates are built in

## \* EXAMPLES



| in1 | in2 | out |
|-----|-----|-----|
| 0   | 0   | 0   |
| 0   | 1   | 0   |

// module\_name <ports>

module AND (out, in1, in2);

input in1, in2;

output wire out;

// in1 and in2 are also

// wire datatype since

// it is default type

assign out = in1 & in2;

// data flow - continuous assignment

endmodule

Homework: currently the output = 5 bits  
HW → make the output → 4 bit with 1 bit for overflow



assign FA1\_OutSum = FA1\_InA ^ FA1\_InB ^ FA1\_InC;

assign FA1\_OutC = (FA1\_InA ^ FA1\_InB) | (FA1\_InA & FA1\_InB)

endmodule







# \* LAB: ELD

27/08/24 8:30AM

- For counter to increment every second, Design 8 bit up counter ( $0 \rightarrow 255$ ) using behavioural modelling
- Design 1Hz clock from input 100MHz clock using clock divider
- Lab HW: Design up/down counter with maximum count of 85
- Write a verilog code where output is delayed version of input by 1 clk cycle

Ans) just make a D flip flop

Note: if we want 3 cycle delay, we pass the input through 3 D flip flops



```
module delay_2(Din, CLK, out);
    input Din, CLK;
    output reg* out;
```

```
D_FF F1(Din, CLK, Q);
D_FF F2(Q, CLK, out);
endmodule
```

\*Note: Q and out are reg inside each D-FF block but will be wire in this module because both are inputs from perspective of our module. Look up previous lecture for the same.

$\boxed{\text{delay 2}} \longrightarrow \text{out}$

$\downarrow \text{reg} \quad \downarrow \text{reg}$

$\boxed{F1} \quad \boxed{F2}$

- 8 bit up counter
  - (1) block diagram
  - (2) define all signals
  - (3) write code



number of flip flops = 8

because number of states = 256

Note: in the flip flop, we just store the state of the circuit

$$NS = PS + 1$$

```
module counter(
    input CLK, reset,
    output [7:0] count
);
```

```
// flip flop
```

```
always @ (posedge CLK)
```

```
begin
```

```
    if (reset)
```

```
        PS <= 8'b00000000
```

```
// same as PS <= 8'd0
```

```
    else
```

```
        PS <= NS
```

```
end
```

```
// finally, assigning the output
```

```
assign count = PS; // if we take count as wire
```

or \*: some functionality

```
always @ (PS) // if we take count as reg
```

```
begin
```

```
    count = PS;
```

```
end
```

```
endmodule
```

- Note: we need to define NS and PS here because they should be 8bit each but by default size = 1bit

Synchronous active high reset  $\Rightarrow$  D flip flop

- Testbench:

The test bench verilog file will be higher as compared to src file in context of hierarchy.

## \* LECTURE : 6 (Architecture)

27/08/24

3 - 4:30pm

- Programmable Logic Device (PLD)
  - Devices whose...  
Internal Architecture is predefined by manufacturer but are created in a way so that they can be configured in the field to perform variety of functions
- Programmability at the software level  
eg: Arduino / RPI Pico  
But you cannot change the instruction set architecture of the CPU

### \* Fusible Link Technology



### \* PROM: programmable read-only memory (1970)



- blow the fuses as per your logic
  - one-time programmable
  - Single PROM instead of multiple chips
    - smaller
    - lighter
    - cheaper
    - less prone to errors (fewer solder joints)
    - easy to identify errors / correct errors
  - Designed for use as memories to store computer programs and constant data values
  - Also useful for implementation for simple logic function such as LUT & state machines
- Very high amount of fuses required for complex CKTs  
programmable once only  
need to switch fuse once blown

### \* EPROM: Erasable PROM (1971: by intel)

- can be erased (UV rays)
- multiple time programmable
- smaller in size than fusible linked devices
- bring the IC back, takes minutes to erase by putting in UV container and then put it back
- whole thing is erased all together
- Cannot erase sections of the device.
- expensive
- erasing process becomes complex as density of transistors increases

### \* EEPROM: Electrically EPROM

### \* PLA: Programmable Logic Arrays (1975)

- High delay
- Makes the left side of PROM programmable as well
- Did not get adopted because people were more comfortable with SOF form

### \* Programmable Logic Device

→ SPLD : Simple

→ CPLD : complex

- Complex PLDs (CPLD)

1984

→ Need for bigger (functionally), smaller (size), faster and cheaper technology

→ MegaPAL: interconnection of 4 PAL

high power consumption

Programmable Array Logic

→ 1984: Altera introduced CPLD using CMOS (high density, low power) and EPROM / EEPROM (programmability)

E<sup>2</sup>PROM

= multiple SPLDs

→ Added multiplexers to each SPLD so that only the necessary stuff is processed

= to combat higher power consumption

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

(communicate)

Signals from a logic block can travel through adjacent blocks only

ALTERA

Programmable interconnect matrix

input / output pins

SPLD like blocks

## • LUT as memory & LUT as ALU

↳ lookup table

→ Operations on memory: read or write

we provide the address of the

memory → 9 bytes

here

you provide the data and the

1 byte ← address

9 bytes ←



## \* FPGA

CLB: configuration logic block

BRAM: block RAM (used to store large amt of data)

input/output block

CMT: clock management tile

FIFO logic

BUFG: Global buffer

DSP: digital signal processing

BUFIO and BUFR: Input/output & Regional buffer

MGT: multi-gigabit transceiver

- we use one oscillator for generating one clock
- we use CMT block for clocks other than the one generated by oscillator

## \* Configurable Logic Block [CLB]



Our flip flop now ↗

| Slices | LUT | Flip Flops | Arithmetic & carry chains |
|--------|-----|------------|---------------------------|
| 2      | 8   | 16         | 2                         |

(4 LUT/slice) (8 FF/slice)

= 6 input LUT

total: 64 bit of data stored in LUTs  
8 bits stored in each LUT (6 bits input to LUT)

= 512 bit of data in one CLB X see below

## • CLB : SLICES

① SLICEM: Full slice } read and write only  
↳ LUT can be used for logic and memory / SRL (shift register)



read address + write address

② SLICEL: logic and arithmetic only } read only

↳ LUT can only be used for logic (not memory/SRL)



• We store large amount of data in BRAM

→ in own FPGA (7-series), there are only 25% SLICEM and 75% SLICEL

CLB\_LL CLB\_LM



So, we can store either 0 or 128 bits in one CLB depending on its type ↗

4 LUTs/slice and each LUT contains 64 bits but in CLB\_LM half mem is taken by slice L

P/B/C/D ↗ is asynchronous ↘ is synchronous as it passes through FF

6 input LUT + 6 input LUT ↗ 64+64=128 bit input ↗

7 input LUT ↗

FPGA → CLB → 2x slices → 4x LUT + MUX + carry chain

↳ slice L ↗ slice M

↳ LUT used for logic

and arithmetic

↳ LUT used for memory

+ 4 flip flops or latches

+ 4 additional flip flops

## • Verilog: Vector and memory

• Only net or reg datatypes can be declared as vectors (multiple bit width)

• Specifying vectors for integer, real, realtime and time datatype is not allowed

• default: 1bit (scalar)

eg: wire [7:0] a-byte ; reg [31:0] a-word;

reg [11:0] counter;

reg a;

reg [2:0] b;

a = counter[7]; index 7

b = counter[4:2]; 4, 3, 2

eg: wire [-3:0]; 4 bits sign

## • memory

reg [3:0] mem [255:0], red;

↳ 4 bit vector

256 locations { 255 : 0 } ↗ each size: 4bits

with each 4bits

• Memory vs vector

• Vector and mem declarations are not same

• In a vector, all bits can be assigned a value in one statement

• In memory, assigned separately.

reg [7:0] vect = 8'b 10100011

reg array [7:0]; // 8 locations of 1bit

array [7] = ...;

array [6] = ...;

:

array [0] = ...;

## \* LAB:3 (Running on hardware)

03/09/24

⇒ Let's say our program is an 8-bit upcounter  
the program will run on the hardware  
but how will you observe the output?

- ① One way could be using LEDs (8 of them for 8 bits) to represent the output physically on the board.
- ② Also, we need an additional circuit to convert 100MHz → 1Hz from oscillator ↴
- ③ Also, how will you implement reset signal?  
How about a physical button connected to the board.

Let's take example of 3 bit up counter



here, duty cycle is 50%



We need to find  $\chi$  such that  $\chi > 4 \text{MHz}$  and number of bits of freq. divider

$$\text{here } \frac{\chi}{2^3} = 1 \Rightarrow \chi = 8388608 \text{ Hz} = 8.388 \text{ MHz}$$

(= clock division)  
and hence the frequency divider / counter  
is 23 bits

To get 2Hz as output from above,

We can change no. of bit of counter to 22 ✓



We need 4 codes

→ CMT

→ freq division

→ Counter

→ Top code to converge them all ↴

Notes:

- ① How to get CMT on Vivado?

IP Catalog → Clocking wizard → Clocking → output options  
clocks (clk-100m)  $\equiv 100\text{MHz}$   
 $\equiv 8.388\text{MHz}$

Note: the "locked" signal is high when the output signal reaches the intended frequency

after generating clock-wizard,  
we need to instantiate it

To instantiate clocking-wizard:

SOURCES

↓  
IP SOURCES

↓  
clk-wiz-0

...  
...  
Instantiation Template

L clk-wiz-0.v

copy verilog code from here to top-count.v



Virtual Input Output = VIO

= debugger

= exact replica of testbenches  
but now it is running on hardware



## \* Lecture : 8

03/09/21

To get stats about the project go to project summary tab on Vivado

⇒ vector & memory

`reg [7:0] my-reg [0:31];`

↳ memory with 32 positions of 8 bit size each

`integer matrix[4:0][0:31];`

↳ 2 dimensional memory

`wire [1:0] regL [0:3];`  
`wire [1:0] reg2 [3:0];`



`array2 [100][7][31:24];`

↳ 4th byte from  
101<sup>th</sup> column and 8<sup>th</sup> row

`reg [31:0] Data-RAM [0:255]`

read 2nd byte from Address 11  
(index)

`Data-RAM[11][15:8]`

2nd and 3rd byte from Address 77

`Data-RAM[77][23:8]`

printf : C :: \$display : Verilog

## \* Vector indexing

`reg [63:0] word;`

`reg [3:0] byte_num;`

`reg [7:0] byteN;`

`byteN = word [byte_num * 8 + : 8]`

$$= 4 \times 8 + : 8$$

= 32 + : 8 (forward direction)

$$= [39:32]$$

$$\text{word}[7+16] = [22:7]$$

$$a[31-:8] = a[31:24]$$

8 bits of data

from index 31

in backwards

direction

for(i=0; i<5; i=i+1)

\$display ("%s", str[i\*8:8]);

⇒ edcba

## \* Verilog : Register vs Integer

- Reg is by default 1 bit wide data type. If more bits are required, we use range declaration.

- Integer is a 32 bit wide datatype.

- Integer cannot change its width. It is fixed.

- Not much utility as compared to Reg / Net

- Typically used for constants or loop variables

- Vivado automatically trims unused bits of Integers.

eg: Integer i = 255;

→ then i = 8 bits

## \* OPERATORS

{  
↳ Unary  
↳ Binary  
↳ Ternary } based on the number of operands

a = ~b;

a = a && b;

a = b ? c : d;

## \* BUS OPERATORS



- Note:

left orith.  
<<<)

- For multiplication, FPGAs have DSP48 dedicated to fast math.

However, if there are no free DSPs, used which is large and slow.

- For multiplication of two N bit numbers, result is max  $2^N$  bits number
  - make sure to define your variables and their size explicitly.

Bitwise operators:  
 operates on &  
 each bit individually

|            |         |           |
|------------|---------|-----------|
| $\sim$     | unverse | Output    |
| $\&$       | And     | can be    |
| $ $        | Or      | multi-bit |
| $\wedge$   | not     |           |
| $\sim\sim$ | XNOR    |           |

  - if the two operands are of different lengths, the shorter one is padded with its MSB with signed bit  
 ↴ so number of gates required:  $\max\{\text{len}(A), \text{len}(B)\}$
  - by default everything is unsigned
  - this is how we can tell the tool that we want signed operation:  
 $\text{assign out} = (\$signed(a)) < (\$signed(b))$   
 or we can store the value in signed way using  $\$signed$  and do operation normally,  
 logical operators:

|                    |            |                   |
|--------------------|------------|-------------------|
| $!$                | NOT        | Output is         |
| $\&\&$             | AND        | one bit only      |
| $  $               | OR         | $0, 1 \leftarrow$ |
| $==$               | EQUAL      | OR TRUE/FALSE     |
| $!=$               | NOT EQUAL  |                   |
| $<, >, \leq, \geq$ | COMPARISON |                   |

| a   | b   | $a \& b$ | $a \oplus b$ | $a \& \& b$ | $a \oplus \oplus b$ |
|-----|-----|----------|--------------|-------------|---------------------|
| 0   | 1   | 0        | 1            | 0/F         | 1/T                 |
| 000 | 000 | 000      | 000          | 0/F         | 0/F                 |
| 000 | 001 | 000      | 001          | 0/F         | 1/T                 |
| 011 | 001 | 001      | 011          | 1/T         | 1/T                 |

- 2 basic blocks: always  $\rightarrow$  and initial  $\rightarrow$   
in behavioural

- All of them start at simulation time  $O(\#o)$
- INITIAL BLOCK
  - starts at  $\#o$  and executes only once.

initial  
begin ..

#70 \$finish after 100 units  
Wait to end  
more units  
 $b = \#50$   $c \& d$   
calculated at  $t=0$   
but assigned at  $t=50$

In the hardware, delay is created in context of clock cycles instead of seconds.

- note: "=" makes code run sequentially
- ALWAYS BLOCK

- statements inside always block are executed either sequentially (=) or parallelly ( $\in$ )  
blocking assignment      ↗      non-blocking assignment      ↗

|              |               |                        |
|--------------|---------------|------------------------|
| always       | always @ (*)  | always @ (posedge ...) |
| ...          | --.           | ....                   |
| not          | Synthesizable | Synthesizable          |
| sensitive to |               | Synthesizable          |

- \* Combinational Circuits using always
  - Common ERRORS
    - no variables are updated in parallel because of parallel execution, one variable be off of 2 block

⇒ ERROR: Multi driver error  
Some variable driving two blocks  
Synthesis error

eg: always @ (posedge clk)

begin

if (rst-n)

$Q \leftarrow D$ ;

end

always @ (negedge rst-n)

begin

if (!rst-n)

$Q \leftarrow 1'b0$ ;

end

- Q is being updated by two blocks simultaneously

## # Parameters

```
module something(
    parameter foo = 1'b0
)
```

\* Digital clock (minute : seconds)  
(behavioural modelling)

$0 \rightarrow 59 \downarrow \downarrow 0 \rightarrow 59$

6 bit up counter

always @ (posedge clk)

HW: modify the digital clk so that the output of CMT block is 16.777 MHz  
clock management time ↴  
24 bit size of the counter



sec\_reg: output of sequential ckt (flip flop)

sec\_next: output of combinational ckt

hence we cannot initialise it

eg: we do not initialize output of AND ckt

if we initialize it, the tool might ignore the 2nd always block (the one whose o/p is sec-next)

ff: takes the next value and assigns to the current value

note: we are making the minutes logic and seconds logic in separate always blocks due to parallel processing

## • Behavioral modelling for comb. ckt

- (1) multi-driver error
- (2) Incomplete sensitivity list
- (3) Incomplete branch and incomplete o/p assignments

### SOLUTIONS ( $\equiv$ Guidelines)

① → have separate always block for each identifier

→ do not update multiple identifiers in one ALWAYS block

② designing AND gate wrong. comb. ckt's  
always @ (in1, in2) begin  
out = in1 & in2; end

Note: we are just explaining our logic through code to the tool. Some tools might give warning but some might not. Hence resulting in incorrect logic.

\* Note: leaving out an input trigger might result in a sequential circuit

③ always @ \* → intention: combinational circuit  
if (a>b)  
gt = 1'b1  
else if (a=b)  
eq = 1'b1

\* Problem 1: 2 outputs of one always block  
\* Problem 2: for each condition, only one variable of the two variables are getting updated and hence the other variable is stored in memory which we don't want because sequential

another example:

→ OR we can initial vars to a value in the start of an always block

case (s)  
2'b00 : y = 1'b1; } not considering  
2'b10 : y = 1'b0; } case where  
2'b11 : y = 1'b1; } s = 2'b01  
endcase

Solution:  
either define all cases or use default keyword  
default: y = 1'b0;

All cases are checked simultaneously

\* CASE  
↳ full case: all possible outcomes are accounted  
↳ parallel case: all stated alternatives are mutually exclusive

eg: case (sel)  
2'b11 : out <= a; full case ✓  
2'b10 : out <= b; parallel case ✓  
2'b01 : out <= c;  
default : out <= d;

endcase

eg2: case (sel)  
2'b1? : out <= a; full case ✓  
2'b?1 : out <= b; parallel x  
default: out <= c; because of ambiguity when sel  
note: for sel = 2'b11  $\Rightarrow$  out = a is 2'b11  
because of higher priority

summary  
• If an always block executes and a variable is not assigned  
→ variable has to be stored  
↳ not combinational ckt  
↳ unnecessary complex  
↳ might not be synthesizable

• USE BLOCKING ASSIGNMENT FOR COMBINATIONAL CIRCUITS

### \* BLOCKING / NON-BLOCKING

→ Note: non blocking works only behavioural modelling i.e. always / initial block

#### ① BLOCKING

statements are executed in the order they are specified in a sequential block

does not blocks execution in a parallel block

#### ⇒ RULE

(1) Always @(\*): use blocking

(2) Always @ (posedge clk): use non-blocking

eg1) always @ (posedge clk)  
begin  
reg1 <= #1 in1;  
reg2 <= @ (negedge clk) in2 ^ in3;  
reg3 <= reg2;

note: the values in1, in2 and in3 are stored at posedge clk

hence reg3 will have the previous value of reg2 and not in1

also, it does not matter if in1 & in2 changed when clk hit neg edge, it will still take the value at initial pos edge to calculate reg2

note: this code isn't synthesizable

eg2) always @ (posedge clk)  
a = b;  
always @ (posedge clk)  
b = a;

note: both always block execute at the same exact time since they are parallel blocks theoretically.

but on the hardware, it could happen that block (1) executes before (2) or vice versa

Conditions: (1) both execute at same time  
a = b  
b = a

(2) (1) then (2)  
a = b  
b = a

(3) (2) then (1)  
b = a  
a = b

case: (1) both at same time  
a = b  
b = a

(2) (1) → (2)  
a = b<sub>old</sub>  
b = b<sub>old</sub>

(3) (2) → (1)  
b = a<sub>old</sub>  
a = b<sub>old</sub>

eg3) always @ (posedge clk)  
begin  
q1 = in;  
q2 = q1;  
out = q2;

non-blocking  
always @ (posedge clk)  
begin  
q1 <= in;

q2 <= q1;  
out <= q2;

end

3 clk cycles delay ✓

for sequential ckt's use non-blocking assignment only

but for comb. ckt's use blocking assignment only

when doing both sequential & comb. we do non blocking

## • FPGA Architecture

2 LUT outputs:  $O_5, O_6$

overall outputs:  $-, -MUX, -Q$   
 eg:  $D, D_{MUX}, D_Q$

through multiplexer  
 passed through flip flop  
 (synchronous)  
 and delayed by 1 clock cycle

if we combine 2 6bit LUTs: we get a  
 64 locations  $\leftarrow$  7bit LUT  
 of 1bit size each  $\hookrightarrow$  128 locations



Combining two 2bit LUTs to get a 3bit LUTs

$$\text{eg: } I = 100 \Rightarrow \text{out} = E \\ I = 001 \Rightarrow \text{out} = F$$

Similarly, our S-series FPGA, can combine all 4 6bit LUTs to get max one 8bit LUT

## \* 3input LUT



3input comb. ckt means 8 rows & 1 output cal in truth table

so, only 2 comb ckt can be implemented in one 3input LUT



this is 2input LUTs x 2 how many comb ckt of 2 inputs can be implemented?  
 because 2input = 4 locations and so, either of the output can be used at once

## • LUT 3 architecture



how many comb ckt for 3bit input? ONE

how many comb ckt for 2bit input? TWO?

set sel line to 0/1 const

limitation

$$2 \times 2\text{bit inputs: } f_1(A, B) = y_1 \\ f_2(C, D) = y_2$$

but we only have 3 inputs available at a time

but do note that no. of outputs match i.e. we have problem with input only

but, we can implement this:

$$f_1(A, B, C) \\ f_2(D, E, F)$$

each LUT can store 4 locations

but you need 8 locations in each LUT for implementing two comb ckt over LUT 3 with 2bit input

since we now have 2inputs only, we need 4 locations only with each LUT which works here.

Conclusion: we can implement 2 comb ckt with 2bit input in LUT 3 only if both the inputs are same

note: upper limit of no. of comb ckt is equal to the no. of outputs

max. no. of comb ckt that works with no constraint: 2

$$f_1(A) ; f_2(B)$$

total locations required:  $2^1 + 2^1 = 4$

inputs:  $2 < 3$

outputs: 2

Conclusion-2:

1 comb ckt with 3 inputs ✓

2 comb ckt with 2 inputs (with common inputs) ✓

2 comb ckt with 1 input ✓

## \* Now what about LUT 6?

| inputs               | comb. ckt. | constraint |
|----------------------|------------|------------|
| $f_1(A, B, C, D, E)$ | 6          | 1          |
| $f_2(F, G, H, I, J)$ | 5          | 2          |

eg:  $f_1(A, B, C)$   $f_2(D, E, F)$

eg:  $f_1(A, B, C)$   $f_2(B, C, D)$

note:  $f_1, f_2, f_3$  wont work because we only have 2 outputs but they give 3 outputs

inputs: 1 for sel 3 upper LUT 2 lower LUT

$f_1(A, B, C)$   $f_2(D, E, F)$

because we have only 5 inputs (rest 1 is sel)

## \* Finite State machine

### ① MOORE machine



### ② MEALY machine



\* LAB: 5  $\Rightarrow$  design of sequence detector 17/09/24

## FSM

midsem code: write a fsm code for something..

FSM: sequential circuit

" "

FFs + 2 comb. ckt

↓  
output + next state

1011 sequence detector



types: overlapping / non-overlapping

⇒ FSM: finite state machine  
every state has a meaning

## MOORE

$S_0$ : " "  
 $S_1$ : " 1"  
 $S_2$ : " 11"  
 $S_3$ : " 110"  
 $S_4$ : " 1101"



non-overlapping 1111 seq. detector



every state should be unique

Verilog code for FSM (moore) (1101)

① module definition

2 comb. ckt + 1 sequential ckt

② define variables for present and next state  
size  $\geq$  number of states

`reg[2:0] present-state,  
next-state;`

`parameter S0 = 3'b000  
S1 = 3'b001  
S2 = 3'b010  
S3 = 3'b011  
S4 = 3'b100`

\* Jhre always blocks because we have three blocks

1 seq + 2 comb

③ State Register always block

always @ (\*)  
begin  
if (present-state == S0)  
present-state <= S0;  
else  
present-state <= next-state;

end

Note: non-blocking assignment for sequential ckt and blocking assignment for comb. ckt

④ next state comb. ckt

always @ (\*)

begin

case(present-state)

S0: if (din == 1)

next-state = S1;

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

:

\* lecture: 12

17/09/24

⇒ SLICE ARCHITECTURE



### • LUT

- ↳ combinational CLK ✓
- ↳ memory ✓
- ↳ shift register

#### ① LUT as memory

- only in SLICEM
- synchronous write operation, WEN must be high
- asynchronous / synchronous read operation
  - ↳ when passed through a FF

#### \* Single Port

common address port for synchronous write and asynchronous read i.e. read and write addresses share the same address bus.



#### \* Dual Port

one port for async write + sync read  
one port for async read



#### \* Simple dual port



#### \* 64x1 : Dual Port



#### \* 64x2 : quad port



## \* Lecture: 13

19/09/24

⇒ LUT as Shift Register (SRL)  
(serial in serial out)

```
module SRL(
    input in,
    input clear,
    input clk,
    output QD;
)
    wire QA, QB, QC, QD;

    always @ (posedge clk or posedge clear)
        if(clear)
            QA <= 0;
        else
            QA <= in;

    always @ (posedge clk or posedge clear)
        if(clear)
            QB <= 0;
        else
            QB <= QA;

    always @ (posedge clk or posedge clear)
        if(clear)
            QC <= 0;
        else
            QC <= QB;

    always @ (posedge clk or posedge clear)
        if(clear)
            QD <= 0;
        else
            QD <= QC;
endmodule
```

D-FF( $in, QA$ );      } Some as  
 D-FF( $QA, QB$ );      } above  
 D-FF( $QB, QC$ );      } if D-FF defined  
 D-FF( $QC, QD$ );      already

### Uses of Shift Registers

① multiplying/dividing by power of 2  
\* OR / by  $2^x$

② Delaying output by some clock cycles



OP A, B, C are some operations that take some clk cycles to process input but to hold the functionality of the given circuit, i.e. both pathways reach max at same time, we need 17 clk cycle delay.

to reduce number of FFs needed we use MC31 on the FPGA

## \* LUT as shift Register

MC31 is delayed version of input by 32 cycles



for 96 bit shift = we need 3 registers

now going back to the delay example



so, 1 slice can give  $4 \times 17$  bit input delay

but we need  $64 \times 17$  bit input delay

so, 16 slices needed

136 → 16 slices needed (FF) (LUT)

68 → 16 : CLBs needed

in one CLB we have one SLICE M

project settings → synthesis

In one CLB, we have

2 slices

8 LUTs

16 FFs

$64 \times 4 \leq 286$  bit mem

128 bit shift register

## \* 64 bit shift register using LUTs

we use 2 LUTs



2 slice can only implement max 128 bit shift register because each slice has 4 LUTs and each LUT can implement 32 bit shift register

for 96 bit shift = we need 3 LUTs

now going back to the delay example



so, 1 slice can give  $4 \times 17$  bit input delay

but we need  $64 \times 17$  bit input delay

so, 16 slices needed

136 → 16 slices needed (FF) (LUT)

68 → 16 : CLBs needed

in one CLB we have one SLICE M

project settings → synthesis

In one CLB, we have

2 slices

8 LUTs

16 FFs

$64 \times 4 \leq 286$  bit mem

128 bit shift register

## \* Push Button

module counter (

input pb,

output [7:0] out

)

always @ (posedge pb)

begin

out <= out + 1

end

endmodule

but we get (1) → (5) → (7)

due to push button debounce



solution: if push button is high for 3 consecutive clock cycles, then only do the logic

implementation: pass values initially to 2 FFs and 1 IF-else and check with current value

switch debounce

FF → FF → FF → set out after a press

\* Lab 6  $\Rightarrow$  AXI Interface 24/09/24

- basics of AXI Interface
- design of floating point arithmetic

$$y = \frac{1}{\ln(x)}$$

- HW: ACCELERATOR:  $Z = \sqrt{x} + \frac{1}{\ln(x)} + 1.5$
- compare which is faster: processor vs FPGA  
 $(C/C++)$        $(Verilog)$   
for communication  
between the two  $\Rightarrow$  AXI  
interface

\* Advanced Extensible Interface (AXI)



Problem:

- B does not know when transmission should end and parse the data or how much data A should send
  - No feedback mechanism from B  $\rightarrow$  A if data received correctly or not
  - A might not have all data ready at each clock rising edge. A should have control to pause transmission.
- > Note: SPI solves all these problems



PROBLEM Now:

- Starting and ending point unknown to B

$\Rightarrow$  we add a "LAST" signal which goes high when the last byte is being sent whenever the last byte of data is being sent, last bit goes high alongside valid signal independent of ready (B) signal



\* Types of HANDSHAKES

- Valid before Ready
- Ready before Valid
- Valid with Ready

\* PROBLEM:

- Valid should not wait until Ready becomes one {stuck in deadlock}

- Once valid is set to high, it cannot be reseted until handshake happens

- DEST can set Ready independent of VALID it is allowed to go back to low after being set to high even before VALID is set to high

> Note: Clock is needed in order to know when transmission is occurring (indicated by the positive edge of the clock)

In today's lab  $\Rightarrow$  uni-sided data transfer from A to B

- AXI STREAM

$$y = \frac{1}{\ln(x)}$$



PROBLEM Now:

- Starting and ending point unknown to B

$\Rightarrow$  we add a "LAST" signal which goes high when the last byte is being sent whenever the last byte of data is being sent, last bit goes high alongside valid signal independent of ready (B) signal



PROBLEM Now:

- Once valid is set to high, it cannot be reseted until handshake happens

$\Rightarrow$  we add a "LAST" signal which goes high when the last byte is being sent whenever the last byte of data is being sent, last bit goes high alongside valid signal independent of ready (B) signal



\* Verilog:

latency = 23  $\Rightarrow$  means it takes 23 clock cycles to carry out log function

slave interface: data input  
Valid input  
Ready output

master interface: data output  
Valid output  
Ready input

latency for reciprocal: 30 cycles

total cycle delayed to find output:  
53 cycles  
our clk cycle = 10ns per

$\Rightarrow$  total delay: 530ns

## • BRAM

in the FPGA, we can store data using:

(1) LUT 64 bits

(2) FF

(3) BRAM 36 kb

for BRAM, we can use BRAM IP directly from vivado's IP catalog

OR we can code our logic for memory with specific conditions for synthesis with BRAM

ultraRAM 288 kb

in one LUT we can store 64 bits of memory

in one slice  $\Rightarrow$  256 bits

in one CLB  $\Rightarrow$  256 bits

in one BRAM  $\Rightarrow$   $36 \times 1024 \times 8$  bits

also, in some boards, we have ultra RAM

(can store max 288 kb mem) but ours doesn't have it.

- We had async read & sync write in LUT
- For BRAM, we have fully synchronous operations
- Configurations
  - True dual port
  - Simple dual port
  - Single port

• Each BRAM can be segmented as:

(1) 36 kb BRAM : we can access data at any place by providing address

(2) 36 kb FIFO : sequentially access only

(3) 18 kb BRAM x 2

(4) 18 kb FIFO + 18 kb BRAM



We can read the same data multiple times anytime

once the data is read, it is pushed out of the memory

needs internal counter to get from where to read / write data

Requires a separate controller alongside memory

\* Integrate cascade logic to build larger memories eg.: to get 72 kb BRAM

## • BRAM Configuration

two ports: A and B

each port can do read / write operation

multi-bit read / write addresses = configurable memory



both ports are symmetrical and totally independent, sharing only the shared data.

Each port can be configured in one of the available widths, independent of the other port

Read port width can be different from the write port width.

BRAM has a global enable signal  $\Rightarrow$  read / write operations are carried out only when EN = high  $\Rightarrow$  else NOP

power  $\downarrow$

efficient  $\checkmark$

$\neg$  data written is passed as opt to DQ<sub>A</sub>  
data is available at output first

$\neg$  DQ<sub>B</sub> (output) holds its previous values (power saver)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

$\neg$  we need to store this in separate memory (could be FF)

## \* Lecture : 15

26/09/29

### ⇒ BRAM

- fully synchronous operations
- memory access (read/write) is controlled by the clk
- to support read-first and no-change modes, we have a latch and register at the end.

### • PIPELINING



increase in hw cost? ↴ No  
increases efficiency / throughput of existing resources ✓  
Clock rate

- output latches will be loaded only when write mode = read-first and write-first not when mode = no-change
- first data gets loaded into output latch then write operation in READ-FIRST



- first write operation to memory and then reads updated data to output latch in WRITE-FIRST



- When WE is inactive and EN is active, read operation happens and the data referenced by the address bus appear on the data-out bus regardless of write mode

### • TYPES OF BRAM

#### ① Dual Port block BRAM

Eg: memory with 1K locations and each 8bits

→ we need 10 bits for address bus and 8 bits for data bus

now, for 2K → 9 bits

addr bus = 11 bits

data bus = 9 bits

now, for 4K → 12 bits

addr bus = 12 bits

data bus = 12 bits

now, for 8K → 13 bits

addr bus = 13 bits

data bus = 13 bits



USING parity bits we can detect for any errors in data

In our BRAM, it supports one parity for every byte of data

So, for 1K locations with each size 8bit, we can add an additional parity bit

So, 1K, 8bit + 1

512, 16bit + 2

256, 32bit + 4

⋮ parity bit

for 36kb of memory, we can have these configs :

1K → 32 + 4

2K → 16 + 2

4K → 8 + 1

8K → 4 + 0

16K → 2 + 0

32K → 1 + 0

maximum locations

so, address bus 2<sup>15</sup>

15 bits

allows 2 adjacent BRAMs to cascade hence ADPA is 16 bits and not 15 bits

why do we have 4 bits for WEN?

Eg: WEN = 1000 i.e. enabled for one byte

in 1K → 32+4 we can write one byte at a time using this

Eg: F X X X garbage value

for WEN = 0110

→ X A B X

Saves time ✓

This is called Byte-wide write enable

→ allows writing eight bit (one byte) portions of incoming data.

if last read data = ABCD new data-in = XFXFX

with mode = READ-first output = ABCD

but in WRITE-FIRST, output = AFCD

Combination of new and previous value ↴

so the same memory location

data corruption case

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

both ports are symmetrical in context of this table i.e. when A is in read mode and B has

one of the 3 write modes, the output will be similar as above but just reversed

&lt;p



## Fourier Transfer

Processor vs FPGA which will be faster?  
 ↳ depends on which algorithm used.



## • Objective

- basics of fourier transform
- design floating point FFT for input of size 8 elements

HW1: design f.p. FFT for input of size 32 elements

$$HW2: \text{design } Z = \frac{x}{y} + \sqrt{\frac{2 \ln(t)}{y}}$$

$x, y, t$  : type int

$t > y > x$

from one domain to another

## • Fourier Transform

↓  
decompose a signal into sinusoidal terms

time domain → freq. domain

in 4G : ft of size 2048

in 5G : ft of size 4096

FS: for periodic signals  
 FT: for non-periodic signals

B.Tech kitne duration ka hai?  
 = fourier

$$FT(\delta(t)) = 1$$

constant → contains all frequencies

FT : time → freq

$$FT^{-1} = IFT : freq \rightarrow time$$

FT

↳ Continuous

↳ Discrete : Sampled / present at only certain time instants

↓

We will work with this since we are using the ADC to get the digital signal

the ADC

$$\text{Discrete FT} \quad X(k) = \sum_{n=0}^{N-1} x[n] e^{-j k \omega_n} = \sum_{n=0}^{N-1} x[n] e^{-j k \frac{2\pi n}{N}}$$

freq. domain signal

discrete time domain signal

size of F.T.  $\Rightarrow N$

4G  $\Rightarrow 2048$

5G  $\Rightarrow 4096$

Lab  $\Rightarrow 8$

Lab KWS  $\Rightarrow 32$

$$x[n] \rightarrow \boxed{FFT} \rightarrow X(k)$$

configuration

$N \equiv$  transform length

We have to send  $N$  data

continuously

and similarly receive  $N$

data continuously

$x[n]$  : complex number

(64bit)

real (32)

imaginary (32)

↓

↓

to verify, use

matlab :

concatenation

$$x = [0 0 0 0 0 0 0 1];$$

$$fft(x);$$

## \* Lecture: 16

ARM and FPGA communication?

↳ AXI Protocol

Advanced extensible interface

Zynq = ARM

AXI is native

Some IP provide both

Some modern ones have only AXI

ARM  $\xleftarrow{\text{AXI}}$  FPGA

in order to have

efficient communication between multiple blocks of a SoC, we have a standard protocol eg: AXI

AXI is popular because it is the protocol which is used by the famous ARM processors

AXI: very popular and is implemented in almost all IPs

wishbone protocol  $\rightarrow$  hardware block famous in the past

## • AMBA: advanced microcontroller bus architecture

AMBA

↳ APB: advanced peripheral bus

↳ AHB: advanced high performance bus

↳ AXI: advanced extensible interface

↳ ATB: advanced trace bus

processor accesses memory for - persistent data storing - instruction storing

AHB used for processor  $\leftrightarrow$  memory communication because accessing memory is slow

APB used for UART communication

↳ slower than AHB

ATB used by debugger to communicate with processor eg. when breakpoint is hit, it needs to fetch register values etc.

## \* AXI

↳ memory mapped  
↳ stream  
↳ lite

memory map

ARM processor treats every other block as memory i.e. communication involves only read and write operation

no need to worry about separate communication logic, say, between USB, UART

start address

end address

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

↓

## \* Lecture 17

17/10/24

last lecture on AXI  
next quiz → AXI

### • Byte addressable memory ↴

every location is one byte  
in size  
every byte has a separate  
address

e.g.:



one byte  
each



2 bytes  
each



4 bytes  
each

- each of the independent channels → we have a set of information signals

e.g.: AW/AR: address bus

W/R: data bus

B: response bus

- R/W include the LAST signal

- AXI provides simultaneous, bidirectional data transfer

- Both read and write operations can happen simultaneously

**BURST TRANSFER:** we do not need to send separate address and control data to get each separate byte.  
we can send one address and fetch x amount of data from data.

**BURSTS** can vary from 1-256 data transfers per burst

- EACH BURST transfer size can be 1-128 bytes (must be of form  $2^n$  ofc)

- max data per burst →  $256 \times 128 = 32\text{kb}$  (i.e. for one address)

what if i want 16kb : two options  
no. of bursts:  $256 \rightarrow 128$   
or each burst size =  $128 \rightarrow 64$

how about 1byte of data?  
no. and each size = 1

4 bytes? →  $1 \times 4$  or  $4 \times 1$

3 bytes? → burst size = 3  
transfer size = 1

258 bytes? →  $128 \times 2$

259 bytes? → need NULL transfer

### Byte lane strobe signal

for every 8 bits of data, indication which bytes of data are valid.

BLSS size = burst transfer size

so, we have e.g. 8bits BLSS for 8bytes of data

now for 259 bytes data transfer →

130 data transfers per burst

2 burst transfer size

with BLSS = 01 for last transfer

for 9 bytes data transfer and with fixed 4 data transfers/burst

→ 4 burst transfer size with BLSS:

1111  
1111  
1000 zeros mean

0000 data null

not needed

AXI: Read

ACLK

ARID[3:0] identifier for when we have multiple slaves/masters in interconnect

ARADDR size of each transfer (1-128)

ARVALID

ARREADY

WSTRB[3:0]

eg: F = 111

all 3 bytes are valid

are valid

FIXED

INCR

WRAP

RESERVED

000 1  
001 2  
010 4  
011 8  
100 16  
101 32  
110 64  
111 128

### ALEN signal [7:0] $0 \rightarrow 255 / 1 \rightarrow 256$

size of data transfer in one burst

if AWLEN: 011 → 3 =  $3+1 = 4$  bytes transferred in one burst

Burst length: ALEN[7:0] + 1

signals: AWLEN / ARLEN

eg: for 10 bytes of data:

if we have 4bytes per burst ⇒

we can have any order like →

bits valid (not null) 1 4 4 1

Strobe signal → 1 F F 1

new signals: ID

SIZE

STROBE

BURST TYPE

### AXI memory mapped

Known for burst transfer

↳ efficiency ↑

↳ single address, multiple data

### • AXI : Lite

↳ single address, single data

↳ number of data transferred in one burst = one

↳ Bursting not supported

↳ subset of AXI: memory mapped

↳ useful for configuring hardware IP where only 2/3 bits of data required

↳ i.e. used for sending smaller sized data

for LITE → mem mapped in timing diagrams

↳ add specific addresses in ARADDR for every transaction

### • AXI : Stream

- unidirectional

- no address

- unlimited data burst size

→ write data: master → slave

complexity: STREAM < mem mapped

### Direct memory access

- AXI data moves ↓

does what it sounds like

DMA: used to switch between

STREAM ⇔ mem mapped

- CPU will config DMA its tell which photo will be stored where and hence we use AXI : lite there

so, AXI : memory mapped for large data

AXI : lite for small data

next week lab: FFT in C on ARM processor

\* CPU configures the DMA

# \* Lab 7 HW-2

objective :  $Z = \frac{x}{y} + \sqrt{\frac{2 \cdot \ln t}{y}}$

inputs :  $x, y, t$

output :  $Z$



## • LECTURE : 18

### AXI INTERCONNECT

- ↳ capable of width conversion
- ↳ multiple number of masters & slaves
- ↳ capable of conversion between different AXI standards (AXI3  $\Rightarrow$  AXI4)

↳ Clock domain transformations  $\Rightarrow$  since processor has higher clock frequency than FPGA



Can AXI master 2 communicate with AXI slave 3?

No, because only master can initiate communication but interconnect 1  $\rightarrow$  2 is connected Slave  $\rightarrow$  Master

Strobe signal is not that useful in case of read transaction because the data would be of sufficient length OR amount when returned from slave

$\rightarrow$  higher clock freq preferred for better performance but needs more power

$\rightarrow$  what if we connect one block which operates at 2Hz clock and other operating at 1Hz clock

the 2Hz block is producing 2 samples per second whereas the other is producing one/s so, 2Hz block is loosing data.

So, we need a buffer to capture the lost data

Hence we use AXI interface between the two blocks and

the interconnect needs to have common clock which can be

calculated so that there is no data loss btw the two blocks.

### • System

- ↳ A way of working, organizing or doing one/more tasks according to a fixed plan, program, set of rules

Embedded System: designed to run on its own

↳ without human intervention and

may be required to respond to events in real time

a larger machine

demands: higher integration, higher performance

low power consumption, low cost

flexibility, scalability

### • Hardware: ASIC vs ASSP

|                         | ASIC                                                    | ASSP | 2 chip soln                                                                                                                                                                     |
|-------------------------|---------------------------------------------------------|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Performance             | +                                                       | +    |                                                                                                                                                                                 |
| Power                   | +                                                       | +    |                                                                                                                                                                                 |
| Unit Cost               | +                                                       | +    |                                                                                                                                                                                 |
| TCO                     | •                                                       | +    | +<br>↳ fabrication, someone else is doing it for you                                                                                                                            |
| total cost of ownership | very high<br>↳ because you are designing your own chip. |      |                                                                                                                                                                                 |
| Risk                    | -                                                       | +    | -<br>↳ because of delay in communication                                                                                                                                        |
| Time to market (TTM)    | -                                                       | +    | -<br>↳ Price: FPGA $\gg$ ASIC                                                                                                                                                   |
| Flexibility             | -                                                       | -    | •                                                                                                                                                                               |
| Scalability             | -                                                       | •    | +<br>↳ for a startup, ASSP is a better choice due to limited fabrication resources and lack of man power but you risk leaking your design since you are outsourcing fabrication |

How to improve SoC's price and performance?

fuse the processor & FPGA

= System on a chip SoC

### • Advantages of SoC

- Higher Performance
- Power Efficiency
- Lighter footprint
- Higher reliability
- Lower cost per unit cost is reduced
- Less flexibility
- Application Specific
- Complexity

less voltage required: < 2V

device size and weight reduced

so it can be upgraded easily after manufacture

cannot be easily adapted to other applications

requires advance skills

// Types: FPGA based SoCs or ASIC based SoCs

## • Lab 9

Handwritten notes

## • Lecture 20

12/11/24

pipelining ≠ parallel processing

utilizing existing hardware  
for getting higher performance

better

throughput

if pipelining not enough? → parallel processing

$$F \rightarrow D \rightarrow X \rightarrow M \rightarrow wB$$

nowadays we have 9 different stages

AMBA

- ↳ APB: advanced peripheral bus
- ↳ AHB: advanced high performance bus
- ↳ AXI: advanced extensible interface
- ↳ ATB: advanced trace bus

DMA: direct memory access

Stream ⇔ memory mapped

memory controller  
coprocessor



FPE: floating point engine

↳ a coprocessor specifically used for floating point operations

Eg of coprocessors: FPE / NEON / GPU / RTC

CNN: convolution neural network

## • Convolution

very high overhead

$$Y_{(k)} = \sum_{i=0}^{i=N-1} w_i X_{(k-i)} \quad \left. \right\} \text{MAC: multiplication & accumulate operation}$$

$$Y_k = w_0 X_k + w_1 X_{k-1} + \dots$$

$$Y_0 = w_0 X_0 + w_1 X_{-1} + w_2 X_{-2} + \dots = w_0 X_0$$

*we don't know past values*

$$Y_1 = w_0 X_1 + w_1 X_0 + w_2 X_{-1} + \dots = w_0 X_1 + w_1 X_0$$

$$Y_2 = w_0 X_2 + w_1 X_1 + w_2 X_0$$

$$Y_3 = w_0 X_3 + w_1 X_2 + w_2 X_1 + w_3 X_0$$

↓

if  $w_0, w_1, w_2 = 1$ ,

this is addition operation

(accumulation)

*because  $w_3 = 0$*

*(given)*

parallel computing process by nature

↓

works on processor but best carried out in FPGA

KSPS: Kilo samples per second

↳ clk rate × 1000 (to account for the kiloword)

clk cycle needed for 1 operation

there is another coprocessor which is an intermediate of PS, PL

## SIMD

single instruction, multiple data

= Neon Processor

only one op<sup>n</sup> at a time

we have this in our

ARM cortex A9.

memory mapped: each external unit has to have a starting address & end address

## • Address Allocations

making an IP memory mapped



↓

memory mapped

IP

load A -

load B -

add C A B

set status "done"

processor reads

"done"

then reads value of C

let each block = 32bit

then the mem requirement is = 128bit

now this IP is AXI compatible

for a mem requirement of 32bytes, we have a 5bit address

from 00000

↓

11111

for 1Kb → 10bit addr bus

1GB → 30bit addr bus

let mem reg = 1Kb,

start addr : 0x4000\_00FF

if it was all 0s, then

end addr would be 11111 11111

⇒ 0x3FF

now, end add = 0x4000\_03FF

let start add = 0x4000\_0800

end add = 0x4000\_0BFF

let mem reg. = 1MB,

then we need to add FFFFF

start: 0x1000\_0000

mem: 1GB

addr: 3FFF\_FFFF

ans: FFFF\_FFFF

Rule: Starting address should be multiple of mem requirement