

# CS411 Project 2 - FPGA



1



# Contents

---

- 1.Systolic Array operation**
- 2.Systolic Array System Module**
- 3.Bram Example ( + VIVADO Block Diagram )**
- 4.VIVADO simulation**
- 5.About Verilog**
- 6.Project Folder**
- 7.Submission**

# Systolic Array Operation

# Systolic Array Operation

- Operation mode {Weight Stationary, Output Stationary}



Samajdar, Ananda, et al. "Scale-sim: Systolic cnn accelerator simulator." *arXiv preprint arXiv:1811.02883* (2018). fig.2

# Systolic Array Operation

- **Operation mode {Weight Stationary, Output Stationary}**

PE and Systolic Array can operate in two modes {WS, OS}

The two modes can be switched using a control signal.



H. Genc, A. Haj-Ali, V. Iyer, A. Amid, H. Mao, J. Wright, C. Schmidt, J. Zhao, A. Ou, M. Banister, Y. S. Shao, B. Nikolić, I. Stoica, and K. Asanović. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. arXiv:1911.09925 [cs.DC], 2019.

# Systolic Array Operation

## ▪ Operation mode {Weight Stationary}

- Pre-load weight matrix into the systolic array.
- Flow the activation matrix into the systolic array.
- Accumulate partial sums in the accumulator.
- Store the accumulated result to RAM.



<https://www.sciencedirect.com/science/article/pii/S0167926022000359>

# Systolic Array Operation

- **Operation mode {Output Stationary}**

- Stream skewed activation and weight data into the systolic array
- After all dot-product results are accumulated in the each PE, drain the results into RAM



<https://www.sciencedirect.com/science/article/pii/S0167926022000359>

# **Systolic Array Modules**

# Systolic Array Modules

- Entire Diagram

[Task 3] Draw this Block Diagram



## BRAM

- A/W\_Bram
  - store activation/weight and feed data to ram
- O\_Bram
  - store result (from systolic array)
- SP\_Bram (you must follow this rule)
  - addr 0 : start signal (1: start, 0: end)
  - addr 4 : mode (1: OS, 0: WS)
  - addr 8 : M
  - addr 12: K
  - addr 16: N
  - addr 100: end signal

## Verilog Module (Your Task 1,2)

Describe “matmul\_system” in the next pages.  
In verilog coding, your goal is making matmul.v

# Systolic Array Modules

## ▪ Example Diagram

Ideally, we would create an AXI interface for direct data transfer from DRAM to RAM using DMA.

However, since we haven't covered that yet, we're using BRAM to handle data transfer. Additionally, you can simply retrieve data from BRAM.



# Systolic Array Modules



# Systolic Array Modules

- PE



- I/O ports are fixed

- act\_data\_in  
INT8 activation data port
- wgt\_data\_in  
INT8 weight data port
- result\_in  
INT32
- result\_out  
INT32
- operation\_signal\_in  
control signal for PE data flow  
{WS : weight preload, activation flow}  
{OS : data flow, drain result}

# Systolic Array Modules

- **SYSTOLIC ARRAY (8x8 example)**



- **Operation**

- mode : {OS, WS}
- input data stream every one cycle.



# Systolic Array Modules

---

- **Index/Address Generator (optional)**

- This module serves the function of determining where to store incoming data in RAM and the accumulator, as well as indicating which data stream out to system.
- The input is provided by the enable signal sent from the FSM. When the enable signal is activated, it automatically generates the addresses required for the operation of the systolic array. The output consists of address and the enable signal for RAM and the accumulator.

# Bram Example + Block Diagram

# Bram Example

---

- Create the VIVADO project.

This link explain how to create VIVADO project with PYNQ-Z2 Board

<https://discuss.pynq.io/t/tutorial-creating-a-hardware-design-for-pynq/145>



Vivado 2023.1



This example was executed on  
the following version of VIVADO.

# Bram Example



1. Click the “+” button

2. select “Zynq7 PS”

- After you make new project, build the Block Diagram
- Add Zynq7 PS that is one part of PYNQ-Z2 board.

# Bram Example



## ▪ Run Block Automation

- Make DDR, FIXED\_IO external port
- There is no need to know detail about this.

# Bram Example



Search “AXI BRAM Controller” and click



- **Add AXI BRAM Controller.**

- This IP manages data transfer to/from the PS and stores/retrieves it from BRAM

# Bram Example



# Bram Example



Then, you can see this scheme.

# Bram Example

## ▪ Entire Scheme



22

# Bram Example

## Bram Port

- addr : address of Bram
- clk : clock signal
- din : data\_in
- dout : data\_out
- rst : reset
- wea[3:0] : write enable (1 bit → 1 byte)

(read)

addr = [address]

dout = (you can get data from here)

(write)

addr = [address]

wea = 4'b1111

din = [4 bytes DATA]



# Bram Example



**Make wrapper verilog file from your Block Diagram**

2. you can make this verilog file : verilog file of Block Diagram

1. click

# Bram Example



# Bram Example



In VIVADO project folder, please search for the ‘bit stream file’ and ‘hwh file’ using the keywords ‘bit’ and ‘hwh.’

Once located, select these two files.

Next, open the Jupyter notebook and move these selected files into the Jupyter notebook folder.

# Bram Example



Next, click the Upload button

Upload two file, and change file\_name as you want.  
In this case, name is “CS411\_bram\_example”

# Bram Example

---



Plus, make new Notebook

# Bram Example

```
In [1]: from pynq import Overlay  
ol = Overlay("CS411_bram_example.bit")
```

your bitstream file name

```
In [2]: ol?
```

```
In [3]: bram_in1 = ol.axi_bram_ctrl_0.mmio.array  
bram_in2 = ol.axi_bram_ctrl_1.mmio.array  
bram_out = ol.axi_bram_ctrl_2.mmio.array
```

```
In [4]: print(bram_in1[0:4])  
print(bram_in2[0:4])  
print(bram_out[0:4])
```

[0 0 0 0]  
[0 0 0 0]  
[0 0 0 0]

```
In [5]: bram_in1[0]=5  
bram_in2[0]=6
```

You can read value from BRAM

You can write value into BRAM

ol? generate this

```
Type:          Overlay  
String form: <pynq.overlay.Overlay object at 0xb3899da8>  
File:          /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/overlay.py  
Docstring:  
Default documentation for overlay CS411_bram_example.bit. The following  
attributes are available on this overlay:  
  
IP Blocks  
-----  
processing_system7_0 : pynq.overlay.DefaultIP  
  
Hierarchies  
-----  
adder_0 : pynq.overlay.DefaultHierarchy  
  
Interrupts  
-----  
None  
  
GPIO Outputs  
-----  
None  
  
Memories  
-----  
axi_bram_ctrl_0 : Memory  
axi_bram_ctrl_1 : Memory  
axi_bram_ctrl_2 : Memory  
PSDDR : Memory
```

# Bram Example

---

```
In [5]: bram_in1[0]=5  
       bram_in2[0]=6
```

```
In [6]: print(bram_out[0])
```

```
11
```

Our adder module read data from addr0 of bram\_in1 and bram\_in2, and store the result into addr0 of bram\_out

<https://www.youtube.com/watch?v=zbumITJQ2Z8>

[Using PYNQ MMIO (Memory Mapped IO)]

“Now you can communicate with Bram.”

# Vivado Simulation

# Vivado Simulation

- Set Design source file and Simulation source file



This button is for adding new source file (modules or testbenches.)

To set the proper file as the Design Source, right-click on it and select 'Move to Design Source.' This file will be used during the synthesis, implementation, and bitstream writing processes.

To set the proper file as the Simulation Source, right-click on it and select 'Move to Simulation Source'

# Vivado Simulation

## ▪ Simulation

- prepare test bench file (.v)



**(necessary)**

You can run the simulation using your 'module.v' and 'testbench.v' files. The simulation is based on the RTL files.

**(necessary)**

You can run the simulation after synthesis. First, synthesize the module and prepare 'testbench.v.' The simulation will be based on the netlist, which may produce different results from the behavior simulation.

**(not necessary)**

You can run the simulation after implementation. Please note that if your module has more than 125 I/O ports, you may not be able to proceed with the implementation step.

# Vivado Simulation

## ▪ Behavior Simulation

- prepare test bench file (.v)

100ns/1ns is recommended  
timescale: #1 = 100ns  
simulate every 1ns

declare ports :  
input → reg  
output → wire

Make the instance of your module  
that you want to simulate.

```
1  `timescale 100ns/1ns
2
3  module tb_bram_combine ();
4
5      wire [5:0] DEBUG_STATE;
6      reg[31:0]a_addr;
7      reg [31:0]a_data;
8      reg [3:0]a_wen;
9      reg clk;
10     reg reset;
11     reg [31:0]o_addr;
12     wire [31:0]o_data_out;
13     reg [31:0]sp_addr;
14     reg [31:0]sp_data_in;
15     reg [3:0]sp_web;
16     reg [3:0]w_web;
17
18
19 endmodule
20
21
22
23
24
25
26
27
28
29
30
31
32
```

design\_1\_wrapper aaa (
 .clk(clk),
 .reset(reset),
 .a\_addr(a\_addr),
 .a\_data(a\_data),
 .a\_wen(a\_wen),
 .sp\_addr(sp\_addr),
 .sp\_data\_in(sp\_data\_in),
 .sp\_web(sp\_web),
 .w\_web(w\_web),
 .STATE(DEBUG\_STATE),
 .O\_BRAM\_DATA(o\_data\_out),
 .O\_BRAM\_ADDR(o\_addr)
);

initial begin  
 clk = 1;  
 reset = 1;  
 a\_addr =0;  
 a\_data=0;  
 a\_wen=0;  
 o\_addr=0;  
 sp\_addr=0;  
 sp\_data\_in=0;  
 sp\_web=0;  
 w\_web=0;  
 #10  
 ...  
end

assign value to input port  
Wait 100x10 ns (10 cycle)  
Clock signal value change  
every 0.5x100ns  
 $T = 100\text{ns} = \#1$

# Vivado Simulation

---

- **Post-synthesis Simulation**

- You have to synthesize your module
- This simulation is bases on the synthesis result
- If you make changes to your Verilog code or block diagram, you'll need to synthesize it again.
- Ensure you have the testbench file, which should be the same as the one used for behavioral simulation.
- (Confirmed) The post-synthesis simulation result corresponds to the same operation as the behavior simulation.

# About Verilog

# Module

```
'timescale 100ns/1ns
module pe #(
    parameter integer ACT_WIDTH      = 8,
    parameter integer WGT_WIDTH      = 8,
    parameter integer MULT_OUT_WIDTH = ACT_WIDTH + WGT_WIDTH,
    parameter integer PE_OUT_WIDTH   = 32,
    parameter integer OP_SIG_WIDTH   = 3
) (
    input wire                               clk,
    input wire                               reset,
    input wire [ OP_SIG_WIDTH -1 : 0]         operation_signal_in,
    input wire [ ACT_WIDTH -1 : 0]             act_data_in,
    input wire [ WGT_WIDTH -1 : 0]             wgt_data_in,
    input wire [ PE_OUT_WIDTH -1 : 0]           result_in,
    output wire [ PE_OUT_WIDTH -1 : 0]          result_out
);
// MAKE YOUR LOGIC HERE
endmodule
```

**module name**

**parameters: you can declare them here and use them when you build logic**

**Here, You can declare I/O ports.**

**In this part, you can build your logic.**

**You can declare wires and regs, and build combinational or sequential logics.**

**(we recommend drawing diagram before writing code)**

# How to make instance of module

The diagram illustrates the Verilog code for creating an instance of a module named 'pe'. The code is annotated with labels pointing to specific parts:

- The name of module**: Points to the identifier `pe #(`.
- Name of parameter (same as names in module declaration)**: Points to the parameter declarations `.ACT_WIDTH`, `.WGT_WIDTH`, and `.PE_OUT_WIDTH`.
- The name of instance**: Points to the instance name `) pe_inst (`.
- Name of I/O (same as names in module declaration)**: Points to the port declarations `.clk`, `.reset`, `.operation_signal_in`, `.act_data_in`, `.wgt_data_in`, `.result_in`, and `.result_out`.

```
pe #(  
    .ACT_WIDTH,  
    .WGT_WIDTH,  
    .PE_OUT_WIDTH  
) pe_inst (  
    .clk,  
    .reset,  
    .operation_signal_in,  
    .act_data_in,  
    .wgt_data_in,  
    .result_in,  
    .result_out  
)
```

(ACT\_WIDTH),  
(WGT\_WIDTH),  
(PE\_OUT\_WIDTH)  
( !clk ), //input  
( reset ), //input  
( operation\_signal\_in ), //input  
( act\_in ), //input  
( wgt\_in ), //input  
( result\_in ), //input  
( result\_out ) //output

In items enclosed within parentheses represent parameters, as well as wires or registers that have been declared at the location where this instance is utilized.

# always @

---

```
reg [ WIDTH - 1 : 0] reg_inner;  
  
always @(posedge clk)  
begin  
    if(reset)  
        reg_inner <= 'b0;  
    else if (enable)  
        reg_inner <= in;  
end
```

**This example is simple register has reset and enable option.**

**Meaning of code: Every rising edge of clock, if reset is 1, then write 0 into reg\_inner, and if reset is 0 and enable is 1, then write 'in' into reg\_inner**

# FSM

---

- **FSM(Finite State Machine)**
  - Input + Current\_state  $\Rightarrow$  Next\_state
- **You can represent STATE in various ways !**

| STATE | binary encoding | Gray encoding | Johnson encoding | One-hot encoding |
|-------|-----------------|---------------|------------------|------------------|
| 0     | 00              | 00            | 000              | 0001             |
| 1     | 01              | 01            | 001              | 0010             |
| 2     | 10              | 11            | 011              | 0100             |
| 3     | 11              | 10            | 111              | 1000             |

# FSM (example)



example code ⇒



```
module fsm_ex1(clk,rst_n, go, ws, rd, ds);
    input clk, rst_n, go, ws;
    output rd, ds;
    parameter IDLE = 2'b00, READ = 2'b01, DLY = 2'b10, DONE = 2'b11;
    reg [1:0] state , next ;

    // always block for state register
    always@ (posedge clk or negedge rst_n) begin
        if (!rst_n) state <= IDLE;
        else state <= next;
    end

    // always combinational block to determine next state
    always@ (*) begin
        next = 2'bxx;
        case(state)
            IDLE : if (go) next = READ; else next = IDLE;
            READ : next = DLY;
            DLY : if (!ws) next = DONE; else next = READ;
            DONE : next = IDLE;
        endcase
    end

    //assign output
    assign rd = ((state==READ) || (state==DLY));
    assign ds = (state == DONE);
endmodule
```

<https://returnclass.tistory.com/197>

# FSM

---

If you are not familiar with FSM.  
The following slides will be helpful.

[https://inst.eecs.berkeley.edu/~eecs151/sp22/files/verilog/verilog\\_fsm.pdf](https://inst.eecs.berkeley.edu/~eecs151/sp22/files/verilog/verilog_fsm.pdf)

or Try searching on Google

# **Project Folder**

# BRAM\_example



That is for BRAM\_Example and includes the following components:

**“adder.v” : This module is utilized in the bram\_example.**

**“.hwh” “.bit” files : In your example, you need to generate a bitstream file that matches functionality if this file.**

**“.ipynb” file : You should use this file to verify functionality.**

# TASK 1



This is for TASK 1 ~ TASK 2

This folder contains a VIVADO project file (.xpr) specifically designed for running the “bram\_combine” test. You can open the VIVADO project by using this file.

There is a test bench file available. You can use this file to perform simulation.

You can find skeleton codes.  
You need to write additional module.v files to develop your matmul\_system more easily.

# TASK 4



This is the Jupyter notebook for TASK 4. Please ensure to relocate it to your Jupyter notebook folder. Within this notebook, you are tasked with creating the ‘matmul\_WS’ and ‘matmul\_OS’ functions. You may refer the example provided in the bram\_example case

# **Submission**

# Score

---

- **TASK 0: Design Document (30%)**
- **TASK 1,2: Verilog test bench (`tb_bram_combine.v`)  
{Behavior, Post-synthesis func + timing} (25%, 15%)**
- **TASK 3,4: Test functionality in jupyter notebook (30%)**

## (Extensible: recommendation)

- While it's not part of the scoring criteria, for Project 3, you do have the flexibility to adjust the size of the systolic array from 8x8 to 16x16 or 32x32 as needed. Modulating the systolic array size using parameters can indeed help reduce the workload and make the code more adaptable to different requirements.

# Reference

---

## 1. PYNQ HW Setting

- <https://blog.umer-farooq.com/a-pynq-z2-guide-for-absolute-dummies-part-i-fun-with-leds-and-switches-47dd76abf9a9>
- [https://pynq.readthedocs.io/en/latest/getting\\_started/pynq\\_z2\\_setup.html](https://pynq.readthedocs.io/en/latest/getting_started/pynq_z2_setup.html)

# Submission

---

- Due date : **2023.11.09**
- **Team\_{Team\_number}\_project2 (.zip file)**
  - └ Document
  - └ your all verilog file (.v)
  - └ VIVADO project folder
  - └ PYNQ jupyter notebook folder (.jpynb + .bit + .hwh)