

# Computer-Aided VLSI System Design

## Homework 3: Simple Convolution and Image Processing Engine

*Graduate Institute of Electronics Engineering, National Taiwan University*



NTU GIEE



# Goal

- In this homework, you will learn
  - How to synthesis your design
  - How to run gate-level simulation
  - How to use SRAM



# Introduction

- In this homework, you are going to implement a simplified convolution and image processing engine. An  $8 \times 8 \times 32$  feature map will be loaded first, and it will be processed with several functions.





# Block Diagram





# Input/Output

| Signal Name | I/O | Width | Simple Description                                                                          |
|-------------|-----|-------|---------------------------------------------------------------------------------------------|
| i_clk       | I   | 1     | Clock signal in the system.                                                                 |
| i_RST_N     | I   | 1     | Active <b>low</b> asynchronous reset.                                                       |
| i_OP_VALID  | I   | 1     | This signal is <b>high</b> if operation mode is valid                                       |
| i_OP_MODE   | I   | 4     | Operation mode for processing                                                               |
| o_OP_READY  | O   | 1     | Set <b>high</b> if ready to get next operation                                              |
| i_IN_VALID  | I   | 1     | This signal is <b>high</b> if input pixel data is valid                                     |
| i_IN_DATA   | I   | 8     | Input pixel data ( <b>unsigned</b> )                                                        |
| o_IN_READY  | O   | 1     | Set <b>high</b> if ready to get <b>next</b> input data (only valid for i_OP_MODE = 4'b0000) |
| o_OUT_VALID | O   | 1     | Set <b>high</b> with valid output data                                                      |
| o_OUT_DATA  | O   | 14    | Pixel data or image processing result ( <b>signed</b> )                                     |



# Specification (1)

- All inputs are synchronized with the **negative** edge clock
- All outputs should be synchronized at clock **rising** edge
- You should reset all your outputs when `i_rst_n` is **low**
  - Active low asynchronous reset is used and only once



# Specification (2)

- Operations are given by i\_op\_mode when i\_op\_valid is **high**
- i\_op\_valid stays only **1** cycle
- i\_in\_valid and o\_out\_valid can't be **high** in the same time
- i\_op\_valid and o\_out\_valid can't be **high** in the same time
- i\_in\_valid and o\_op\_ready can't be **high** in the same time
- i\_op\_valid and o\_op\_ready can't be **high** in the same time
- o\_op\_ready and o\_out\_valid can't be **high** in the same time



# Specification (3)

- Set `o_op_ready` to **high** to get next operation (only one cycle)
  - Raise `o_op_ready` only when the design is prepared for the next operation
- `o_out_valid` should be **high** for valid output results
- **At least one SRAM** is implemented in your design



# Specification (4)

- Only worst-case library is used for synthesis
- The synthesis result of data type should **NOT** include any **Latch**
- The slack for setup time should be **non-negative**
- **No timing violations or glitches** in the gate-level simulation **after reset**



# Waveform: Loading Image





# Waveform: Other Operations





# Operation Modes

| i_op_mode | Meaning                                          |
|-----------|--------------------------------------------------|
| 4'b0000   | Input feature map loading                        |
| 4'b0001   | Origin right shift                               |
| 4'b0010   | Origin left shift                                |
| 4'b0011   | Origin up shift                                  |
| 4'b0100   | Origin down shift                                |
| 4'b0101   | Reduce the channel depth of the display region   |
| 4'b0110   | Increase the channel depth of the display region |
| 4'b0111   | Output the pixels in the display region          |
| 4'b1000   | Perform convolution in the display region        |
| 4'b1001   | Median filter operation                          |
| 4'b1010   | Sobel gradient + non-maximum suppression (NMS)   |



# Input Image

- The input image is given in **raster-scan** order

|  |  | channel 31 |    |    |    |    |    |    |    |     |           |
|--|--|------------|----|----|----|----|----|----|----|-----|-----------|
|  |  | ..         |    |    |    |    |    |    |    |     |           |
|  |  | channel 1  |    |    |    |    |    |    |    |     |           |
|  |  | channel 0  |    |    |    |    |    |    |    |     |           |
|  |  | 0          | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 79  | 2014 2015 |
|  |  | 8          | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 87  | 2022 2023 |
|  |  | 16         | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 95  | 2030 2031 |
|  |  | 24         | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 103 | 2038 2039 |
|  |  | 32         | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 111 | 2046 2047 |
|  |  | 40         | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 119 | ..        |
|  |  | 48         | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 127 | ..        |
|  |  | 56         | 57 | 58 | 59 | 60 | 61 | 62 | 63 |     |           |



# Input Image Loading

- An 8x8x32 image is loaded for 2048 cycles in **raster-scan** order
- The size of each pixel is 8 bits (unsigned)
- Raise **o\_op\_ready** to 1 after loading all image pixels
- If **o\_in\_ready** is 0, stop input data until **o\_in\_ready** is 1
- The input feature map will be loaded only once at the beginning



# Origin

- The first pixel in the display region is **origin**
- The default coordinate of the origin is at 0
- The size of the display region is  $2 \times 2 \times \text{depth}$





# Origin Shifting

- Origin right shift



- If output of display exceeds the boundary, retain the same origin





# Channel Depth

- 3 depths are considered in this design
  - 32, 16, 8
  - Default depth is 32
- The display size will change according to different depth

| Depth | Display size           |
|-------|------------------------|
| 32    | $2 \times 2 \times 32$ |
| 16    | $2 \times 2 \times 16$ |
| 8     | $2 \times 2 \times 8$  |



# Scale-down

- Reduce the channel depth of the display region
  - Ex. For channel depth,  $32 \rightarrow 16 \rightarrow 8$
- If the depth is 8, retain the same depth





# Scale-up

- Increase the channel depth of the display region
  - Ex. For channel depth,  $8 \rightarrow 16 \rightarrow 32$
- If the depth is 32, retain the same depth





# Display

- You have to output the pixels in the display region
- Set **o\_out\_data [13:8]** to 0 and **o\_out\_data [7:0]** to pixel data
- The pixels are displayed in **raster-scan** order
  - For example: 0 → 1 → 8 → 9 → 64 → 65 → ... → 1992 → 1993

32  
16  
8

| channel 31 |    |    |    |    |    |    |    |
|------------|----|----|----|----|----|----|----|
| ..         |    |    |    |    |    |    |    |
| channel 1  |    |    |    |    |    |    |    |
| channel 0  |    |    |    |    |    |    |    |
| 0          | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
| 8          | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
| 16         | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24         | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32         | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40         | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48         | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56         | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



# Display

- For display, the display size changes according to the depth





# Convolution

- For this operation, you have to perform convolution **in the display region**
- The size of the kernel is a  $3 \times 3 \times$  depth and the weights in each channel are identical
- The feature map needs to be zero-padded for convolution
- The accumulation results should be **rounded to the nearest integer** [1]
  - Do not truncate temporary results during computation
- After the convolution, you have to output the **4** accumulation results in **raster-scan** order
- The values of original pixels will not be changed



# Example of Convolution





# Example of Convolution

- The number of channels that are accumulated during convolution is determined by the depth.
  - For example, accumulate 8 channels if the depth is 8.





# Median Filter Operation

- For this operation, you have to perform median filtering **in the first 4 channels of the display region**
- The kernel size of the median filter is  $3 \times 3$
- Perform median filtering on each channel **separately**
- The feature map needs to be zero-padded for median filter operation
- After median filtering, you have to output the  **$2 \times 2 \times 4$**  filtered results in **raster-scan** order
  - Set  **$o\_out\_data[13:8]$**  to 0 and  **$o\_out\_data[7:0]$**  to pixel data
- The values of original pixels will not be changed

*F*  
*chan*

# Example of Median Filter Operation





# Sobel Gradient + NMS

- Calculate gradient **in the first 4 channels of the display region** using the Sobel operator and retain only local maxima along the gradient direction
  - Conduct computations **separately** for each channel
- The kernel size of the Sobel operator is  $3 \times 3$
- The feature map needs to be zero-padded
- After computation, you have to output the  **$2 \times 2 \times 4$**  results in **raster-scan** order
- The values of original pixels will not be changed





# Sobel Gradient + NMS

- Gradient magnitude  $G(x, y) = |G_x(x, y)| + |G_y(x, y)|$



# Sobel Gradient + NMS

- Gradient magnitude  $\underline{G(x,y) = |G_x(x,y)| + |G_y(x,y)|}$





# Sobel Gradient + NMS

- Gradient direction



- Tangent approximation

| Tangent           | Approx. Value                           | Tangent            | Approx. Value      |
|-------------------|-----------------------------------------|--------------------|--------------------|
| $\tan 0^\circ$    | 0                                       | $\tan 112.5^\circ$ | $-\tan 67.5^\circ$ |
| $\tan 22.5^\circ$ | $2^{-2} + 2^{-3} + 2^{-5} + 2^{-7}$     | $\tan 135^\circ$   | $-\tan 45^\circ$   |
| $\tan 45^\circ$   | 1                                       | $\tan 157.5^\circ$ | $-\tan 22.5^\circ$ |
| $\tan 67.5^\circ$ | $2 + 2^{-2} + 2^{-3} + 2^{-5} + 2^{-7}$ | $\tan 180^\circ$   | $-\tan 0^\circ$    |

# Sobel Gradient + NMS



- Non-maximum suppression (NMS)
  - Find the direction  $d_k \in \{0^\circ, 45^\circ, 90^\circ, 135^\circ\}$  that is closest to the gradient direction  $\theta(x, y)$
  - If the value of  $G(x, y)$  is less than any of its two neighbors along  $d_k$ , then set  $G_{NMS}(x, y)$  to 0 (suppression); otherwise, set  $G_{NMS}(x, y) = G(x, y)$
  - Output  $G_{NMS}(x, y)$  in **raster-scan** order



|   |             |             |   |
|---|-------------|-------------|---|
| 0 | 0           | 0           | 0 |
| 0 | $G_{(0,0)}$ | $G_{(1,0)}$ | 0 |
| 0 | $G_{(0,1)}$ | $G_{(1,1)}$ | 0 |
| 0 | 0           | 0           | 0 |

Example:

If  $22.5^\circ \leq \theta(0,0) \leq 67.5^\circ$ , then compare  $G(0,0)$  with its two neighbors along  $45^\circ$  direction, i. e. 0 and  $G(1,1)$ .

If  $112.5^\circ \leq \theta(0,1) \leq 157.5^\circ$ , then compare  $G(0,1)$  with its two neighbors along  $135^\circ$  direction, i. e. 0 and  $G(1,0)$ .



# Testbench

```
`timescale 1ns/1ps
`define CYCLE      5.0      // CLK period.
`define HCYCLE     (`CYCLE/2)
`define MAX_CYCLE  10000000
`define RST_DELAY   2

`ifdef tb1
  `define INFILE ".../00_TESTBED/PATTERN/indata1.dat"
  `define OPFILE ".../00_TESTBED/PATTERN/opmode1.dat"
  `define GOLDEN ".../00_TESTBED/PATTERN/golden1.dat"
`elsif tb2
  `define INFILE ".../00_TESTBED/PATTERN/indata2.dat"
  `define OPFILE ".../00_TESTBED/PATTERN/opmode2.dat"
  `define GOLDEN ".../00_TESTBED/PATTERN/golden2.dat"
`elsif tb3
  `define INFILE ".../00_TESTBED/PATTERN/indata3.dat"
  `define OPFILE ".../00_TESTBED/PATTERN/opmode3.dat"
  `define GOLDEN ".../00_TESTBED/PATTERN/golden3.dat"
```

```
`elsif tb4
  `define INFILE ".../00_TESTBED/PATTERN/indata4.dat"
  `define OPFILE ".../00_TESTBED/PATTERN/opmode4.dat"
  `define GOLDEN ".../00_TESTBED/PATTERN/golden4.dat"
`else
  `define INFILE ".../00_TESTBED/PATTERN/indata0.dat"
  `define OPFILE ".../00_TESTBED/PATTERN/opmode0.dat"
  `define GOLDEN ".../00_TESTBED/PATTERN/golden0.dat"
`endif

// Modify your sdf file name
`define SDFFILE ".../02_SYN/Netlist/core_syn.sdf"
```

```
// For gate-level simulation only
`ifdef SDF
  initial $sdf_annotate(`SDFFILE, u_core);
  initial #1 $display("SDF File %s were used for this simulation.", `SDFFILE);
`endif
```



# Pattern

P20說[13:8]設為0 那golden為什麼[13:8]還有值？

**indata\*.dat**

|          |
|----------|
| 11010000 |
| 10101101 |
| 10001011 |
| 01010110 |
| 00101110 |
| 11110001 |
| 11110110 |
| 11001101 |
| 01000110 |
| 11001110 |
| 00001011 |
| 10101111 |
| 10001000 |
| 00111111 |
| 00001100 |
| 11110100 |
| 11100100 |
| 00101100 |
| 11000100 |
| 11001000 |

**opmode\*.dat**

|      |
|------|
| 0000 |
| 0010 |
| 1000 |
| 0100 |
| 1000 |
| 0011 |
| 1000 |
| 0100 |
| 1000 |
| 0110 |
| 1000 |
| 0001 |
| 1000 |
| 0110 |
| 1000 |
| 0011 |
| 1000 |
| 0100 |

**golden\*.dat**

|                |
|----------------|
| 0010011111001  |
| 00110011000100 |
| 00110001011010 |
| 01000001010101 |
| 00110001011010 |
| 01000001010101 |
| 00101110110101 |
| 0011111110110  |
| 0010011111001  |
| 00110011000100 |
| 00110001011010 |
| 01000001010101 |
| 00110001011010 |
| 01000001010101 |
| 00101110110101 |
| 0011111110110  |
| 0010011111001  |
| 00110011000100 |
| 00110001011010 |
| 01000001010101 |
| 00110001011010 |
| 01000001010101 |
| 00101110110101 |
| 0011111110110  |



# 01\_RTL

## ■ core.v

```
module core (                                //Don't modify interface
    input          i_clk,
    input          i_RST_N,
    input          i_OP_VALID,
    input [ 3:0]   i_OP_MODE,
    output         o_OP_READY,
    input          i_IN_VALID,
    input [ 7:0]   i_IN_DATA,
    output         o_IN_READY,
    output         o_OUT_VALID,
    output [13:0]  o_OUT_DATA
);
```



# 01\_RTL

## ■ rtl\_01.f

```
// -----  
// Simulation: HW3  
// -----  
  
// testbench  
// -----  
..../00_TESTBED/testbench.v  
  
// memory file  
// -----  
//.../sram_256x8/sram_256x8.v  
//.../sram_512x8/sram_512x8.v  
//.../sram_4096x8/sram_4096x8.v  
  
// design files  
// -----  
../core.v
```



# 01\_RTL

- Run the RTL simulation under 01\_RTL folder

```
vcs -f rtl_01.f -full64 -R -debug_access+all +v2k  
+notimingcheck -sverilog +define+tb0
```

tb0, tb1, tb2, tb3, tb4

or

```
./01_run tb0 5.0
```

clock period



# 01\_RTL

- SpyGlass linting

```
./02_lint
```

- Command for cleaning temporary files

```
./99_clean_up
```

- Note that before executing the shell script, change the file permissions by

```
chmod +x ./01_run ./02_lint ./99_clean_up
```



## 02\_SYN

- **core\_dc.sdc**

```
# operating conditions and boundary conditions #
set cycle 5.0; # modify your clock cycle here #
```

- **flist.sv**

```
1 // list all paths to your design files
2 `include "../01_RTL/core.v"
```

- Run the command to do synthesis

```
dc_shell-t -f syn.tcl | tee syn.log
```



## 03\_GATE

- Run gate-level simulation under 03\_GATE folder

```
vcs -f rtl_03.f -full64 -R -debug_access+all +v2k  
+maxdelays -negdelay +neg_tchk +define+SDF+tb0
```

tb0, tb1, tb2, tb3, tb4

or

```
./03_run tb0 5.0
```

clock period



# sram\_256x8

## Pin Description

| Pin    | Description               |
|--------|---------------------------|
| A[7:0] | Addresses (A[0] = LSB)    |
| D[7:0] | Data Inputs (D[0] = LSB)  |
| CLK    | Clock Input               |
| CEN    | Chip Enable               |
| WEN    | Write Enable              |
| Q[7:0] | Data Outputs (Q[0] = LSB) |

## SRAM Logic Table

| CEN | WEN | Data Out  | Mode    | Function                                                                                                                                                         |
|-----|-----|-----------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| H   | X   | Last Data | Standby | Address inputs are disabled; data stored in the memory is retained, but the memory cannot be accessed for new reads or writes. Data outputs remain stable.       |
| L   | L   | Data In   | Write   | Data on the data input bus D[n-1:0] is written to the memory location specified on the address bus A[m-1:0], and driven through to the data output bus Q[n-1:0]. |
| L   | H   | SRAM Data | Read    | Data on the data output bus Q[n-1:0] is read from the memory location specified on the address bus A[m-1:0].                                                     |



# Submission

- Create a folder named **studentID\_hw3** and follow the hierarchy below

```
r11943006_hw3/
├── 01_RTL/
│   ├── core.v (and other Verilog files)
│   └── rtl_01.f
├── 02_SYN/
│   ├── syn.tcl
│   ├── flist.sv
│   ├── core_dc.sdc
│   └── Netlist/
│       ├── core_syn.v
│       ├── core_syn.sdf
│       └── core_syn.ddc
└── Report/
    ├── core_syn.area
    └── core_syn.timing
└── 03_GATE/
    └── rtl_03.f
report.txt
```

- Compress the folder **studentID\_hw3** in a tar file named **studentID\_hw3\_vk.tar** (*k* is the number of version, *k* =1,2,...)
  - Use lower case for the letter in your student ID  
(Ex. r11943006\_hw3\_v1.tar)



# Grading Policy

- Correctness of simulation: **70%** (follow our spec)

| Pattern | Description                    | RTL simulation | Gate-level simulation |
|---------|--------------------------------|----------------|-----------------------|
| tb0     | Load + shift + scale + display | 5%             | 5%                    |
| tb1     | Load + shift + scale + conv.   | 5%             | 10%                   |
| tb2     | Load + shift + median filter   | 5%             | 5%                    |
| tb3     | Load + shift + Sobel + NMS     | 5%             | 10%                   |
| tb4     | All operations (w/o display)   | 5%             | 5%                    |
| tbh     | <u>Hidden patterns</u>         | x              | 10%                   |

- Performance: **30%**
  - Performance = **Area \* Time ( $\mu\text{m}^2 * \text{ns}$ )**
    - **Time = total simulation time of tb4**
    - The lower the value, the better the performance
  - **Performance score only counts if your design passes all the test patterns**



# Grading Policy

- **No late submission is allowed**
  - Any submissions after the deadline will receive 0 point
- Lose **5 points** for any wrong naming rule or format for submission
  - Ensure that the submitted files can be decompressed and executed without issues
- **No plagiarism**
  - Plagiarism in any form is strictly prohibited, including copying from online sources or past assignments



# Grading Policy

- **Violations of any spec (p.6 – p.9) incur point penalties**
  - Negative slack
    - 0 point for gate-level simulations and performance
  - Design without SRAM
    - 0 point for gate-level simulations and performance
  - Violate other rules but pass all simulations
    - Performance score \* 0.7



# Area

## ■ core\_syn.area

|                                |               |
|--------------------------------|---------------|
| Number of ports:               | 883           |
| Number of nets:                | 6074          |
| Number of cells:               | 5142          |
| Number of combinational cells: | 4756          |
| Number of sequential cells:    | 306           |
| Number of macros/black boxes:  | 1             |
| Number of buf/inv:             | 1241          |
| Number of references:          | 269           |
|                                |               |
| Combinational area:            | 66609.370834  |
| Buf/Inv area:                  | 10982.178019  |
| Noncombinational area:         | 12156.778545  |
| Macro/Black Box area:          | 131906.968750 |
| Net Interconnect area:         | 565696.185242 |
|                                |               |
| Total cell area:               | 210673.118129 |
| Total area:                    | 776369.303371 |

Number of macros/black boxes  
should not be 0

**210673.118129  $\mu\text{m}^2$**



# Report

- Your design will be simulated using the clock period in report.txt
- **report.txt**

```
1 StudentID:  
2  
3 Clock period: 5.0 (ns)  
4  
5 Area : 210673.118129 (um^2)  
6
```

The clock period that can pass all gate-level simulations without any timing violations



# Discussion

- **NTU Cool Discussion Forum**
  - For any questions not related to assignment answers or privacy concerns, please use the NTU Cool discussion forum
  - **TAs will prioritize answering questions on the NTU Cool discussion forum**
- **Email: r11943006@ntu.edu.tw**
  - Title should start with **[CVSD 2024 Fall HW3]**
  - Email with wrong title will be moved to trash automatically



# References

- [1] Rounding to the nearest
  - [Rounding - MATLAB & Simulink \(mathworks.com\)](#)
- [2] R. C. Gonzalez and R. E. Woods, *Digital Image Processing*, 4th edition, Pearson, 2018.
- [3] Image gradients and Sobel kernels
  - [Image Gradients with OpenCV \(Sobel and Scharr\)](#)