

# **Microprocessor Applications**

## **(Term Project 2)**



# **TP (2)**

**2025. 6. 16.**

### **Students**

**202110410** 조민우

**202110365** 김상만

- **64pt FFT Accelerator with Multiple Cores**

- **Result of TP\_2**
- **Code Review**
- **How to Debug**
- **Ablation Study**
- **Roofline Model**
- **How to improve**
- **Summary**

# Result of TP\_2

## ■ Accuracy & Performance

Output count has reached the total number of samples.  
Validating bit-accurate output...

- Total mismatches: 0 / 192

Checking SRAM occupancy...

- SRAM 2: occupancy = 32

- SRAM 3: occupancy = 32

Test passed. Ending simulation.

\$finish called from file "[../../tb/tb\\_mem\\_IO.v](#)", line 132.

\$finish at simulation time [6634000](#)

Simulation complete, time is [6634000 ps](#).



### • Accuracy

- Total mismatches : 0/192

### • Performance

- Simulation Time : 6,634,000[ps]

### • Memory Capacity

- Used Memory: SRAM 0: 32(128B)

- Used Memory: SRAM 1: 32(128B)

- **Implement POV**

- Pipeline
- Effective Memory Capacity
- Synchronizing with dummy

- **Code Review**

# Code Review (1/7)

## ■ Implement POV

- Pipelining with 3 cores (HOW?)
  - Distribute the roles among the cores



Specify Cores roles

|               | A                           | B        | C         | D         | E         | F         | G         | H         | I         | J         | K         | L         | M         | N         | O         | P         | Q         | R         | S         | T         | U         | V         | W         | X         | Y         | Z         | AA        | AB        | AC        | AD        | AE        | AF        | AG        | AH        |
|---------------|-----------------------------|----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| <b>stage0</b> |                             |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 1             | time->                      | 1        | 2         | 3         | 4         | 5         | 6         | 7         | 8         | 9         | 10        | 11        | 12        | 13        | 14        | 15        | 16        | 17        | 18        | 19        | 20        | 21        | 22        | 23        | 24        | 25        | 26        | 27        | 28        | 29        | 30        | 31        | 32        | 33        |
| 2             | risc-v 1                    | lw<br>d0 | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  |           |
| 3             | risc-v 2                    |          |           |           |           | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  |           |
| 4             | risc-v 0                    | lw<br>w( |           |           |           |           |           |           |           | sw<br>w(t | sw<br>w(o | sw<br>w(o | sw<br>w(t | sw<br>w(o |
| 5             |                             |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 6             | <b>stage1,2,3,4,5</b>       |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 7             | time->                      | 1        | 2         | 3         | 4         | 5         | 6         | 7         | 8         | 9         | 10        | 11        | 12        | 13        | 14        | 15        | 16        | 17        | 18        | 19        | 20        | 21        | 22        | 23        | 24        |           |           |           |           |           |           |           |           |           |
| 8             | risc-v 1                    |          |           |           |           | lw<br>o_  | sw<br>o_  |           |           |           |           |           |           |           |           |           |           |           |
| 9             | risc-v 2                    | lw<br>d0 | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  | lw<br>d0  | lw<br>d1  | sw<br>d0  | sw<br>d1  |           |
| 10            | risc-v 0                    | lw<br>w( | sw<br>w(t | lw<br>w(o | sw<br>w(o | sw<br>w(t | sw<br>w(o |           |
| 11            |                             |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 12            | <b>stage 5 Output store</b> |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 13            | time->                      | 1        | 2         | 3         | 4         | 5         | 6         | 7         | 8         | 9         | 10        | 11        | 12        | 13        | 14        | 15        | 16        | 17        | 18        | 19        | 20        | 21        | 22        | 23        | 24        |           |           |           |           |           |           |           |           |           |
| 14            | risc-v 1                    |          | lw<br>o_  | sw<br>o_  |           |           |
| 15            | risc-v 2                    | lw<br>o_ | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  | lw<br>o_  | sw<br>o_  |           |           |           |
| 16            | risc-v 0                    |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 17            |                             |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |
| 18            |                             |          |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |           |

# Code Review (2/7)

## Implement POV-cont'd

- Pipelining with 3 cores (HOW?)

- Distribute the roles among the cores



Specify Cores roles



```
# Load twiddle factors for all stages (32 values)
# Butterfly 0-31: Load W[0] for stage 0 (all W[0])
```



```
# Butterfly 0
lw t0, 0(s0)      # d0(SRAM_I/O->t0)
lw t1, 128(s0)    # d1(SRAM_I/O->t1)
sw t0, 0(s1)      # d0(t0->SRAM_0)
sw t1, 0(s2)      # d1(t1->SRAM_1)
addi x0, x0, 1    # dummy
```



```
# Butterfly 0
lw t2, 0(s1)      # d0(SRAM_0->t2)
lw t3, 0(s2)      # d1(SRAM_1->t3)
sw t2, 0(s4)      # d0(t2->dpath0)
sw t3, 0(s5)      # d1(t3->dpath1)
addi x0, x0, 0x0000 # dummy
addi x0, x0, 0x0000 # dummy
```

Use 6 dummy codes before starts

# Code Review (3/7)

## ■ Implement POV-cont'd

- Pipelining with 3 cores (HOW?)
  - Distribute the roles among the cores



```
# Butterfly 0
lw t4, 0($3)      # W[0](SRAM_W->t4)
addi x0, x0, 1     # dummy
sw t4, 0($6)      # W[0](t4->dpath2)
lw t4, 64($3)     # W[16](SRAM_W->t4)_next twiddle factor
lw t5, 4($4)       # o_d0(dpath0->t5)
sw t5, 0($1)       # o_d0(t5->SRAM_0)
```

```
# Butterfly 1
sw t4, 0($6)      # W[16](t4->dpath2)
lw t4, 0($3)      # W[0](SRAM_W->t4)_next twiddle factor
lw t5, 4($4)       # o_d0(dpath0->t5)
sw t5, 0($2)       # o_d0(t5->SRAM_1)
```



```
# Butterfly operations (32 total)

# Butterfly 0
lw t6, 4($5)      # o_d1(dpath1->t6)
sw t6, 4($1)       # o_d1(t6->SRAM_1)
addi x0, x0, 1     # dummy
```



```
# Butterfly 0
lw t2, 0($1)       # d0(SRAM_0->t2)
lw t3, 4($1)       # d1(SRAM_0->t3)
sw t2, 0($4)       # d0(t2->dpath0)
sw t3, 0($5)       # d1(t3->dpath1)
```

# Code Review (4/7)

## ■ Implement POV-cont'd

- Effective Memory Capacity
  - Memory is reused after use

(The "reg" expression was omitted for simplicity)



|      |             |                    |
|------|-------------|--------------------|
| lw   | t0, 0(s0)   | # d0(SRAM I/O->t0) |
| lw   | t1, 128(s0) | # d1(SRAM I/O->t1) |
| sw   | t0, 0(s1)   | # d0(t0->SRAM_0)   |
| sw   | t1, 0(s2)   | # d1(t1->SRAM_1)   |
| addi | x0, x0, 1   | # dummy            |

<Store Input to SRAM 0, 1>

|      |                |                  |
|------|----------------|------------------|
| lw   | t2, 0(s1)      | # d0(SRAM_0->t2) |
| lw   | t3, 0(s2)      | # d1(SRAM_1->t3) |
| sw   | t2, 0(s4)      | # d0(t2->dpath0) |
| sw   | t3, 0(s5)      | # d1(t3->dpath1) |
| addi | x0, x0, 0x0000 | # dummy          |
| addi | x0, x0, 0x0000 | # dummy          |

<Store dpath output to SRAM 0, 1>

|    |           |                    |
|----|-----------|--------------------|
| sw | t4, 0(s6) | # W[0](t4->dpath2) |
| lw | t5, 4(s4) | # o_d0(dpath0->t5) |
| lw | t6, 4(s5) | # o_d0(dpath1->t6) |
| sw | t5, 0(s1) | # o_d0(t5->SRAM_0) |
| sw | t6, 0(s2) | # o_d0(t6->SRAM_1) |

<Store dpath output to SRAM 0, 1>

# Code Review (6/7)

## ▪ **Implement POV-cont'd**

- Synchronizing with dummy

- Dummy codes were inserted during IDLE state when a core had no tasks to perform

## Synchronizing with other cores!!

```

lw      t4, 0($3)          # W[0](SRAM_W->t4)

addi   x0, x0, 1           # dummy

# Load twiddle factors for all stages (32 values)
# Butterfly 0-31: Load W[0] for stage 0 (all W[0])

# Butterfly 0
sw t4, 0($6)          # W[0](t4->dpath2)
lw t5, 4($4)          # o_d0(dpath0->t5)
lw t6, 4($5)          # o_d0(dpath1->t6)
sw t5, 0($1)          # o_d0(t5->SRAM_0)
sw t6, 0($2)          # o_d0(t6->SRAM_1)

```

# Code Review (7/7)

## ■ Implement POV-cont'd

### • Synchronizing with dummy

- Dummy codes were inserted during IDLE state when a core had no tasks to perform

```
lw      t4, 0(s3)      # W[0](SRAM_W->t4)
addi   x0, x0, 1        # dummy

# Load twiddle factors for all stages (32 values)
# Butterfly 0-31: Load W[0] for stage 0 (all W[0])

# Butterfly 0
sw t4, 0(s6)      # W[0](t4->dpath2)
lw t5, 4(s4)        # o_d0(dpath0->t5)
lw t6, 4(s5)        # o_d0(dpath1->t6)
sw t5, 0(s1)        # o_d0(t5->SRAM_0)
sw t6, 0(s2)        # o_d0(t6->SRAM_1)
```



# How to Debug (1/7)

## Using C & Excel for debugging with Waveforms

```
if (set == 0 && l == 1)
{
    printf("State of SRAM_0 after 0 stage of first set\n");
    for (j = 0; j < L_FFT; j++)
        printf("SRAM_0[%2d] = \t0x%8x\n", j, SRAM_0[j]);
}

if (set == 0 && l == 2)
{
    printf("State of SRAM_0 after 1 stage of first set\n");
    for (j = 0; j < L_FFT; j++)
        printf("SRAM_0[%2d] = \t0x%8x\n", j, SRAM_0[j]);
}

if (set == 0 && l == 4)
{
    printf("State of SRAM_0 after 2 stage of first set\n");
    for (j = 0; j < L_FFT; j++)
        printf("SRAM_0[%2d] = \t0x%8x\n", j, SRAM_0[j]);
}

if (set == 0 && l == 8)
{
    printf("State of SRAM_0 after 3 stage of first set\n");
    for (j = 0; j < L_FFT; j++)
        printf("SRAM_0[%2d] = \t0x%8x\n", j, SRAM_0[j]);
}

if (set == 0 && l == 16)
{
    printf("State of SRAM_0 after 4 stage of first set\n");
    for (j = 0; j < L_FFT; j++)
        printf("SRAM_0[%2d] = \t0x%8x\n", j, SRAM_0[j]);
}
```

| stage0                  | stage1                   | stage2                   | stage3                   | stage4                   | stage5                   |
|-------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|
| SRAM_0[0] = 0xd004fe    | SRAM_0[0] = 0xee7c4abc   | SRAM_0[0] = 0xed3fb607   | SRAM_0[0] = 0xf515fe6e   | SRAM_0[0] = 0xfacaf646   | SRAM_0[0] = 0xfb0ff96    |
| SRAM_0[1] = 0xff841944  | SRAM_0[1] = 0xc54ea9a    | SRAM_0[1] = 0xe2280a22   | SRAM_0[1] = 0x9f9d103d   | SRAM_0[1] = 0x44a1692    | SRAM_0[1] = 0xdf6706e0   |
| SRAM_0[2] = 0x5d5e1278  | SRAM_0[2] = 0x5f5e0f3    | SRAM_0[2] = 0x0f68fc7    | SRAM_0[2] = 0x7f70043f   | SRAM_0[2] = 0x3cf9de2    | SRAM_0[2] = 0x35cfb3c    |
| SRAM_0[3] = 0xd5e1278   | SRAM_0[3] = 0x6f86051    | SRAM_0[3] = 0xd0406f7    | SRAM_0[3] = 0xfe7809ad   | SRAM_0[3] = 0x8200af     | SRAM_0[3] = 0xdfdbf023   |
| SRAM_0[4] = 0xfc2ffff   | SRAM_0[4] = 0xebf90152   | SRAM_0[4] = 0x1419b5     | SRAM_0[4] = 0x768d44     | SRAM_0[4] = 0x4adff93    | SRAM_0[4] = 0x1330140    |
| SRAM_0[5] = 0xd8f83fec  | SRAM_0[5] = 0xe3dfe60a   | SRAM_0[5] = 0x25fdb5     | SRAM_0[5] = 0x96eff9d    | SRAM_0[5] = 0x8a6ff70    | SRAM_0[5] = 0x4e20010    |
| SRAM_0[6] = 0xdbcb402a  | SRAM_0[6] = 0x1035fe8    | SRAM_0[6] = 0x1f1c0042b  | SRAM_0[6] = 0xf52e0a27   | SRAM_0[6] = 0x3b7054d    | SRAM_0[6] = 0x2e70157    |
| SRAM_0[7] = 0x296e62a   | SRAM_0[7] = 0xbfbac002a  | SRAM_0[7] = 0x9f620459   | SRAM_0[7] = 0x2c107a1    | SRAM_0[7] = 0x0fa9305e9  | SRAM_0[7] = 0x0fad4001e  |
| SRAM_0[8] = 0xf2b20ced  | SRAM_0[8] = 0xc0c0f69    | SRAM_0[8] = 0xfc001d7    | SRAM_0[8] = 0x825f68     | SRAM_0[8] = 0x0fd802b4   | SRAM_0[8] = 0x2f04411    |
| SRAM_0[9] = 0x5a022a3   | SRAM_0[9] = 0x5e2376     | SRAM_0[9] = 0x332209d    | SRAM_0[9] = 0xf5349aa    | SRAM_0[9] = 0xdaf000d8   | SRAM_0[9] = 0x0fc0f0357  |
| SRAM_0[10] = 0x46711e6  | SRAM_0[10] = 0x7af6d83   | SRAM_0[10] = 0x7f7007c8  | SRAM_0[10] = 0x8f78ff87  | SRAM_0[10] = 0x4fffa1    | SRAM_0[10] = 0x3ecfaee   |
| SRAM_0[11] = 0xbdb71b1c | SRAM_0[11] = 0x542f2c2   | SRAM_0[11] = 0xe45040aa  | SRAM_0[11] = 0xe8fcdf49  | SRAM_0[11] = 0x0feb9ff24 | SRAM_0[11] = 0x0fe73ff9e |
| SRAM_0[12] = 0xebc7eb91 | SRAM_0[12] = 0xffffd446  | SRAM_0[12] = 0x1f1c0d91  | SRAM_0[12] = 0x9f9d8f66  | SRAM_0[12] = 0x1fe16fe80 | SRAM_0[12] = 0xfcdd1ff34 |
| SRAM_0[13] = 0x55d14cd  | SRAM_0[13] = 0x3f36194e  | SRAM_0[13] = 0x0fd2c02d9 | SRAM_0[13] = 0x5b6fe17   | SRAM_0[13] = 0x3f0470    | SRAM_0[13] = 0x0fe5f02d5 |
| SRAM_0[14] = 0xe2fcfb   | SRAM_0[14] = 0x0edf2f74a | SRAM_0[14] = 0x2ef5b     | SRAM_0[14] = 0xfc920303  | SRAM_0[14] = 0x0fbabfd62 | SRAM_0[14] = 0x0ff5f050c |
| SRAM_0[15] = 0xe230f911 | SRAM_0[15] = 0xf626fb7e  | SRAM_0[15] = 0x0fb0dfa2  | SRAM_0[15] = 0x0f6c1fcb7 | SRAM_0[15] = 0x0fd9ff70  | SRAM_0[15] = 0x0f8a4fc40 |
| SRAM_0[16] = 0x1b101afa | SRAM_0[16] = 0xb5f1fd8b  | SRAM_0[16] = 0x0f7a7ff47 | SRAM_0[16] = 0x81fc5a    | SRAM_0[16] = 0x0fa4a010a | SRAM_0[16] = 0x0fcfe0fe  |
| SRAM_0[17] = 0xffffd27a | SRAM_0[17] = 0xb33fc45   | SRAM_0[17] = 0x0f63003b3 | SRAM_0[17] = 0x0ff3df55  | SRAM_0[17] = 0x2c30ba8   | SRAM_0[17] = 0x63a0516   |
| SRAM_0[18] = 0xbcb2e023 | SRAM_0[18] = 0x1fb1f67   | SRAM_0[18] = 0x1120f6c0  | SRAM_0[18] = 0x11b0fe25  | SRAM_0[18] = 0x0f427065c | SRAM_0[18] = 0x0f58104f7 |
| SRAM_0[19] = 0xb8b0697  | SRAM_0[19] = 0x49df595   | SRAM_0[19] = 0x6849f08   | SRAM_0[19] = 0x6bcfa82   | SRAM_0[19] = 0x0fd608fd  | SRAM_0[19] = 0x407b9     |
| SRAM_0[20] = 0x3b03a6   | SRAM_0[20] = 0x5f5e0105  | SRAM_0[20] = 0x3a9fe43   | SRAM_0[20] = 0x5b9d07    | SRAM_0[20] = 0x2b2bd1b9  | SRAM_0[20] = 0x0ff960103 |
| SRAM_0[21] = 0xea010566 | SRAM_0[21] = 0xe18fc4    | SRAM_0[21] = 0x502f931   | SRAM_0[21] = 0x6aa050e   | SRAM_0[21] = 0xc3012c    | SRAM_0[21] = 0x2a6ff3a   |
| SRAM_0[22] = 0x418f64   | SRAM_0[22] = 0xe602a1    | SRAM_0[22] = 0x6f816a6   | SRAM_0[22] = 0x691109    | SRAM_0[22] = 0x0f17704d9 | SRAM_0[22] = 0x0f76105d6 |
| SRAM_0[23] = 0xb7f190   | SRAM_0[23] = 0xfc808072  | SRAM_0[23] = 0x1fe18fc8  | SRAM_0[23] = 0x9f93b37b  | SRAM_0[23] = 0x82d01b7   | SRAM_0[23] = 0x205015e   |
| SRAM_0[24] = 0x1ac7fd9c | SRAM_0[24] = 0xd930138   | SRAM_0[24] = 0x95df96e   | SRAM_0[24] = 0x0f72502e  | SRAM_0[24] = 0x0fa9cf9de | SRAM_0[24] = 0x0fcff7f6a |
| SRAM_0[25] = 0x1029141e | SRAM_0[25] = 0x6fc5ff6   | SRAM_0[25] = 0x994fe86   | SRAM_0[25] = 0xf6f3045d  | SRAM_0[25] = 0x0f79348d2 | SRAM_0[25] = 0x0fa2efae8 |
| SRAM_0[26] = 0x6104d5   | SRAM_0[26] = 0x3d3fc64   | SRAM_0[26] = 0x14430579  | SRAM_0[26] = 0x0ff7e089b | SRAM_0[26] = 0x0f829fd5  | SRAM_0[26] = 0x0fc650537 |
| SRAM_0[27] = 0x14a7dd63 | SRAM_0[27] = 0x19631462  | SRAM_0[27] = 0x66004e4   | SRAM_0[27] = 0xffc8fe86  | SRAM_0[27] = 0x0ffd3fe4  | SRAM_0[27] = 0x0fae7fe0a |



# How to Debug (2/7)

## ■ Debugging Example (Timing analysis)

- See the output result (console)

Output 0~63(set0)correct, 64~ wrong..

Output count has reached the total number of samples.  
Validating bit-accurate output.  
- Mismatch @ output[0x0040]: expected 0xf551fc78, got 0xfd23fa07  
- Mismatch @ output[0x0041]: expected 0x0184f8db, got 0xfe450daf  
- Mismatch @ output[0x0042]: expected 0xfcce5fe76, got 0x00f7fbef  
- Mismatch @ output[0x0043]: expected 0xfe16f4f1, got 0x0169119d  
- Mismatch @ output[0x0044]: expected 0x03a4fd74, got 0xfe7a00ae  
- Mismatch @ output[0x0045]: expected 0x0561fd93, got 0xf522004b

See the set1\_stage 0



Set1\_stage 0

# How to Debug (4/7)

## ■ Debugging Example (Timing analysis)-cont'd

- See the d0, d1 & twiddle factor timing



# How to Debug (5/7)

## ■ Debugging Example (Timing analysis)-cont'd

- See the d0, d1 & twiddle factor timing



# How to Debug (6/7)

## ■ Debugging Example (Timing analysis)-cont'd

- See the input timing



<risv-2 waveform>

Right input is to late  
=> Compute core (risv-1) need stall

# How to Debug (7/7)

## ■ Debugging Example (Timing analysis)-cont'd

- Add 6 dummy(nop) to data\_m2.s and check

```
#butterfly 31
lw s11, 124(s1) #sram_0[31]->s11
sw s11, 248(s7) #s11->sram i/o
addi x0, x0, 0x0000
# dummy
addi x0, x0, 0x0000
```

```
# ===== set0 End =====
```

<data\_m2.s>



<risv-1 waveform>

Right timing! => correct result!!

```
Output count has reached the total number of samples.
Validating bit-accurate output...
- Total mismatches: 0 / 192
Checking SRAM occupancy...
- SRAM 2: occupancy = 32
- SRAM 3: occupancy = 32
Test passed. Ending simulation.
$finish called from file "./tb/tb_mem_ID.v", line 132.
$finish at simulation time 6634000
Simulation complete, time is 6634000 ps.
```

<console>

## Pipelining With Memory Flag

- Code for Flagging
- Execution time & Memory Capacity

```
# Clear sync flags initially
sw    x0,    0($2)      # Clear ready flag
sw    x0,    4($2)      # Clear done flag

main_loop:
bge   s5,    s4,    end_infinite

# Wait for Core 2 to be ready (check if previous output is done)
beq   s5,    x0,    skip_wait  # Skip wait for first set
wait_core2:
lw    t0,    4($2)      # Check done flag from Core 2
beq   t0,    x0,    wait_core2
sw    x0,    4($2)      # Clear done flag
```



```
Output count has reached the total number of samples.
Validating bit-accurate output...
- Total mismatches: 0 / 192
Checking SRAM occupancy...
- SRAM 2: occupancy = 64
- SRAM 3: occupancy = 3
Test passed. Ending simulation.
$finish called from file "../../tb/tb_mem_IO.v", line 132.
$finish at simulation time 60022000
Simulation complete, time is 60022000 ps.
```

### Performance

- Simulation Time : 60,022,000[ps]

### Memory Capacity

- Used Memory: SRAM 0: 64(128B)  
- Used Memory: SRAM 1: 3(128B)

SRAM 1 for Flags

## ■ Parallel Operations /w TP\_1 code

- Reuse TP\_1 code & dummy for Sync

```
# reg에 Twiddle Factor을 담아두고, 재사용하려한다.  
lw      a7, 0($3)          # a7 = twiddle[0]  
lw      a6, 64($3)         # a6 = twiddle[16]  
  
# i = 0  
# n = 0  
# stage[0][0]  
  
# stage[0][0]의 연산 결과를 stage[0][1]로 전달하는데  
# SRAM 0의 sw/lw 과정을 제하기위해 가용한 reg를 최대한 사용하였다.  
  
# Input  
# 아래는 DIT 형식에 맞추어 SRAM I/O으로부터의 Input을 받는 과정이다.  
# e.g. x[0], x[32], x[16]...순서...  
lw      t0, 0($0)  
lw      t1, 128($0)  
lw      t2, 64($0)  
lw      t3, 192($0)  
lw      t4, 32($0)  
lw      t5, 160($0)  
lw      t6, 96($0)  
lw      a0, 224($0)
```

- Execution time & Memory Capacity



```
Output count has reached the total number of samples.  
Validating bit-accurate output...  
- Total mismatches: 0 / 192  
Checking SRAM occupancy...  
- SRAM 2: occupancy = 64  
- SRAM 3: occupancy = 64  
- SRAM 5: occupancy = 64  
Test passed. Ending simulation.  
$finish called from file "../../../../tb/tb\_mem\_I0.v", line 132.  
$finish at simulation time 6044000  
Simulation complete, time is 6044000 ps.
```

## Performance X 10!!

### ■ Performance

- Simulation Time : 6,044,000[ps]

### ■ Memory Capacity

- Used Memory: SRAM 0: 64(128B)
- Used Memory: SRAM 1: 64(128B)
- Used Memory: SRAM 2: 64(128B)

## ■ Pipelining Cores

- code

```
stage0:  
    # Initial synchronization - wait for m1 to start loading  
  
    lw      t4, 0($3)      # W[0](SRAM_W->t4)  
  
    addi   x0, x0, 1       # dummy  
    addi   x0, x0, 1       # dummy  
  
    # Load twiddle factors for all stages (32 values)  
    # Butterfly 0-31: Load W[0] for stage 0 (all W[0])  
  
    # Butterfly 0  
    sw t4, 0($6)          # W[0](t4->dpah2)  
    lw t5, 4($4)          # o_d0(dpah0->t5)  
    lw t6, 4($5)          # o_d0(dpah1->t6)  
    sw t5, 0($1)          # o_d0(t5->SRAM_0)  
    sw t6, 0($2)          # o_d0(t6->SRAM_1)
```

Memory X 3!!

- Execution time & Memory Capacity



```
Output count has reached the total number of samples.  
Validating bit-accurate output...  
    - Total mismatches: 0 / 192  
Checking SRAM occupancy...  
    - SRAM 2: occupancy = 32  
    - SRAM 3: occupancy = 32  
Test passed. Ending simulation.  
$finish called from file "../../tb/tb\_mem\_IO.v", line 132.  
$finish at simulation time 6804000  
Simulation complete, time is 6804000 ps.
```

## ■ Performance

- Simulation Time : 6,804,000[ps]

## ■ Memory Capacity

- Used Memory: SRAM 0: 32(128B)
- Used Memory: SRAM 1: 32(128B)

# Ablation Study(4/4)

## ■ Pipeling Cores (Optimized)-final version

- code

```
stage0:  
    # Initial synchronization - wait for m1 to start loading  
  
    lw      t4, 0($3)      # W[0](SRAM_W->t4)  
  
    addi   x0, x0, 1       # dummy  
    addi   x0, x0, 1       # dummy  
  
    # Load twiddle factors for all stages (32 values)  
    # Butterfly 0-31: Load W[0] for stage 0 (all W[0])  
  
    # Butterfly 0  
    sw t4, 0($6)          # W[0](t4->dpah2)  
    lw t5, 4($4)          # o_d0(dpah0->t5)  
    lw t6, 4($5)          # o_d0(dpah1->t6)  
    sw t5, 0($1)          # o_d0(t5->SRAM_0)  
    sw t6, 0($2)          # o_d0(t6->SRAM_1)
```

Performance X 1.1 !!

- Execution time & Memory Capacity



```
Output count has reached the total number of samples.  
Validating bit-accurate output...  
- Total mismatches: 0 / 192  
Checking SRAM occupancy...  
- SRAM 2: occupancy = 32  
- SRAM 3: occupancy = 32  
Test passed. Ending simulation.  
$finish called from file "../../../../tb/tb\_mem\_IO.v", line 132.  
$finish at simulation time 6634000  
Simulation complete, time is 6634000 ps.
```

### ■ Performance

- Simulation Time : 6,634,000[ps]

### ■ Memory Capacity

- Used Memory: SRAM 0: 32(128B)  
- Used Memory: SRAM 1: 32(128B)

# Roofline Model(1/4)

## ■ System Analysis

- Roofline analysis



Y axis (perf)  
X axis (IO)

Performance limit  
BW\_m (Slope)

: BF/cycle  
: BF/BYTE  
  
: 1BF/cycle  
: 12Byte/cycle (3core)

# Roofline Model(2/4)

## ■ Compute Performance (BF/Cycle)



- Pipelining with memory flag  
 $(60022000 \text{ ps}) * (1\text{cycle}/2000\text{ps}) = 30,011\text{cycle}$   
576BF/30,011cycle = **0.019 [BF/cycle]**
- Parallel Operations /w TP\_1 code  
 $(6044000 \text{ ps}) * (1\text{cycle}/2000\text{ps}) = 3,022\text{cycle}$   
576BF/3,022cycle = **0.191 [BF/cycle]**
- Pipelining Cores  
 $(6804000 \text{ ps}) * (1\text{cycle}/2000\text{ps}) = 3,402\text{cycle}$   
576BF/3,402cycle = **0.169 [BF/cycle]**
- Final Ver.  
 $(6634000 \text{ ps}) * (1\text{cycle}/2000\text{ps}) = 3,317\text{cycle}$   
576BF/3,317cycle = **0.174 [BF/cycle]**

## ■ Compute IO (BF/Byte)-cont'd

$1\text{BF} \Rightarrow 3\text{LW} + 3\text{SW}$  (input),  $2\text{LW} + 2\text{SW}$ (output)

$$\text{IO} = 1\text{BF}/(10*4)\text{BYTE} = 0.025(\text{BF/BYTE})$$

- Pipelining with memory flag

-Total

Input load :  $192*4$  Byte  
Flag check :  $(6*3*4)*576$ Byte (plag poling)  
Compute :  $10*4*576$  Byte  
Output store :  $192*4$  Byte

$$\begin{aligned}\text{IO} &= 576\text{BF} / (192*4 + 6*3*4*576 + \\ &\quad 10*4*576 + 192*4) \\ &= 576\text{BF} / 66048\text{Byte} \\ &= 0.0087(\text{BF/BYTE})\end{aligned}$$

- Parallel Operations /w TP\_1 code

-Total

Input load :  $192*4$  Byte  
Compute( $d_1, d_2$ ) :  $8*4*576$  Byte  
Compute( $W$ ) :  $(1+1+2+4+8+24+576)*4$ Byte  
Output store :  $192*4$  Byte

$$\begin{aligned}\text{IO} &= 576\text{BF} / 22432\text{Byte} \\ &= 0.0257(\text{BF/BYTE})\end{aligned}$$

# Roofline Model(4/4)

## ■ Compute IO (BF/Byte)-cont'd

- Pipelining Cores & Final ver.

-Total

Input load :  $192 \times 4$  Byte  
Compute :  $10 \times 4 \times 576$  Byte  
Output store :  $192 \times 4$  Byte

$$\text{IO} = 576\text{BF} / 24576 \text{ Byte}$$
$$= 0.0234(\text{BF/BYTE})$$

## ■ Draw roofline model /w Excel

| A                                | B       | C                     | D           | E            | F          | G          |
|----------------------------------|---------|-----------------------|-------------|--------------|------------|------------|
|                                  | BF/byte | Performance(BF/cycle) | Memory line | Compute line | X_roofline | Y_roofline |
| 2 Pipelining with memory flag    |         | 0.0087                | 0.019       | 0.1044       | 1          | 0.001      |
| 3 Parallel Operations /w TP_1 cc |         | 0.0257                | 0.191       | 0.3084       | 1          | 0.005      |
| 4 Pipelining Cores               |         | 0.0234                | 0.169       | 0.2808       | 1          | 0.01       |
| 5 Final ver.                     |         | 0.0234                | 0.174       | 0.2808       | 1          | 0.05       |
|                                  |         |                       |             |              | 0.0833     | 1          |
|                                  |         |                       |             |              | 0.1        | 1          |
|                                  |         |                       |             |              | 0.5        | 1          |
|                                  |         |                       |             |              | 1          | 1          |



Our system is Memory Bound !!

# How to improve



Our system is Memory Bound !!



- Reuse twiddle factor
- Using register
- Etc..

# How to improve



: core 0 is wasted..

If core 0 do LW & SW Next set input data?



core 1, 2 are good!!



## ■ Compare with TP\_1

- 3 cores vs Single core

|                    | Performance[ps] | Memory Capacity[B/B] |
|--------------------|-----------------|----------------------|
| Single Core (TP_1) | 14,190,000[ps]  | 256B/256B            |
| 3 Cores (TP_2)     | 6,634,000[ps]   | 256B/768B            |