

# Experiment 1

## Vivado HLS synthesis report for 11-tap FIR Filter

Kaushik Balaji M S *M.Tech in System-on-chip Design*

*Indian Institute of Technology, Palakkad*

152502010@smail.iitpkd.ac.in

### Abstract

The objective of this experiment is to implement and compare multiple C-based realizations of a Finite Impulse Response (FIR) filter using High-Level Synthesis (HLS). Although all implementations realize the same FIR filtering operation, they differ in loop structure, data movement, and optimization directives. The impact of these variations on performance metrics such as latency, initiation interval, and hardware resource utilization is analyzed. This experiment demonstrates how algorithmically equivalent descriptions can result in different hardware architectures after synthesis.

### I. INTRODUCTION

This experiment compares multiple FIR code variants that implement the same filter coefficients and tap structure but differ in loop organization, degree of parallelism, and memory access patterns. The goal is to analyze how these implementation choices affect synthesis results while preserving identical functional output.

### II. IMPLEMENTATION

Five different implementations of FIR Filter was made to compute using the same inputs and the same filter coefficients. These models differ in how the shifting, multiplication and accumulation operations are done.

#### A. FIR Model 1

```

1
2 #define N 11
3 #include <ap_int.h>
4 void FIR1(int *y, int x) {
5     int C[N]={53,0,-91,313,500,313,0,-91,0,53};
6     static int shift_reg[N];
7     int acc,i;
8     acc=0;
9     SAL: for(i=N-1;i>=0;i--) {
10         if(i==0) {
11             acc+=x*C[0];
12             shift_reg[0]=x;
13         }
14         else {
15             shift_reg[i]=shift_reg[i-1];
16             acc += shift_reg[i]*C[i];
17         }
18     }
19     *y=acc;
20 }
```

Listing 1. All operations done inside single loop

## B. FIR Model 2

```

1
2 #define N 11
3 #include <ap_int.h>
4 void FIR2(int *y, int x) {
5 int C[N]={53,0,-91,313,500,313,0,-91,0,53};
6 static int shift_reg[N];
7 int acc,i;
8 acc=0;
9 SAL: for(i=N-1;i>0;i--) {
10     shift_reg[i]=shift_reg[i-1];
11     acc += shift_reg[i]*C[i];
12 }
13 acc += x*C[0];
14 *y=acc;
15 }
```

Listing 2. Input logic is handled separately

## C. FIR Model 3

```

1
2 #define N 11
3 #include <ap_int.h>
4 void FIR3(int *y, int x) {
5 int C[N]={53,0,-91,313,500,313,0,-91,0,53};
6 static int shift_reg[N];
7 int acc,i;
8 acc=0;
9 //loop fission
10 TDL: for(i=N-1;i>0;i--) {
11     shift_reg[i]=shift_reg[i-1];
12 }
13 shift_reg[0]=x;
14 MAC: for(i=N-1;i>=0;i--) {
15     acc += shift_reg[i]*C[i];
16 }
17 *y=acc;
18 }
```

Listing 3. Trap delay separated from other arithmetic operations

## D. FIR Model 4

```

1 #define N 11
2 #include <ap_int.h>
3 void FIR4(int *y, int x) {
4 int C[N]={53,0,-91,313,500,313,0,-91,0,53};
5 static int shift_reg[N];
6 int acc,i; acc=0;
7 //loop fission
8 TDL: for(i=N-1;i>1;i=i-2) {
9     shift_reg[i]=shift_reg[i-1];
10    shift_reg[i-1]=shift_reg[i-2];
11 }
12 if(i==0) shift_reg[1]=shift_reg[0];
13 shift_reg[0]=x;
14 MAC: for(i=N-1;i>=0;i--) {
15     acc += shift_reg[i]*C[i];
16 }
17 *y=acc;
18 }
```

Listing 4. Loop iterations reduced by half

## E. FIR Model 5

```

1
2 #define N 11
3 #include <ap_int.h>
4 void FIR5(int *y, int x) {
5     int C[N]={53,0,-91,313,500,313,0,-91,0,53};
6     static int shift_reg[N];
7     int acc,i;
8     acc=0;
9     #pragma HLS ARRAY_PARTITION variable=shift_reg complete
10    //loop fission
11    TDL: for(i=N-1;i>1;i=i-2){//TDL: Tapped Delay Line
12        shift_reg[i]=shift_reg[i-1];
13        shift_reg[i-1]=shift_reg[i-2];
14    }
15    if(i==0) shift_reg[1]=shift_reg[0];
16    shift_reg[0]=x;
17    MAC: for(i=N-1;i>=3;i=i-4){//MAC: Multiply and Accumulate
18        acc += shift_reg[i]*C[i]+shift_reg[i-1]*C[i-1]+shift_reg[i-2]*C[i-2]+shift_reg[i-3]*C
19            [i-4];
20    }
21    for(;i>=0;i--) acc +=shift_reg[i]*C[i];
22    *y=acc;
23 }
```

Listing 5. Pragma introduced, so maximum parallelism

## III. REPORT ANALYSIS

The synthesis report of every implementation was obtained and their performance and resource estimations are studied.

### A. FIR Model 1



Fig. 1. Base Implementation report

This implementation uses a single loop that performs shift register updates and multiply–accumulate operations together, along with a conditional check to handle the current input sample. **Result obtained:** Latency: 20 cycles , Interval: 21 , DSP: 4 , FF: 869 , LUTs: 368

- The conditional logic inside the loop introduces additional multiplexers and control paths in the datapath.

- The loop dependencies restrict pipelining by the HLS tool.

As a result, this model serves as the reference design with moderate latency and resource usage.

## B. FIR Model 2



Fig. 2. Input Logic Optimized synthesis

This implementation removes the conditional check from the main loop by handling the boundary condition separately, while still performing shift register updates and multiply–accumulate operations within a single loop.

**Result obtained:** Latency: 17 cycles , Interval: 18 , DSP: 4 , FF: 672 , LUTs: 315

- Eliminating the conditional logic simplifies the datapath and reduces the number of multiplexers required.
- The reduced control complexity allows tighter scheduling and improved pipelining by the HLS tool. As a result, this model achieves lower latency and reduced resource utilization compared to the baseline implementation.

## C. FIR Model 3



Fig. 3. Synthesis after Loop Fission

This implementation separates the shift register update and multiply–accumulate operations into two independent loops executed sequentially.

**Result obtained:** Latency: 33 cycles , Interval: 34 , DSP: 2 , FF: 487 , LUTs: 371

- Sequential execution of the two loops increases the overall latency.
- The separation of operations enables reuse of arithmetic resources across cycles, reducing DSP utilization.

As a result, this model provides improved area efficiency at the cost of increased latency.

## D. FIR Model 4



Fig. 4. Synthesis after Manual loop unrolling

This implementation manually unrolls the shift register update loop by a factor of two without applying array partitioning.

**Result obtained:** Latency: 40 cycles , Interval: 41 , DSP: 2 , FF: 1240 , LUTs: 970

- The unrolled shift logic increases control complexity and register usage.
- Multiple read and write operations to the shift register in a single iteration increase scheduling overhead.

As a result, this model exhibits increased latency and significantly higher resource usage, making it less efficient than previous implementations.

## E. FIR Model 5



Fig. 5. Synthesis after Pipelining

This implementation applies complete array partitioning to the shift register and introduces parallelism in the multiply–accumulate operations.

**Result obtained:** Latency: 27 cycles , Interval: 28 , DSP: 8 , FF: 1862 , LUTs: 991

- Array partitioning using *PRAGMA* enables simultaneous access to all filter taps, increasing parallel execution.
- Increased parallelism results in higher DSP and flip-flop utilization.

As a result, this model achieves improved performance compared to sequential designs, with a substantial increase in hardware resource usage.

| FIR Model | Latency (cycles) | Latency (ns) | Interval | DSP | FF   | LUTs |
|-----------|------------------|--------------|----------|-----|------|------|
| FIR1      | 20               | 200          | 21       | 4   | 869  | 368  |
| FIR2      | 17               | 170          | 18       | 4   | 672  | 315  |
| FIR3      | 33               | 330          | 34       | 2   | 487  | 371  |
| FIR4      | 40               | 400          | 41       | 2   | 1240 | 970  |
| FIR5      | 27               | 270          | 28       | 8   | 1862 | 991  |

TABLE I  
COMPARISON OF FIR CODE VARIANTS: LATENCY AND RESOURCE UTILIZATION

## CONCLUSION

This experiment shows that, although all FIR implementations compute the same filter function, differences in coding style significantly impact latency, throughput, and hardware resource utilization after HLS synthesis. Hence, careful selection of implementation style is essential to achieve the desired performance-area trade-off.