

# Experiment 3

## FIR Filter using Zynq PS, AXI DMA and HLS Accelerator

Kaushik Balaji M S *M.Tech in System-on-chip Design*  
*Indian Institute of Technology, Palakkad*  
152502010@smail.iitpkd.ac.in

### I. INTRODUCTION

In this experiment, an FIR filter accelerator is implemented using Vitis HLS, and synthesized into an RTL-based IP core. The generated IP is designed using AXI4-Stream interfaces for high-throughput data transfer. To transfer data efficiently between DDR memory and the accelerator, an AXI DMA engine is used. The ARM processor sends input samples stored in DDR memory to the FIR accelerator through the DMA's MM2S (Memory Mapped to Stream) channel, and receives the filtered output through the S2MM (Stream to Memory Mapped) channel.

The objective of this experiment is to understand PS–PL communication using the AXI protocol, integrate an HLS-generated streaming IP into a Zynq system, and validate real-time data transfer using AXI DMA in a standalone Vitis application.

### II. STEPS USED AND OBSERVATIONS MADE

The experiment was performed using the ZedBoard platform based on the Zynq-7000 SoC. The overall workflow consists of three major phases: (i) FIR accelerator creation using Vitis HLS, (ii) hardware integration using Vivado block design, and (iii) software development using Vitis IDE.

#### A. FIR Filter Implementation using Vitis HLS

The FIR filter was implemented in C/C++ using Vitis HLS. The design was written using streaming interfaces such that the input and output ports follow the AXI4-Stream protocol. This allows the accelerator to continuously process input samples without requiring register-based control.

The HLS function accepts an AXI stream input packet containing the input sample and generates an output packet containing the filtered result. Internally, a shift register is maintained to store past input samples. Multiply-and-accumulate operations are performed using predefined FIR coefficients. Optimizations such as loop unrolling and array partitioning are applied to increase parallelism and reduce latency.

```

1 #include <hls_stream.h>
2 #include <ap_axi_sdata.h>
3 #define N 10
4 //for keeping the necessary control signals
5 typedef ap_axiu<32,0,0,0> axis_t;
6
7 void FIR_ALL_NEW(hls::stream<axis_t> &s_axis,
8                  hls::stream<axis_t> &m_axis){
9     #pragma HLS interface axis port=s_axis
10    #pragma HLS interface axis port=m_axis
11    #pragma HLS interface ap_ctrl_none port=return
12
13    int C[N] = {53,0,-91,313,500,313,0,-91,0,53};
14    static int shift_reg[N];
15    #pragma HLS array_partition variable=shift_reg complete

```

```

16 // ---- Axis handshake was used here ----
17 axis_t in_pkt = s_axis.read();      // blocks until VALID&READY
18 int x = in_pkt.data;
19 int acc = 0;
20
21 // ---- FIR definition
22 FIR_LOOP:
23 for(int i=N-1;i>=0;i--) {
24     #pragma HLS unroll
25     if(i==0) {
26         shift_reg[0] = x;
27         acc += x * C[0];
28     } else {
29         shift_reg[i] = shift_reg[i-1];
30         acc += shift_reg[i] * C[i];
31     }
32 }
33
34 // ---- OUTPUT PACKET ----
35 axis_t out_pkt;
36 out_pkt.data = acc;
37
38 // Verilog based writing
39 out_pkt.last = in_pkt.last;        // TLAST pass-through
40 out_pkt.keep = -1;                // full word valid (1111)
41 m_axis.write(out_pkt);           // VALID asserted when written
42 }
43 }
```

Listing 1. FIR Filter C code



Fig. 1. Vitis HLS Resource Utilization

Once functional verification was completed in HLS simulation, the design was synthesized to generate RTL (Verilog/VHDL). The RTL was then packaged as a Vivado IP core.

## B. Vivado Block Design Integration

After generating the FIR IP core, a Vivado block design was created to integrate the Processing System, AXI DMA, and FIR accelerator. The Processing System was configured with DDR memory enabled, and the AXI interface was enabled to allow communication between PS and PL.

An AXI DMA block was added to the design to facilitate high-speed data transfer between DDR memory and the AXI-stream based FIR accelerator. The MM2S channel of DMA provides the input stream to the FIR accelerator, and the S2MM channel receives the filtered output stream. Connection automation was used to connect clocks, resets, and AXI interconnect paths.



Fig. 2. AXI DMA configs

The complete hardware design includes the following major blocks:

- 1) Zynq Processing System (PS7)
- 2) AXI DMA (Simple mode)
- 3) FIR Filter IP (HLS generated)
- 4) AXI Interconnect and Reset Controller

Figure 3 shows the complete hardware architecture consisting of the Zynq Processing System, AXI DMA, HLS-generated FIR IP, and AXI interconnect modules.

After successful synthesis and implementation, the bitstream was generated and the hardware design was exported as an .xsa file for software development in Vitis.



Fig. 3. Full Block Diagram



Fig. 4. Vivado Implementation done for FIR



Fig. 5. Export Hardware platform from Vivado

### C. Vitis Application Development

A standalone application was created in Vitis IDE using the exported hardware platform. The software runs on the ARM Cortex-A9 processor and is responsible for configuring the DMA engine, sending input samples to the FIR accelerator, and receiving filtered output samples. The sample `HelloWorld.c` code was replaced with the code as given below in Listing: 2.

```

1 #include "xparameters.h"
2 #include "xaxidma.h"
3 #include "xil_printf.h"
4 #include "xil_cache.h"
5 #include "xstatus.h"
6
7 #define DMA_DEV_ID      XPAR_AXI_DMA_0_DEVICE_ID
8
9 #define SIZE           12          // number of samples
10 #define BYTES         (SIZE * sizeof(u32))
11
12 u32 input_data[SIZE] __attribute__((aligned(64)));
13 u32 output_data[SIZE] __attribute__((aligned(64)));
14
15 int main()
16 {
17     XAxiDma dma;
18     XAxiDma_Config *cfg;
19     int status;
20
21     xil_printf("\n\r==== FIR DMA Test Start ====\n\r");
22
23     // -----

```

```

24 // DMA INIT
25 // -----
26 cfg = XAxiDma_LookupConfig(DMA_DEV_ID);
27 if (!cfg) {
28     xil_printf("DMA lookup failed\n\r");
29     return XST_FAILURE;
30 }
31
32 status = XAxiDma_CfgInitialize(&dma, cfg);
33 if (status != XST_SUCCESS) {
34     xil_printf("DMA init failed\n\r");
35     return XST_FAILURE;
36 }
37
38 // Ensure simple mode
39 if (XAxiDma_HasSg(&dma)) {
40     xil_printf("DMA is in SG mode ... expected simple mode\n\r");
41     return XST_FAILURE;
42 }
43
44 // -----
45 // DMA RESET (important after bitstream download)
46 // -----
47 XAxiDma_Reset(&dma);
48 while (!XAxiDma_ResetIsDone(&dma));
49
50 // Disable interrupts (polling mode)
51 XAxiDma_IntrDisable(&dma, XAXIDMA_IRQ_ALL_MASK,
52                     XAXIDMA_DMA_TO_DEVICE);
53 XAxiDma_IntrDisable(&dma, XAXIDMA_IRQ_ALL_MASK,
54                     XAXIDMA_DEVICE_TO_DMA);
55
56 // -----
57 // PREPARE INPUT DATA
58 // -----
59 for (int i = 0; i < SIZE; i++) {
60     input_data[i] = i + 1;
61     output_data[i] = 0;
62 }
63
64 // Cache maintenance
65 Xil_DCacheFlushRange((UINTPTR)input_data, BYTES);
66 Xil_DCacheFlushRange((UINTPTR)output_data, BYTES);
67
68 // -----
69 // START DMA TRANSFERS
70 // Order matters: start S2MM (receive) first
71 // -----
72
73 status = XAxiDma_SimpleTransfer(&dma,
74                                 (UINTPTR)output_data,
75                                 BYTES,
76                                 XAXIDMA_DEVICE_TO_DMA);
77 if (status != XST_SUCCESS) {
78     xil_printf("S2MM start failed\n\r");
79     return XST_FAILURE;
80 }
81
82 status = XAxiDma_SimpleTransfer(&dma,
83                                 (UINTPTR)input_data,
84                                 BYTES,
85                                 XAXIDMA_DMA_TO_DEVICE);
86 if (status != XST_SUCCESS) {
87     xil_printf("MM2S start failed\n\r");
88     return XST_FAILURE;

```

```

89 }
90
91 // -----
92 // WAIT FOR COMPLETION
93 // -----
94 while (XAxiDma_Busy(&dma, XAXIDMA_DMA_TO_DEVICE));
95 while (XAxiDma_Busy(&dma, XAXIDMA_DEVICE_TO_DMA));
96
97 // -----
98 // ERROR CHECK
99 // -----
100 u32 sr;
101
102 sr = XAxiDma_ReadReg(dma.RegBase, XAXIDMA_TX_OFFSET + XAXIDMA_SR_OFFSET);
103 if (sr & XAXIDMA_ERR_ALL_MASK) {
104     xil_printf("MM2S DMA error: 0x%08x\n\r", sr);
105     return XST_FAILURE;
106 }
107
108 sr = XAxiDma_ReadReg(dma.RegBase, XAXIDMA_RX_OFFSET + XAXIDMA_SR_OFFSET);
109 if (sr & XAXIDMA_ERR_ALL_MASK) {
110     xil_printf("S2MM DMA error: 0x%08x\n\r", sr);
111     return XST_FAILURE;
112 }
113
114 // -----
115 // INVALIDATE CACHE FOR OUTPUT
116 // -----
117 Xil_DCacheInvalidateRange((UINTPTR)output_data, BYTES);
118
119 // -----
120 // PRINT RESULTS
121 // -----
122 xil_printf("\n\rFiltered Output:\n\r");
123 for (int i = 0; i < SIZE; i++) {
124     xil_printf("y[%d] = %d\n\r", i, output_data[i]);
125 }
126
127 xil_printf("\n\r==== FIR DMA Test Complete ===\n\r");
128
129 return XST_SUCCESS;
130 }

```

Listing 2. Vitis code to connect with zedboard

The software initializes the AXI DMA using the Xilinx DMA driver library. Input samples are stored in a buffer located in DDR memory. Cache flush operations are performed to ensure that the updated input data is written back to memory before DMA reads it. Similarly, after DMA writes the output buffer, cache invalidation is performed to ensure that the processor reads the latest data from DDR memory.

The DMA transfer is performed using polling mode. The receive transfer (S2MM) is initiated first, followed by the transmit transfer (MM2S). Once both transfers are completed, the output buffer contains the FIR filtered results. These results are printed on the serial terminal using UART.

Figure 6 shows the UART output when the FIR filter was executed on the ZedBoard using the exported hardware platform.



Fig. 6. FIR filter running on Vitis

#### D. Observations

The output results obtained from the UART terminal confirm that the FIR accelerator correctly processes the streamed input samples and produces filtered output values.



Fig. 7. FIR output on Vitis via UART to the console

The successful execution validates the integration of an AXI4-Stream based hardware accelerator into a Zynq system and demonstrates high-throughput PS–PL communication using AXI DMA.



Fig. 8. Zedboard connections

As shown in Figure 8, one connection from the host system to the ZedBoard is through JTAG, which is used to program the FPGA fabric. Another connection is through UART, which enables the processor to transmit output data to the host system.



Fig. 9. Zedboard pins

### III. CONCLUSION

A hardware-accelerated FIR filter was successfully implemented on the Zynq-7000 platform using a Vitis HLS generated IP core integrated into the Programmable Logic. AXI4-Stream interfaces were used for input and output communication, and AXI DMA was used to transfer data efficiently between DDR memory and the FIR accelerator. The standalone Vitis application running on the ARM Cortex-A9 processor successfully configured the DMA engine, transmitted input samples, and received filtered output samples.

The results verified correct FIR filter operation, demonstrating the effectiveness of PS–PL hardware acceleration.