



---

# Bit-Serial Neural Computation Engine: Implementation and Resource Analysis

Sudipto Sarkar

Rickarya Das

Arka Das



# Contents

Indian Institute of Technology Kharagpur

- ❖ Introduction
- ❖ Why Neural Computation?
- ❖ What is a neuron?
- ❖ Proposed Architecture
- ❖ What is Input Buffer?
- ❖ Hidden Weight Memory
- ❖ Flattening and Latency
- ❖ Mac Engine Structure
- ❖ FSM-Mac Engine
- ❖ ReLU Activation
- ❖ Device Implementation
- ❖ Utilisation Report
- ❖ Timing Report
- ❖ Power Report
- ❖ Conclusion
- ❖ References

# INTRODUCTION

## The Challenge with Neural Networks on FPGAs

1. Heavy Dependence on MAC/DSP Units
2. Not Suitable for Edge Devices
3. Traditional Approaches Don't Scale Down

## What We Built?

- ✓ A neural network accelerator implemented entirely with LUTs, flip-flops, and shift-add logic.
- ✓ A bit-serial Processing Element that performs 8-bit  $\times$  8-bit multiply-accumulate one bit per cycle, reducing hardware footprint dramatically.
- ✓ A modular, two-stage architecture that supports streamed inputs, multi-neuron parallelism, and configurable layers.

*"We prove that neural networks don't need heavy MAC units—just clever architecture."*

Can neural inference be done efficiently *without* using a single DSP or MAC unit...



## Why This Problem Matters?

- Enables **ML deployment on low-cost FPGAs** where DSPs are scarce.
- Reduces **power consumption, area footprint, and hardware complexity**.
- Opens the door for **scalable, customizable neural hardware** using only basic logic (LUTs + registers).



## Why Not just Use CPUs or GPUs for Neural Computation?

| Need             | CPU                    | Neural Engine / Accelerator             |
|------------------|------------------------|-----------------------------------------|
| Throughput       | Low                    | Very High                               |
| Power Efficiency | High power consumption | Low power-highy                         |
| Specialization   | General-purpose        | Purpose-built for AI (matrix / MAC ops) |
| Latency          | High                   | Low                                     |
| Cost             | High                   | Low                                     |

### Why Neural Engines Win?

- Designed specifically for **massive parallel multiply-accumulate (MAC) operations.**
- **10×–100× better energy efficiency** compared to CPUs/GPUs.
- Optimized memory pipelines reduce data movement cost.
- Lower latency → suitable for real-time inference (speech, vision, robotics).

# What is a neuron?



*Mathematical definition of a single neuron !*

# Proposed architecture



# What is Input Buffer?



*Converts streaming serial data into a wide parallel vector for high-throughput processing!*



Structure and Operation of the Hidden Layer Weight Memory Module.



This slide details the mathematical and timing aspects of reading from the flattened weight memory.







This module implements the Rectified Linear Unit (ReLU) activation function with a single clock cycle latency.

Implemented with 128 neurons in hidden layer initially! Parallel Processing = 4



Floorplan Details - Scaled version of our Architecture on Artix Ultrascale+



# Utilization Report

Indian Institute of Technology Kharagpur

| Resource | Available | Used | Utilization |
|----------|-----------|------|-------------|
| LUT      | 41000     | 34   | 0.08%       |
| FF       | 82000     | 39   | 0.05%       |
| IO       | 300       | 68   | 22.67%      |
| DSP      | 240       | 0    | 0%          |

Utilization Graph for 4-3-2 neural computational Engine (A Simplified version)





# Timing Report

Indian Institute of Technology Kharagpur

## Design Timing Summary

### Setup

|                              |          |
|------------------------------|----------|
| Worst Negative Slack (WNS):  | 7.248 ns |
| Total Negative Slack (TNS):  | 0.000 ns |
| Number of Failing Endpoints: | 0        |
| Total Number of Endpoints:   | 69       |

### Hold

|                              |          |
|------------------------------|----------|
| Worst Hold Slack (WHS):      | 0.173 ns |
| Total Hold Slack (THS):      | 0.000 ns |
| Number of Failing Endpoints: | 0        |
| Total Number of Endpoints:   | 69       |

### Pulse Width

|                                          |          |
|------------------------------------------|----------|
| Worst Pulse Width Slack (WPWS):          | 4.500 ns |
| Total Pulse Width Negative Slack (TPWS): | 0.000 ns |
| Number of Failing Endpoints:             | 0        |
| Total Number of Endpoints:               | 45       |

All user specified timing constraints are met.

The design meets all timing requirements at 100 MHz.

There are no setup, hold, or pulse-width violations.

Power analysis from Implemented netlist. Activity derived from constraints files, simulation files or vectorless analysis.

**Total On-Chip Power:** 0.34 W

**Design Power Budget:** Not Specified

**Process:** typical

**Power Budget Margin:** N/A

**Junction Temperature:** 25.6°C

Thermal Margin: 59.4°C (31.3 W)

Ambient Temperature: 25.0 °C

Effective θ<sub>JA</sub>: 1.9°C/W

Power supplied to off-chip devices: 0 W

Confidence level: Low

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity



The total on-chip power consumption is 0.34 W, with dynamic power being the major contributor. Most of the dynamic power comes from logic and signal activity, with static power accounting for the rest.

# Conclusion

## What we achieved?

- ✓ The architecture achieves correct results while consuming **extremely low FPGA resources** (<0.1% LUTs, 0 DSPs) and operating with **very low power**.
- ✓ This confirms that **serialized arithmetic** is a viable alternative to traditional parallel MAC-based accelerators, especially for **edge-class and resource-constrained hardware platforms**.
- ✓ Through its multi-lane **parallel processing feature**, the design accelerates hidden-layer computation by evaluating **multiple neurons simultaneously**, boosting throughput while preserving low area usage.

## FUTURE ASPECTS: BIT-SERIAL NEURAL COMPUTATION ENGINE

without DSP or MAC





## References

- 
- [1] "Low-power and low-cost dedicated bit-serial hardware neural network for epileptic seizure prediction system" SM Kueh, TJ Kazmierski
  - [2] "BIT - SERIAL NEURAL NETWORKS" Alan F. Murray, Anthony V. W. Smith and Zoe F. Butler. Department of Electrical Engineering, University of Edinburgh, The King's Buildings, Mayfield Road, Edinburgh, Scotland, EH93JL.
  - [3] AMD Xilinx, Vivado Design Suite User Guide\*, 2024.
  - [4] Introduction to Neural Computation – MIT
  - [5] G. Csordas, B. Feher, and T. Kovacshazy, “Application of bit-serial arithmetic units for FPGA implementation of convolutional neural networks,” in *Proc. 19th Int. Conf. Appl. Electron. (AE)*, Pilsen, Czech Republic, 2018, pp. 23–28.



**THANK YOU !**