

# Towards Energy-Efficient, Memory-Centric Architectures for Graph Neural Network Acceleration on Edge Devices

## (1) Research theme

In the modern data ecosystem, there is a paradigm shift towards both decentralized computing (Edge AI) and the analysis of complex, interconnected data structures. The premier technology for the latter is **Graph Neural Networks (GNNs)**, a class of AI models uniquely capable of learning from relational data like social networks, molecular structures, and IoT sensor grids. Concurrently, **Edge AI** seeks to move computation from centralized clouds to resource-constrained devices at the data source, enabling low-latency, private, and autonomous intelligent systems. When combined, these trends promise to unlock sophisticated, real-time analytics directly within smart environments.

However, a fundamental conflict exists: GNNs are notoriously memory-bound and computationally intensive, characterized by irregular data access patterns that overwhelm conventional processors. Edge devices, in contrast, operate under stringent power, memory, and area constraints. This research proposes to bridge this critical gap by exploring and developing novel **energy-efficient, memory-centric hardware architectures** specifically designed for GNN acceleration. The goal is to create a new paradigm for on-device GNN processing that moves computation closer to data, thereby minimizing the primary bottleneck—data movement—and enabling the deployment of powerful graph-based AI on the next generation of edge devices.

## (2) Research plan

### 1. Background (Problem Statement)

The era of the Internet of Things (IoT) and Industry 4.0 is generating unprecedented volumes of graph-structured data from interconnected sensors, machines, and systems. GNNs are the state-of-the-art for extracting valuable insights from this data, but their deployment at the edge is severely hampered. Traditional computing architectures, designed for dense and regular computations, are fundamentally inefficient for GNNs. The irregular, data-dependent memory accesses required to traverse graph structures lead to poor cache utilization and frequent stalls, creating a "memory wall." This results in high latency and prohibitive energy consumption, making real-time GNN inference impractical on battery-powered edge devices.

As a global leader in robotics, electronics, and smart infrastructure, Japan faces a pressing need to embed advanced intelligence into its systems to maintain a competitive edge and address societal challenges like an aging workforce. Current edge AI solutions are often limited to simpler models like CNNs, leaving the vast potential of relational data untapped. There is a critical need for a new class of hardware accelerators that abandon the conventional compute-centric model and instead adopt a **memory-centric** approach, treating data locality and energy efficiency as first-order design principles. This is vital for unlocking applications in autonomous systems, predictive maintenance, and smart city management.

## 2. Thesis Statement

This research proposes the development of an energy-efficient, memory-centric hardware accelerator framework for GNNs, capable of performing complex graph-based inference on resource-constrained edge devices. The goal is to design a holistic system, based on algorithm-hardware co-design principles, that fundamentally mitigates the data movement bottleneck by integrating computation with on-chip memory systems. This initiative will enable a new class of powerful, low-latency, and privacy-preserving Edge AI applications, aligning directly with Japan's Society 5.0 vision to create a super-smart society where physical and cyberspace are tightly integrated.



- **Input:** GNN Model & Graph Data
- **Core Component: Memory-Centric Accelerator Core**
  - **Graph-Aware Memory Controller:** Intelligently manages data placement and access.
  - **On-Chip Memory Banks (SRAM/eNVM):** Stores graph structure and node features locally.
  - **Array of Processing Elements (PEs):** Physically located close to memory banks to perform computations (Aggregation & Update).
  - **Control Unit:** Orchestrates the dataflow between memory and PEs, minimizing off-chip access.

- **Output:** Inference Results (e.g., Node Labels, Graph Classification)
- **Key Principle:** A tight processing loop is shown between "On-Chip Memory" and "PEs", labeled "Minimized Data Movement". A much smaller, dashed arrow points from the core to "Off-Chip DRAM", labeled "Infrequent Access".

### **3. Methodology:**

The proposed research will be conducted through a structured, multi-phase approach, integrating algorithmic analysis with hardware architecture design and evaluation.

- **Workload Characterization & Bottleneck Analysis:** Initially, a comprehensive analysis of various GNN models (e.g., GCN, GAT, GraphSAGE) will be conducted on existing edge platforms (like NVIDIA Jetson or Raspberry Pi). This phase will precisely identify and quantify the performance and energy bottlenecks, focusing on memory access patterns and hardware underutilization.
- **Architectural Design & Simulation:** Design a novel, memory-centric accelerator architecture. This will involve designing a specialized on-chip memory hierarchy with graph-aware caching policies (e.g., degree-aware caching) and a parallel processing engine optimized for both the sparse aggregation and dense update phases of GNNs. The design will be modeled and simulated using hardware description languages (Verilog/VHDL) and high-level synthesis (HLS) tools.
- **Algorithm-Hardware Co-design Exploration:** Investigate techniques where the GNN model and hardware are optimized in tandem. This includes developing hardware-aware quantization and pruning methods that exploit the accelerator's architecture to maximize efficiency with minimal accuracy loss. The goal is to create models that are inherently "hardware-friendly."
- **FPGA-Based Prototyping & Edge Deployment:** Implement a prototype of the proposed accelerator on a Field-Programmable Gate Array (FPGA). This will allow for rapid, real-world testing and validation of the architecture's performance and energy efficiency before committing to a more rigid ASIC design.
- **Comprehensive Evaluation:** Rigorously benchmark the prototype against state-of-the-art CPUs, GPUs, and existing edge AI accelerators. Key evaluation metrics will include latency, throughput (inferences/second), energy efficiency (inferences/Joule), and area efficiency (performance/mm<sup>2</sup>), using standardized benchmark datasets like those from the Open Graph Benchmark (OGB).
- **Potential Validation:** If feasible, validate the final prototype using a real-world application scenario, such as real-time anomaly detection in a simulated smart factory sensor network, demonstrating its practical utility and performance benefits.

#### **4. Desired Outcome(s):**

This research aims to deliver a significant contribution to the fields of computer architecture and Edge AI. The primary desired outcome is a novel framework and a verifiable prototype of a memory-centric GNN accelerator that demonstrates superior energy efficiency and performance compared to conventional solutions.

Specific potential benefits for Japan include:

- **Enablement of Advanced Edge AI:** Allowing complex GNNs to run on millions of IoT devices, enabling smarter infrastructure in cities, factories, and homes.
- **Enhanced Energy Efficiency and Sustainability:** Drastically reducing the power consumption of edge devices, leading to longer battery life and a more sustainable computing ecosystem.
- **Improved Privacy and Real-Time Responsiveness:** Ensuring sensitive data is processed locally on-device, which enhances security and eliminates cloud communication latency for critical applications.
- **Contribution to Society 5.0:** Providing a key enabling technology for the real-time data fusion and intelligent decision-making at the heart of the Society 5.0 vision.
- **Advancing Japan's Semiconductor Leadership:** Contributing novel architectural designs and intellectual property for the next generation of domain-specific AI chips.

The research will yield theoretical insights into the co-design of AI algorithms and hardware, along with a practical, high-performance framework applicable to industrial challenges not only in Japan but also globally.

#### **5. Motivation for Research:**

My motivation stems from the conviction that the future of AI is not in the cloud, but pervasively and efficiently embedded in the world around us. Japan's Society 5.0 vision, which aims to solve societal problems by integrating cyberspace and physical space, perfectly encapsulates this future. I am particularly fascinated by the fundamental challenge of overcoming the "memory wall," which I believe is the single greatest obstacle to realizing this vision.

My goal is to explore the deep intersection of GNN algorithms, computer architecture, and low-power VLSI design. I aim to contribute to a paradigm shift from brute-force computation to intelligent, efficient hardware design. Studying in Japan, a global powerhouse in electronics, semiconductor manufacturing, and robotics, offers an unparalleled opportunity to work with leading experts and access cutting-edge technologies, providing the ideal environment to pursue this impactful research.

## **6. Key References:**

- T. N. Kipf and M. Welling, "Semi-Supervised Classification with Graph Convolutional Networks," in *ICLR*, 2017.
- Y. Ham et al., "Ape-GNN: A Unifying Framework for GNN Computing on Various Accelerators," in *ASPLOS*, 2022.
- Y. Geng et al., "I-GCN: A Scalable and High-Performance GCN Accelerator with Runtime Locality Enhancement," in *IEEE Transactions on Computers*, vol. 71, no. 11, pp. 2895-2909, 2022.
- Y. Zhu et al., "EnGN: A High-Throughput and Energy-Efficient Accelerator for Large-Scale Graph Neural Networks," in *HPCA*, 2021.
- W. A. Wulf and S. A. McKee, "Hitting the memory wall: implications of the obvious," *ACM SIGARCH Computer Architecture News*, vol. 23, no. 1, pp. 20–24, 1995. [Online]. Available: <https://doi.org/10.1145/218582.218585>
- Report on The 5th Science and Technology Basic Plan (Society 5.0), Available: [https://www8.cao.go.jp/cstp/kihonkeikaku/5basicplan\\_en.pdf](https://www8.cao.go.jp/cstp/kihonkeikaku/5basicplan_en.pdf)