



# Learning Cache Coherence Traffic for NoC Routing Design

Guochu Xiong

College of Computing and Data  
Science

Nanyang Technological University  
Singapore, Singapore  
guochu.xiong@ntu.edu.sg

Xiangzhong Luo

College of Computing and Data  
Science

Nanyang Technological University  
Singapore, Singapore  
xiangzhong.luo@ntu.edu.sg

Weichen Liu\*

College of Computing and Data  
Science

Nanyang Technological University  
Singapore, Singapore  
liu@ntu.edu.sg

## Abstract

The rapid growth of multi-core systems highlights the need for efficient Network-on-Chip (NoC) design to ensure seamless communication. Cache coherence, essential for data consistency, substantially reduces task computation time by enabling data sharing among caches. As a result, routing serves two roles: facilitating data sharing (influenced by topology) and managing NoC-level communication. However, cache coherence is often overlooked in routing, causing mismatches between design expectations and evaluation outcomes. Two main challenges are the lack of specialized tools to assess cache coherence's impact and the neglect of topology selection in routing. In this work, we propose a cache coherence-aware routing approach with integrated topology selection, guided by our Cache Coherence Traffic Analyzer (CCTA). Our method achieves up to 10.52% lower packet latency, 55.51% faster execution time, and 49.02% total energy savings, underscoring the critical role of cache coherence in NoC design and enabling effective co-design.

## CCS Concepts

- Networks → Network on chip.

## Keywords

Cache coherence, Network-on-Chips, routing algorithms, topology-routing co-optimization, Reinforcement learning.

### ACM Reference Format:

Guochu Xiong, Xiangzhong Luo, and Weichen Liu. 2025. Learning Cache Coherence Traffic for NoC Routing Design. In *Great Lakes Symposium on VLSI 2025 (GLSVLSI '25), June 30–July 02, 2025, New Orleans, LA, USA*. ACM, New York, NY, USA, 7 pages. <https://doi.org/10.1145/3716368.3735166>

## 1 Introduction

Advancements in computational power increasingly rely on multi-core system design, where efficient communication is essential. Network-on-Chip (NoC) serves as the backbone of data transfer and coordination, ensuring optimal performance and scalability. A well-designed NoC enhances throughput and minimizes latency, directly boosting computational efficiency. Over time, routing has emerged as a critical research area, bridging NoC architecture and

\*Corresponding author



This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

GLSVLSI '25, New Orleans, LA, USA

© 2025 Copyright held by the owner/author(s).

ACM ISBN 979-8-4007-1496-2/25/06

<https://doi.org/10.1145/3716368.3735166>

communication efficiency while driving major breakthroughs [5, 15, 16, 21, 25]. With AI-driven advancements, routing strategies have significantly improved dynamic NoC management and performance optimization, reinforcing their role in innovation.

In parallel, cache coherence is essential for multi-core communication, especially in NoC systems, where it ensures a consistent memory view across processors. Core requests generate coherence messages to update states or fetch data, enabling efficient parallel access via protocols like MSI, MESI, and MOESI [6]. Research has optimized coherence protocols to reduce traffic [7, 8], improve energy efficiency [9], and minimize latency [10]. However, existing tools overlook cache coherence performance, focusing only on NoC or system-level metrics, making it difficult to evaluate coherence optimizations and their impact on NoC design. This gap hinders a comprehensive assessment of their true effectiveness.

Despite its crucial role in NoC communication, cache coherence is often overlooked in NoC design. Consider the task graph in Figure 1a, where the number of cores matches the number of tasks. Assume Task  $t_0$  and  $t_1$  share some data initially stored in core 0, while  $t_1$  and  $t_3$  share other data stored in core 3. Most NoC designs [12, 27, 28] rely on synthetic traffic (Figure 1b), which ignores cache coherence. However, cache coherence traffic (Figure 1c) better reflects real-world applications, reducing computation time even with optimal mapping and routing. This improvement comes from direct cache-to-cache data sharing, avoiding the costly process of fetching data from main memory. Consequently, routing plays a dual role: (1) facilitating efficient data sharing during task computation through cache-to-cache communication (this portion of communication time is included in the computation time), and (2) managing purely NoC-level communication, counted as communication time. While topology influences mapping and routing effectiveness, optimizing communication time also requires accounting for congestion and network dynamics.

One persistent challenge is integrating topology selection as a core component of routing decisions. Most approaches either fix topologies [21, 25] or treat them as part of the environment [5, 14–16], often leading to suboptimal outcomes, as routing performance depends heavily on network structure, ignoring topology impact can significantly undermine a routing strategy's effectiveness.

Additionally, current routing designs overlook cache-level data sharing from cache coherence, focusing solely on the communication within NoC. As shown in Figure 2, existing routing strategies generally fall into two categories [2, 5, 15–17, 20, 21]: The first focuses solely on design and evaluation within synthetic traffic (Case 1), while the second designs using synthetic traffic but evaluates the system with cache coherence present (Case 2). In Case 1, excluding cache coherence fails to capture real-world traffic overhead, while



(a) An example of task graph



(b) Synthetic traffic with optimal mapping and routing



(c) Cache coherence traffic with optimal mapping and routing

**Figure 1: Motivation example:** (a) DAG Task Graph Representation; (b) Synthetic traffic without cache coherence protocol; (c) Cache coherence traffic with data sharing between cores.

in Case 2, evaluating performance with cache coherence after the design phase misses opportunities to optimize routing from the outset. Cache coherence adds significant traffic, increasing data transfer demands and straining the network. As core counts scale, challenges like energy consumption and congestion intensify, further impacting communication efficiency between caches. Ignoring cache coherence during design creates a mismatch between expected and actual performance. To address these shortcomings, Case 3 integrates cache coherence into both design and evaluation. Coherence traffic alters communication patterns by increasing cache-to-cache exchanges, adding overhead and potentially congesting conventional routing paths. By incorporating cache coherence from the outset, routing strategies can manage traffic more effectively, mitigating congestion and improving NoC performance.

**Figure 2: Difference of NoC design workflow.**

In this paper, we highlight three key observations: (i) NoC design overlooks cache coherence-induced data sharing in task computation, (ii) design objectives often misalign with actual performance when cache coherence is considered, and (iii) a lack of tools prevents a deeper understanding of cache coherence's impact on NoC design. To address these gaps, we introduce a novel analysis tool fully integrated with Gem5 [11] to evaluate cache coherence performance in realistic scenarios. Additionally, we propose a DRL-based approach that optimally selects topology and routing paths, dynamically adapting to network structure while accounting for cache coherence traffic, resulting in improved real-world performance. Our contributions include: (i) developing the Cache Coherence Traffic Analyzer (CCTA) to evaluate key metrics and provide insights into cache coherence effects, (ii) integrating cache coherence features into our framework, highlighting its critical role in NoC design and enabling new co-design opportunities, (iii) introducing a DRL-based framework that optimizes topology and routing by leveraging the interplay between NoC traffic and cache coherence. During design, the DRL agent learns the environment, enabling selection of optimal topologies and routing paths. For users with fixed topologies, we provide pretrained models as baselines for further optimization, benchmarking, and validation. Both the framework and models are fully integrated into Gem5, offering a comprehensive solution for NoC design and evaluation, and (iv) demonstrating the superiority of our approach by consistently outperforming existing algorithms, particularly in handling cache coherence complexities, leading to enhanced performance and efficiency.

## 2 Related Work

To the best of our knowledge, this is the first work to propose cache-coherence-aware DRL-based approach that jointly optimizes topology selection and routing, bridging NoC design and cache coherence into a comprehensive solution.

NoC design has advanced significantly. For instance, Garnet[1] provides a cycle-accurate CMP evaluation model, while ArSMART NoC[12] enhances routing through clustering and dynamic transmission, and neural networks offer faster, more accurate latency estimation[13]. However, existing routing algorithms struggle to integrate topology selection with routing decisions. While RL-based routing is used to reduce congestion and improve performance[22–24] and system feedback aids low-latency designs[25], multi-objective optimization remains confined to fixed topologies[17]. Some works treat topology as static[5, 21] or separate it from routing[14–16], overlooking that topology selection should be guided by routing since the latter inherently depends on network structure.



**Figure 3: Proposed DRL-based topology selection with neural network learning for dynamic routing.**

Recent cache coherence advancements target energy reduction and traffic optimization. For example, [7] improves directory-based coherence by distinguishing private from shared data to lower traffic and DRAM accesses, while [8] introduces DiCo-CMP, which outperforms Token-CMP in traffic and area efficiency. However, most NoC designs treat cache coherence as an external factor, missing opportunities for integrated performance gains. Similarly, cache coherence analysis tools primarily focus on time measurement. For instance, [28] estimates worst-case execution time by integrating isolated cache analysis, timer-based retention, and time-coherence—albeit limited to snooping-based protocols and requiring modifications—while [29] gathers time statistics by altering classical protocols.

Therefore, there remains a pressing need to integrate cache coherence into NoC design and analysis while adapting topology to routing decisions. Our approach fills this gap by developing a cache coherence analysis tool, combining it with NoC metrics during routing design, and optimizing topology and routing holistically, ultimately leading to more efficient traffic distribution, reduced energy consumption, and enhanced system performance.

### 3 Methodology

This section outlines our framework that integrates deep reinforcement learning (DRL) with cache coherence-aware decision-making for dynamic topology selection and routing in NoC systems. We focus on two main components: the Cache Coherence Traffic Analyzer (CCTA) for measuring and analyzing cache coherence metrics (Section 3.1), and the DRL-based topology selection with routing optimization model (Section 3.2), as shown in Figure 3.

#### 3.1 Cache Coherence Traffic Analysis

Cache coherence maintains data consistency and shareability in multi-core systems. Before memory operations propagate, the protocol ensures data copies in other caches are updated or invalidated,

making changes visible to all cores. Coherence messages, like invalidation or update requests, are sent across the NoC, contributing to traffic and affecting NoC performance and system efficiency.

Despite its crucial role in NoC and system performance, cache coherence introduces communication overhead that impacts packet transmission, making accurate measurement essential. However, existing methods lack dedicated evaluation of coherence traffic metrics in NoC design, hindering progress assessment and limiting verification of its impact. To address this gap, we developed the Cache Coherence Traffic Analyzer (CCTA) (Figure 4).



**Figure 4: The framework of CCTA.**

Cache coherence events fall into three categories: write hits in private caches with an S (Shared) state, read misses, and write misses in private or shared caches. For example, in the MESI directory-based protocol [6], the most widely used in current literature, when a private cache receives a write request in the S state, the cache coherence protocol initiates message transmissions between system components until the private cache controller receives acknowledgement (ACK) messages (shown in Figure 5). Without cache coherence (e.g., in synthetic traffic), these messages stay within the requestor, avoiding extra NoC traffic. Thus, cache coherence increases network traffic and CPU delay, which is the time from a write request in the S state to execution, entirely attributed to coherence overhead. Similarly, during read and write misses, the protocol manages message transmissions to maintain data consistency across caches.



**Figure 5: The process of cache coherence as private cache in S state in write hit case.**

The dynamic routing of diverse message types across multiple ports during NoC and cache coherence interactions makes metric

collection and analysis challenging. Specifically, it is difficult to (i) identify cache coherence messages, (ii) determine their starting points, and (iii) manage communication across multiple caches and levels. Additionally, the complexity of cache coherence protocols, where a single state can generate multiple message types and the same message can traverse different states, further complicates classification and collection. CCTA addresses these challenges by using intermediate states to capture transient events alongside primary theoretical states in each cache.

To evaluate timing metrics (CPU delay, average write miss time (including CPU delay), and average memory fetch time), CCTA marks the start and end points within the L1 private cache where the initial request and final response occur, using primary and intermediate states. Timestamps (orange markers in Figure 4) log each request’s duration, while monitors at private cache controller in-ports record the first and last message timestamps (blue markers in Figure 4). These intervals (orange and blue lines in Figure 5) are accumulated to capture the complete timeline of packet traversal through the NoC, from initiation to final state transitions.

CCTA analyzes cache coherence communication by tracking message transmissions across NoC. Each time a cache line’s state changes—from request initiation to completion—message collection is triggered and continues until the final transmission (green markers in Figure 4). As shown in Figure 5, every state change in the requestor is recorded as a message transmission event, capturing all protocol-generated traffic. This process starts when the first message signals a state change (e.g., a write request for a Shared-state cache line) and ends with the final message (e.g., upon receiving an ACK). By integrating CCTA, we achieve precise measurement and in-depth analysis of cache coherence communication.

### 3.2 Topology Selection and Routing Design

Our methodology leverages DRL to jointly optimize topology selection and routing in a dynamic NoC and cache coherence environment. We choose DRL over traditional RL, adaptive routing[4], and genetic algorithms[3] because it learns multi-objective rewards from both NoC and cache coherence metrics while adapting in real time. Unlike RL (single-objective reward), genetic algorithms (slow convergence), or adaptive routing (complex congestion management), our dual-network architecture—featuring a Q-network combined with an  $\epsilon$ -greedy strategy for decision making and a WeightPredictor for fine-tuning simulation parameters, dynamically adjusts to network conditions, significantly enhancing overall adaptability and performance.

**States.** A unique aspect of our approach is merging cache coherence states (e.g., number of coherence messages, CPU delay, average write miss time) with NoC states (e.g., average packet latency, average packet delay, average link utilization) into a single model state space. This integrated representation enhances context-awareness, enabling more informed decisions that consider both network and cache coherence metrics. The DRL agent and link-weight learning process both use this comprehensive state representation as input.

**Action.** Our DRL framework jointly optimizes topology selection and link-weight learning. For topology selection, the action space consists of candidate topologies. A Q-network computes Q-values based on NoC and cache coherence metrics, and

an  $\epsilon$ -greedy policy—where  $\epsilon$  decays as  $1/(episode+1)$ —chooses the highest-valued topology while still exploring alternatives. For example, if congestion is high, the agent may select a topology with more routing paths even if it’s not typically optimal. For routing, the Link Weight Update module employs a three-layer MLP (256 neurons per hidden layer, ReLU activations, and 50% dropout) whose output layer dynamically adjusts to the chosen topology and core count. Despite its simplicity, this network continually trains in real time to select the lowest-cost paths based on current conditions. Meanwhile, Garnet supports bidirectional links to further enhance routing flexibility.

**Reward.** Unlike traditional methods that optimize latency or energy in isolation, our approach balances NoC and cache coherence trade-offs, minimizing average packet latency ( $L_t$ ), CPU delay ( $H_t$ ), average packet delay ( $D_t$ ), and cache coherence messages ( $C_t$ ). Normalization values are empirically chosen based on simulated metric ranges to ensure proportional contributions and prevent dominance by any single factor:

$$R_t = -(\alpha_1 \times L_t + \alpha_2 \times H_t + \alpha_3 \times D_t + \alpha_4 \times C_t) \quad (1)$$

## 4 Experiment

The experiment, using the PARSEC 2.1 benchmark suite [18] in the Gem5 simulator, evaluates performance in real-world scenarios. Energy consumption is measured with McPAT [19]. Simulations are conducted on NoC configurations with 16 and 64 cores, as shown in Table 1. The system employs a directory-based MESI cache coherence protocol, which is both well-studied in academia and aligned with industry standards such as the Coherent Hub Interface (CHI), ensuring realistic modeling of coherence interactions.

**Table 1: Platform Parameters**

| Platform Parameters       | Values                                                     |
|---------------------------|------------------------------------------------------------|
| Virtual channels per port | 4                                                          |
| Flow control              | Credit-based                                               |
| Frequency                 | 2 GHz                                                      |
| Flit size                 | 128 bits                                                   |
| L1D Cache size            | 64 KB                                                      |
| L2 Cache size             | 2 MB                                                       |
| Memory size               | 512 MB                                                     |
| Cacheline size            | 64 B                                                       |
| Cache coherence protocol  | Directory-based MESI                                       |
| Topology types            | Crossbar, Mesh, Pt2Pt, Torus, Fat-Tree, FlattenedButterfly |

All CCTA traces extracted from Gem5’s output files are utilized as the state representations for the learning environment. The training process is structured into multiple episodes, where each episode corresponds to the execution of a complete application workload. At each decision step within an episode, the current state is passed to the Q-network, which evaluates the Q-values for candidate network topologies—including Crossbar, Mesh, Pt2Pt, Torus, Fat Tree, and Flattened Butterfly. An  $\epsilon$ -greedy policy then selects one of these

topologies and the Link Weight Update module (a three-layer MLP) refines routing. The updated configuration is simulated in the Gem5 environment, which produces updated metrics for NoC traffic and cache coherence behavior. Based on these results, a multi-objective reward is computed that drives backpropagation for both the Q-network and the MLP-based refinement module, with the state updated before the next step. Although experiments focus on one application at a time, the framework can handle multiple workloads by randomly selecting benchmarks per episode while leaving the rest of the training process unchanged.

We compare methods that account for cache coherence with those that do not. For the latter, we evaluate dimension-order routing (XY), Q-learning-based methods like BiLCQ [22] and CrQ [23], as well as the congestion-adaptive DyAD [24]. While BiLCQ and CrQ use Q-tables for updates, DyAD makes decisions based on queue lengths without Q-learning. To ensure a fair comparison, we integrate our Cache Coherence Traffic Analyzer (CCTA) with RL-based [20] and DRL-based [21] methods (i.e., non-cc-aware RL and non-cc-aware DRL) transforming them into cache coherence-aware versions, referring to as cc-aware RL and cc-aware DRL. This enables direct comparison within a cache-coherent environment.

#### 4.1 Analysis of Cache Coherence Necessity

To assess the necessity of incorporating cache coherence in routing, we compare XY, non-cc-aware RL, and cc-aware RL based on average packet delay. As shown in Figure 6, cc-aware RL reduces packet delay by up to 17.65% and 14.29% compared to XY and non-cc-aware RL. Since both RL-based methods share the same routing approach, the reduction in delay highlights the role of cache coherence in improving data transmission efficiency. Furthermore, the results confirm that CCTA accurately captures cache coherence performance without compromising evaluation accuracy.



Figure 6: Normalized Average Packet Delay for PARSEC.

#### 4.2 Analysis of Cache Coherence Impact

This section examines how cache coherence affects NoC traffic. First, we compare NoC energy consumption for non-cc-aware RL, non-cc-aware DRL, and our routing framework without cache coherence, showing our routing method's superior efficiency. Next, we evaluate average packet latency, total energy consumption, and execution time in both cache-coherent and non-cache-coherent scenarios, highlighting the influence of cache coherence on NoC design and the effectiveness of our approach.

**Results on NoC Energy.** As illustrated in Figure 7, our DRL-based routing framework (without cache coherence consideration) reduces total NoC energy consumption by up to 90.75%, 89.75%, and 49.68% compared to XY, non-cc-aware RL, and non-cc-aware DRL, respectively, demonstrating its efficiency in optimizing both path selection and traffic management.



Figure 7: Normalized Total NoC Energy for PARSEC.

**Results on Average Packet Latency.** Figure 8a shows that our method consistently outperforms other routing strategies, reducing average latency by 3.36%, 10.52%, 7.43%, 7.65%, and 0.63% compared to XY, DyAD, CrQ, cc-aware RL-based, and cc-aware DRL-based methods, respectively, while incurring 1.51% more latency than BiLCQ. Note that XY, DyAD, CrQ, and BiLCQ operate without cache coherence. For workloads such as X264, Flu, and Fer, BiLCQ achieves lower latency, as it employs the RL-based routing which only optimizes latency without considering cache coherence or related overhead, it can aggressively reduce delays in some workloads but at the expense of increased overall energy usage (Figure 8b). In contrast, our method achieves a better trade-off by maintaining competitive latency while substantially reducing total energy consumption, enhancing overall system performance and greater long-term sustainability.

**Results on Total Energy.** Figure 8b shows our method reduces energy consumption by 44.75%, 48.26%, 49.02%, 47.38%, 41.77%, and 40.67% relative to XY, DyAD, BiLCQ, CrQ, cc-aware RL, and cc-aware DRL. Non-cc-aware techniques consistently consume more energy. Under cache coherence, cc-aware RL and DRL match our energy and latency on all workloads except Vips by effectively managing congestion and coherence traffic. However, they rely on inflexible topologies and incremental updates, limiting synergy between topology selection, routing, and coherence. Our method dynamically integrates these elements, avoiding unnecessary overhead and achieving notably shorter execution time (Figure 8c).

**Results on Execution Time.** Our method (Figure 8c) reduces execution time by 55.51% and 31.20% compared to cc-aware RL and cc-aware DRL, respectively. Notably, cc-aware RL and DRL achieve nearly identical execution times on X264, Fer, and Bla—workloads whose predictable, loop-driven communication phases allow both methods to incur comparable overhead and performance. In contrast, Flu and Vips exhibit more bursty, dynamic data-sharing patterns, causing the two methods to diverge in traffic management and produce greater variation in execution times.



(a) Average packet latency



(b) Total energy



(c) Execution time

Figure 8: Normalized performance in PARSEC: (a) Average packet latency, (b) Total energy, and (c) Execution time.

**Results on Cache Coherence Performance.** Figure 9 shows our method reduces coherence-induced message transmissions by 8.33% and 0.95% compared to cc-aware RL and DRL, respectively, while remaining stable across all workloads. Although coherence protocols inherently increase traffic, our proactive approach throttles congestion and boosts communication efficiency. For workloads like Fer and Bla, this approach slightly increases coherence messages to ensure stable data exchange and balanced packet distribution. Coupled with improved performance (Figure 8), our method



Figure 9: Normalized total messages driven by cache coherence protocol for PARSEC.



Figure 10: Comparison of total NoC energy per packet and average link utilization across 16 and 64 cores.

delivers a superior balance of NoC efficiency, coherence traffic, and overall system performance. Moreover, RL- and DRL-based routing show similar performance under cache coherence for certain workloads, underscoring the need for more advanced cc-aware frameworks like ours to handle coherence more effectively.

### 4.3 Analysis of Scalability

We compare total NoC energy per packet and average link utilization in 16-core and 64-core systems. As shown in Figure 10, our method reduces NoC energy per packet by 16.03% and 13.67% compared to cc-aware RL and cc-aware DRL in the 16-core system, while boosting average link utilization by 30.82%. In the 64-core system, it achieves even greater gains—10.61% and 14.0% higher link utilization and 25.19% and 20.37% lower energy per packet

than cc-aware RL and cc-aware DRL, respectively—demonstrating superior scalability and efficiency as core counts grow.

## 5 Conclusion

In this work, we demonstrate that cache coherence significantly reduces task computation time by enabling efficient data sharing among caches, thereby enhancing overall system performance. Building on this insight, we identify and address three critical gaps in current NoC design: (i) misalignment between design objectives and actual performance when considering cache coherence, (ii) lack of analysis tools, limiting the understanding of cache coherence's impact, and (iii) unexploited potential of integrating topology selection into routing decisions. To address these challenges, we propose a DRL-based framework that jointly optimizes topology selection and routing by dynamically learning from both NoC and cache coherence metrics. Additionally, we develop a Gem5-compatible Cache Coherence Traffic Analyzer (CCTA) to enable precise, fine-grained evaluation of coherence behaviors, providing deeper insights into its role in NoC design. Our approach achieves up to 10.52% lower packet latency, 55.51% faster execution time, and 49.02% total energy savings, while maintaining minimal coherence overhead. These results highlight the importance of explicitly considering cache coherence in NoC design, paving the way for NoC-coherence co-design and demonstrating the broad potential of our methodology.

## Acknowledgments

This work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (RG94/23).

## References

- [1] N. Agarwal, T. Krishna, L. -S. Peh and N. K. Jha, "GARNET: A detailed on-chip network model inside a full-system simulator," 2009 IEEE International Symposium on Performance Analysis of Systems and Software, Boston, MA, USA, 2009, pp. 33-42, doi: 10.1109/ISPASS.2009.4919636.
- [2] M. Li, W. Liu, L. H. K. Duong, P. Chen, L. Yang and C. Xiao, "Contention-Aware Routing for Thermal-Reliable Optical Networks-on-Chip," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 40, no. 2, pp. 260-273, Feb. 2021, doi: 10.1109/TCAD.2020.2994261.
- [3] G. Leary, K. Srinivasan, K. Mehta and K. S. Chatha, "Design of Network-on-Chip Architectures With a Genetic Algorithm-Based Technique," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 5, pp. 674-687, May 2009, doi: 10.1109/TVLSI.2008.2011205.
- [4] Zhiliang Qian, Paul Bogdan, Guopeng Wei, Chi-Ying Tsui, and Radu Marculescu. 2012. A traffic-aware adaptive routing algorithm on a highly reconfigurable network-on-chip architecture. In Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS '12). Association for Computing Machinery, New York, NY, USA, 161–170. <https://doi.org/10.1145/2380445.2380475>
- [5] S. Zeng, X. Xu and Y. Chen, "Multi-Agent Reinforcement Learning for Adaptive Routing: A Hybrid Method using Eligibility Traces," 2020 IEEE 16th International Conference on Control & Automation (ICCA), Singapore, 2020, pp. 1332-1339, doi: 10.1109/ICCA51439.2020.9264518.
- [6] Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, David A. Wood, and Natalie Enright Jerger. 2020. A Primer on Memory Consistency and Cache Coherence (2nd ed.). Morgan & Claypool Publishers.
- [7] E. Derebasoglu, I. Kadaiif and O. Ozturk, "Coherency Traffic Reduction in Many-core Systems," 2022 25th Euromicro Conference on Digital System Design (DSD), Maspalomas, Spain, 2022, pp. 262-267, doi: 10.1109/DSD57027.2022.00043.
- [8] A. Ros et al., "Dealing with traffic-area trade-off in direct coherence protocols for many-core CMPs," in APPT. Springer, 2009, pp. 11–27.
- [9] Tales Marchesan Chaves, Everton Alceu Carara, and Fernando Gehm Moraes. 2011. Energy-efficient cache coherence protocol for NoC-based MPSoCs. In Proceedings of the 24th symposium on Integrated circuits and systems design (SBCCI '11). Association for Computing Machinery, New York, NY, USA, 215–220. <https://doi.org/10.1145/2020876.2020925>
- [10] L. Masing, A. Srivatsa, F. Kreß, N. Anantharajah, A. Herkersdorf and J. Becker, "In-NoC Circuits for Low-Latency Cache Coherence in Distributed Shared-Memory Architectures," 2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Hanoi, Vietnam, 2018, pp. 138-145, doi: 10.1109/MCSoC2018.2018.00033.
- [11] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (May 2011), 1–7. <https://doi.org/10.1145/2024716.2024718>.
- [12] H. Chen, P. Chen, J. Zhou, L. H. K. Duong and W. Liu, "ArSMART: An Improved SMART NoC Design Supporting Arbitrary-Turn Transmission," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 5, pp. 1316-1329, May 2022, doi: 10.1109/TCAD.2021.3091961.
- [13] Y. Li and P. Zhou, "Fast and Accurate NoC Latency Estimation for Application-Specific Traffics via Machine Learning," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 9, pp. 3569-3573, Sept. 2023, doi: 10.1109/TC-SII.2023.3258700.
- [14] Q. Ijaz and E. B. Bourennane, "An improved hybrid network-on-chip with flexible topology and frugal routing," The Journal of Engineering, vol. 2024, no. 6, p. e12395, 2024.
- [15] S. S. Bhavanasi, L. Pappone and F. Esposito, "Routing with Graph Convolutional Networks and Multi-Agent Deep Reinforcement Learning," 2022 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Phoenix, AZ, USA, 2022, pp. 72-77, doi: 10.1109/NFV-SDN56302.2022.9974607.
- [16] A. Abrol, P. Murali Mohan, and T. Truong-Huu, "A Deep Reinforcement Learning Approach for Adaptive Traffic Routing in Next-gen Networks," 2024 IEEE International Conference on Communications (ICC), Denver, USA, 2024. [Online]. Available: <https://arxiv.org/abs/2402.04515>.
- [17] Shaocong Wang, Xiaoyun Zhang, Changhong Wang, Ke Wu, Cunlu Li, Dezun Dong, DRLAR: A deep reinforcement learning-based adaptive routing framework for network-on-chips, Computer Networks, Volume 246, 2024, 110419, ISSN 1389-1286, <https://doi.org/10.1016/j.comnet.2024.110419>.
- [18] J. C. Bienia, S. Kumar, J. P. Singh, K. Li, The PARSEC benchmark suite: Characterization and architectural implications, in: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008, pp.
- [19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, USA, 2009, pp. 469-480.
- [20] Sheng-Chun Kao, Chao-Han Huck Yang, Pin-Yu Chen, Xiaoli Ma, and Tushar Krishna. 2019. Reinforcement learning based interconnection routing for adaptive traffic optimization. In Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip (NOCS '19). Association for Computing Machinery, New York, NY, USA, Article 17, 1–2. <https://doi.org/10.1145/3313231.3352369>.
- [21] Paul Almasan, José Suárez-Varela, Krzysztof Rusek, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2022. Deep reinforcement learning meets graph neural networks: Exploring a routing optimization use case. Comput. Commun. 196, C (Dec 2022), 184–194. <https://doi.org/10.1016/j.comcom.2022.09.029>.
- [22] F. Farahnakian et al., "Bi-LCQ: A low-weight clustering-based Q-learning approach for NoCs," Microproc. Microsyst., vol. 38, no. 1, pp. 64–75, Feb. 2014.
- [23] N. Gupta et al., "Improved Route Selection Approaches using Q-learning framework for 2D NoCs," in Proc. of the 3rd Inter. Workshop on Many-core Embedded Systems, New York, USA, Jun. 13-14, 2015.
- [24] J. Hu and R. Marculescu, "DyAD: smart routing for networks-on-chip," in Proc. 41st Annu. Design Automation Conf. (DAC), New York, USA, Jun. 7-11, 2004.
- [25] Reshma Raj R.S., Rohit R., Mushrif Shaikh Shahreyar, Akash Raut, Pournami P.N., Saidalavi Kalady, Jayaraj P.B., DeepNR: An adaptive deep reinforcement learning based NoC routing algorithm, Microprocessors and Microsystems, Volume 90, 2022, 104485, ISSN 0141-9331, <https://doi.org/10.1016/j.micpro.2022.104485>.
- [26] Hui Chen, Zihao Zhang, Peng Chen, Xiangzhong Luo, Shiqing Li, and Weichen Liu. 2021. MARCO: A High-performance Task Mapping and Routing Co-optimization Framework for Point-to-Point NoC-based Heterogeneous Computing Systems. ACM Trans. Embed. Comput. Syst. 20, 5s, Article 54 (October 2021), 21 pages. <https://doi.org/10.1145/3476985>.
- [27] Hui Chen, Zihao Zhang, Peng Chen, Shien Zhu, and Weichen Liu. 2021. Parallel Multipath Transmission for Burst Traffic Optimization in Point-to-Point NoCs. In Proceedings of the 2021 Great Lakes Symposium on VLSI (GLSVLSI '21). Association for Computing Machinery, New York, NY, USA, 289–294. <https://doi.org/10.1145/3453688.3461521>.
- [28] S. Bayes, M. Hossam and M. Hassan, "Shared Data Kills Real-Time Cache Analysis. How to Resurrect It?", 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 2024, pp. 1-6, doi: 10.23919/DATEx58400.2024.10546818.
- [29] A. M. Kaushik, M. Hassan and H. Patel, "Designing Predictable Cache Coherence Protocols for Multi-Core Real-Time Systems," in IEEE Transactions on Computers, vol. 70, no. 12, pp. 2098-2111, 1 Dec. 2021, doi: 10.1109/TC.2020.3037747.