

---

# [AI System Design – Term Project] Optimizations of AI Accelerator Networks

Chester Sungchung Park (박성정)  
SoC Design Lab, Konkuk University  
Webpage: <http://soclab.konkuk.ac.kr>

# Teaching Assistants

---

- Shinyoung Kim ([shinyoungkim@konkuk.ac.kr](mailto:shinyoungkim@konkuk.ac.kr))
- Jaesuk Lee ([jaesuki98@konkuk.ac.kr](mailto:jaesuki98@konkuk.ac.kr))

# Outline

---

- ❑ Introduction
  - Target workload
  - NetTLMSim
- ❑ Problem to solve
- ❑ Reference codes
- ❑ Evaluation
- ❑ Submission

# Target Workload

## □ BERT-medium *Layer group A*

- Layer group 0 + matmul ( $QK^T$ )



# NetTLMsim

## Virtual prototype

Assumed design parameters

| Space-Time Mapping   | Target Hardware | 16 chiplets per package [1],<br>16 PEs per chiplet [1, 2],<br>1024 MACs per PE [1, 3]<br>Total 512 TOPS at 1 GHz freq. | Memory Architecture | DRAM Type<br>SRAM Type<br>Global Buf. Size, Bandwidth<br>Local Buf. Size, Bandwidth<br>Precision | HBM2E, 2 devices<br>Total 1024 GB/s<br>Dual port [3]<br>640 KB activation storage [4],<br>128 GB/s per NoC router<br>36 KB x2 (double buffered) [4],<br>128 GB/s at 1 GHz freq.<br>8 bits (24 bits for partial sums) [4] |
|----------------------|-----------------|------------------------------------------------------------------------------------------------------------------------|---------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                      |                 | NVDLA-like vector MAC array                                                                                            |                     |                                                                                                  |                                                                                                                                                                                                                          |
| Network Architecture | Loop Order      | OS-LWS [4]                                                                                                             |                     |                                                                                                  |                                                                                                                                                                                                                          |
|                      | Topology        | Mesh [2]                                                                                                               |                     |                                                                                                  |                                                                                                                                                                                                                          |
|                      | Routing         | DOR YX [2]                                                                                                             |                     |                                                                                                  |                                                                                                                                                                                                                          |
|                      | Flow Ctrl       | Cut-through [2]                                                                                                        |                     |                                                                                                  |                                                                                                                                                                                                                          |
|                      | Packet Len      | 1 head flit, 16 body flits [2]                                                                                         |                     |                                                                                                  |                                                                                                                                                                                                                          |

[1] Cai et al. 2024      [3] Keller et al. 2023  
[2] Zimmer et al. 2020      [4] Venkatesan et al. 2019



System overview



For more details, check the following presentation:

F. S. Park and C. S. Park, *Pre-RTL simulation based design space exploration for multi-chiplet dataflows*, HiPChips (MICRO 2024).

# NetTLMsim

## Virtual prototype (cont'd)



|             | Sequence number |       |       |       |       |       |       |       |       |
|-------------|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|
|             | Pass0           | Pass1 | Pass2 | Pass3 | Pass4 | Pass5 | Pass6 | Pass7 | Pass8 |
| WE          | Load 0          | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8     |
| Compute     | 0               | 1     | 0     | 1     | 2     | 3     | 4     | 5     | 6     |
| Store       |                 |       |       |       |       |       |       |       |       |
| FC-Q        |                 |       |       | 1     | 2     | 3     | 4     | 5     | 6     |
| FC-V        |                 |       | 0     | 1     | 2     | 3     | 4     | 5     | 6     |
| Transpose K | 0               |       |       | 1     | 2     | 3     | 4     | 5     | 6     |
| Merge K     |                 |       |       | 0     | 1     | 2     | 3     | 4     | 5     |
| FC-K        |                 |       |       | 1     | 2     | 3     | 4     | 5     | 6     |
| KT          |                 |       |       | 0     | 1     | 2     | 3     | 4     | 5     |
| MK          |                 |       |       | 0     | 1     | 2     | 3     | 4     | 5     |



# Reference Project

- Build project and run simulation (See **Appendix A**)



# Problem to Solve

---

- ❑ Partitioning/placement configuration selection
  - Input: **NetTLMsim output** for the  $n$ -th iteration
  - Output: **NetTLMsim input** for the  $(n+1)$ -th iteration
    - ✓ Only partitioning/placement configuration *varied*
    - ✓ Hardware/layer group configuration *kept fixed*

# Problem to Solve



# Problem to Solve



# Design Constraints

---

- ❑ Optimize with respect to partition and placement only
  - You are **not** allowed to modify any other settings, e.g., hardware, layer group configurations etc.
    - ✓ Otherwise, it won't be considered for evaluation
- ❑ Run *no* additional simulations (e.g., those using NetTLMsim) *inside* the partitioning/placement configuration selection
  - Only a single simulation run is allowed per iteration.

# Design Constraints

## □ Modify **only** (a part of) the body of *simple\_dse.py*

- You are allowed to change **only** the paragraphs between “*Edit code below*” and “*Edit code above*”



```
37 def config_selector(
38     config: Dict[str, Any],
39     iteration: int,
40     allowed_groups: Optional[List[str]] = None,
41     last_iter_dir: Optional[Path] = None,
42 ) -> Dict[str, Any]:
43     partitions = config.get("partitions")
44     if not isinstance(partitions, dict):
45         return config
46     # Load last iteration logs
47     if last_iter_dir.exists() & (iteration != 1):
48         """ Edit code below """
49         def _rotate_router_ids(
50             nodes: Dict[str, Any],
51             skip_if: Optional[Callable[[str, Dict[str, Any]], bool]] = None,
52             ) -> None:
53             """Mutate nodes in-place by rotating their 'router_ids'."""
54             for node_key, node in nodes.items():
55                 if skip_if is not None and skip_if(node_key, node):
56                     continue
57                 ids = node.get("router_ids")
58                 if isinstance(ids, list) and ids:
59                     if len(ids) == 1:
60                         shift = 1
61                     else:
62                         shift = random.randint(1, len(ids) - 1)
63                     node["router_ids"] = ids[shift:] + ids[:shift]
64
65             for group_key, group in partitions.items():
66                 try:
67                     group_idx = group_key.split("_")[-1]
68                 except (ValueError, IndexError):
69                     group_idx = None
70
71                 if allowed_groups is not None and group_idx not in allowed_groups:
72                     continue
73
74                 compute_nodes = group.get("compute_nodes") or {}
75                 _rotate_router_ids(compute_nodes)
76
77                 tensor_nodes = group.get("tensor_nodes") or {}
78                 _rotate_router_ids(
79                     tensor_nodes,
80                     skip_if=lambda node_key, node: "weight" in node_key,
81                 )
82
83             """ Edit code above """
84
85             # Ensure mutated partitions stay attached
86             config["partitions"] = partitions
87
88             # Record a breadcrumb so users can see what changed
89             meta = config.setdefault("_simple_dse", {})
90             history = meta.get("history", [])
91             history.append(
92                 {
93                     "iteration": iteration,
94                     "timestamp": datetime.now().isoformat(),
95                     "target_groups": allowed_groups if allowed_groups is not None else "all",
96                 }
97             )
98             meta["last_iteration"] = iteration
99
100         return config
```

```
101     log_dir_path = last_iter_dir / "logs/layer_group_A"
102     sim_result_path = last_iter_dir / "simulation_result.json"
103     log_alloc_path = log_dir_path / "allocation.csv"
104     log_exec_time_path = log_dir_path / "execution_time.csv"
105     log_flowID_path = log_dir_path / "flowID.csv"
106     log_link_load_path = log_dir_path / "link_load.csv"
107     log_routers_path = list(log_dir_path.glob("router_*.csv"))
108     log_router_path = [f for f in log_routers_path if re.match(r"^.+router_\d+.csv$", f.name)]
109     log_router_buf_path = [f for f in log_routers_path if re.match(r"^.+router_\d+_.buf.csv$", f.name)]
110     log_d2d_buf_path = list(log_dir_path.glob("d2d_*.buf.csv"))
111
112     with open(sim_result_path, "rb") as f:
113         sim_result = json.loads(f.read())
114     throughput = sim_result["data"]["summary"]["total_throughput_ops"]
115     power_effs = sim_result["data"]["summary"]["power_efficiency_ops_per_w"]
116     alloc = pd.read_csv(log_alloc_path)
117     flowID = pd.read_csv(log_flowID_path)
118     routers = [pd.read_csv(file, skipinitialspace=True) for file in log_router_path]
119     router_bufs = [pd.read_csv(file, dtype=str, skipinitialspace=True) for file in log_router_buf_path]
120     d2d_bufs = [pd.read_csv(file, dtype=str, skipinitialspace=True) for file in log_d2d_buf_path]
```

Using these variables to utilize the results from the previous iteration

# Evaluation

---

- ❑ Submission completeness (10pt)
  - Reproducibility
- ❑ Optimality (40pt)
  - Final throughput and power efficiency
- ❑ DSE efficacy (30pt)
  - Sampling efficiency
  - General applicability
- ❑ Documentation (20pt)
  - Covering all those mentioned in the above

# Submission Completeness

---

## ❑ Make sure that your submission is **complete**

- In other words, it should be possible to **reproduce** your design together with the claimed throughput and power efficiency using **only** the files that you submitted by the submission deadline

# Optimality

---

- ❑ Evaluated both **absolutely** and **relatively**
  - In other word, the quantity as well as the ranking matters
- ❑ Evaluate **separately** for throughput and power efficiency

# DSE Efficacy

---

## ❑ Sampling efficiency

- Defined as the final quantity **divided by** the total number of iterations
  - ✓ Evaluates *how many simulation runs are required to reach the final throughput and power efficiency*
- Considered for **both** *training* (fine-tuning) and *inference* in the case of AI/ML (and separately, if necessary)
- The fewer iterations, the better sampling efficiency

## ❑ General applicability

- Evaluates whether the proposed idea is generally applicable
  - ✓ For example, different **layer group configurations** (including layer groups 0~2)
- May be further verified by extra hardware/layer group configurations

# Submission

---

Deadline

- **Dec. 12 (Fri), 10:00** GMT+9

**Only one zip file** submission per team including the following files:

- Source code (**all** those needed to reproduce your results)
  - ✓ Except all those provided by TA (e.g., reference project)
- Documentation (PPT)
  - ✓ Including the explanation of the above source code

Upload the zip file to the Ecampus

- Plus, send to [chesterku2013@gmail.com](mailto:chesterku2013@gmail.com) as a *backup*

You can post questions in the Ecampus (Q&A)

Delayed submission will result in penalty!

---

# **Appendix A:**

# **Running a DSE**

# Runing a DSE

## □ Start the DSE

- Open the terminal with ‘**Ctrl + `**
- Type the following command and press ‘**Enter**’
  - ✓ `./run_simple_dse.sh TP_AS25 --layer-groups A --iterations N`
  - ✓ Set iteration count to **N**

