



# 先进封装与集成芯片

## Advanced Package and Integrated Chips



**Lecture 11 : Cost-Driven Partition**  
**Instructor: Chixiao Chen, Ph. D**

# Course Presentation Schedule



| Date                    | Presenter    | Type/title |
|-------------------------|--------------|------------|
| 5.13<br>(+20%<br>Bonus) | XXXX0125 冯效贤 | Paper 2.3  |
|                         | XXXX0159 杨洁  | Paper 2.1  |
|                         | XXXX0163 曾昱满 | Paper 2.8  |
| 5.27                    | XXXX0030 张申  | Paper 1.4  |
|                         | XXXX0034 鞠佳琦 | Paper 1.6  |
|                         | XXXX0062 陈泽琦 | Paper 1.8  |
|                         | XXXX0063 陈泽兴 | Paper 1.9  |
|                         | XXXX0068 丁成成 | Paper 4.1  |
|                         | XXXX0110 李勇江 | Paper 3.1  |
|                         | XXXXX044 马宇杰 | Paper 3.2  |
|                         |              |            |
|                         |              |            |

| Date | Presenter     | Type/title |
|------|---------------|------------|
| 6.3  | XXXX0022 郑昊   | Paper 2.2  |
|      | XXXX0022 卢嘉骏  | Paper 2.4  |
|      | XXXX0165 张劲松  | Paper 2.5  |
|      | XXXX0018 孙发显  | Paper 2.6  |
|      | XXXX0088 黄至锐  | Paper 2.9  |
|      | XXXX0079 李沛哲  | Paper 3.3  |
|      | XXXX0155 王运正茂 | Paper 3.4  |
|      | XXXX0127 罗昀斌  | Project    |

打分标准：技术准确理解（30%）+ 创新点分析（30%）+ 表达清晰/逻辑条理（25%）+ 问题回答（15%）+ 按时（-X%）= 100%

# Overview



- Cost-Driven Chiplet Partition Strategy
  - Chiplet Economics with Yield
  - IO Die and Active Interposer
- 1<sup>st</sup> Part of Course Presentation

# Why we do not make bigger chips?

If technology scaling only gives you (say) 1.5x more devices per 24 months, why not just make chips 1.33x bigger to get 2x transistors?



395 chips → 362 good die  
(8% yield loss)

(hypothetical/academic example [1], not real yield rates)



192 chips → 162 good die  
(16% yield loss)



# How partition affects yields?

- Known good die test is applied before assembly/packaging to ensure overall yield.



# Yield vs. Area and Cost

- The smaller the chip size, the higher the semiconductor manufacturing yield, which translates into lower cost.
- Building multiple chiplets is not a free lunch.



# Case Study: 1st Gen AMD EPYC Arch

➤ It was 7-8 years ago, AMD MCM product was very likely a cost-driven innovation

- MCM approach has many advantages
  - Higher yield, enables increased feature-set
  - Multi-product leverage
- AMD EPYC Processors
  - $4 \times 213\text{mm}^2 \text{ die/package} = 852\text{mm}^2 \text{ Si/package}^*$
- Hypothetical EPYC Monolithic processor
  - $\sim 777\text{mm}^2*$
  - Remove die-to-die Infinity Fabric™ PHYs and logic (4/die), duplicated logic, etc.
  - $852\text{mm}^2 / 777\text{mm}^2 = \sim 10\% \text{ MCM area overhead}$

Traditional Monolithic



32C Die Cost  
1.0X

1<sup>st</sup> Gen EPYC



32C Die Cost  
0.59X<sup>1</sup>

# But homogenous partition is not enough

- Smaller dies are required when more transistors are incorporated. IO-heavy duplication is not good.
- Innovation: Partition SoC, where CPUs use leading technology, and IOs use N-1 technology.



Prior Generation RYZEN™ Processor Die



CPU core + L3 on this die comprises ~56% of the area  
 These circuits see increased 7nm gains  
 Remaining ~44% sees very little performance and density improvement from 7nm



7nm CCD is ~86% CPU + L3

# Case Study: 2<sup>nd</sup> Gen AMD EPYC CPU



Traditional Monolithic



1<sup>st</sup> Gen EPYC CPU



2<sup>nd</sup> Gen EPYC CPU



Use an Advanced Technology Where it is Needed Most

Each IP in its Optimal Technology, 2<sup>nd</sup> Gen Infinity Fabric™ Connected

Centralized I/O Die Improves NUMA

Superior Technology for CPU Performance and Power

# IO Die Architecture

- Central IOD enables a single NUMA domain per socket
- Improved average memory latency<sup>1</sup> by approximately 24ns (~19%)<sup>2</sup>
- Minimum (local) latency only increases approximately 4ns with chiplet architecture



**Single Domain**

CCD0, CCD1, IO0, CCD2, CCD3, IO1, CCD4, CCD5, IO2, CCD6, CCD7, IO3, MA/MB/MC/MD/ME/MF/MG/MH interleaved

1.46GHz / DDR2933 (coupled)<sup>1</sup>

- 1: Local 94ns
- 2: ~97ns
- 3: ~104ns
- 4: ~114ns

Measured Avg: ~104ns

Repeater: 1 FCLK (1.46GHz)

Switch: 2 FCLK (1.46GHz)  
(low-load bypass, best-case)

# Performance vs. Cost

- Higher core counts and performance than possible with a monolithic design
- Lower costs at all core count/performance points in product line
- Cost scales down with performance by depopulating chiplets
- 14nm technology for IOD reduces fixed cost



# Chiplet Modularity and Reusage

- IO die gives some new thoughts about modularity and scalability. Furtherly, IO die can be partitioned or shrunked due to cost.
- AMD CPU families with only 3 dies



**AMD EPYC 2<sup>nd</sup>-Generation**  
**AMD EPYC Embedded**



# Active Interposer: 3D-lization of IO Dies

- Challenges of IO dies in AI systems: Power delivery of these memory access bandwidth requirements is fully different from CPUs, much much higher.
- 3D packaging technology is desired. Active interposers can be regarded as an 3D-lization of IO dies.

## Bandwidth + Power<sup>1</sup> Requirements



# Core Die Reusage for different package



Same CCD for Genoa + MI300A  
3D Interface to IOD



- Same CCD adapted to work for 4<sup>th</sup> Gen EPYC™ CPUs and AMD Instinct™ MI300A 3D stack
  - EPYC™ MCM uses “GMI” SerDes interface through package substrate
  - AMD Instinct™ MI300A vertical stack uses dense TSV interface from IOD to CCD in two-link ‘wide’ mode
  - Dramatically higher 3D signal density enabled virtually no die size increase with simple interface multiplexing