



## 3D Design & Roadmap

UFRGS, August 27<sup>th</sup> 2012

*P. Vivet*

[pascal.vivet@cea.fr](mailto:pascal.vivet@cea.fr)

# 3D Design – will dream become reality ?



■ Focus today on 3D interconnect for complex MPSoC

# Outline

- Introduction:
  - Available 3D-IC technology in LETI and design perspectives
- Our design roadmap
- Mag3D demonstrator implementation:
  - Memory-on-processor (WideIO)
  - Processor-on-processor (ANoC)
- 3D-IC compute node perspective
- Design Flow & perspectives
- Conclusion

# A complete toolset for 3D



- CMOS 300 + 3D 300
- CMOS 200 mm
- MEMS & 3D 200 mm
- Nanoscale Characterization

- Fully operational 300mm line dedicated to 3D – inaugurated in 2011, January

# Available 3D-IC technology in LETI



# 3D IC interconnect: Through Silicon Via

## Through Silicon Via (TSV)

### Via First TSV (Polysilicon filled)

Processed before CMOS front-end steps

Pitch: ~10µm

Density: 10000 TSV/mm<sup>2</sup>



Trench AR 20,  
5x100µm

### Via Middle TSV (Copper filled)

Processed after CMOS front-end steps

Pitch: 40µm to 50µm

Density: 500 TSV/mm<sup>2</sup>



AR 10,  
10x100µm

### Via Last TSV (Copper liner)

Processed after metallization

Pitch: ~100µm

Density: 100 TSV/mm<sup>2</sup>



AR 1  
80x80µm



AR 2,  
60x120µm



AR 3,  
40x120µm

# Fine pitch interconnect – technological Leti Roadmap

High volume manufacturability (HVM) (300mm compatibility, high speed P&P)



Pre-applied underfill



Current technologies

A 10µm pitch interconnection technology using micro tube insertion into Al-Cu for 3D applications; De Brugiere & Al, ECTC 2011

100-30 µm range

TLP(Cu/Sn)



Advanced technologies



30-10 µm range



Down to 2 µm

# Design perspectives

- What can we expect today from the technology ?
  - Less than 10 µm diameter TSV is challenging
  - Must add guard interval thus reduce effective interconnect pitch
  - ⇒ ~500 interconnects/mm<sup>2</sup>
  - ⇒ Coarse grain to medium grain partitioning for 3D SoC and Silicon board

## 3D SoC



- 3D-stacked dies
- Memory-on-processor:
  - 3D memory hierarchy
- Processor-on-processor:
  - Many-core cluster
- High bandwidth
- Fine grain architecture partitioning
- High density for vertical interconnects
- Face-to-back

## Silicon board



- Dies stacked on a silicon interposer
- Heterogeneous integration:
  - Digital, analog, memory, input/output, power management
- Medium bandwidth
- System partitioning
- High density for horizontal interconnects
- Face-to-face
- Large size silicon interposer

# Our design roadmap



# Our design roadmap



# Our design roadmap



# Mad3D demonstrator

- Focus on 3D-SoC and on intra-chip interconnects
- Partnership between LETI, STEricsson, STMicroelectronics and Cadence
- Same SoC addressing several schemes of 3D integration:

High speed CMOS techno - 70mm<sup>2</sup>  
3000 TSVs and micro-bumps  
1000 flip-chip bumps  
TFBGA 12x12x1.2 - 581balls



leti

ST ERICSSON

ST

cadence®

**Memory-on-Processor version :**  
**WIOMING = Mag3D + WidelIO DRAM**



Differentiation with minimum number of masks



DRAM traffic:  
High bandwidth  
Low power

NoC traffic:  
Asynchronous  
Serialization

**Processor-on-Processor version :**  
**MAGtoMAG = Mag3D + Mag3D with 3D ANoC**

# Our design roadmap



# Why do we need WideIO DRAM?

- Graphics and display performance of high end smartphone and tablet devices will be limited by memory bandwidth in 2013 time frame
- WideIO provides 2x power efficiency compared to LPDDR2/3
- The current wideIO JEDEC spec proposal is going up to 17GBytes/s. Moving to DDR mode and higher frequencies will enable eventually **WideIO to provide more than 50GBytes/s**

Technology Transition



DRAM Bandwidth Requirements



# Memory Options and BW

## WIOMING

|                                                                      | LPDDR2<br>PoP | LPDDR3<br>PoP/Discrete  | WideIO<br>single die    | WideIO<br>Cube          | LPDDR4       | WideIO2          |
|----------------------------------------------------------------------|---------------|-------------------------|-------------------------|-------------------------|--------------|------------------|
| <b>BW (Gbyte/s)</b>                                                  | <b>8.5</b>    | <b>12.8</b>             | <b>12.8</b>             | <b>12.8</b>             | <b>~25.6</b> | <b>~34 → 136</b> |
| <b>possible BW evolution (Gbyte/s)</b>                               | -             | <b>17<sup>(1)</sup></b> | <b>17<sup>(2)</sup></b> | <b>17<sup>(2)</sup></b> | -            | -                |
| <b>max package density (Gbit)</b>                                    | <b>4x4</b>    | <b>4x4</b>              | <b>1x4</b>              | <b>4x4</b>              | <b>TBD</b>   | <b>TBD</b>       |
| <b>power efficiency (mW/Gbyte/s)</b>                                 | <b>78</b>     | <b>67</b>               | <b>42</b>               | <b>42</b>               | <b>TBD</b>   | <b>TBD</b>       |
| <b>Samples availability</b>                                          | <b>OK</b>     | <b>OK</b>               | <b>OK</b>               | <b>4Q '12</b>           | <b>2015?</b> | <b>2015?</b>     |
| <b>volume maturity</b>                                               | <b>2011</b>   | <b>2012</b>             | <b>2013</b>             | <b>2013</b>             | <b>2015?</b> | <b>2015?</b>     |
| <b>relative memory cost<br/>for equivalent density<sup>(3)</sup></b> | <b>1</b>      | <b>~1.1</b>             | <b>~1.2</b>             | <b>~1.4</b>             | <b>TBD</b>   | <b>TBD</b>       |

<sup>(1)</sup> LPDDR3E: clock from 800 to 1066MHz. Discussion just started at Jedec and memory vendors

<sup>(2)</sup> WideIO clock frequency from 200MHz to 266Mhz: already specified at Jedec

<sup>(3)</sup> Estimates based on memory supplier survey (memory cost only)

# Wide IO integration into Mag3D

- 3D test chip backbone is the LETI MAGALI SoC
- The NoC architecture has been extended to interface with Wide IO memory
- Four independent data traffic and memory controllers have been added
- Specific design for WideIO Testability



# Wide IO controller architecture

## Wide IO Memory Controller

- NoC Memory Controller and data transfer management (*LETI*)
- WideIO Memory Controller (*Denali/Cadence*)
- Physical interface & testability (*STEricsson*)

ANoC output:  
550Mflit/s  
32-bit flit  
**2.2 GB/S peak**

**NoC interface with SoC**

Network On Chip

Reset

Reference clock

ANoC input:  
550Mflit/s, 32-bit flit  
**2.2 GB/S peak**



# Mag3D final GDSII



## Circuit Technology

- High speed CMOS TSV middle process
- Face2Back, Die2Die, Flip-Chip 3Dassembly

## Main features

- WidelIO memory controllers
- 3D ANOC
- 3GPP LTE multi core CPU backbone
- Host CPU

## Circuit numbers

- 125 Million Transistors
- 400 Macros
- 270 pads
- 1980 TSV for 3D NoC
- 1016 TSV for WidelIO memory
- 933 Bumps for flip chip

## Circuit performances

- WidelIO 200MHz / 512 bits
- Units in the [350 - 400] MHz range
- Asynchronous NoC ~ 550 MHz

# WIOMING stack



- 1016 backside micro-bumps / TSVs:
  - 50µm x 40 µm pitch
  - For signal, test and power
  - No backside redistribution layer
  - Mechanical bumps added



- Package:
  - 12 x 12 x 1.2 BGA
  - 0.4 mm ball pitch
  - 459 balls for signal, test and power



## Assembly technology

|                |               |
|----------------|---------------|
| Assembly       | Die-to-Die    |
| Stacking       | Face-to-Back  |
| TSV process    | Via Middle    |
| TSV density    | 10µm diameter |
| TSV xy pitch   | 50µm x 40 µm  |
| Copper Pillars | 20µm diameter |

# Mag3D application board

- Same application environment for hosting the different 3D versions of Mag3D:

- Standalone (Mag3D only):
  - 3GPP-LTE Application perimeter
- Wioming (Mag3D + Wide IO):
  - Wide IO technology performance assessment
  - Thermal behavior analysis
- Mag –to-Mag:
  - 3D ANoC technology performance assessment
- Baseline is existing Magali prototyping board



Magali board



Mag3D daughter board



# Our design roadmap



# 3D Asynchronous NoC for Multi-core Scalability

- For technology nodes < 32 nm
  - Performance is required in many applications, Mask cost + design time limit developments possibilities  
⇒ High volume production is required
- Proposal: easily stackable simple “tiles”
  - No complex phy, a set of tiles will give you the performance for your application.  
⇒ Increase number of applications for a single die, reach required volume production.
- Constraints ?
  - High bandwidth between dies,
  - Easy staking, no clock distribution issues
  - Power distribution,
  - Testability,
  - Fault Tolerance
- *Proposal : 3D Asynchronous NoC*
  - ⇒ Fast serial link
  - ⇒ Full asynchronous logic
  - ⇒ Including 3D DFT and Fault Tolerance



# Quasi Delay Insensitive Asynchronous Logic

## Quasi Delay Insensitive (QDI) Logic

- Initiated by Caltech Univ. (1995)
- Provide robustness to PVT conditions
- Consume energy only for allowed transitions
- Self adapt to voltage supply
- => Perfectly adapted for 3D TSV connection

## Explicit asynchronous handshakes

- dual-rail or 4-rail encoding
- 4-phase Return to Zero protocol

## Fully implemented in standard-cell

- Using C-elements or Muller gates



QDI 4-Phase / 4-Rail Asynchronous Protocol



Asynchronous 4-rail pipeline stage

# 3D ANoC : Asynchronous NoC features

## ■ ANoC main features

- GALS template
- **2D mesh** based extended in **3D**
- Paquet based, source routing
- 32 bits, 2 virtual or physical channels
- GALS interfaces to bridge between asynchronous and synchronous domain
- Local clock generators in each synchronous IP
- **Asynchronous** NoC achieves 550 MFlits/s



« A Fully-Asynchronous Low-Power Framework for GALS NoC Integration »  
Yvain Thonnart, Pascal Vivet, Fabien Clermidy, DATE'2010

## ■ 3D ANoC serial link ?

- ⇒ **serialization**, to reduce number of TSVs at 3D NoC interface,
- ⇒ NoC serial link is also fully implemented in asynchronous logic.
- ⇒ this is a compromise between throughput and number of TSVs



# Serial Link Circuit Implementation

- a Serializer of n:p composed of p Serializer of m:1
    - a Serializer of m:1 is a tree of “Self-Controlled Multiplexors”
- $m = \text{Serialization Ratio} = \frac{n}{p}$
- R, The Serialization Bandwidth Ratio as the throughput cost factor
    - f, the transfer rate of parallel input data
    - g, the transfer rate of serialized output data

$$R = \text{Serialization Bandwidth Ratio} = \frac{n \times f}{p \times g}$$

$$R = \frac{4 \times 550 \text{Mflits/s}}{1 \times 1200 \text{Mflits/s}} = 1.8, \text{ and not } 4$$



# Serialization Area Cost Analysis

|           | MD TSV               | HD TSV                | 65 nm                 | 32 nm                  |
|-----------|----------------------|-----------------------|-----------------------|------------------------|
| Parallel  | 0.4 mm <sup>2</sup>  | 0.016 mm <sup>2</sup> | 0 mm <sup>2</sup>     | 0 mm <sup>2</sup>      |
| Serial x2 | 0.2 mm <sup>2</sup>  | 0.008 mm <sup>2</sup> | 0.012 mm <sup>2</sup> | 0.0039 mm <sup>2</sup> |
| Serial x4 | 0.1 mm <sup>2</sup>  | 0.004 mm <sup>2</sup> | 0.016 mm <sup>2</sup> | 0.0056 mm <sup>2</sup> |
| Serial x8 | 0.05 mm <sup>2</sup> | 0.002 mm <sup>2</sup> | 0.019 mm <sup>2</sup> | 0.0067 mm <sup>2</sup> |



MD = Medium density (10 µm diameter)  
HD = High density (1 µm diameter)

# 3D ANoC integration into Mag3D

- 3D test chip based on the LETI Magali SoC backbone
- The NoC architecture has been extended to support four 3D NoC interfaces
- Includes JTAG based 3D DFT



| Four implementation flavors of 3D ANoC                                                                                                                                                                                                                                                                      |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>3D ANoC with fault tolerance:</b> <ul style="list-style-type: none"> <li>• Add 15% spare TSV per ANoC serial link</li> <li>• Test &amp; repair JTAG based architecture</li> </ul>                                     |
| <b>3D ANoC with serial link:</b> <ul style="list-style-type: none"> <li>• Serialization to reduce number of TSVs : Trade-off between throughput (performance) and nb of TSV (cost)</li> <li>• 2x redundancy</li> </ul>  |
| <b>3D ANoC :</b> <ul style="list-style-type: none"> <li>• Hierarchical 3D router</li> <li>• Fully asynchronous</li> </ul>                                                                                                                                                                                   |

# MAGtoMAG stack specification

- Partnership between LETI, STMicroelectronics and Cadence
- One Mag3D die is stacked on top of another Mag3D die in the same package
- Face to Back stacking
- Mag3D supplies and interconnect signals (3D-NoC) go through the SoC by means of TSVs

| Technology     |               |
|----------------|---------------|
| Assembly       | Die-to-Die    |
| Stacking       | Face-to-Back  |
| TSV process    | Via Middle    |
| TSV density    | 10µm diameter |
| TSV xy pitch   | 50µm x 40 µm  |
| Copper Pillars | 20µm diameter |



# Our design roadmap



# Where is 3D in massively parallel computing ?

## Silicon board:

3D Integrated Circuit  
to fill the integration  
gap for massively  
parallel computing



# 3D-IC compute node architecture

- Several small and low power System-on-Chips:
  - Multi-processor SoC: MPCore CPU + FPU + GPU + processing fabric with 3D-cache hierarchy
  - Memory SoC: Wide IO SDRAM, Non Volatile Memory
  - Interconnect, peripheral and IO SoC: Interfaces (memory, PCIe...), peripheral interconnect and primary Inputs/outputs
- Energy efficient interconnects:
  - WidelIO for memory connections
  - Asynchronous Network on Chip (3D NoC) for inter-processor communications



# Interposer based system integration

- Focus is miniaturization and energy efficient intra-chip communications.
- Silicon interposer technology benefits for system integration are:
  - High **horizontal interconnect density with metal layers**
  - Aggressive **vertical interconnect thanks to TSV technology**
  - Backbone for **heterogeneous integration** of small dies + **passives**
  - Backbone for integration of **IOs, shared peripherals , test, Power Management Units**
  - Better **thermal conductivity** with silicon



- **3D Design Flow & Challenges**

# Collaborative definition of 3D Design Flow with EDA partners

**Yesterday: Survivor kit...**

- manual implementation of TSV
- Manual partitioning with 2D tools



**Multiple partnerships to prepare 3D design flow**



## 3D Stack Design Exploration

- Multiple techno nodes
- Die partitioning
- Architecture exploration
- Simultaneous floorplan and TSV location exploration



## 3D Stack/Package analysis

- 3D Thermal Profile analysis
- 3D Test & Defect analysis



## 3D Implementation

- 3D Floorplan
- 3D Power planning
- 3D Test
- 2D Place & CTS & Route
- 3D analysis (power/timing)
- 3D Verification

# 3D Design Flow : WIOMING exemple

- Target technology
  - Uses ST-Microelectronics high-speed CMOS library
  - Uses TSV middle ( $\varnothing 10\mu\text{m}$ ) + Copper Pillar ( $\varnothing 10\mu\text{m}$ )
  - Is a Flip-Chip packaging assembly
  - Is a Face2Back, Die to Die 3D stacking assembly
- Back End kit
  - Virtuoso tech file addon kit for 3D layers
  - EDI Techno file & captable
  - DRC & LVS « 3D » addon kit
- Specific cells
  - Flip Chip Bumps, Micro-bumps,
  - ESDs, Micro-buffers,



# Cadence EDI 3D-IC Stack Design Implementation & Analysis

- Cadence **Encounter 3D-IC** design implementation is developed in collaboration with major foundries and advanced system designers.
- Supports a comprehensive 3D-IC modeling for both implementation and analysis.
  - Different types of 3D Interconnect: TSV, micro-bump, copper pillar or direct bonding and backside metal layers.
  - Multiple set of manufacturing rules
- Supports multiple types of 3D-IC stacking in design implementation and analysis
  - Silicon interposer
  - Vertical stack.
  - Mixed stack.
- EDI Design methodology and design flow are proven with several 3D-IC tape-out.



**cadence®**

# Cadence EDI 3D-IC Analysis Methodology



**Cadence**

# 3D-ANoC : TSV Floorplanning



## 3D-ANoC TSV design

- Symetrical 3D NoC connection for face-up  $\Leftrightarrow$  face-down
- 3D NoC matrix also contains power supplies (gnd/vdd) to supply top die through bottom die

## Place & Route tool flow

- Use automated TSV creation + assignement + symmetric top  $\Leftrightarrow$  bottom die faces

```
- set die to bottom  
- Create the TSV matrix  
- Assign TSV + backBump for bottom die  
- save TSV & back Bump for bottom die connection  
- set die to top  
- create front Bump of top die, from bottom die file
```

- Then, use semi-automated FP commands for :
  - PG connections between TSV and flip-chip bumps,
  - $\mu$ - Buffer cell placement,
  - PG routing within the TSV matrix,
  - ESDs, etc ...



# 3D-IC : Power & IR drop analysis

- Using Encounter Power System (EPS tool)
  - Currently using ERA (Early Rail Analysis) mode
  - No sign-off mode, due to missing of sign off library views
- Power analysis
  - Top chip power consumption in the **1-2 Watt**,
  - according to target frequency and activity ratio.
- 3D IR-Drop analysis (of the same die)
  - **Bottom Die**  
=> supplied from the Flip-Chip IO ring + central matrix power supplies
    - 0.02 mV max IR drop
  - **Top Die**  
=> supplied from the TSV matrix, through the bottom die
    - 0.2 mV max IR drop



# 3D-IC : Thermal Analysis

## Power Map :

- SoC Die : 2 Watts
- Memory Die : 0.5 Watts

→ Using Encouter Thermal analysis



# Mag3D heaters and thermal sensors

- For thermal behavior analysis in a 3D package environment:
  - 8 heater blocks (STEricsson) to emulate hot spots
  - Can generate each 1Watt
    - Total ~ 8Watt
  - Separate supplies by use of dedicated flip chip bumps for rich power profile emulation from application board
  - Thermal sensor for temperature measurement from the application software



Heater

Thermal Sensor

# 3D ANoC : DFT and fault tolerance

## ■ 3D ANoC DFT architecture → test individual TSVs

- Based on JTAG protocol IEEE 1149.1
- One DFT test wrapper + one TAP per die
- JTAG protocol propagation in the 3D stack

## ■ TSV fault tolerance

- Add 15% spare TSV per ANoC serial link
- ⇒ *Test & repair JTAG based architecture*



# Wide IO Test Architecture

**WideIO if. with Memory**

## Specific design for WideIO Testability

- Use of IEEE 1500 Test Controller
- Use of OCC (On Chip Clock controller)
- 5 different test mode features

### WideIO memory test ?

- ➔ *Memory is delivered tested by the DRAM foundry,*
- ➔ *but through dedicated pads,*
- not through its WideIO matrix signals ...*



| Test mode       | Test feature & coverage                                                                                    |
|-----------------|------------------------------------------------------------------------------------------------------------|
| Boundary Scan   | To test TSV connections between die & memory                                                               |
| Direct Access   | To generate direct ( <i>but partial</i> ) memory accesses from die (used for debug purpose mainly)         |
| Memory BIST     | Memory BIST, included in the die (DENALI controller), to test the whole memory, using the WideIO interface |
| Stuck-at        | Standard DFT of the WideIO memory Controller                                                               |
| PLL test & bist | To test the specific memory controller PLL                                                                 |



# Some Other Design Challenges

- 3D Design Tools
  - 3D-stack and package co-design
  - 3D System Level partitionning and early floorplanning analysis
  - More automation for final verification (DRC, LVS)
- 3D Testability
  - On going standardization efforts
    - IEEE WG 3D Test (see <http://grouper.ieee.org/groups/3Dtest/>)
  - Optimize the overall 3D DFT architecture and 3D ATPG algorithm
  - Get more data on TSV defect & yield analysis
- 3D Analysis and Optimization
  - *Power Delivery Networks, for IRdrop and Thermal constraints*
  - *Thermal characterization & optimization*

# Conclusion & perspective

- Wide IO: **Memory-on-processor**
  - In mobile computing, off-package memory interfaces have reached their limit above ~10GByte/s
  - 3D stacking technology enables a power efficiency breakthrough in memory interconnect
  - Integrated and validated in a real 3D prototype
- Asynchronous NoC: **Processor-on-processor**
  - Template based design offering efficient communication infrastructure
  - Asynchronous logic get rid of any timing deviation, of unknown 3D TSV, Bumps characteristics
  - Integrated in a real 3D prototype
- Wide IO + ANoC + Interposer
  - Key technologies for next generation power efficient compute node

# Main publications

- « A Fully-Asynchronous Low-Power Framework for GALS NoC Integration » Yvain Thonnart, Pascal Vivet, Fabien Clermidy, **DATE'2010**
- « 3D Embedded multi-core: Some perspectives », F. Clermidy , F. Darve, D. Dutoit, W. Lafi, P. Vivet, **DATE'2011**
- « 3D Technologies : Some Perspectives for Memory », D. Dutoit, F. Clermidy, P. Vivet, CODESS, **ESWEEK 2011**
- « Physical Implementation of an Asynchronous 3D-NoC Router using Serial Vertical Links » - Florian Darve, Abbas Sheibanyrad, Pascal Vivet and Frédéric Petrot, **ISVLSI'2011**
- « 3D NoC Using Through Silicon Via: an Asynchronous Implementation »  
Pascal Vivet, Denis Dutoit, Yvain Thonnart and Fabien Clermidy, **VLSI-SOC'2011**
- « A Three-Layers 3D-IC Stack including Wide-IO and 3D NoC – Practical Design Perspective », P. Vivet, V. Guerin, Presentation at the 3D Architecture for Semiconductor Integration and Packaging, **2011 RTI 3D ASIP**, San Francisco, USA, Dec 2011.
- WideIO JEDEC standard, see <http://www.jedec.org/>

# Many Thanks ...

- To our partners in this project
  - STMicroelectronics, ST-Ericsson, CADENCE



- To LISAN LETI Design Team
  - D. Dutoit, F. Clermidy, Y. Thonnart, C. Bernard, F. Darve, T. Khandelwal,
- Work partly funded by the following European programs :
  - **COCOA** (*Chip-On-Chip technology to Open new Applications*)
  - **3DIM3** (*3D-TSV Integration for Multimedia and Mobile applications*)
  - **PRO3D** (*Programming for Future 3D Architecture with Many Cores*)



# leti

LABORATOIRE D'ÉLECTRONIQUE  
ET DE TECHNOLOGIES  
DE L'INFORMATION

CEA-Leti  
MINATEC Campus, 17 rue des Martyrs  
38054 GRENOBLE Cedex 9  
Tel. +33 4 38 78 36 25

[www.leti.fr](http://www.leti.fr)



Thank you for  
your attention

