

# Thermal Management of High Performance Microprocessors in Burn-in Environment

Arman Vassighi  
ECE Department  
Univ.of Waterloo  
Waterloo,Canada

Oleg Semenov  
ECE Department  
Univ.of Waterloo  
Waterloo,Canada

Manoj Sachdev  
ECE Department  
Univ.of Waterloo  
Waterloo,Canada

Ali Keshavarzi  
Research Labs  
Intel Corp.  
Hillsboro, OR

## Abstract

*In deep sub-micron technologies, increased standby current in high performance processors will result in increased junction temperature. Elevated temperature will have positive feedback on the standby current. If the temperature is not controlled, it may lead to the thermal runaway. In this paper we investigate the thermal management of high performance chips in the burn-in environment.*

## 1: Introduction

Increasing the performance of CMOS chips has been one of the primary reasons for technology scaling. With scaling, supply voltage is also reduced in order to reduce the active power consumption and maintain the device reliability. However, performance considerations require that transistor threshold voltage also be reduced proportionately. As a result of reduced threshold voltage, leakage current is increased exponentially. This is one of the major concerns in deep submicron technologies. Complexity of leakage management worsens in the burn-in environment, where an IC is subjected to voltage and temperature stresses [1]. The leakage current increases the junction temperature, and increased junction temperature further escalates the leakage current because of flattening of the sub-threshold slope at higher temperatures.

IC manufacturers have traditionally used burn-in procedures to remove weak devices from the population before shipping them to the customer. Weak devices often fail in the field resulting in early life failures or infant mortality (Figure 1). Stresses during burn-in cause weak devices to degrade while normal device population remains unaffected. Standard test programs can detect these degraded devices that will exhibit abnormal voltage or current levels or functional failures. In this paper we study the behavior of the CMOS circuit in burn-in environment in terms of temperature and supply voltage. Then for the given burn-in voltage we will show that there is a burn-in temperature that leads to thermal runaway. Then for burn-in stress voltage, we find the maximum reliable temperature.



**Figure 1.** Bathtub curve indicating the failure rate of the electronic devices during their life

## 2: Junction Temperature Estimation Procedure

Historically, the burn-in environment temperature and voltage have been  $125^{\circ}C$  and  $V_{DD} + 30\%$  to  $V_{DD} + 40\%$ , respectively. At the time, the leakage power was a non-issue. However, in sub- $0.18\mu m$  technologies, leakage power is significantly high under burn-in conditions. Figure 2 further explains the unabated increased in leakage current for 130 nm CMOS technology under burn-in conditions. The figure shows increase in leakage power with temperature and voltage increase. As it can be seen from the graph, the leakage is increased by more than one order of magnitude going from nominal to burn-in conditions.

The junction temperature of an IC is defined as the temperature of the silicon substrate, and it is a crucial parameter in reliability-prediction procedures and burn-in testing. The junction temperature or  $T_j$ , is defined as [2]:

$$T_j = T_a + P \times \theta_{ja} \quad (1)$$

where  $T_a$  is the ambient or set point temperature,  $P$  is the device total power, and  $\theta_{ja}$  is the junction-to-ambient thermal resistance. The power dissipation can be subdivided into dynamic and leakage components, as

$$P = P_{dynamic} + P_{leakage} \quad (2)$$

$$P_{leakage} = I_{leak} \times V_{dd} \quad (3)$$

$$P_{dynamic} = C \times V_{dd}^2 \times f_{toggle} \quad (4)$$



**Figure 2.** Off current of a NMOS transistor in terms of voltage and temperature for 0.35 $\mu$ m CMOS technology (normalized to off current in 0.6 V and 25°C).

In equation 4,  $C$  is the total IC capacitance and  $f_{toggle}$  is the frequency that is used for node toggling during burn-in and can be expressed as:

$$f_{toggle} = \frac{I_{on}}{C_{gate} \times V_{dd} \times N} \quad (5)$$

where  $C_{gate}$  is the gate capacitance of a single gate and  $N$  is the number of logic stages in the critical path. To evaluate junction temperature,  $T_j$ , under different environmental conditions, a program has been developed. Fig. 3 depicts the flow chart of the program. At any initial temperature, the program reads the input current for a single transistor. Based on the circuit implementation and architecture, the total power is computed using equations 3 and 4 and junction temperature is updated in equation 1. Using this procedure, for any given voltage and process technology, junction temperature is calculated and convergence of the obtained temperature is being tested [3]. Depending on the result after several iterations the junction temperature will either converges to a temperature or will increase and lead the chip to thermal runaway.

A 32-bit microprocessor in 0.13 $\mu$ m dual Vth CMOS technology was used to verify the program. The parameters of this program were calibrated to the experimental data from the microprocessor.

### 3: Burn-in Environment Setup for Thermal Runaway Avoidance

In a burn-in environment, the device operates at lower frequency,  $f_{toggle}$ , (often 0.001 of  $f_{max}$ ) where  $f_{max}$  is the nominal operating frequency. Therefore, the dynamic power dissipation is a negligible portion of the total power. To find the optimum



**Figure 3. Procedure of calculating the junction temperature for given supply voltage.**

condition for the burn-in, simulations were performed with frequency set at 1/1000 of  $f_{max}$  and  $V_{dd}$  was set to 1.35 times of the nominal  $V_{dd}$ . The ambient temperature was used as a parameter. The ambient temperature was changed from low to high temperature and the change in the junction temperature was plotted. Fig. 4 shows results of simulations for microprocessor model in  $0.13\mu m$  technology with an air cooled burn-in oven. We found that at ambient temperature of  $0^{\circ}C$  and below, the junction temperature converges, while for ambient temperature above  $0^{\circ}C$ , junction temperature increases monotonically. As can be seen for ambient temperature for  $10^{\circ}C$  and  $20^{\circ}C$ , this behavior will lead the chip to thermal runaway.

To avoid the thermal runaway, the ambient temperature must be kept below  $0^{\circ}C$  at all times. Since for air cooled oven it is impossible to cool the oven to temperatures less than room temperature other BI ovens that have less junction to ambient thermal resistance, must be used. Liquid cooled and refrigeration ovens with junction to ambient thermal resistance of  $0.45^{\circ}C/W$  and  $0.3^{\circ}C/W$ , respectively, are possible solutions for  $0.13\mu m$  technology and beyond.

Fig. 5 shows junction temperature of the same processor for liquid cooled oven. Since the junction to ambient thermal resistance of the oven is much lower than air cooled oven, we can see that the BI can be performed in higher ambient temperature ( $80^{\circ}C$ ).

The processors in a production line have a skewed normal leakage distribution. The processors which are leakier are more susceptible to thermal runaway because the leakage power dissipation of these processors is higher. Since leakier processors are also faster, the cost of losing them to thermal runaway is even higher than the processors with average leakage. Therefore, a burn-in procedure must be tailored according to the processor leakages. The processors are categorized based on their



**Figure 4. Thermal runaway conditions in static burn-in for 80 W processor in  $0.13\mu m$  CMOS technology.**

leakage. Subsequently burn-in procedure for each category is optimized to minimize the thermal runaway probability. Fig. 6 repeats the simulation results shown in Fig.4 for a processor which its transistors are 30% leakier than nominal. In this case, to avoid thermal runaway, the ambient temperature must be significantly reduced (from  $80^{\circ}C$  to  $30^{\circ}C$ ).

#### 4: Burn-in Trends Condition with Scaling

As technology scales the static power of the chip increases due to increased gate leakage and subthreshold leakage. This impacts the burn-in conditions by increasing the junction temperature furthermore. In  $0.10\mu m$  technology, The gate leakage power is 30% to 40% of the chip total leakage. Although this part of the leakage is not temperature sensitive, but since it is very dependent on voltage across the gate oxide, under the stress voltage in the burn-in condition, this leakage increases significantly and contributes to the junction temperature increase. It is important to note that still the subthreshold current which is voltage and temperature dependent is the reason that chip goes to thermal runaway. To study the burn-in conditions for  $0.10\mu m$  technology, we used the same tool that we developed for  $0.13\mu m$  technology. The width and the length of the transistors were reduced by proper scaling factor. The number of transistors were doubled to keep the area the same as  $0.13\mu m$  technology. The power supply was reduced, according to power supply for  $0.10\mu m$  technology. Simulations were carried out assuming that the architecture of the chip was the same as the one in  $0.13\mu m$  technology. Fig. 7 shows that for a processor in  $0.10\mu m$  technology any ambient temperature more than  $-20^{\circ}C$  leads the chip to thermal runaway. Fig. 8 shows that Liquid cooled oven let the same processor to be burned



**Figure 5. Junction temperature of the processor in  $0.13\mu m$  CMOS technology in the liquid cooled oven.**

at higher ambient temperature ( $75^\circ C$ ).

## 5: Conclusion

A Thermo-Electrical tool was developed for junction temperature estimation. Using this tool and industrial data, behavior of the high performance microprocessors under burn-in condition was studied. It was shown that there is strong possibility of thermal runaway due to positive feedback between temperature and leakage power dissipation. It was also shown that the leakage power has a large impact on junction temperature and consequently on setting of the ambient temperature; chips with different leakage must be burned-in under different conditions to avoid the thermal runaway. Finally, burn-in ovens with smaller junction to ambient thermal resistance are needed for future technologies.

## References

- [1] R.-P. Vollertsen, "Burn-In", IEEE International Integrated Reliability Workshop Final Report, 1999, pp. 167 -173.
- [2] P. Tadayon, "Thermal Challenges During Microprocessor Testing", Intel Tech. Journal Q3, 2000.
- [3] K. Kanda, K. Nose, H. Kawaguchi, and T. Sakurai, "Design Impact of Positive Temperature Dependence on Drain Current in Sub-1-V CMOS VLSIs", IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 10, OCTOBER 2001



**Figure 6.** Junction temperature of the processor in  $0.13\mu\text{m}$  CMOS technology where transistors are 30% leakier than nominal (liquid cooled oven).



**Figure 7.** Junction temperature of the processor in  $0.10\mu\text{m}$  CMOS technology in the air cooled oven.



**Figure 8.** Junction temperature of the processor in  $0.10\mu\text{m}$  CMOS technology in the liquid cooled oven.