

# Characterization of System Level Single Event Upset (SEU) Responses using SEU Data, Classical Reliability Models, and Space Environment Data



Melanie Berg<sup>1</sup>, Kenneth LaBel<sup>2</sup>, Michael Campola<sup>2</sup>, Michael Xapsos<sup>2</sup>

[Melanie.D.Berg@NASA.gov](mailto:Melanie.D.Berg@NASA.gov)

1. AS&D in support of NASA/GSFC

2. NASA/GSFC



# Acronyms

- Combinatorial logic (CL)
- Commercial off the shelf (COTS)
- Complementary metal-oxide semiconductor (CMOS)
- Device under test (DUT)
- Edge-triggered flip-flops (DFFs)
- Error rate ( $\lambda$ )
- Error rate per bit( $\lambda_{\text{bit}}$ )
- Error rate per system( $\lambda_{\text{system}}$ )
- Field programmable gate array (FPGA)
- Global triple modular redundancy (GTMR)
- Hardware description language (HDL)
- Input – output (I/O)
- Intellectual Property (IP)
- Linear energy transfer (LET)
- Mean fluence to failure (MFTF)
- Mean time to failure (MTTF)
- Number of used bits (#Usedbits)
- Operational frequency (fs)
- Personal Computer (PC)
- Probability of configuration upsets ( $P_{\text{configuration}}$ )
- Probability of Functional Logic upsets ( $P_{\text{functionalLogic}}$ )
- Probability of single event functional interrupt ( $P_{\text{SEFI}}$ )
- Probability of system failure ( $P_{\text{system}}$ )
- Processor (PC)
- Radiation Effects and Analysis Group (REAG)
- Reliability over time ( $R(t)$ )
- Reliability over fluence ( $R(\Phi)$ )
- Single event effect (SEE)
- Single event functional interrupt (SEFI)
- Single event latch-up (SEL)
- Single event transient (SET)
- Single event upset (SEU)
- Single event upset cross-section ( $\sigma_{\text{SEU}}$ )
- System on a chip (SoC)
- Windowed Shift Register (WSR)
- Xilinx Virtex 5 field programmable gate array (V5)
- Xilinx Virtex 5 field programmable gate array radiation hardened (V5QV)

# Problem Statement

- Conventional methods of applying single event upset (SEU) data to complex systems need improvement.
- The problem boils down to extrapolation and application of SEU data to characterize system performance in radiation environments.



# Abstract – Impact to Community

- We are investigating the application of **classical reliability** performance metrics combined with standard **SEU analysis** data.
- We expect to relate SEU behavior to system performance requirements...
  - Our proposed methodology will provide better prediction of SEU responses in harsh radiation environments with confidence metrics.



# SEU System Analysis Is Not Simple Algebra



- When a system is targeted for space, single event effect (SEE) data are obtained for all devices that make up that system.
- Combining component data is not simple addition.
- Co-dependent susceptibilities exist and must be handled accordingly.



*Proposed method should target critical missions subjected to ionizing particles.*

# NEPP - Small Mission Efforts

*Proposed method fits into new approaches for small mission assurance.*





# Scope of Presentation

- **Full system analysis requires combining SEU data for a variety of devices across a variety of boxes/mediums.**
- **The scope of this presentation is for System-type SEU data analysis for a single device.**
- **System on a Chip (SoC) field programmable gate array (FPGA) analysis.**
- **Future presentations will expand to address full systems.**



# Background (1)

## FPGA SEU Susceptibility

SEU Cross Section ( $\sigma_{SEU}$ )

- $\sigma_{SEU}$ s (per category) are calculated from SEU test and analysis.
- $\sigma_{SEU}$ s are calculated with particles that vary in linear energy transfer (LET).
- FPGA architectures vary and so do their SEU responses.
- Most believe the dominant  $\sigma_{SEU}$ s are per bit (configuration or flip-flops (DFFs)). However, global routes are significant (more than DFFs).

$$P(fs)_{\text{system}} \underset{\substack{\text{Design } \sigma_{SEU} \\ \text{Configuration } \sigma_{SEU}}}{\propto} P_{\text{Configuration}} + P(fs)_{\text{functionalLogic}} + P_{\text{SEFI}}$$

$\sigma_{SEU}$ s are measured by bit

$\sigma_{SEU}$ s are measured by bit???

$\sigma_{SEU}$

Sequential and Combinatorial logic (CL) in data path

Global Routes and Hidden Logic

For a system, should  $\sigma_{SEU}$ s be measured by bit????

## Background (2)

### Conventional Conversion of SEU Cross-Sections To Error Rates for Complex Systems First Step

$$\sigma_{SEU} = \#errors/\text{fluence}$$

$$\lambda_{\text{system}} = \#errors/\text{time}$$

LET: Linear energy transfer

- Perform SEU accelerated radiation testing across ions with different linear energy transfers (LETs) to calculate  $\sigma_{SEU}$ s per LET.



# Windowed Shift Register (WSR) Test Structures Are Used To Obtain SEU Data



- Shift registers are typical test structures (mapped into device under test (DUT)) used for accelerated radiation testing.
- Purpose is to analyze DFF and CL susceptibility.

# Windowed Shift Register (WSR) Microsemi – RTG4 Heavy Ion Data at 100MHz



# Background (3)

## Conventional Conversion of SEU Cross-Sections To Error Rates for Complex Systems Next Step

- **Bottom-Up approach** (transistor level):
  - Given  $\sigma_{\text{SEU}}$  (per bit) use an error rate calculator (such as CRÈME96) to obtain an error rate per bit ( $\lambda_{\text{bit}}$ ).
  - Multiply  $\lambda_{\text{bit}}$  by the number of used memory bits (#UsedBits) in the target design to attain a system error rate ( $\lambda_{\text{system}}$ ). Configuration and DFFs.
- **Top-Down approach** (system level):
  - Given  $\sigma_{\text{SEU}}$  (per system) use an error rate calculator (such as CRÈME96) to obtain an error rate per bit ( $\lambda_{\text{system}}$ ).



# Technical Problems with Current Methods of Error Rate Calculation

- For submission to CRÈME96,  $\sigma_{\text{SEU}}$  data (in Log-linear form) are fitted to a Weibull curve.
  - The two main parameters for curve fitting are a shape factor and a slope factor.
  - During the curve fitting process, a large amount of error can be introduced.**
  - Consequently, it is possible for resultant error rates (for the same design) to vary by decades.
- Because of the error rate calculation process,  $\sigma_{\text{SEU}}$  data are blended together and it is nearly impossible to hone in on the problem spots.  
**This can become important for mitigation insertion.**



# Technical Problems with Bottom-Up Analysis Method (1)



- Multiplying each bit within a design by  $\lambda_{bit}$  is not an efficient method of system error rate prediction.
  - Works well with memory structures... but...complex systems do not operate or respond like memories.
  - If an SEU affects a bit, and the bit is either inactive, disabled, or masked, a system malfunction might not occur.
    - Using the same multiplication factor across DFFs will produce extreme over-estimates.
    - To this date, there is no accurate method to predict DFF activity for complex systems.
    - Fault injection or simulation will not determine frequency of activity.



$$\lambda_{system} < \lambda_{bit} \times \#UsedBits$$

# Technical Problems with Bottom-Up Analysis Method (2)

- There are a variety of components that are susceptible to SEUs (clocks, resets, combinatorial logic, flip-flops (DFFs, etc...)).
  - Various component susceptibilities are not accurately characterized at a per bit level.
  - Design topology makes a significant difference in susceptibility and is not characterized in error rate calculators (e.g., CREME96).



Error rates calculated at the transistor-bit level are estimated at too small of granularity for proper extrapolation to complex systems.

# What If Tests Do Not Investigate Test Structures Across A Variety of Parameters



*Data might not reflect potential SEU responses!*



# Understand Goal of SEU Testing and Data Application



- Is the goal of SEU testing to analyze test circuits?
  - Efficacy of DFF mitigation.
  - Single event transient (SET) propagation strength.
  - SET width.
  - General test circuit evaluation.
- Or... is the goal of testing to obtain data for eventual system characterization?
- System characterization requires more than conventional test circuit analysis.
  - Test circuits are too simple.
  - Test circuits often do not follow formal design rules (e.g., synchronous, CMOS balancing, or place and route).
  - Design topology affects SEU response.
- Complex system test structures are important for SEU system characterization.
  - Top down approach.
  - Multiple complex test structures and trend evaluation is essential.

# Let's Not Reinvent The Wheel... A Proven Solution Can Be Found in Classical Reliability Analysis

- Classical reliability models have been used as a standard metric for complex system performance.
- The analysis provides a more in depth interpretation of system behavior over time by using system-level MTTF data for system performance metrics.



*Theory is already developed, proven, and should be in our hands!*

$$R(t)=e^{-t/MTTF} \text{ or } R(t)=e^{-\lambda t}$$

# Weibull Failure Rate ( $\lambda(T)$ ) Bathtub Curve





# Mapping Classical Reliability Models from The Time Domain To The Fluence Domain

- The exponential model that relates reliability to MTTF assumes that during **useful-lifetime**:
  - Failures are independent.
  - Error rate is constant.
  - MTTF =  $1/\lambda$ .
$$R(t)=e^{-t/MTTF} \text{ or } R(t)=e^{-\lambda t}$$

*Weibull slope = 1... exponential.*
- For a given LET (across fluence):
  - SEUs are independent.
  - $\sigma_{SEU}$  is constant.
  - MFTF =  $1/\sigma_{SEU}$ .

*Parallel between time and fluence.*

$$\sigma_{SEU} = \#errors/fluence$$
$$\lambda_{system} = \#errors/time$$
- Hence, mapping from the time domain to the fluence domain (per LET) is straight forward:
  - $t \Leftrightarrow \Phi$
  - MTTF  $\Leftrightarrow$  MFTF
  - $\lambda \Leftrightarrow \sigma_{SEU}$
$$R(t)=e^{-t/MTTF} \Leftrightarrow R(\Phi)=e^{\Phi/MFTF}$$

# Creating Reliability Curves from $\sigma_{SEU}$ s

- $\sigma_{SEU}$  data are system level.
- A histogram of environment data is created. Bins are determined by LET values at each  $\sigma_{SEU}$  data point.
- For each data point at a given LET, a combination of binned environment data and upper-bound  $\sigma_{SEU}$  data are used to determine system reliability performance.
- A piecemeal approach is performed per data point to determine the weakest points of system performance.



M. A. Xapsos, IEEE NSREC Short Course, Ponte Vedra Beach, FL, 2008.

# Example of Proposed Methodology

## Application



- Mission requirements:
  - The FPGA shall contain an embedded microprocessor.
  - Selection shall be made between a Xilinx V5QV (very expensive device) or a Xilinx V5 with embedded PowerPC (relatively cheap device).
  - FPGA operation shall have reliability of 3-nines (99.9%) within a 10 minute window at Geosynchronous Equatorial Orbit (GEO).
- Proposed methodology:
  - Create a histogram of particle flux versus LET for a 10-minute window of time for your target environment.
  - Calculate MFTF per LET (obtain SEU data).
  - Graph  $R(\Phi)$  for a variety of LET values and their associated MFTFs.  $R(\Phi)=e^{\Phi/MFTF}$
  - For selected ranges of LETs, use an upper bound of particle flux (number of particles/cm<sup>2</sup>•10-minutes), to determine if the system will meet the mission's reliability requirements.

# Environment Data: Flux versus LET Histogram for A 10-minute Window



*Geosynchronous Equatorial Orbit (GEO) 100-mils shielding*



# MFTF versus LET for the Xilinx V5 Embedded PowerPC Core and the Xilinx V5QV MicroBlaze Soft Processor Core

- **V5QV:** no system errors were observed below  $\text{LET}=1.8\text{MeV}\cdot\text{cm}^2/\text{mg}$ . Total fluence  $> 5.0\times 10^8$  particles/cm $^2$ .
- **PowerPC:**
  - No system errors were observed below  $\text{LET}=0.07\text{MeV}\cdot\text{cm}^2/\text{mg}$  with total fluence =  $1.0\times 10^8$  particles/cm $^2$ .
  - Hence, at 0.07, we will assume an upper-bound  $\text{MFTF} = 1.0\times 10^8$  particles/cm $^2$ .
  - More tests would increase the MFTF for this bin.



# Reliability across Fluence up to LET=0.07MeV•cm<sup>2</sup>/mg

Binned GEO Environment data show approximately 3000 particles/(cm<sup>2</sup>•10-minutes), in the range of 0.0MeV•cm<sup>2</sup>/mg to 0.07MeV•cm<sup>2</sup>/mg. We are using MFTF for 0.07MeV•cm<sup>2</sup>/mg to upper bound this bin.



Reliability at 3000 particles/(cm<sup>2</sup>•10-minutes) > 99.99% for the PowerPC design implementation. “9’s” could be increased with more tests.



# Reliability across Fluence up to LET=0.14MeV•cm<sup>2</sup>/mg

Binned GEO Environment data show approximately 11 particles/(cm<sup>2</sup>•10-minutes), in the range of 0.07MeV•cm<sup>2</sup>/mg to 0.14MeV•cm<sup>2</sup>/mg. We are using MFTF for 0.1MeV•cm<sup>2</sup>/mg to upper bound this bin.



Reliability at 5 particles/(cm<sup>2</sup>•10-minutes) > 99.999% for the V5QV PowerPC design implementation.

# Reliability across Fluence up to LET=1.8 MeV•cm<sup>2</sup>/mg



Binned GEO Environment data show approximately 9 particles/(cm<sup>2</sup>•10-minutes), in the range of 0.14MeV•cm<sup>2</sup>/mg to 1.8MeV•cm<sup>2</sup>/mg. We are using MFTF for 1.8MeV•cm<sup>2</sup>/mg to upper bound this bin:



Reliability at 9 particles/(cm<sup>2</sup>•10-minutes) > 99.9% for the PowerPC design implementation. This is the most susceptible bin for the system.

# Reliability across Fluence up to LET=3.6MeV•cm<sup>2</sup>/mg



Binned GEO Environment data show approximately 0.23 particles/(cm<sup>2</sup>•10-minutes), in the range of 1.8MeV•cm<sup>2</sup>/mg to 3.6MeV•cm<sup>2</sup>/mg.



Within this LET range, reliability at 0.23 particles/(cm<sup>2</sup>•10-minutes)  
> 99.999% for both design implementations.

# Reliability across Fluence at LET=40MeVcm<sup>2</sup>/mg

Binned GEO environment data show approximately 0.07 particles/(cm<sup>2</sup>•10-minutes), in the range of 3.6MeV•cm<sup>2</sup>/mg to 40.0MeV•cm<sup>2</sup>/mg.



Within this LET range, reliability at 0.07 particles/(cm<sup>2</sup>•10-minutes) > 99.9% for both design implementations. We can refine by analyzing smaller bins.

# Example Conclusion

- Using the proposed methodology, the commercial Xilinx V5 device will meet project requirements.
- In this case, the project is able to save money by selecting the significantly cheaper FPGA device and gain performance because of the embedded PowerPC.





# Conclusions

- This study transforms proven classical reliability models into the SEU particle fluence domain. The intent is to better characterize SEU responses for complex systems.
- The method for reliability-model application is as follows:
  - SEU data are obtained as MFTF.
  - Reliability curves (in the fluence domain) are calculated using MFTF; and are analyzed with a piecemeal approach.
  - Environment data are then used to determine particle flux exposure within required windows of mission operation.
- The proposed method does not rely on data-fitting and hence removes a significant source of error.
- The proposed method provides information for highly SEU-susceptible scenarios; hence enables a better choice of mitigation strategy.
- This is preliminary work. There is more to come.

*This methodology expresses SEU behavior and response in terms that missions understand via classical reliability metrics.*



# Acknowledgements

- *Some of this work has been sponsored by the NASA Electronic Parts and Packaging (NEPP) Program and the Defense Threat Reduction Agency (DTRA).*
- *Thanks is given to the NASA Goddard Radiation Effects and Analysis Group (REAG) for their technical assistance and support. REAG is led by Kenneth LaBel and Jonathan Pellish.*

## *Contact Information:*

*Melanie Berg: NASA Goddard REAG FPGA  
Principal Investigator:*

*Melanie.D.Berg@NASA.GOV*