



# Modernization of FPGA Risk Analysis for Critical Space Applications

Melanie Berg

Contractor in support of NASA/GSFC

[Melanie.D.Berg@NASA.gov](mailto:Melanie.D.Berg@NASA.gov)



# Acronyms

| Acronym | Description                                                 | Acronym              | Description                                            | Acronym  | Description                       |
|---------|-------------------------------------------------------------|----------------------|--------------------------------------------------------|----------|-----------------------------------|
| AI      | Artificial Intelligence                                     | IP                   | Intellectual property                                  | RPP      | Rectangular parallel pipe         |
| BRAM    | embedded static random-access memory                        | Ib                   | lower bound                                            | SEE      | single event effect               |
| CCIX    | Interconnect consortium                                     | LBNL                 | Lawrence Berkeley National Laboratory                  | SEF      | single event failure              |
| CLB     | configurable logic block                                    | LET                  | linear event transfer                                  | SEFI     | single event functional interrupt |
| CMOS    | Complementary MOSFET                                        | LUT                  | Look Up Table                                          | SERDES   | serializer -deserializer          |
| CXL     | Compute express link                                        | LVDS                 | low Voltage Differential Signaling                     | SET      | single event transient            |
| DDR4    | Double Data Rate 4 Synchronous Dynamic Random-Access Memory | MFTF                 | mean fluence to failure                                | SEU      | single even upset                 |
| DFF     | Flip-flop                                                   | MIPI                 | mobile industry processor interface                    | SoC      | system on chip                    |
| DSP     | Digital signal processor                                    | n                    | number of events                                       | SRAM     | static random access memory       |
| DUT     | device under test                                           | NoC                  | network on chip                                        | T        | number of experiments             |
| FPGA    | Field programmable gate array                               | P                    | probability                                            | ub       | upper bound                       |
| FTF     | fluence to failure                                          | PCIe                 | Peripheral Component Interconnect Express              | wDMA     | Direct memory access              |
| G       | Giga                                                        | $P_{\text{effect}}$  | Probability an event can exist through system topology | $\mu$    | mean                              |
| Gb/s    | Gigabits/second                                             | $P_{\text{gen}}$     | Probability an event can occur from ionization         | $\sigma$ | cross section                     |
| GPIO    | general purpose input/output                                | $P_{\text{observe}}$ | Probability an event can be observed                   | $\Phi$   | fluence                           |
| GR      | global route                                                | RF                   | radio frequency                                        | Qcoll    | Collection charge                 |
| HBM     | High Bandwidth Memory                                       | RHA                  | Radiation Hardness Assurance                           | Qcrit    | Critical charge                   |
| I/O     | input/output                                                | RTD                  | representative tactical design                         | twidth   | Transient width                   |

# The Space Environment and Mission Operation



- ▶ The space environment consists of a variety of ionizing particles (radiation) that can disrupt mission operations.
- ▶ Risk analysis or failure analysis is performed to determine component-level susceptibilities and how they can potentially affect system operation in radiation environments.
- ▶ Goal: investigate susceptibilities and predict (calculate) the probability of an upset:
  - Requires space environment particle flux data.
  - Requires component susceptibility data





# Radiation Hardness Assurance (RHA)

- ▶ Radiation Hardness Assurance (RHA) is the process of:
  - Identifying possible susceptibilities, vulnerabilities and failure modes
  - Obtaining single event effects (SEE) data... performing radiation testing on components at a beam facility (accelerated testing).
  - Analyzing SEE data (calculating cross-sections).
  - Extrapolating SEE data (using target environment particle flux) to a tactical system:
    - **Transform** accelerated (beam) SEE data to target space environment.
    - **Transform** test structure susceptibilities to target system topology.
  - Calculating failure rates based on extrapolation information.
  - Based on mission requirements, mitigation (fault tolerance) insertion



# Device Penetration of Heavy Ions and Linear Energy Transfer (LET)



## How Do Heavy Ions Affect Electronics?

- ▶ LET characterizes the deposition of charged particles passing through a device.
- ▶ Based on Average energy loss ( $dE$ ) per unit path length ( $dx$ ) (stopping power)
- ▶ Density is used to normalize LET to the target material ( $\rho$ ).

$$LET = \frac{1}{\rho} \frac{dE}{dx}$$

Energy  $\frac{cm^2}{MeV}$

$\rho$  Density of target material

Linear path length  $mg$   
Units



Average energy deposited per unit path length



Sensitive region:  
Rectangular Parallelipiped

# Energy Collection and SET Generation



- ▶ For CMOS, SET generation occurs due to an “off” gate turning “on”.
- ▶ For a CMOS SET: there is a push-pull between the on gate and the off gate  $Q_{coll}$
- ▶ SETs can have significant metastable states
- ▶ SET has an amplitude and width ( $\tau_{width}$ ) based on:
  - Amount of  $Q_{coll}$  (i.e. small LET → small SET)
  - The capacitance of the gate’s load
  - The strength (current) of its complimentary “ON” gate
  - The dissipation strength of the process.
- ▶ Captured SET is a SEU

$$\begin{array}{c} \text{Collected} \\ \text{Charge} \\ Q_{coll} > Q_{crit} \\ \text{Critical} \\ \text{Charge} \end{array}$$



# SEUs and SETs in Combinatorial Logic and Edge Triggered Flip Flops (DFF)



| Combinatorial (CL)                                                                                                                     | Sequential (DFF)                                                                                                                              |
|----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| Logic function generation (computation)                                                                                                | Captures and holds state of data input at rising edge of clock                                                                                |
| <b>SET</b><br>                                       | <b>SEU</b><br>                                             |
| <b>Glitch in the CL:</b><br><br><b>Double Sided</b> | <b>SET Capture in DFF loop</b><br><br><b>Single Sided</b> |

Edge triggered DFF SEEs are significantly different than a latch due to master-slave capture topology.

# Investigating Failure Modes: Radiation Testing and SEE Cross Sections



System failures due to SEEs are second order:

- Probability that a transistor will change state, and
- Probability the SEU or SET will cause system malfunction.

Cross sections are metrics derived from beam testing

$$\sigma_{SEU}(LET) = \frac{\#events}{\#ions/cm^2} = \frac{\#events}{Fluence}$$

$\sigma_{seu}$ s are empirical data that are calculated per selected LET values (particle spectrum).

## Terminology:

- Flux: Particles/(sec-cm<sup>2</sup>)
- Fluence: Particles/cm<sup>2</sup>
- Linear energy transfer (LET MeV·cm<sup>2</sup>/mg)



SEE testing at LBNL 88in Cyclotron

LBNL: Lawrence Berkeley National Laboratory

# FPGA SEU Cross Section Model



**SEU Cross sections for a mapped design ( $\sigma_{SEF}$ ) are based on the FPGA's internal elements and the mapped design's topology.**

$$\sigma_{SEF} = f(\sigma_{configuration}, \sigma_{BRAM}, \sigma_{functionalLogic}, \sigma_{HiddenLogic})$$

**There are established testing techniques to study various FPGA elements**

Melanie Berg et. al, "FPGA SEU Radiation Test Guidelines:" [https://nepp.nasa.gov/files/23779/fpga\\_radiation\\_test\\_guidelines\\_2012.pdf](https://nepp.nasa.gov/files/23779/fpga_radiation_test_guidelines_2012.pdf)

# Challenges Using Conventional SEE Cross-Section Data for System Characterization



$$\sigma_{SEU} = \frac{\#events}{\#ions/cm^2} = \frac{\#events}{Fluence}$$

- ▶ Methods for calculating Single Event error rates rely on cross-sections.
  - Conventionally,  $\sigma_{SEU}$ s are metrics that describe a sensitive area (SEE susceptibility) of a device.
  - The concept of sensitive area/volume works well for transistor or bit-level component metrics.
  - This (old-school, conventional) concept is used to extrapolate SEE data to systems:
    - Fine-grain component cross-sections (bit-level/basic mechanisms) are obtained and are usually multiplied to characterize system SEE behavior.

RPP

$$Error\ Rate = \#(fine_{grain_{elements}}) \times error\_rate(fine_{grain_{element}})$$

Linear Bounding (presumed)... this is not extrapolation...topology is ignored

# Challenges for SEE Test and Analysis of New Generation SoC/FPGA



- ▶ Cannot test every fine grain (basic mechanism).
- ▶ Not all basic mechanisms are linearly extrapolatable (topology matters).
- ▶ SoCs contain significant amount of embedded circuitry (hidden logic).
- ▶ Hidden circuits are extremely complex and require complex test methods.
- ▶ Increased focus on  $\sigma_{HiddenLogic}$

$$\sigma_{SEF} = f(\sigma_{configuration}, \sigma_{BRAM}, \sigma_{functionalLogic}, \sigma_{HiddenLogic})$$

# Fine-Grain Test Structures Should Not Be Used for SoC Extrapolation



- Conventional test structure: shift register.
- Shift register data (the conventional golden metric) is insignificant towards the characterization of an SoC.
- Instead, test using coarse-grain structures:
  - Test operation in similar modes to flight.
  - Flight-like high-speed I/O protocols
  - Flight-like state-based controls and functions

Move from counting events of basic mechanisms

$$\sigma_{SEU} = \frac{\#events}{\#ions/cm^2} = \frac{\#events}{Fluence}$$

To obtaining the fluence until an event occurs

$$\sigma_{SEF} = \frac{1}{Fluence}$$

FTF: fluence-to-failure



# Reimagine Cross Sections as Probabilities



$$\text{Single Event Failure} = \sigma_{SEF} = \frac{\#events}{\text{Fluence}}$$

## Failure Rate in the fluence domain



**For system analyses:** Step away from the conventional methods of cross-sections representing sensitive areas and the RPP method.

**Redefine the cross-section metric to be a probability.**

**The probability an event will occur when the target is subjected to a given number of particles (per area).**

**$\sigma_{SEF}$  is now a rate. However, the rate is in the fluence domain not the time domain.**

# Single Event Effects And The Binomial Distribution

Trial → Event → Effect (Response)



- ▶ Each ion can either cause an event or not:
  - Binomial distribution... over multiple Bernoulli trials
    - ... each ion is an independent random trial with two (2) possible outcomes
  - Trial outcomes:
    - event (1) or
    - no event (0)
- ▶ For this definition, cross-sections can never be greater than 1.
- ▶ Law of large numbers states that these binomial experiments can be characterized by Poisson distributions.
- ▶ For systems, there will be times when the exponential distribution is a better model. The exponential distribution is a special case of the Poisson...  $P(X=0)$

Flipping a coin is the most common example of a binomial experiment



- Just like each coin toss, each particle is a Bernoulli trial
- An Event is an upset/failure

$$\sigma_{SEF} = \frac{\#events}{\#ions/cm^2} < 1$$

Makes sense if we are redefining a cross section as a probability



# Modeling System-Level Susceptibilities as They Pertain to Empirical Cross-Sections



**Be careful... your test system can greatly impact the quality of your cross-section data.**

# Cross-Section ( $\sigma_{SEF}$ ) As A Set of Transfer Functions of $P_{gen}$



Probability ionization + design topology will cause an effect



Poor test systems,  $P_{observe} \rightarrow 0$ :

- Test system adds noise to data (bad system design, dosimetry, flux control)
- Inability to reliably observe and report failures:
  - Missed events/upsets
  - Latency from event to observation
  - False events
  - Flux/fluence control

Empirical cross sections are not pure

$$\sigma_{SEF} = P_{gen} \times P_{Effect} \times P_{observe}$$



Many assume,  $P_{observe} = 1$ ; and  
Many assume they are measuring critical charge( $P_{gen}$ )

For a system, these assumptions are not true

# Testing Homogenous Cells versus Complex Systems



- ▶ **Homogenous Cells ( $P_{effect} \rightarrow 1$ ):**
  - Copies of a simple structure (inverters, buffers, memory cells)
  - Each test has many targets (that are the same components) and hence increases statistics.
  - In most cases, FTF is not the best approach. Instead, use a countable metric.
- ▶ **Complex systems ( $0 < P_{effect} < 1$ ):**
  - Many variables, moving parts, and state space exploration paths
  - Difficult to test and requires strategic planning.
  - Planning includes taking advantage of dominant mechanisms of failure.
  - Alternatively, the tests are evaluating probabilities of failure with respect to fluence exposure.



Countable systems (how many events per ion) → Poisson distribution  
FTF (how many ions until event) → Exponential distribution

# An Example of When to Use Homogenous Testing and Linear Bounding



$$\sigma_{SEF} = f(\sigma_{configuration}, \sigma_{BRAM}, \sigma_{functionalLogic}, \sigma_{HiddenLogic})$$



# Configuration SEU and Functional Upsets



- Direct connections from configuration to user logic.

An affected active/used bit has the ability to instantaneously cause an unexpected effect



No Read-Write cycle required!

# Example: Routing Configuration Upsets in a Xilinx Virtex FPGA





# SRAM-Based FPGAs and SEU Cross-Sections

For SRAM-Based FPGAs, Configuration bits are the dominant mechanisms of failure.



We first obtain configuration-bit cross-sections

We perform a linear transformation:  
(#essential\_bits × configuration cross-section)

We use the linear transformation as a bounding  
cross-section (error rate)

$$\sigma(LET)_{configuration\_Device} = \frac{\#events}{\#Particles/cm^2}$$

$$\sigma(LET)_{configuration\_bit} = \frac{\#events}{\left(\frac{\#Particles}{cm^2}\right) * (\#unmaskedconfigurationBits)}$$

$$\sigma(LET)_{Essential\_bit} = Essential\_bits \times \sigma(LET)_{configuration\_bit} \quad \text{Bound}$$

$$\sigma(LET)_{SEF} = 1/FTF = 1/((FailureTime - BeamStartTime) * AverageFlux) \quad \text{System Extrapolation}$$

Which cross-sections do we use for failure analysis? ... Must consider mission requirements.



# Homogenous Cross Sections: FPGA Configuration Memory



- Single event functional interrupt (SEFIs) can occur, however, they have a low event probability during testing (depending on fluence).
- If the experiments go to a high enough fluence, it is highly likely that a SEFI will occur... yet its  $\sigma_{\text{SEFI}}$  will be low.

| LET  | Number of Experiments (82 total tests) |
|------|----------------------------------------|
| 0.1  | 3                                      |
| 1.16 | 21                                     |
| 1.54 | 16                                     |
| 2.39 | 15                                     |
| 4.35 | 12                                     |
| 7.27 | 12                                     |
| 10.9 | 3                                      |

All 82 tests are represented in the graph. The results are so close that it is difficult to decipher between each experiment per LET.

# SRAM-Based FPGAs and Using Configuration Memory as An Upper-Bound



If Upper-bounds Satisfy Mission Reliability/Survivability Requirements, Then No FTF (Data Refinement) Necessary



# Error Bounding is Easy... Why Not Always Use It?



- ▶ Error bounding provides extreme upper bounds without knowledge of design topology.
  - Error rates calculated using error bounds might not meet requirements. Will need to refine SEE data by performing FTF type SEE testing.
  - Can't be used to study the efficacy of mitigation.
- ▶ When using error bounding... prove it before you use it.
  - $\sigma(LET)_{\text{Essential\_bit}}$  should only be used if it is known/proven to be an upper-bound (or close enough depending on criticality).
  - **The proof of bounding has been the missing factor; and is now necessary.**
  - Why now? Device complexity includes a significant amount of hidden logic.
    - Hidden logic have components that are not included in the essential bit count.
    - It has shown (in flight) to impact susceptibility (e.g., internal scrubbers).

# Fluence-to-Failure Experiments and The Exponential Model



**Classical Reliability : transformation from the time domain to the fluence domain.**

|                                                      | Exponential Distribution Variables                                                                                                                     |
|------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| Fluence-to-failure (FTF)                             | $\Phi_i$ Random Variable: per experiment- <i>i</i> for a selected LET                                                                                  |
| SEF Cross-section<br>(rate w.r.t. fluence)           | $\sigma_{SEF_i} = \frac{1}{\Phi_i}$                                                                                                                    |
| Sample mean (MFTF)                                   | $\mu = \frac{1}{n} \sum_{i=1}^T \Phi_i$ Average of fluence-to-failure test results.<br><b>n = number of events</b><br><b>T = number of experiments</b> |
| Mean SEF                                             | $\sigma_{SEF\mu} = \frac{1}{\mu}$ Classical Reliability: Constant per LET                                                                              |
| Standard deviation                                   | $\mu = MFTF$ Use of exponential population standard deviation definition                                                                               |
| Standard error of the mean (SEM)                     | $\frac{\mu}{\sqrt{n}}$ Generally used for error bars                                                                                                   |
| Exponential PDF<br>Probability distribution function | $\sigma_{SEF\mu} e^{-\sigma_{SEF\mu} \Phi}$ or $\frac{1}{\mu} e^{-\frac{1}{\mu} \Phi}$                                                                 |



# FTF PDF Expected Empirical Data: How Can 5-10 Tests Be Sufficient?

$$\text{Experiment fluence to failure} = \Phi_i = \frac{1}{\sigma_{SEF_i}} \quad MFTF = \mu$$

$$f(\Phi) = \sigma_{SEF_\mu} e^{-\sigma_{SEF_\mu} \Phi}$$

ub: upper bound  
lb: lower bound

For each experiment, most FTF data points ( $\Phi_i$ ) will occur near the mean, for a well-made test system. The goal is to design a test system where  $\Phi_{lb}$  is close to  $\Phi_{ub}$

$$P(\Phi_{lb} < \Phi_i < \Phi_{ub}) = e^{-\frac{\Phi_{lb}}{\mu}} - e^{-\frac{\Phi_{ub}}{\mu}}$$

- Concerns...deviation from the mean depends on:**
- Mechanisms of SEF in the DUT (homogenous, multi-modal)
  - Integrity and expediency to detect and report SEF
  - Dosimetry
  - Flux control



**The reality is: increasing the number of tests will not bring your empirical mean closer to the actual mean if concerns are not controlled.**



# Xilinx/AMD Kintex-Ultrascale...FTF Data for Complex Operations



- FTF cross section data are within a decade and are sufficient for calculating SEF cross-section means
- Calculate mean per LET analyzing each experiment  $i$  :
  - No event for experiment  $i$ :  $n=0$  and  $\Phi_i$  = fluence for experiment  $i$
  - Event for experiment  $i$ :  $n=1$  and  $\Phi_i$  = recorded fluence for event occurrence
- If  $n=0$  for a majority of tests, increase fluence (and check your test system).

$$\mu = \frac{1}{n} \sum_{i=1}^T \Phi_i$$
$$\sigma_{SEF\mu} = \sqrt{\frac{1}{\mu}}$$

# NASA Mission Requiring Test-As-You-Fly (FTF) Radiation Data



$$\sigma_{SEF} = f(\sigma_{configuration}, \sigma_{BRAM}, \sigma_{functionalLogic}, \sigma_{HiddenLogic})$$

- ▶ DUT: Microchip RTProASIC3 mission critical.
- ▶ Mission Requirement: work through worst-week with ground intervention restricted to 0.01/day.
- ▶ DUT area constraints limit mitigation.
- ▶ **Extrapolated (upper-bound) Error rates do not** meet requirements (use of shift register data).
- ▶ **Test-as-you-fly** heavy-ion testing required (**FTF data refinement**).



Texas A&M Cyclotron Facility

A robust complex system was developed:

Multi-use Test Platform enabled testing the DUT with the NASA flight image. DUT was controlled and operated (at speed) as it would be in flight. FTF data were successfully obtained.

# Microchip RTProASIC FTF Data versus Bounding Extrapolation Data



- FTF experiments were RTD-Test-as-you-fly.
- SEF data for a specified function within the NASA flight design is illustrated.  
Extrapolated data cannot be refined to specific function.

RTD: representative tactical design

SEF: single event failure

LET: Linear energy transfer



Large number of low LET Particles per day during worst week.



| LET Range     | Fluence/Day       |
|---------------|-------------------|
| 0.1 ... 0.5   | $1.8 \times 10^6$ |
| 0.5 ... 1.0   | $7.6 \times 10^3$ |
| 1.0 ... 5.0   | $1.0 \times 10^3$ |
| 5.0 ... 10.0  | $4.2 \times 10^1$ |
| 10.0 ... 20.0 | $8.4 \times 10^0$ |

# Test-As-You-Fly FTF Refined Data Meet Requirements



Extrapolated data do not meet requirements while Test-as-you-fly data do.

SEF: single event failure

LET: Linear energy transfer



| Weibull Parameter           | Description              |
|-----------------------------|--------------------------|
| $\text{LET}_{\text{onset}}$ | Onset LET                |
| $\sigma_{\text{SAT}}$       | Saturation cross-section |
| W                           | width                    |
| S                           | shape                    |

| Parameter                   | Extrapolated                                  | Test-As-You-Fly                             |
|-----------------------------|-----------------------------------------------|---------------------------------------------|
| $\text{LET}_{\text{onset}}$ | $0.5 \text{ MeV}\cdot\text{cm}^2/\text{mg}$   | $2.0 \text{ MeV}\cdot\text{cm}^2/\text{mg}$ |
| $\sigma_{\text{SAT}}$       | $60 \mu\text{m}^2$                            | $6000 \mu\text{m}^2$                        |
| W                           | $42.58 \text{ MeV}\cdot\text{cm}^2/\text{mg}$ | $30 \text{ MeV}\cdot\text{cm}^2/\text{mg}$  |
| S                           | 2.0                                           | 2.8                                         |
| Multiplier                  | 15200                                         | 1                                           |
| Error rate                  | $2.1 \times 10^{-1} \text{ errors/day}$       | $2.3 \times 10^{-3} \text{ errors/day}$     |

Does not meet requirements

Does meet requirements

# Summary

- ▶ Modernization of FPGA SEE Risk Analysis:
  - Test:
    - Application of course grained experiments (test smart logic).
    - Test systems and DUT test structures become complex... in turn data better characterize missions.
  - Analysis: Top-Down approach:
    - For cross-section analysis, use classical reliability models transformed to the fluence domain.
    - Cross-sections become probabilities (instead of areas).
  - Extrapolation:
    - Course-grain SEE data in the form of probabilities become easier to extrapolate to complex space applications.

