

# **Improving DRAM Performance, Security, and Reliability by Understanding and Exploiting DRAM Timing Parameter Margins**

*Submitted in partial fulfillment of the requirements  
for the degree of Doctor of Philosophy  
in Electrical and Computer Engineering*

**Jeremie S. Kim**

B.S., Electrical and Computer Engineering, Carnegie Mellon University  
M.S., Electrical and Computer Engineering, Carnegie Mellon University

## **Thesis Prospectus Committee**

Prof. Onur Mutlu (Chair)

Prof. James C. Hoe

Prof. Derek Chiou

Dr. Saugata Ghose

**August 2020**

Carnegie Mellon University  
Pittsburgh, PA

Copyright © 2021 Jeremie S. Kim

All Rights Reserved

## Abstract

Characterization of real DRAM devices has enabled findings in DRAM device properties, which has led to proposals that significantly improve overall system performance by reducing DRAM access latency and power consumption. In addition to improving system performance, a deeper understanding of DRAM technology via characterization can also improve device reliability and security. These can be seen with the recent discoveries of 1) DRAM-based true random number generators (TRNGs), a method for generating true random numbers using DRAM devices which can be used in many applications, 2) DRAM-based physical unclonable functions (PUFs), a method for generating unique device-dependent keys for identification and authentication, and 3) the RowHammer vulnerability, a phenomenon where repeatedly accessing a DRAM row can cause failures in unaccessed neighboring DRAM rows.

To advance DRAM-based discoveries and mechanisms, this dissertation rigorously characterizes many modern commodity DRAM devices and shows that by exploiting DRAM access timing margins within manufacturer-recommended DRAM timing specifications, we can significantly improve system performance, reduce power consumption, and improve device reliability and security. First, we characterize DRAM timing parameter margins and find that certain regions of DRAM can be accessed faster than other regions due to DRAM cell process manufacturing variation. We exploit this by enabling variable access times depending on the DRAM cells being accessed, which not only improves overall system performance, but also decreases power consumption. Second, we find that we can uniquely identify DRAM devices by the locations of failures that result when we access DRAM with timing parameters reduced below specification values. Because we induce these failures with DRAM accesses, we can generate these unique identifiers significantly more quickly than prior work. Third, we propose a random number generator that is based on our observation that timing failures in certain DRAM cells are randomly induced and can thus be repeatedly polled to very quickly generate true random values. Finally, we characterize the RowHammer security vulnerability on a wide range of modern DRAM chips while violating the DRAM refresh requirement in order to directly characterize the underlying DRAM technology without the interference of refresh commands. We demonstrate with our characterization of real chips, that existing RowHammer mitigation mechanisms either are not scalable or suffer from prohibitively large performance overheads in projected future devices and it is critical to research more effective solutions to RowHammer. Overall, our studies build a new understanding of modern DRAM devices to improve computing system performance, reliability and security all at the same time.

## Acknowledgments

I have many people to thank for their support during my PhD journey. First and foremost, I am extremely grateful to my advisor, Onur Mutlu, who has generously mentored and guided me since my sophomore year of college. His passion for Computer Architecture and his research initially caught my interest during a seminar course lecture, and his exemplary feedback, encouragement, and ongoing support helped me to adopt his passion into my own work as well. I am grateful to have experienced, first-hand, the thought and care Onur puts into both his teaching and his research, and I have learned greatly through many iterations of paper submissions, conference talks, course exams, and lectures with him. Onur has also provided countless opportunities for collaboration within the SAFARI research group and industrial collaborators, which provided me with many new experiences during my exciting and unique PhD experience, as well as made this thesis possible.

I am grateful to the members of my PhD committee, James Hoe, Derek Chiou, and Saugata Ghose, for their valuable feedback and stimulating discussions.

I am immensely grateful to have found many great friends in the SAFARI research group. I could not have done it without Minesh Patel, a great friend that truly made my PhD experience enjoyable. He brought an immense wealth of knowledge, enforced high research standards, and fostered my enjoyment of hard liquors. I am very thankful for Can Firtina, an endless source of entertainment, unforgettable experiences, and shelter. I am grateful for Hasan Hassan who on many occasions kept me company in the office late into the night. I am thankful for Damla Senol Cali, my first friend in SAFARI with whom I had to navigate the new field of bioinformatics and learn Onur's research process. I also want to thank Giray Yaglikci for his friendship and company on hikes around the world.

I would like to acknowledge all past and current members of our research group for being both great friends and colleagues. I want to especially thank Yoongu Kim, Hongyi Xin, Donghyuk Lee, Rachata Ausavarungnirun, Yixin Luo, Saugata Ghose, Vivek Seshadri, and Kevin Chang, for their mentorship during my formative years

in SAFARI. I thank Can Firtina, Giray Yaglikci, Nastaran Hajinazar, Geraldo De Oliveira, and Ivan Fernandez-Vega for gracefully welcoming my invasion of their office space and providing a fun working environment. I also thank all others for their discussions, feedback, collaboration, and support: Arash Tavakkol, Jawad Haj-Yahya, Mohammed Alser, Roknoddin Azizibarzoki, Nika Mansourighiasi, Lois Orosa, Juan Gomez Luna, Amirali Boroumand, Jisung Park, Nandita Vijaykumar, Ivan Puddu, Ataberk Olgun, Nisa Bostanci, Rahul Bera, Konstantinos Kanellopoulos, and Taha Shahroodi.

I am especially grateful to have met and worked with Tyler Huberty, Stephan Meier, Jared Zerbe, Jung-Sik Kim, Heonjae Ha, Seung Lee, Gihong Kim, Taehyun Kim, Augustin Hong, Can Alkan, and countless others during my PhD journey. They have all provided great insight, mentorship, and support.

I would also like to thank my internship mentors, Stefan Saroiu and Alec Wolman, during my time at Microsoft Research, who provided me with a stimulating environment and an industrial perspective on system security. I sincerely thank Microsoft for this opportunity.

I would like to thank the National Science Foundation (grants 1212962 and 1320531), the National Institutes of Health (grant HG006004) and SAFARI Research Group's industrial partners for respectively the financial support and the gift funding they have provided that have contributed to works during my PhD.

To my many friends and family, your support and encouragement throughout my journey were worth more than I can express on paper. I want to particularly thank Jimmy Lee, Ho-Gyun Choi, Noelle Jung, Stephanie Chen, Justine Kim, and Matt Yin for always being there.

Finally, I want to thank my parents Hyong and Anita for their unwavering support, encouragement, and love.

# Contents

|          |                                                                                               |           |
|----------|-----------------------------------------------------------------------------------------------|-----------|
| <b>1</b> | <b>Introduction</b>                                                                           | <b>15</b> |
| 1.1      | Problem and Thesis Statement . . . . .                                                        | 15        |
| 1.2      | Overview of Our Approach . . . . .                                                            | 16        |
| 1.3      | Contributions . . . . .                                                                       | 18        |
| 1.4      | Dissertation Outline . . . . .                                                                | 22        |
| <b>2</b> | <b>Background</b>                                                                             | <b>23</b> |
| 2.1      | DRAM System Organization . . . . .                                                            | 23        |
| 2.2      | DRAM Chip Organization . . . . .                                                              | 24        |
| 2.3      | DRAM Commands . . . . .                                                                       | 25        |
| 2.4      | DRAM Cell Operation . . . . .                                                                 | 26        |
| 2.5      | DRAM Failure Modes . . . . .                                                                  | 28        |
| 2.6      | Violating Manufacturer-Specified Timing Parameters . . . . .                                  | 29        |
| <b>3</b> | <b>Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines</b> | <b>30</b> |
| 3.1      | Motivation and Goal . . . . .                                                                 | 31        |
| 3.2      | Testing Methodology . . . . .                                                                 | 32        |
| 3.3      | Activation Failure Characterization . . . . .                                                 | 33        |
| 3.3.1    | Spatial Distribution of Activation Failures . . . . .                                         | 33        |
| 3.3.2    | Data Pattern Dependence . . . . .                                                             | 36        |
| 3.3.3    | Temperature Effects . . . . .                                                                 | 38        |
| 3.3.4    | Latency Effects . . . . .                                                                     | 39        |

|          |                                                                                                                                                               |           |
|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
| 3.3.5    | Short-term Variation . . . . .                                                                                                                                | 40        |
| 3.3.6    | DRAM Write Operations . . . . .                                                                                                                               | 41        |
| 3.4      | Exploiting Activation Latency Variation . . . . .                                                                                                             | 42        |
| 3.4.1    | Solar-DRAM . . . . .                                                                                                                                          | 42        |
| 3.4.2    | Static Profile of Weak Subarray Columns . . . . .                                                                                                             | 44        |
| 3.5      | Solar-DRAM Evaluation . . . . .                                                                                                                               | 45        |
| 3.5.1    | Evaluation Methodology . . . . .                                                                                                                              | 45        |
| 3.5.2    | Multi-core Evaluation Results . . . . .                                                                                                                       | 47        |
| 3.6      | Related Work . . . . .                                                                                                                                        | 49        |
| 3.7      | Limitations . . . . .                                                                                                                                         | 50        |
| 3.8      | Summary . . . . .                                                                                                                                             | 51        |
| <b>4</b> | <b>The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices</b> | <b>52</b> |
| 4.1      | Physical Unclonable Functions . . . . .                                                                                                                       | 53        |
| 4.2      | Motivation and Goal . . . . .                                                                                                                                 | 55        |
| 4.3      | Properties of a Runtime-Accessible PUF . . . . .                                                                                                              | 56        |
| 4.3.1    | Characteristics of a Desirable PUF . . . . .                                                                                                                  | 57        |
| 4.3.2    | Characteristics of a <i>Runtime-Accessible</i> PUF . . . . .                                                                                                  | 57        |
| 4.4      | Testing Environment . . . . .                                                                                                                                 | 58        |
| 4.5      | DRAM Retention PUFs: Analysis . . . . .                                                                                                                       | 59        |
| 4.5.1    | Evaluating Retention PUFs . . . . .                                                                                                                           | 59        |
| 4.5.2    | Evaluation Times of Retention PUFs . . . . .                                                                                                                  | 60        |
| 4.5.3    | Optimizing Retention PUFs . . . . .                                                                                                                           | 62        |
| 4.6      | DRAM Latency PUFs . . . . .                                                                                                                                   | 65        |
| 4.6.1    | PUF Characteristics: Experimental Analysis . . . . .                                                                                                          | 66        |
| 4.6.2    | Runtime-Accessible PUF Metrics Evaluation . . . . .                                                                                                           | 73        |
| 4.7      | Design Considerations . . . . .                                                                                                                               | 77        |
| 4.7.1    | Repeatability of Cell Latency Failures . . . . .                                                                                                              | 78        |

|          |                                                                                                                   |           |
|----------|-------------------------------------------------------------------------------------------------------------------|-----------|
| 4.7.2    | DRAM Latency PUF Evaluation Algorithm . . . . .                                                                   | 79        |
| 4.7.3    | Variation Among PUF Memory Segments . . . . .                                                                     | 82        |
| 4.7.4    | Support for Changing Timing Parameters . . . . .                                                                  | 83        |
| 4.7.5    | Device Enrollment . . . . .                                                                                       | 83        |
| 4.7.6    | In-DRAM Error Correcting Codes . . . . .                                                                          | 84        |
| 4.7.7    | Effect of High-Temperature . . . . .                                                                              | 85        |
| 4.8      | Related Work . . . . .                                                                                            | 86        |
| 4.9      | Limitations . . . . .                                                                                             | 88        |
| 4.10     | Summary . . . . .                                                                                                 | 89        |
| <b>5</b> | <b>D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput</b> | <b>90</b> |
| 5.1      | True Random Number Generators (TRNGs) . . . . .                                                                   | 91        |
| 5.2      | Motivation and Goal . . . . .                                                                                     | 92        |
| 5.3      | Testing Environment . . . . .                                                                                     | 95        |
| 5.4      | Activation Failure Characterization . . . . .                                                                     | 96        |
| 5.4.1    | Spatial Distribution of Activation Failures . . . . .                                                             | 97        |
| 5.4.2    | Data Pattern Dependence . . . . .                                                                                 | 99        |
| 5.4.3    | Temperature Effects . . . . .                                                                                     | 101       |
| 5.4.4    | Entropy Variation over Time . . . . .                                                                             | 102       |
| 5.5      | D-RaNGe: A DRAM-based TRNG . . . . .                                                                              | 103       |
| 5.5.1    | RNG Cell Identification . . . . .                                                                                 | 104       |
| 5.5.2    | Sampling RNG Cells for Random Data . . . . .                                                                      | 105       |
| 5.5.3    | Full System Integration . . . . .                                                                                 | 107       |
| 5.6      | D-RaNGe Evaluation . . . . .                                                                                      | 108       |
| 5.6.1    | NIST Tests . . . . .                                                                                              | 108       |
| 5.6.2    | RNG Cell Distribution . . . . .                                                                                   | 109       |
| 5.6.3    | TRNG Key Characteristics Evaluation . . . . .                                                                     | 110       |
| 5.7      | Comparison with Prior DRAM TRNGs . . . . .                                                                        | 115       |
| 5.7.1    | DRAM Command Scheduling . . . . .                                                                                 | 116       |

|          |                                                                                                   |            |
|----------|---------------------------------------------------------------------------------------------------|------------|
| 5.7.2    | DRAM Data Retention . . . . .                                                                     | 117        |
| 5.7.3    | DRAM Startup Values . . . . .                                                                     | 118        |
| 5.7.4    | Combining DRAM-based TRNGs . . . . .                                                              | 119        |
| 5.8      | Other Related Works . . . . .                                                                     | 119        |
| 5.9      | Limitations . . . . .                                                                             | 120        |
| 5.10     | Summary . . . . .                                                                                 | 121        |
| <b>6</b> | <b>Revisiting RowHammer: An Experimental Analysis of Modern Devices and Mitigation Techniques</b> | <b>122</b> |
| 6.1      | RowHammer: DRAM Disturbance Errors . . . . .                                                      | 123        |
| 6.2      | Motivation and Goal . . . . .                                                                     | 124        |
| 6.3      | Experimental Methodology . . . . .                                                                | 125        |
| 6.3.1    | Testing Infrastructure . . . . .                                                                  | 125        |
| 6.3.2    | Characterized DRAM Chips . . . . .                                                                | 127        |
| 6.3.3    | Effectively Characterizing RowHammer . . . . .                                                    | 128        |
| 6.4      | RowHammer Characterization . . . . .                                                              | 132        |
| 6.4.1    | RowHammer Vulnerability . . . . .                                                                 | 132        |
| 6.4.2    | Data Pattern Dependence . . . . .                                                                 | 133        |
| 6.4.3    | Hammer Count ( <i>HC</i> ) Effects . . . . .                                                      | 135        |
| 6.4.4    | RowHammer Spatial Effects . . . . .                                                               | 136        |
| 6.4.5    | First RowHammer Bit Flips . . . . .                                                               | 140        |
| 6.4.6    | Single-Cell RowHammer Bit Flip Probability . . . . .                                              | 144        |
| 6.5      | Implications for Future Systems . . . . .                                                         | 145        |
| 6.5.1    | RowHammer Mitigation Mechanisms . . . . .                                                         | 146        |
| 6.5.2    | Evaluation of Viable Mitigation Mechanisms . . . . .                                              | 150        |
| 6.5.3    | RowHammer Mitigation Going Forward . . . . .                                                      | 154        |
| 6.6      | Related Work . . . . .                                                                            | 156        |
| 6.7      | Limitations . . . . .                                                                             | 157        |
| 6.8      | Summary . . . . .                                                                                 | 158        |

|                                                                                      |            |
|--------------------------------------------------------------------------------------|------------|
| <b>7 Putting It All Together</b>                                                     | <b>159</b> |
| 7.1 Implementing All Proposed Techniques on the Same System . . . . .                | 159        |
| 7.2 Cost-Benefit Analysis . . . . .                                                  | 160        |
| <b>8 Conclusions and Future Directions</b>                                           | <b>162</b> |
| 8.1 Conclusions . . . . .                                                            | 162        |
| 8.2 Future Research Directions . . . . .                                             | 164        |
| 8.2.1 Reducing DRAM Latency by Exploiting Different Timing Pa-<br>rameters . . . . . | 164        |
| 8.2.2 Improving Security Primitives for DRAM Chips . . . . .                         | 165        |
| 8.2.3 RowHammer Mitigation Going Forward . . . . .                                   | 167        |
| 8.3 Final Concluding Remarks . . . . .                                               | 169        |

# List of Figures

|     |                                                                                                                                                                                                                                  |    |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 2-1 | A typical DRAM-based system [163]. . . . .                                                                                                                                                                                       | 24 |
| 2-2 | DRAM bank and cell architecture [163]. . . . .                                                                                                                                                                                   | 25 |
| 2-3 | Command sequence for reading data from DRAM and the state of a DRAM cell during each related step. . . . .                                                                                                                       | 26 |
| 3-1 | Activation failure bitmap in 1024x1024 cell array. . . . .                                                                                                                                                                       | 34 |
| 3-2 | Probability of the first access to a newly-activated row going to a particular cache line offset within the row. . . . .                                                                                                         | 36 |
| 3-3 | Data pattern dependence of the proportion of local bitlines with activation failures found over 16 iterations. . . . .                                                                                                           | 37 |
| 3-4 | Temperature effects on a local bitline's $F_{prob}$ . . . . .                                                                                                                                                                    | 39 |
| 3-5 | $F_{prob}$ of local bitlines across time. . . . .                                                                                                                                                                                | 41 |
| 3-6 | Weighted speedup improvements of Solar-DRAM, its three individual components, and FLY-DRAM over baseline LPDDR4 DRAM, evaluated over various 4-core workload mixes from the SPEC CPU2006 benchmark suite. . . . .                | 47 |
| 4-1 | Average DRAM retention PUF evaluation time vs. temperature shown for three selected memory segment sizes for each manufacturer. Average DRAM latency PUF evaluation time (Section 4.6.2) is shown as a comparison point. . . . . | 61 |

|     |                                                                                                                                                                           |     |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| 4-2 | Distributions of Jaccard indices calculated across every possible pair of PUF responses across all tested PUF memory segments from each of 223 LPDDR4 DRAM chips. . . . . | 69  |
| 4-3 | Distributions of Jaccard indices calculated between PUF responses of DRAM chips from a single manufacturer. . . . .                                                       | 70  |
| 4-4 | Distribution of the Intra-Jaccard index range values calculated between many PUF responses that a PUF memory segment generates over a 30-day period. . . . .              | 72  |
| 4-5 | DRAM latency PUF repeatability vs. temperature. . . . .                                                                                                                   | 74  |
| 5-1 | Activation failure bitmap in $1024 \times 1024$ cell array. . . . .                                                                                                       | 98  |
| 5-2 | Data pattern dependence of DRAM cells prone to activation failure over 100 iterations . . . . .                                                                           | 99  |
| 5-3 | Effect of temperature variation on failure probability . . . . .                                                                                                          | 101 |
| 5-4 | Density of RNG cells in DRAM words per bank. . . . .                                                                                                                      | 110 |
| 5-5 | Distribution of TRNG throughput across chips. . . . .                                                                                                                     | 113 |
| 6-1 | Our SoftMC infrastructure [302, 120] for testing DDR4 DRAM chips. . . . .                                                                                                 | 126 |
| 6-2 | RowHammer bit flip coverage of different data patterns (described in Section 6.3.3) for a single representative DRAM chip of each type-node configuration. . . . .        | 134 |
| 6-3 | Hammer count ( $HC$ ) vs. RowHammer bit flip rate across DRAM type-node configurations. . . . .                                                                           | 135 |
| 6-4 | Distribution of RowHammer bit flips across row offsets from the victim row. . . . .                                                                                       | 137 |
| 6-5 | Distribution of the number of RowHammer bit flips per 64-bit word for each DRAM type-node configuration. . . . .                                                          | 139 |
| 6-6 | Number of hammers required to cause the first RowHammer bit flip ( $HC_{first}$ ) per chip across DRAM type-node configurations. . . . .                                  | 141 |

- 6-7 Hammer Count (left y-axis) required to find the first 64-bit word containing one, two, and three RowHammer bit flips. Hammer Count Multiplier (right y-axis) quantifies the *HC* difference between every two points on the x-axis (as a multiplication factor of the left point to the right point). . . . . 143
- 6-8 Effect of RowHammer mitigation mechanisms on a) DRAM bandwidth overhead (note the inverted log-scale y-axis) and b) system performance, as DRAM chips become more vulnerable to RowHammer (from left to right). . . . . 153

# List of Tables

|     |                                                                                                                                                                  |     |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| 3.1 | Evaluated system configuration. . . . .                                                                                                                          | 46  |
| 4.1 | The number of tested PUF memory segments across the tested chips from each of the three manufacturers. . . . .                                                   | 67  |
| 4.2 | Number of PUF memory segments tested for 30 days. . . . .                                                                                                        | 71  |
| 4.3 | Percentage of PUF memory segments per chip with Intra-Jaccard index ranges <0.1 or 0.2 over a 30-day period. Median [minimum, maximum] values are shown. . . . . | 73  |
| 4.4 | Percentage of <i>good</i> memory segments per chip across manufacturers. Median [min, max] values are shown. . . . .                                             | 83  |
| 5.1 | D-RaNGe results with NIST randomness test suite. . . . .                                                                                                         | 109 |
| 5.2 | Comparison to previous DRAM-based TRNG proposals. . . . .                                                                                                        | 115 |
| 6.1 | Summary of DRAM chips tested. . . . .                                                                                                                            | 127 |
| 6.2 | Fraction of DDR3 DRAM chips vulnerable to RowHammer when $\mathbf{HC} < 150\text{k}$ . . . . .                                                                   | 133 |
| 6.3 | Worst-case data pattern for each DRAM type-node configuration at <b>50°C</b> split into different manufacturers. . . . .                                         | 134 |
| 6.4 | Lowest $\mathbf{HC}_{\text{first}}$ values ( $\times 1000$ ) across all chips of each DRAM type-node configuration. . . . .                                      | 142 |
| 6.5 | Percentage of cells with monotonically increasing RowHammer bit flip probabilities as $\mathbf{HC}$ increases. . . . .                                           | 144 |
| 6.6 | System configuration for simulations. . . . .                                                                                                                    | 151 |

# Chapter 1

## Introduction

### 1.1 Problem and Thesis Statement

Characterization of real DRAM devices has enabled findings in DRAM device properties, which has lead to proposals that significantly improve overall system performance by reducing DRAM access latency and power consumption. In addition to improving system performance, a deeper understanding of DRAM technology via characterization can also improve device reliability and security. These can be seen with the recent discoveries of 1) DRAM-based true random number generators (TRNGs) [154, 317, 117, 330, 77, 264, 320, 250], a method for generating true random numbers using DRAM devices which can be used in many applications, 2) DRAM-based physical unclonable functions (PUFs) [318, 154, 117, 328, 329, 268, 322, 319, 152, 235, 321, 250], a method for generating unique device-dependent keys for identification and authentication, and 3) the RowHammer vulnerability [170, 66, 104, 105, 204, 265, 270, 285, 324, 340, 360, 87, 145, 184, 239, 341], a phenomenon where repeatedly accessing a DRAM row can cause failures in unaccessed neighboring DRAM rows. To advance this collection of discoveries and mechanisms, we rigorously characterize many modern commodity DRAM devices and show that by exploiting DRAM access timing margins and specifications, we can significantly improve system performance, reduce power consumption, and improve device reliability and security. First, we

characterize DRAM timing parameter margins and find that certain regions of DRAM can be accessed faster than other regions due to DRAM cell process manufacturing variation. We exploit this by enabling variable access times depending on the DRAM cells that are being accessed, which not only improves overall system performance, but also decreases power consumption. Second, with further characterization, we find that we can uniquely identify DRAM devices by the locations of failures that result when we access DRAM with timing parameters reduced below specification values. Because we induce these failures with DRAM accesses, we can generate these unique identifiers significantly quicker than prior work. Third, we propose a random number generator that is based on our observation that timing failures in certain DRAM cells are randomly induced and can thus be repeatedly polled to very quickly generate true random values. Finally, we characterize the RowHammer security vulnerability on a wide range of modern DRAM devices while violating the DRAM refresh requirement in order to directly characterize the underlying DRAM technology without the interference of refresh commands.

Our thesis statement is as follows: *By rigorously understanding and exploiting DRAM device characteristics, we can significantly improve system performance and enhance system security and reliability.*

## 1.2 Overview of Our Approach

In line with our thesis statement, we use rigorous characterization of real DRAM chips to make novel observations on chip properties and use these observations to improve system performance and enhance system security and reliability. We demonstrate across four works, that by understanding per-chip error characteristics using a profiling mechanism, we can develop mechanisms that exploit chip-dependent error profiles to improve system performance or enhance system security and reliability.

The first mechanism that we develop based on our observations, *Subarray-optimized Access Latency Reduction DRAM (Solar-DRAM)*, builds on our detailed experimental characterization and exploits each of our novel observations to significantly and robustly

reduce DRAM access latency. The key ideas of Solar-DRAM are to issue 1) DRAM reads with reduced  $t_{RCD}$  (i.e., by 39%) unless the requested DRAM cache line contains weak DRAM cells that are likely to fail under reduced  $t_{RCD}$ , and 2) all DRAM writes with reduced  $t_{RCD}$  (i.e., by 77%). Solar-DRAM determines whether a DRAM cell is weak using a per-chip *static profile* of local bitlines, which we experimentally find to be *reliable across time*. Compared to state-of-the-art LPDDR4 DRAM, Solar-DRAM provides significant system performance improvement while maintaining data correctness.

The second mechanism, *the DRAM latency PUF*, exploits our novel observation that reducing DRAM read access latency below the reliable datasheet specifications using software-only system calls results in error patterns that can be used as unique identifiers. We demonstrate that users can further enhance the reliability of the unique identifiers using an error profile of the DRAM chip, which enables users to select regions of DRAM that are better suited for evaluating PUFs. We experimentally demonstrate, using 223 modern LPDDR4 DRAM chips, that the DRAM latency PUF satisfies all of the requirements of an effective runtime-accessible PUF. In particular, a DRAM latency PUF can be evaluated in 88.2ms on average across all devices at all operating temperatures. We show that, for a constant DRAM capacity overhead of 64KiB, the DRAM latency PUF’s average (minimum, maximum) evaluation time speedup over the DRAM retention PUF [318, 154, 210, 364] is 152x (109x, 181x) at 70°C and 1426x (868x, 1783x) at 55°C, with exponentially increasing speedups at even lower temperatures.

The third mechanism, D-RaNGe, exploits the novel observation from our characterization results that true random numbers can be extracted from access latency failures with high throughput. D-RaNGe consists of two steps: 1) identifying specific DRAM cells that are vulnerable to activation failures using a *low-latency* profiling step and 2) generating a continuous stream (i.e., constant rate) of random numbers by repeatedly inducing activation failures in the previously-identified vulnerable cells. D-RaNGe runs entirely in software and is capable of immediately running on any commodity system that provides the ability to manipulate DRAM timing parameters

within the memory controller [12, 11]. For most other devices, a simple software API must be exposed without any hardware changes to the commodity DRAM device (e.g., similarly to SoftMC [120, 302]), which makes D-RaNGe suitable for implementation on most existing systems today.

Finally, we demonstrate via characterization of many DRAM chips and technology node generations, that the DRAM-based security vulnerability, RowHammer, is getting worse as device feature size reduces. This means that the number of activations needed to induce a RowHammer bit flip also reduces, to as few as 9.6k in the most vulnerable chip we tested. We then use our characterization results to demonstrate how five state-of-the-art RowHammer mitigation mechanisms do not scale to support the degrees of RowHammer vulnerability that we expect to see in future devices. We conclude by discussing various methods for improving RowHammer mitigation for future DRAM devices.

### 1.3 Contributions

This dissertation makes the following **key contributions**:

1. Using 282 LPDDR4 DRAM modules from three major DRAM manufacturers, we extensively characterize the effects of multiple testing conditions (e.g., DRAM temperature, DRAM access latency parameters, data patterns written in DRAM) on activation failures. We demonstrate the viability of mechanisms that exploit variation in access latency of DRAM cells by showing that cells that operate correctly at reduced latency continue to operate correctly at the same latency over time. That is, a DRAM cell’s activation failure probability is *not* vulnerable to significant variation over short time intervals.
  - (a) We present data across our DRAM modules, that activation failures exhibit high spatial locality and are tightly constrained to a small number of *columns* (i.e., on average 3.7%/2.5%/2.2% per bank for DRAM chips of manufacturers A/B/C) at the granularity of a DRAM subarray.

- (b) We demonstrate that  $t_{RCD}$  can be greatly reduced (i.e., by 77%) for DRAM *write* requests while *still* maintaining data integrity. This is because  $t_{RCD}$  defines the amount of time required for data within DRAM cells to be amplified to a *readable* voltage level, which does *not* govern DRAM *write* operations.
  - (c) We find that across SPEC CPU2006 benchmarks, DRAM accesses to closed rows typically request the  $0^{th}$  *cache line* in the row, with a maximum (average) probability of 22.2% (6.6%). This is much greater than the expected probability (i.e., 3.1%) assuming that DRAM accesses to closed rows access each cache line with an equal probability. Since  $t_{RCD}$  affects only DRAM accesses to closed DRAM rows, we find that simply reducing  $t_{RCD}$  for all accesses to the  $0^{th}$  cache lines of all DRAM rows improves overall system performance by up to 6.54%.
  - (d) We propose Solar-DRAM, a mechanism that exploits our three key observations on reliably reducing the  $t_{RCD}$  timing parameter. Solar-DRAM selectively reduces  $t_{RCD}$  for 1) reads to DRAM cache lines containing “weak” or “strong” cells, and 2) writes to all of DRAM. We evaluate Solar-DRAM on a variety of multi-core workloads and show that compared to *state-of-the-art* LPDDR4 DRAM, Solar-DRAM improves performance by 4.97% (8.79%) on heterogeneous and by 4.31% (10.87%) on homogeneous workloads.
2. We introduce the DRAM latency PUF, a new class of DRAM PUFs, that is based on the deliberate violation of manufacturer-specified DRAM latency parameters. DRAM latency PUFs can be implemented *with no additional hardware overhead* on any commodity off-the-shelf (COTS) system that permits software-controlled manipulation of DRAM access latencies at the memory controller (e.g., [191, 1, 12]).
- (a) Using experimental data from 223 real LPDDR4 DRAM chips, we extensively analyze both DRAM latency PUFs and DRAM retention PUFs. We

show that DRAM latency PUFs 1) satisfy all characteristics of an effective PUF, and 2) are suitable for use as runtime-accessible PUFs across a *wide range* of temperatures. We also present an extensive characterization of DRAM retention PUFs under a wide range of temperatures. We show that while DRAM retention PUFs can be evaluated faster at higher temperatures, their evaluation time at temperatures even as high as 70°C is *prohibitively* slow.

- (b) We experimentally show that the DRAM latency PUF significantly outperforms DRAM retention PUFs, achieving an average speedup of 152x/1426x at 70°C/55°C when evaluating PUFs with a constant DRAM capacity overhead of 64KiB. We also find that while DRAM retention PUFs suffer from *temperature-dependent evaluation times*, the DRAM latency PUF provides a consistently low average evaluation time of 88.2ms at all operating temperatures.
- 3. We introduce D-RaNGe, a new methodology for extracting true random numbers from a commodity DRAM device at high throughput and low latency. The key idea of D-RaNGe is to use DRAM cells as entropy sources to generate true random numbers by accessing them with a latency that is lower than manufacturer-recommended specifications.
  - (a) Using experimental data from 282 state-of-the-art LPDDR4 DRAM devices from three major DRAM manufacturers, we present a rigorous characterization of randomness in errors induced by accessing DRAM with low latency. Our analysis demonstrates that D-RaNGe is able to maintain high-quality random number generation both over 15 days of testing and across the entire reliable testing temperature range of our infrastructure (55°C-70°C). We verify our observations from this study with prior works' observations on DDR3 DRAM devices [55, 190, 191, 163]. Furthermore, we experimentally demonstrate on four DDR3 DRAM devices, from a single manufacturer, that D-RaNGe is suitable for implementation in a wide range

of commodity DRAM devices.

- (b) We evaluate the quality of D-RaNGe’s output bitstream using the standard NIST statistical test suite for randomness [279] and find that it successfully passes every test. We also compare D-RaNGe’s performance to four previously proposed DRAM-based TRNG designs (Section 5.7) and show that D-RaNGe outperforms the best prior DRAM-based TRNG design by over two orders of magnitude in terms of maximum and average throughput.
4. We provide the first rigorous RowHammer failure characterization study of a broad range of real modern DRAM chips across different DRAM types, technology node generations, and manufacturers. We experimentally study 1580 DRAM chips ( $408 \times$  DDR3,  $652 \times$  DDR4, and  $520 \times$  LPDDR4) from 300 DRAM modules ( $60 \times$  DDR3,  $110 \times$  DDR4, and  $130 \times$  LPDDR4) and present our RowHammer characterization results for both aggregate RowHammer failure rates and the behavior of individual cells while sweeping the hammer count ( $HC$ ) and stored data pattern.
- (a) Via our rigorous characterization studies, we definitively demonstrate that the RowHammer vulnerability significantly worsens (i.e., the number of hammers required to induce a RowHammer bit flip,  $HC_{\text{first}}$ , greatly reduces) in newer DRAM chips (e.g.,  $HC_{\text{first}}$  reduces from  $69.2k$  to  $22.4k$  in DDR3,  $17.5k$  to  $10k$  in DDR4, and  $16.8k$  to  $4.8k$  in LPDDR4 chips across multiple technology node generations).
  - (b) We demonstrate, based on our rigorous evaluation of five state-of-the-art RowHammer mitigation mechanisms, that even though existing RowHammer mitigation mechanisms are reasonably effective at mitigating RowHammer in today’s DRAM chips (e.g., 8% average performance loss on our workloads when  $HC_{\text{first}}$  is  $4.8k$ ), they will cause significant overhead in future DRAM chips with even lower  $HC_{\text{first}}$  values (e.g., 80% average performance loss with the most scalable mechanism when  $HC_{\text{first}}$  is 128).
  - (c) We evaluate an ideal refresh-based mitigation mechanism that selectively

refreshes a row only just before it is about to experience a RowHammer bit flip, and find that in chips with high vulnerability to RowHammer, there is still significant opportunity for developing a refresh-based RowHammer mitigation mechanism with low performance overhead that scales to low  $HC_{\text{first}}$  values. We conclude that it is critical to research more effective solutions to RowHammer, and we provide promising directions for future research.

## 1.4 Dissertation Outline

This thesis is organized into 7 chapters. Chapter 2 describes necessary background on DRAM organization, operations, and failure mechanisms. Chapter 3 presents solar-DRAM, a mechanism for reducing DRAM access latency. Chapter 4 presents the DRAM Latency PUF, a fast and efficient method for generating unique identifiers in commodity DRAM. Chapter 5 presents D-RaNGe, a method for quickly and efficiently generating true random numbers in DRAM. Chapter 6 presents our experimental study of RowHammer on several technology node generations of DRAM chips. Finally, Chapter 7 presents conclusions and future research directions that are enabled by this dissertation.

# Chapter 2

## Background

We describe the DRAM organization and operation necessary for understanding our observations and mechanism for reducing DRAM access latencies. We refer the reader to prior works [191, 156, 289, 119, 120, 376, 192, 173, 292, 55, 188, 54, 57, 190, 294, 193, 174, 170, 35, 58, 206, 155, 266, 376, 158, 260, 236, 56, 207, 164, 163, 165, 98, 97] for more detail.

### 2.1 DRAM System Organization

In a typical system configuration, a CPU chip includes a set of memory controllers, where each memory controller interfaces with a DRAM channel to perform read and write operations. As we show in Figure 2-1 (left), a DRAM channel has its own I/O bus and operates independently of other channels in the system. To achieve high memory capacity, a channel can host multiple DRAM modules by sharing the I/O bus between the modules. A DRAM module implements a single or multiple DRAM ranks. Command and data transfers are serialized between ranks in the same channel due to the shared I/O bus. A DRAM rank consists of multiple DRAM chips that operate in lock-step, i.e., all chips simultaneously perform the same operation, but they do so on different bits. The number of DRAM chips per rank depends on the data bus width of the DRAM chips and the channel width. For example, a typical

system has a 64-bit wide DRAM channel. Thus, four 16-bit or eight 8-bit DRAM chips are needed to build a DRAM rank.



Figure 2-1: A typical DRAM-based system [163].

## 2.2 DRAM Chip Organization

At a high-level, a DRAM chip consists of billions of DRAM cells that are hierarchically organized to maximize storage density and performance. We describe each level of the hierarchy of a modern DRAM chip.

A modern DRAM chip is composed of multiple DRAM banks (shown in Figure 2-1, right). The chip communicates with the memory controller through the *I/O circuitry*. The I/O circuitry is connected to the *internal command and data bus* that is shared among all banks in the chip.

Figure 2-2a illustrates the organization of a DRAM bank. In a bank, the *global row decoder* partially decodes the address of the accessed *DRAM row* to select the corresponding *DRAM subarray*. A DRAM subarray is a 2D array of DRAM cells, where cells are horizontally organized into multiple DRAM rows. A DRAM row is a set of DRAM cells that share a wire called the *wordline*, which the *local row decoder* of the subarray drives after fully decoding the row address. In a subarray, a column of cells shares a wire, referred to as the *bitline*, that connects the column of cells to a *sense amplifier*. The sense amplifier is the circuitry used to read and modify the data of a DRAM cell. The row of sense amplifiers in the subarray is referred to as the *local row-buffer*. To access a DRAM cell, the corresponding DRAM row first needs to be copied into the local row-buffer, which connects to the internal I/O bus via the *global row-buffer*.



Figure 2-2: DRAM bank and cell architecture [163].

Figure 2-2b illustrates a DRAM cell, which is composed of a *storage capacitor* and *access transistor*. A DRAM cell stores a single bit of information based on the charge level of the capacitor. The data stored in the cell is interpreted as a “1” or “0” depending on whether the charge stored in the cell is above or below a certain threshold. Unfortunately, the capacitor and the access transistor are not ideal circuit components and have *charge leakage paths*. Thus, to ensure that the cell does not leak charge to the point where the bit stored in the cell flips, the cell needs to be periodically *refreshed* to fully restore its original charge.

## 2.3 DRAM Commands

The memory controller issues a set of DRAM commands to access data in the DRAM chip. To perform a read or write operation, the memory controller first needs to *open* a row, i.e., copy the data of the cells in the row to the row-buffer. To open a row, the memory controller issues an *activate (ACT)* command to a bank by specifying the address of the row to open. The memory controller can issue *ACT* commands to different banks in consecutive DRAM bus cycles to operate on *multiple banks in parallel*. After opening a row in a bank, the memory controller issues either a *READ* or a *WRITE* command to read or write a DRAM word (which is typically equal to 64 bytes) within the open row. An open row can serve multiple *READ* and *WRITE* requests without incurring precharge and activation delays. A DRAM row typically contains 4-8 KiBs of data. To access data from another DRAM row in the same bank, the memory controller must first close the currently open row by issuing a

*precharge (PRE)* command. The memory controller also periodically issues *refresh (REF)* commands to prevent data loss due to charge leakage.

## 2.4 DRAM Cell Operation

We describe DRAM operation by explaining the steps involved in reading data from a DRAM cell.<sup>1</sup> The memory controller initiates each step by issuing a DRAM command. Each step takes a certain amount of time to complete, and thus, a DRAM command is typically associated with one or more timing constraints known as *timing parameters*. It is the responsibility of the memory controller to satisfy these timing parameters in order to ensure *correct* DRAM operation.

In Figure 2-3, we show how the state of a DRAM cell changes during the steps involved in a read operation. Each DRAM cell diagram corresponds to the state of the cell at exactly the tick mark on the time axis. Each command (shown in purple boxes below the time axis) is issued by the memory controller at the corresponding tick mark. Initially, the cell is in a *precharged* state ①. When precharged, the capacitor of the cell is disconnected from the bitline since the wordline is not asserted and thus the access transistor is off. The bitline voltage is stable at  $\frac{V_{dd}}{2}$  and is ready to be perturbed towards the voltage level of the cell capacitor upon enabling the access transistor.



Figure 2-3: Command sequence for reading data from DRAM and the state of a DRAM cell during each related step.

To read data from a cell, the memory controller first needs to perform *row activation*

---

<sup>1</sup>Although we focus only on reading data, steps involved in a write operation are similar.

by issuing an *ACT* command. During row activation (②), the row decoder asserts the wordline that connects the storage capacitor of the cell to the bitline by enabling the access transistor. At this point, the capacitor charge perturbs the bitline via the *charge sharing* process. Charge sharing continues until the capacitor and bitline voltages reach an equal value of  $\frac{V_{dd}}{2} + \delta$ . After charge sharing (③), the sense amplifier begins driving the bitline towards either  $V_{dd}$  or  $0\text{ V}$  depending on the direction of the perturbation in the charge sharing step. This step, which amplifies the voltage level on the bitline as well as the cell is called *charge restoration*. Although charge restoration continues until the original capacitor charge is fully replenished (④), the memory controller can issue a *READ* command to safely read data from the activated row before the capacitor charge is fully replenished. A *READ* command can reliably be issued when the bitline voltage reaches the voltage level  $V_{read}$ . To ensure that the read occurs after the bitline reaches  $V_{read}$ , the memory controller inserts a time interval  $t_{RCD}$  between the *ACT* and *READ* commands. It is the responsibility of the DRAM manufacturer to ensure that their DRAM chip operates safely as long as the memory controller obeys the  $t_{RCD}$  timing parameter, which is defined in the DRAM standard [141]. If the memory controller issues a *READ* command before  $t_{RCD}$  elapses, the bitline voltage may be below  $V_{read}$ , which can lead to the reading of a wrong value.

To return a cell to its precharged state, the voltage in the cell must first be fully restored. A cell is expected to be fully restored when the memory controller satisfies a time interval dictated by  $t_{RAS}$  after issuing the *ACT* command. Failing to satisfy  $t_{RAS}$  may lead to insufficient amount of charge to be restored in the cells of the accessed row. A subsequent activation of the row can then result in the reading of incorrect data from the cells.

Once the cell is successfully *restored* (④), the memory controller can issue a *PRE* command to close the currently-open row to prepare the bank for an access to another row. The cell returns to the precharged state (⑤) after waiting for the timing parameter  $t_{RP}$  following the *PRE* command. Violating  $t_{RP}$  may prevent the sense amplifiers from fully driving the bitline back to  $\frac{V_{dd}}{2}$ , which may later result in the row

to be activated with too small amount of charge in its cells, potentially preventing the sense amplifiers to read the data correctly.

For correct DRAM operation, it is critical for the memory controller to ensure that the DRAM timing parameters defined in the DRAM specification are *not* violated. Violation of the timing parameters may lead to incorrect data to be read from the DRAM, and thus cause unexpected program behavior [191, 156, 120, 55, 58, 54, 190]. In this work, we study the failure modes due to violating DRAM timing parameters and explore their application to reliably generating true random numbers.

## 2.5 DRAM Failure Modes

As we describe in Section 2.4, the memory controller must satisfy timing parameters associated with DRAM commands for correct operation. We define *access latency failures* as failures that occur due to accessing a DRAM module with *any* reduced timing parameter. In this dissertation, we focus on *activation failures*, which is a special case of access latency failures, caused by reducing the  $t_{RCD}$  timing parameter.

An *activation failure* occurs due to insufficient time for the sense amplifier to drive the bitline to  $V_{access}$ . Depending on the reduction in the  $t_{RCD}$  parameter, there are two modes of *activation failure*. The first mode of activation failure results in transient failures in the returned data, but no failures in the data stored in the DRAM cells. In this case, the *next access* to the same row that satisfies the timing parameters would return correct data. Such a failure may happen when the bitline does *not* reach  $V_{access}$  prior to the read operation but the sense amplifier continues to drive the bitline towards the same direction (i.e., full 0 or 1) as the charge-sharing phase has already started.

The second mode of activation failure *destroys* the data stored in a DRAM cell such that future accesses (with default timing parameters) return failures. Such a failure may happen when, at the time the **READ** is issued, the bitline voltage level is sufficiently low. In this case, the read operation could significantly disturb the bitline such that the sense amplifier starts driving the bitline towards the opposite of the

original direction. We observe both of the *activation failure* modes in our experiments with real LPDDR4 DRAM modules.

## 2.6 Violating Manufacturer-Specified Timing Parameters

Different cells in the same DRAM chip have different reliable operation latencies (for each timing parameter) due to two major reasons: 1) design (architectural) differences [190], and 2) process variation [191]. For example, a cell located closer to the sense amplifiers than an otherwise-equivalent cell can operate correctly with a lower  $t_{RCD}$  constraint [190] because the inherent latency to access a cell close to the sense amplifiers is lower. Similarly, a cell that happens to have a larger capacitor (due to manufacturing process variation) can operate reliably with tighter timing constraints than a smaller cell elsewhere in the same chip [191].

Because manufacturing process variation occurs in random and unpredictable locations within and across chips [73, 379, 199, 168, 191, 190, 58, 55, 54, 188, 170], the manufacturer-published timing parameters are chosen to ensure reliable operation of the *worst-case* cell in any acceptable device at the *worst-case* operating conditions (e.g., highest supported temperature, lowest supported voltage). This results in a large *safety margin* (or, *guardband*) for each timing parameter, which prior work shows can often be reliably reduced at *typical* operating conditions [55, 191, 53].

Prior work also shows that decreasing the timing parameters *too aggressively* results in failures, with increasing error rates observed for larger reductions in timing parameter values [206, 266, 155, 191, 260, 55, 119, 58, 120, 156, 158, 157]. Errors occur because, with reduced timing parameters, the internal DRAM circuitry is *not* allowed *enough time* to properly perform its functions and stabilize outputs before the memory controller issues the next command (Section 2.4).

## Chapter 3

# Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines

DRAM latency is a major bottleneck for many applications in modern computing systems. In this chapter, we rigorously characterize the effects of reducing DRAM access latency on 282 state-of-the-art LPDDR4 DRAM modules. As found in prior work on older DRAM generations (DDR3), we show that regions of LPDDR4 DRAM modules can be accessed with latencies that are significantly lower than manufacturer-specified values *without* causing failures. We present novel data that 1) further supports the viability of such latency reduction mechanisms and 2) exposes a variety of new cases in which access latencies can be effectively reduced. Using our observations, we propose a new low-cost mechanism, Solar-DRAM, that 1) identifies failure-prone regions of DRAM at reduced latency and 2) robustly reduces average DRAM access latency while maintaining data correctness, by issuing DRAM requests with reduced access latencies to non-failure-prone DRAM regions. We evaluate Solar-DRAM on a

wide variety of multi-core workloads and show that for 4-core homogeneous workloads, Solar-DRAM provides an average (maximum) system performance improvement of 4.31% (10.87%) compared to using the default fixed DRAM access latency.

### 3.1 Motivation and Goal

Many prior works [119, 172, 187, 237, 242, 243, 381, 173, 192, 190] show that various important workloads exhibit *low* access locality and thus are unable to effectively exploit *row-buffer locality*. In other words, these workloads issue a significant number of DRAM accesses that result in bank (i.e., row buffer) conflicts, which *negatively* impact overall system performance. Each access that causes a bank conflict requires activating a closed row, a process whose latency is dictated by the  $t_{RCD}$  timing parameter. The memory controller must wait for  $t_{RCD}$  before issuing any other command to that bank. To reduce the overhead of bank conflicts, we aim to reduce the  $t_{RCD}$  timing parameter while maintaining data correctness.

**Prior Observations.** In a recent publication, Chang et al. [55] observe that activation failures 1) are *highly* constrained to specific columns of DRAM cells across an entire DRAM bank, i.e., global bitlines, and regions of memory that are closer to the row decoders, 2) can *only* affect cells within the cache line granularity of bits that is first requested in a closed row, and 3) propagate back into DRAM cells and become *permanent* failures in the stored data.

Based on these observations, Chang et al. propose FLY-DRAM, which *statically* profiles DRAM *global bitlines* as *weak* or *strong* using a one-time profiling step. During execution, FLY-DRAM relies on this *static* profile to access *weak* or *strong* global bitlines with *default* or *reduced*  $t_{RCD}$ , respectively.

Unfortunately, [55] falls short in three aspects. First, the paper lacks analysis of whether a *strong* bitline will ever become a *weak* bitline or vice versa. This analysis is necessary to demonstrate the viability of relying on a static profile of global bitlines to guarantee data integrity. Second, the authors present a characterization of activation failures on an older generation of DRAM (DDR3). Third, the proposed mechanism,

FLY-DRAM, does not *fully* take advantage of all opportunities to reduce  $t_{RCD}$  in modern DRAM modules (as we show in Section 3.3).

Given the shortcomings of prior work [55], **our goal** is to 1) present a more rigorous characterization of activation failures on *state-of-the-art LPDDR4* DRAM modules, 2) demonstrate the viability of mechanisms that rely on a static profile of weak cells to reduce DRAM access latency, and 3) devise new mechanisms that exploit *more activation failure characteristics* on *state-of-the-art LPDDR4* DRAM modules to further reduce DRAM latency.

## 3.2 Testing Methodology

To analyze DRAM behavior under reduced  $t_{RCD}$  values, we developed an infrastructure to characterize state-of-the-art LPDDR4 DRAM chips [141] in a thermally-controlled chamber. Our testing environment gives us precise control over DRAM commands and  $t_{RCD}$ , as verified via a logic analyzer probing the command bus. In addition, we determined the address mapping for internal DRAM row scrambling so that we could study the spatial locality of activation failures in the physical DRAM chip. We test for activation failures across a DRAM module using Algorithm 1. The key idea is to access every cache line across DRAM, and open a closed row on each access. This *guarantees* that we test every DRAM cell’s propensity for activation failure.

---

**Algorithm 1:** DRAM Activation Failure Testing

---

```

1 DRAM_ACT_fail_testing(data_pattern, reduced_tRCD):
2   write data_pattern (e.g., solid 1s) into all DRAM cells
3   foreach col in DRAM module:
4     foreach row in DRAM module:
5       refresh(row)           // replenish cell voltage
6       precharge(row)        // ensure next access activates row
7       read(col) with reduced_tRCD // induce activation failures on col
8       find and record activation failures

```

---

We first write a known data pattern to DRAM (Line 2) for consistent testing conditions. The *for loops* (Lines 3-4) ensure that we test all DRAM cache lines. For

each cache line, we 1) refresh the row containing it (Line 5) to induce activation failures in cells with similar levels of charge, 2) precharge the row (Line 6), and 3) activate the row again with a *reduced*  $t_{RCD}$  (Line 7) to induce activation failures. We then find and record the activation failures in the row (Line 8), by comparing the read data to the data pattern the row was initialized with. We experimentally determine that Algorithm 1 takes approximately 200ms to test a single bank.

Unless otherwise specified, we perform all tests using 2y-nm LPDDR4 DRAM chips from three major manufacturers in a thermally-controlled chamber held at 55°C. We control the ambient temperature precisely using heaters and fans. A microcontroller-based PID loop controls the heaters and fans to within an accuracy of 0.25°C and a reliable range of 40°C to 55°C. We keep the DRAM temperature at 15°C above ambient temperature using a separate local heating source. This local heating source probes local on-chip temperature sensors to smooth out temperature variations due to self-induced heating.

### 3.3 Activation Failure Characterization

We present our extensive characterization of activation failures in modern LPDDR4 DRAM modules from three major DRAM manufacturers. We make a number of key observations that 1) support the viability of a mechanism that uses a *static profile* of weak cells to exploit variation in access latencies of DRAM cells, and 2) enable us to devise new mechanisms that exploit *more activation failure characteristics* to further reduce DRAM latency.

#### 3.3.1 Spatial Distribution of Activation Failures

We first analyze the spatial distribution of activation failures across DRAM modules by visually inspecting bitmaps of activation failures across many DRAM banks. A *representative* 1024x1024 array of DRAM cells with a significant number of activation failures is shown in Figure 3-1. Using these bitmaps, we make three key observations.

**Observation 1:** Activation failures are highly constrained to *local bitlines*. We infer



Figure 3-1: Activation failure bitmap in 1024x1024 cell array.

that the granularity at which we see bitline-wide activation failures is a subarray. This is because the number of consecutive rows with activation failures on the same bitline falls within the range of expected modern subarray sizes of 512 to 1024 [173, 190]. We hypothesize that this occurs as a result of process manufacturing variation at the level of the local sense amplifiers. Some sense amplifiers are manufactured “weaker” and *cannot* amplify data on the local bitline as quickly. This results in a higher probability of activation failures in DRAM cells attached to the same “weak” local bitline. While manufacturing process variation dictates the local bitlines that contain errors, the manufacturer design decisions for subarray size dictates the number of cells attached to the same local bitline, and thus, the number of consecutive rows that contain activation failures in the same local bitline. **Observation 2:** Subarrays from Vendor B and C’s DRAM modules consist of 512 DRAM rows, while subarrays from Vendor A’s DRAM modules consist of 1024 DRAM rows. **Observation 3:** We find that within a set of subarray rows, very few rows (<0.001%) exhibit a significantly different set of cells that experience activation failures compared to the expected set of cells. We hypothesize that the rows with significantly different failures are rows that are *remapped* to redundant rows (see [156, 206]) after the DRAM module was manufactured (indicated in Figure 3-1).

We next study the granularity at which activation failures can be induced when accessing a row. We make two observations (also seen in prior work [55]). **Observation 4:** When accessing a row with low  $t_{RCD}$ , the errors in the row are constrained to the

DRAM cache line granularity (typically 32 or 64 bytes), and only occur in the aligned 32 bytes that is first accessed in a closed row (i.e., up to 32 bytes are affected by a single low  $t_{RCD}$  access). Prior work [55] also observes that failures are constrained to cache lines on a system with 64 byte cache lines. **Observation 5:** The first cache line accessed in a closed DRAM row is the *only* cache line in the row that we observe to exhibit activation failures. We hypothesize that DRAM cells that are subsequently accessed in the same row have enough time to have their charge amplified and completely restored for correct sensing.

We next study the proportion of weak subarray columns per bank across many DRAM banks from all 282 of our DRAM modules. We collect the proportion of weak subarray columns per bank across two banks from each of our DRAM modules across all three manufacturers. For a given bank, we aggregate the subarray columns that contain activation failures when accessed with reduced  $t_{RCD}$  across our full range of temperatures. **Observation 6:** We observe that banks from manufacturers A, B, and C have an average/maximum (standard deviation) proportion of weak subarray columns of 3.7%/96% (12%), 2.5%/100% (6.5%), and 2.2%/37% (4.3%), respectively. We find that on average, banks have a *very low proportion of weak subarray columns*, which means that the memory controller can issue DRAM accesses to *most* subarray columns with reduced  $t_{RCD}$ .

We next study how a real workload might be affected by reducing  $t_{RCD}$ . We use Ramulator [174, 3] to analyze the spatial distribution of accesses immediately following an ACTIVATE (i.e., accesses that can induce activation failures) across 20 workloads from the SPEC CPU2006 benchmark suite [5]. Figure 3-2 shows the probability that the first access to a newly-activated row is to a particular cache line offset within the row. For a given cache line offset (x-axis value), the probability is presented as a distribution of probabilities, found across the SPEC CPU2006 workloads. Each distribution of probabilities is shown as a box-and-whisker plot<sup>1</sup> where the probability

---

<sup>1</sup>A box-and-whisker plot emphasizes the important metrics of a dataset’s distribution. The box is lower-bounded by the first quartile (i.e., the median of the first half of the ordered set of data points) and upper-bounded by the third quartile (i.e., the median of the second half of the ordered set of data points). The median falls within the box. The *inter-quartile range* (IQR) is defined as the distance between the first and third quartiles, or the size of the box. Whiskers extend an additional





Figure 3-3: Data pattern dependence of the proportion of local bitlines with activation failures found over 16 iterations.

to highlight the accumulation rate of local bitlines with failures in earlier iterations.

For a given iteration, we calculate the *coverage of each data pattern* as:

$$\frac{\sum_{n=1}^x \text{unique\_local\_bitlines}(\text{data\_pattern}, \text{iteration}_n)}{\text{total\_local\_bitlines\_with\_failures}} \quad (3.1)$$

where *unique\_local\_bitlines()* is the number of local bitlines observed to contain failures in a given iteration but *not* observed to contain failures in any prior iteration when using a specific data pattern, and *total\_local\_bitlines\_with\_failures* is the total number of unique local bitlines observed to contain failures at *any* iteration, with *any* data pattern. The *coverage* of a single data pattern indicates the effectiveness of that data pattern to identify the full set of local bitlines containing activation-failure-prone DRAM cells. **Observation 8:** Each walking pattern in a set of WALK1s or WALK0s (i.e., 16 walking 1 patterns and their inverses) finds a similar coverage of local bitlines over many iterations. Given Observation 8, we have already simplified Figure 3-3 by grouping the set of 16 walking 1 patterns and plotting the distribution of coverages of the patterns as a box-and-whisker-plot (WALK1). We have done the same for the set of 16 walking 0 patterns (WALK0). **Observation 9:** The random data pattern exhibits the highest coverage of activation-failure-prone local bitlines across all three DRAM manufacturers. We hypothesize that the random data results in, on average across DRAM cells, the worst-case coupling noise of a DRAM cell and its neighbors.

This is consistent with prior works' experimental observations that the random data pattern causes the highest rate of charge leakage in cells [260, 206, 155].

### 3.3.3 Temperature Effects

We next study the effect of DRAM temperature (at the granularity of  $5^{\circ}C$ ) on the number of activation failures across a DRAM module (at reduced  $t_{RCD}$ ). We make similar observations as prior work [55] and see no clear correlation between the *total number* of activation failures across a DRAM device and DRAM temperature. However, when we analyze the activation failure rates at the granularity of a *local bitline*, we observe correlations between DRAM temperature and the number of activation failures in a *local bitline*.

To determine the effect of temperature on a local bitline's probability to contain cells with activation failures, we study activation failures on a local bitline granularity with a range of temperatures. For a set of  $5^{\circ}C$  intervals of DRAM temperature between  $55^{\circ}C$  and  $70^{\circ}C$ , we run 100 iterations of Algorithm 1, recording each cell's probability of failure across all our DRAM modules. We indicate a *local bitline's probability of failure* ( $F_{prob}$ ) as:

$$F_{prob} = \sum_{n=1}^{cells\_in\_SA\_bitline} \frac{num\_iters\_failed_{cell_n}}{num\_iters \times cells\_in\_SA\_bitline} \quad (3.2)$$

where *cells\_in\_SA\_bitline* indicates the number of cells in a local bitline, *num\_iters\_failed<sub>cell<sub>n</sub></sub>* indicates the number of iterations out of the 100 tested iterations in which *cell<sub>n</sub>* fails, and *num\_iters* is the total number of iterations that the DRAM module is tested for.

Figure 3-4 aggregates our data across 30 DRAM modules from each DRAM manufacturer. Each point in the figure represents the  $F_{prob}$  of a local bitline at temperature  $T$  on the x-axis (i.e., the baseline temperature) and the  $F_{prob}$  of the same local bitline at temperature  $T + 5$  on the y-axis (i.e.,  $5^{\circ}C$  above the baseline temperature). The  $F_{prob}$  values at the baseline temperature are binned at the granularity of 1% and represent the range of  $F_{prob} \pm 0.5\%$ . We aggregate the  $F_{prob}$  values at temperature  $T + 5$  for every local bitline whose  $F_{prob}$  at temperature  $T$  falls within the same bin



Figure 3-4: Temperature effects on a local bitline's  $F_{prob}$ .

on the x-axis. We aggregate each set of  $F_{prob}$  values with box-and-whisker plots to show how the  $F_{prob}$  is generally affected by increasing the temperature. We draw each box-and-whisker plot with a blue box, orange whiskers, black whisker ends, and red medians. **Observation 10:** We observe that  $F_{prob}$  at temperature  $T + 5$  tends to be higher than  $F_{prob}$  at temperature  $T$  (i.e., the blue region of the figure is above the  $x = y$  line). Thus,  $F_{prob}$  tends to increase with increased temperature. However, there are cases (i.e., <25% of all data points) where the  $F_{prob}$  decreases with an increased temperature. We conclude that in order to find a comprehensive set of weak subarray columns, we must profile for activation failures with a range (e.g., 40°C to 55°C) of DRAM temperatures.

### 3.3.4 Latency Effects

We next study the effects of changing the value of  $t_{RCD}$  on activation failures. We sweep  $t_{RCD}$  between 2ns and 18ns (default) at the coarse granularity of 2ns, and we study the correlation of  $t_{RCD}$  with the total number of activation failures. We make *two* observations analogous to those made by Chang et al. [55]. **Observation 11:** We observe *no* activation failures when using  $t_{RCD}$  values above 14ns regardless of the temperature. The first  $t_{RCD}$  at which activation failures occur is 4ns below manufacturer-recommended values. This demonstrates the additional *guardband* that manufacturers place to account for process variation. **Observation 12:** We observe that a small reduction (i.e., by 2ns) in  $t_{RCD}$  results in a significant increase (>10x) in the number of activation failures.

In addition to repeating analyses on older generation modules [55], we are the first to study the effects of changing the  $t_{RCD}$  value on the failure probability of an individual cell. **Observation 13:** We observe that, if a DRAM cell fails 100% of the time when accessed with a reduced  $t_{RCD}$  of  $n$ , the same cell will likely fail between 0% and 100% when  $t_{RCD}$  is set to  $n + 2$ , and 0% of the time when  $t_{RCD}$  is set to  $n + 4$ . We hypothesize that the large changes in activation failure probability is due to the coarse granularity with which we can change  $t_{RCD}$  (i.e., 2ns; due to experimental infrastructure limitations). For this very reason, we cannot observe gradual changes in the activation failure probability that we expect would occur at smaller intervals of  $t_{RCD}$ . We leave the exploration of correlating finer granularity changes of  $t_{RCD}$  with the probability of activation failure of a DRAM cell to future work.

### 3.3.5 Short-term Variation

Many previous DRAM retention characterization works [21, 155, 206, 266, 344, 120, 58, 190, 284, 306, 170, 147, 157, 156, 366, 271, 151, 233, 260, 158] have shown that there is a well-known phenomenon called variable retention time (VRT), where variation occurs *over time* in DRAM circuit elements that results in significant and sudden changes in the leakage rates of charge from a DRAM cell. This affects the retention time of a DRAM cell over short-term intervals, resulting in varying retention failure probabilities for a given DRAM cell over the span of minutes or hours. To see if a similar time-based variation phenomenon affects the probability of an *activation failure*, we sample the  $F_{prob}$  of many local bitlines every six hours over 14 days and study how  $F_{prob}$  changes across the samples for a given local bitline. Figure 3-5 plots the change in  $F_{prob}$  of a given local bitline from one time sample to another. For a given local bitline, every pair of sample  $F_{prob}$  values (across the 14 day study) are plotted as (x,y) pairs. We collect these data points across all local bitlines in 30 DRAM modules (10 of each DRAM manufacturer) and plot the points. All points sharing the same  $F_{prob}$  on the x-axis, are aggregated into box-and-whisker plots. **Observation 14:** We find that the box-and-whisker plots show a tight distribution around the diagonal axis (where x equals y). This indicates that the  $F_{prob}$  of a given local bitline



Figure 3-5:  $F_{prob}$  of local bitlines across time.

remains highly similar (*correlation*  $r = 0.94$ ) across time. *This means that a weak local bitline is very likely to remain weak and a strong local bitline is very likely to remain strong across time.* Thus, we can identify the set of weak local bitlines *once* and that set would remain *constant* across time. To determine the number of iterations we expect to profile for to find a comprehensive set of weak local bitlines, we run iterations of Algorithm 1 for each bank until we only observe either zero or one failing bit in a local bitline that has never been observed to fail before in the tested bank. At this point, we say that we have found the *entire* set of local bitlines containing activation failures. **Observation 15:** We find that the required number of iterations to find the entire set of local bitlines containing activation failures differs significantly across chips and manufacturers. The average/maximum (standard deviation) number of iterations required to find the entire set of local bitlines for manufacturers A, B, and C is 843/1411 (284.28), 162/441 (174.86), and 1914/1944 (26.28), respectively.

### 3.3.6 DRAM Write Operations

We next study the effects of reduced  $t_{RCD}$  on *write* operations. We hypothesize that  $t_{RCD}$  is *mostly unnecessary* for DRAM write operations, because  $t_{RCD}$  dictates the time required for the sense amplifiers to amplify the data in DRAM cells to an *I/O readable value* ( $V_{access}$ ) such that reads can be correctly serviced. To determine the

effects of reducing  $t_{RCD}$  on DRAM write operations, we run two experiments with our DRAM modules. First, we sweep the value of  $t_{RCD}$  between 2ns and 18ns, and write a known data pattern across DRAM. We then read every value in the DRAM array with the default  $t_{RCD}$  and compare each read value with the expected value. We repeat this process 100 times using the *random* data pattern for each of our DRAM modules. We observe activation failures only when  $t_{RCD}$  is set below 4ns. We conclude that we can reliably issue DRAM write operations to our LPDDR4 DRAM modules with a *significantly* reduced  $t_{RCD}$  (i.e., 4ns; a reduction of 77%) without loss of data integrity.

## 3.4 Exploiting Activation Latency Variation

Based on our key observations from our extensive characterization of activation latency failures in DRAM (Section 3.3), we propose *Subarray-optimized Access Latency Reduction DRAM (Solar-DRAM)*, a mechanism that robustly reduces  $t_{RCD}$  for both DRAM read and write requests.

### 3.4.1 Solar-DRAM

Solar-DRAM consists of three components that exploit various observations on activation failures and memory access patterns. These three components are pure hardware approaches implemented within the memory controller without any DRAM changes and are invisible to applications.

**Component I: Variable-latency cache lines (VLC).** The first key observation that we exploit is that activation failures are highly constrained to *some* (or few) *local bitlines* (i.e., only 3.7%/2.5%/2.2% of subarray columns per bank are weak on average for DRAM manufacturers A/B/C respectively. See Section 3.3.1), and the local bitlines with activation-failure-prone cells are *randomly* distributed across the chip (not shown). Given the known spatial distribution of activation failures, the memory controller can issue memory requests with varying activation latency depending on whether or not the access is to data contained in a “weak” local bitline. To enable such a mechanism, Solar-DRAM requires the use of a *weak subarray column profile*

that identifies local bitlines as either *weak* or *strong*. However, since activation failures affect DRAM only at the granularity of a cache line (Section 3.3.1), Solar-DRAM needs to only store whether or not a column of cache-line-aligned DRAM cells within a subarray, i.e., a *subarray column*, contains a weak local bitline.

The second key observation that we exploit is that the failure probability of a cell, when accessed with a reduced  $t_{RCD}$ , is *not* vulnerable to short-term time variation, i.e., we did not observe significant changes within 14 days of testing (Section 3.3.5). This novel observation is necessary to ensure that a profile of weak local bitlines will *not change over time* and thus allows Solar-DRAM to rely on a static profile.<sup>2</sup>

Given a static profile of weak subarray columns, we can safely access the weak subarray columns with the default  $t_{RCD}$ , and *all other* subarray columns with a reduced  $t_{RCD}$ . We observe that after finding the initial set of failing columns there is still a very low probability (i.e.,  $< 5 \times 10^{-7}$ ) that a strong column will result in a single error. Fortunately, we find this probability to be low enough such that employing error correction codes (ECC) [151, 246, 115, 48], which are already present in modern DRAM chips, would transparently mitigate low-probability activation failures in strong columns.

**Component II: Reordered subarray columns (RSC).** We observe in Section 3.3.1, that the memory controller accesses the  $0^{th}$  cache line of a newly-activated DRAM row with the highest probability compared to the rest of the cache lines. Thus, we would like to devise a mechanism that reduces access latency (i.e.,  $t_{RCD}$ ) *specifically to the  $0^{th}$  cache line in each row* because the first accessed cache line in a newly-activated row is most affected by  $t_{RCD}$ . To this end, we propose a mechanism that scrambles column addresses such that the  $0^{th}$  cache line in a row is *unlikely* to get mapped to weak subarray columns. Given a weak subarray column profile, we identify the *global column* (i.e., the column of cache-line-aligned DRAM cells across a full DRAM bank) containing the fewest weak subarray columns, called the *strongest*

---

<sup>2</sup>We acknowledge that we do *not* consider long-term variation that may arise from aging or wearout of circuit components. We leave this exploration to future work. Such long-term effects can have implications for a static profile (as discussed in DIVA-DRAM [190]), but one can devise a mechanism that updates the profile at regular long time intervals with low overhead, e.g., as in prior work [260, 266].

*global column*. We then scramble the column address bits such that the 0<sup>th</sup> cache line for each bank maps to the *strongest global column* in the bank. We perform this scrambling by changing the DRAM address mapping at the granularity of the global column, in order to reduce the overhead in address scrambling.

**Component III: Reduced latency for writes (RLW).** The final observation that we exploit in Solar-DRAM is that write operations do *not* require the default  $t_{RCD}$  value (Section 3.3.6). To exploit this observation, we use a reliable, reduced  $t_{RCD}$  (i.e., 4ns, as measured with our experimental infrastructure) for *all* write operations to DRAM.

### 3.4.2 Static Profile of Weak Subarray Columns

To obtain the static profile of weak subarray columns, we run multiple iterations of Algorithm 1, recording all subarray columns containing observed activation failures. As we observe in Section 3.3, there are various factors that affect a local bitline’s probability of failure ( $F_{prob}$ ). We use these factors to determine a method for identifying a comprehensive profile of weak subarray columns for a given DRAM module. First, we use our observation on the accumulation rate of finding weak local bitlines (Section 3.3.5) to determine the number of iterations we expect to test each DRAM module. However, since there is such high variation across each DRAM module (as seen in the standard deviations of the distributions in Observation 11), we can only provide the expected number of iterations needed to find a comprehensive profile for DRAM modules of a manufacturer, and the time to profile depends on the module. We show in Section 3.3.2 that no *single* data pattern alone finds a high coverage of weak local bitlines. This indicates that we must test each data pattern (40 data patterns) for the expected number of iterations needed to find a comprehensive profile of a DRAM module for a range of temperatures (Section 3.3.3). While this could result in many iterations of testing (on the order of a few thousands; see Section 3.3.5), this is a one-time process on the order of half a day per bank that results in a reliable profile of weak subarray columns. The required one-time profiling can be performed in two ways: 1) the system running Solar-DRAM can profile a DRAM module when

the memory controller detects a new DRAM module at bootup, or 2) the DRAM manufacturer can profile each DRAM module and provide the profile within the *Serial Presence Detect* (SPD) circuitry (a Read-Only Memory present in each DIMM) [142].

To minimize the storage overhead of the weak subarray column profile in the memory controller, we encode each subarray column with a bit indicating whether or not to issue accesses to it with a reduced  $t_{RCD}$ . After profiling DRAM, the memory controller loads the weak subarray column profile once into a small lookup table in the DRAM channel’s memory controller.<sup>3</sup> For any DRAM request, the memory controller references the lookup table with the subarray column that is being accessed. The memory controller determines the  $t_{RCD}$  timing parameter according to the value of the bit found in the lookup table.

## 3.5 Solar-DRAM Evaluation

We first discuss our evaluation methodology and evaluated system configurations. We then present our multi-core simulation results for our chosen system configurations.

### 3.5.1 Evaluation Methodology

**System Configurations.** We evaluate the performance of Solar-DRAM on a 4-core system using Ramulator [174, 3], an open-source cycle-accurate DRAM simulator, in CPU-trace-driven mode. We analyze various real workloads with traces from the SPEC CPU2006 benchmark [5] that we collect using Pintool [215]. Table 3.1 shows the configuration of our evaluated system. We use the standard LPDDR4-3200 [141] timing parameters as our baseline. To give a conservative estimate of Solar-DRAM’s performance improvement, we simulate with a 64B cache line and a subarray size of

---

<sup>3</sup>To store the lookup table for a DRAM channel, we require  $num\_banks \times num\_subarrays\_per\_bank \times \frac{row\_size}{cacheline\_size}$  bits, where  $num\_subarrays\_per\_bank$  is the number of subarrays in a bank,  $row\_size$  is the size of a DRAM row in bits, and  $cacheline\_size$  is the size of a cache line in bits. For a 4GB DRAM module with 8 banks, 64 subarrays per bank, 32-byte cache lines, and 2KB per row, the lookup table requires 4KB of storage.

1024 rows.<sup>4</sup>

|                          |                                                                                                                                                                                                  |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Processor</b>         | 4 cores, 4 GHz, 4-wide issue, 8 MSHRs/core, OoO 128-entry window                                                                                                                                 |
| <b>LLC</b>               | 8 MiB shared, 64B cache line, 8-way associative                                                                                                                                                  |
| <b>Memory Controller</b> | 64-entry R/W queue, FR-FCFS [275, 383]                                                                                                                                                           |
| <b>DRAM</b>              | LPDDR4-3200 [141], 2 channels, 1 rank/channel, 8 banks/rank, 64K rows/bank, 1024 rows/subarray, 8 KiB row-buffer, Baseline: $t_{RCD}/t_{RAS}/t_{WR} = 29/67/29$ cycles (18.125/41.875/18.125 ns) |
| <b>Solar-DRAM</b>        | reduced $t_{RCD}$ for requests to strong cache lines: 18 cycles (11.25ns)<br>reduced $t_{RCD}$ for write requests: 7 cycles (4.375ns)                                                            |

Table 3.1: Evaluated system configuration.

**Solar-DRAM Configuration.** To evaluate Solar-DRAM and FLY-DRAM [55] on a variety of different DRAM modules with unique properties, we simulate varying 1) the number of weak subarray columns per bank between  $n = 1$  to 512, and 2) the chosen weak subarray columns in each bank. For a given  $n$ , i.e., weak subarray column count, we generate 10 *unique* profiles with  $n$  randomly chosen weak subarray columns per bank. The profile indicates whether a subarray column should be accessed with the default  $t_{RCD}$  (29 cycles; 18.13 ns) or the reduced  $t_{RCD}$  (18 cycles; 11.25 ns). We use these profiles to evaluate 1) Solar-DRAM’s three components (described in Section 3.4.1) independently, 2) Solar-DRAM with all its three components, 3) FLY-DRAM [55], and 4) our baseline LPDDR4 DRAM.

*Variable latency cache lines* (VLC), directly uses a weak subarray column profile to determine whether an access should be issued with a reduced or default  $t_{RCD}$  value. *Reordered subarray columns* (RSC) takes a profile and maps the 0<sup>th</sup> cache line to the *strongest global column* in each bank. For a given profile, this maximizes the probability that any access to the 0<sup>th</sup> cache line of a row will be issued with a reduced  $t_{RCD}$ . *Reduced latency for writes* (RLW) reduces  $t_{RCD}$  to 7 cycles (4.38

---

<sup>4</sup>Using the typical upper-limit values for these configuration variables reduces the total number of subarray columns that comprise DRAM (to 8,192 subarray columns per bank). A smaller number of subarray columns reduces the granularity at which we can issue DRAM accesses with reduced  $t_{RCD}$ , which reduces Solar-DRAM’s potential for performance benefit. This is because a single activation failure requires the memory controller to access *larger* regions of DRAM with default  $t_{RCD}$ .

ns) (Section 3.3.6) for *all* write operations to DRAM. *Solar-DRAM* (Section 3.4.1) combines all three components (*VLC*, *RSC*, and *RLW*). Since *FLY-DRAM* [55] issues read requests at the granularity of the *global column* depending on whether a global column contains weak bits, we evaluate *FLY-DRAM* by taking a weak subarray column profile and extending each weak subarray column to the global column containing it. Baseline LPDDR4 uses a fixed  $t_{RCD}$  of 29 cycles (18.13 ns) for all accesses. We present performance improvement of the different mechanisms over this LPDDR4 baseline.

### 3.5.2 Multi-core Evaluation Results

Figure 3-6 plots the improvement in weighted speedup[301], which corresponds to system throughput [79], over the baseline on 20 homogeneous mixes of 4-core workloads and 20 heterogeneous mixes of 4-core workloads randomly combined from the set of workloads in the SPEC CPU2006 benchmark suite [5]. For each configuration of *<weak subarray column count, weak subarray column profile, mechanism, workload mix>*, we aggregate all weighted speedup improvement results into a box-and-whisker plot.

We make four key observations. First, Solar-DRAM provides significant weighted speedup improvement. Even when half of the subarray columns are classified as weak (which is very unrealistic and conservative, as our experiments on real DRAM modules show), Solar-DRAM improves performance by 4.03% (7.71%) for heterogeneous and 3.36% (8.80%) for homogeneous workloads. In the ideal case, where there are 0 weak subarray columns per bank and thus, the memory controller issues *all* memory



Figure 3-6: Weighted speedup improvements of Solar-DRAM, its three individual components, and FLY-DRAM over baseline LPDDR4 DRAM, evaluated over various 4-core workload mixes from the SPEC CPU2006 benchmark suite.

accesses with a reduced  $t_{RCD}$ , Solar-DRAM improves performance by 4.97% (8.79%) for heterogeneous and 4.31% (10.87%) for homogeneous workloads. Second, each individual component of Solar-DRAM improves system performance.  $RLW$  is the best alone: it improves performance by 2.92% (5.90%) for heterogeneous and 2.25% (6.59%) for homogeneous workloads. Because  $RLW$  is independent of the number of weak subarray columns in a bank, its weighted speedup improvement is constant regardless of the number of weak subarray columns per bank. Third, Solar-DRAM provides higher performance improvement than each of its components, demonstrating that the combination of  $VLC$ ,  $RSC$ , and  $RLW$  is *synergistic*. Fourth, Solar-DRAM provides much higher performance improvement than FLY-DRAM. This is because Solar-DRAM 1) exploits the observation that *all write requests* can be issued with a greatly reduced  $t_{RCD}$  (i.e., by 77%), and 2) issues read requests with reduced  $t_{RCD}$  at the granularity of the *local* bitline rather than the global bitline. This means that for a single weak cache line in a subarray, Solar-DRAM issues read requests with default  $t_{RCD}$  *only* to cache lines in the *subarray column* containing the weak cache line, while FLY-DRAM would issue read requests with default  $t_{RCD}$  to *all* cache lines in the column across the *full bank*. For this very same reason, we also observe that  $VLC$  alone outperforms FLY-DRAM. Fourth, Solar-DRAM enables significantly higher performance improvement on DRAM modules with a high rate of activation failures, where FLY-DRAM provides no benefit. Because FLY-DRAM categorizes columns across the *entire bank* as strong or weak, even a low activation failure rate across the DRAM chip results in a high number of cache lines requiring the default  $t_{RCD}$  timing parameter in FLY-DRAM. We experimentally observe the average proportion of weak subarray columns per bank to be 3.7%/2.5%/2.2% for DRAM manufacturers A/B/C (Section 3.3.1). Even at such a low proportion of weak subarray columns (i.e., 38/26/23 subarray columns out of 1024 subarray columns in our evaluated DRAM configuration), we expect the performance benefit of FLY-DRAM to be well below 1.6% (i.e., the median performance benefit when we evaluate FLY-DRAM with 16 weak subarray columns in Figure 3-6 across all workload mixes) for DRAM manufacturers B and C, and 0% for DRAM manufacturer A. We conclude that Solar-DRAM's three

components provide significant performance improvement on modern LPDDR4 DRAM modules over LPDDR4 DRAM and FLY-DRAM.

## 3.6 Related Work

Many works seek to improve DRAM access latency. They can be classified according to the mechanisms they take advantage of, as follows.

**Static Variation.** We have already described these works [55] in detail in Section 3.1 and compared to FLY-DRAM [55] in Section 3.5. Solar-DRAM outperforms FLY-DRAM. Das et al. [70] propose a method to reduce *refresh latency*, which is orthogonal to Solar-DRAM.

**Operational Factors.** Prior works improve DRAM latency by controlling or taking advantage of changes in operational factors such as temperature [191] and voltage [58]. These works are orthogonal to Solar-DRAM since they reduce latency in response to changes in factors that are independent of latency variations inherent to the DRAM module.

**Access Locality.** Some work exploits locality in DRAM access patterns [119, 349, 299] and reorganizes DRAM accesses to allow for higher locality [243, 186, 290, 295] in order to reduce average DRAM access latency. These can be combined with Solar-DRAM for further latency reduction.

**Modifications to DRAM Architecture.** Various works [56, 57, 125, 173, 192, 213, 288, 292, 293, 294, 291, 305, 376, 216, 349] propose mechanisms that change the structure of DRAM to reduce latency. Solar-DRAM requires *no* changes to the DRAM chip.

**Software Support.** Several works [74, 144, 187, 253] propose using compile-time optimizations to improve DRAM access locality and thus, decrease overall DRAM access latency. Solar-DRAM reduces the latency of the average memory access and would provide added benefits to software optimizations. If the profile of weak subarray columns is exposed to the compiler or the system software, the software could potentially use this device-level information to allocate latency-critical data at

*stronger* locations in DRAM, while decreasing the hardware overhead of storing weak subarray column profiles in the memory controller.

### 3.7 Limitations

Solar-DRAM has a few limitations that must be considered when analyzing the viability of such a mechanism on a real system.

First, our characterization in Solar-DRAM demonstrates that failures due to low-latency accesses are generally localized to specific local bitlines. However, there are low probability bit flips that may occur outside of the "weak" local bitlines less predictably. Solar-DRAM relies on Error Correcting Code (ECC) hardware to handle such low probability bit flips that are not encapsulated by the Solar-DRAM profile. In DRAM devices without ECC hardware or margin in their ECC capability for additional bit flips, applications using Solar-DRAM on such devices will likely see failures in their data. We expect future work to enhance Solar-DRAM by enabling the ability to either 1) comprehensively predict all weak subarray cache lines with a better profiling methodology, 2) predict low probability bit flips and mitigate them before they occur, or 3) understand why these low probability bit flips occur and eliminate them completely.

Second, a number of unknowns prevents us from providing sampling statistics for our data. These unknowns include 1) the percentage of DRAM modules that we characterize compared to the total number of available modules (billions), 2) the total number of DRAM chips available for a given DRAM type, technology, and process size, and 3) the inability to randomly sample DRAM modules. Without this knowledge, we are unable to provide confidence intervals, or margins of sampling error. Therefore, the profile storage sizes and performance benefit distributions may not be representative of the average DRAM device. However, given that we were able to see the overarching trend of weak subarray bitlines in each DRAM chip that we tested, we are confident that this effect is prevalent across chips of the DRAM types and process nodes that we tested and all chips should be able to benefit from Solar-DRAM with low storage

overhead.

Third, we did not account for long-term DRAM aging effects during characterization. Thus, we are not able to confidently determine how long a profile will remain viable, and how often re-profiling will need to occur in order to reliably issue low latency accesses according to the Solar-DRAM profile. However, we do show that a profile will likely remain viable after 14 days of usage. In the worst case, the system may require profiling the DRAM chip every 14 days in order to create a reliable profile. However, we believe this is a conservative estimate, and there may be more intelligent ways for identifying gradual changes to the profile during normal execution.

### 3.8 Summary

We introduced 1) a rigorous characterization of activation failures across 282 *real state-of-the-art LPDDR4* DRAM modules, 2) Solar-DRAM, whose key idea is to exploit our observations and issue DRAM accesses with variable latency depending on the target DRAM location’s propensity to fail with reduced access latency, and 3) an evaluation of Solar-DRAM and its three individual components, with comparisons to the state-of-the-art [55]. We find that Solar-DRAM provides significant performance improvement over the state-of-the-art DRAM latency reduction mechanism across a wide variety of workloads, *without* requiring any changes to DRAM chips or software.

## Chapter 4

# The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their *physical microstructures*. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based

PUF can generate many unique identifiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55°C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation.

In this chapter, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specifications using software-only system calls. Doing so results in error patterns that reflect the compound effects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense amplifiers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modifications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70°C and 1426x (868x, 1783x) at 55°C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.

## 4.1 Physical Unclonable Functions

A Physically Unclonable Function (PUF) maps a set of *input parameters* to unique, device-specific signatures that can be generated *repeatably* and *reliably*. We refer to the process of generating a signature using a given set of input parameters as the *evaluation* of a PUF. The resulting signature reflects a device’s inherent, random physical variations introduced during manufacturing. This property guarantees that the signature is practically impossible to predict or replicate without access to the device itself [365, 91]. These characteristics enable PUFs to be frequently used in security applications such as low-cost authentication mechanisms against system

security attacks and prevention of integrated circuit (IC) counterfeiting [318, 364].

PUFs are generally used in a challenge-response (CR) protocol [318], in which a trusted server gives a device a *challenge* (i.e., a set of input parameters and conditions with which to evaluate a PUF), and verifies the device’s *PUF response* (i.e., the signature generated by the PUF). A CR protocol generally consists of two phases: *enrollment* and *authentication*. Enrollment is a one-time setup phase in which a given device is analyzed, and all possible PUF responses are stored in the trusted server. Authentication occurs when an application running on the enrolled device requests escalated permissions from the trusted server to perform a secure action. The server provides a challenge to the application, which then evaluates the PUF with the requested parameters and returns the PUF response. If the response matches with the previously-enrolled response for the challenge, i.e., the *golden key*, authentication is successful. The CR can be done *statically*, where the PUF is evaluated only once before runtime (e.g., at bootup) or at *runtime*, where an application running on the enrolled device can evaluate a PUF on-demand [364].

PUFs for silicon devices were first introduced as a method for integrated circuit (IC) identification, exploiting manufacturing process variation among devices for *disambiguating* different devices [212]. Since then, many prior works have proposed PUF evaluation techniques for different substrates (e.g., ASICs, FPGAs, memories), exploiting manufacturing variation in different components such as emerging memory technologies [277, 177, 136, 342], flash memory [353], Application Specific Integrated Circuit (ASIC) logic [123, 335, 108, 195, 200, 226, 278, 251, 252, 116, 92, 91, 314, 179, 309, 221, 338, 371], Static Random Access Memory (SRAM) [107, 126, 127, 67, 32, 382, 359, 19], and Dynamic Random Access Memory (DRAM) [318, 154, 117, 328, 329, 268, 322, 319].

PUFs must satisfy *five* key characteristics to be effective in security applications [328, 364, 117, 318, 222]. We describe these characteristics in detail in Section 4.3.1. PUFs satisfying these characteristics 1) guarantee a level of robustness for *disambiguating* many devices and 2) are practically impossible for an attacker to duplicate *without* access to the physical device itself. In addition to these properties, a *runtime-accessible*

PUF, i.e., a PUF that is accessible online to an application running on an enrolled device, must 1) be easily evaluated with *low latency* to prevent unnecessary slowdown of the application requesting authentication, and 2) provide *low system interference*, i.e., minimize the disturbance PUF evaluation causes to other applications running on the same system. Section 4.3.2 describes the characteristics of *ideal runtime-accessible PUFs*.

## 4.2 Motivation and Goal

DRAM-based PUFs, henceforth called *DRAM PUFs*, have recently attracted significant interest for two key reasons: 1) DRAM is already widely used in a wide variety of modern systems [238, 244], ranging from embedded to server, and 2) DRAM’s large address space, which is on the order of Giga- or Tera-bytes, makes it naturally suitable for CR applications by providing a greater CR space relative to smaller components (e.g., SRAMs) [107, 126, 108, 127, 67, 32, 382, 359, 19]. Prior DRAM PUF proposals exploit variations in DRAM start-up values [328], DRAM write access latencies [117], and DRAM cell retention failures [318, 154, 210, 364] to generate reliable PUF responses.

Unfortunately, these prior DRAM PUF proposals have significant drawbacks that make them unsuitable as *runtime-accessible* PUFs. PUFs that use DRAM start-up values [328] preclude runtime-accessible PUF evaluation by requiring a DRAM power cycle for *every* authentication. This requires either interrupting other applications using DRAM or restarting the entire system, which is likely infeasible at runtime. On the other hand, PUFs that exploit variation in write access latencies [117] *can be* evaluated at runtime. However, [117]’s proposal requires additional circuitry in a DRAM chip to allow fine-grained manipulation of write latency [117]. This requires changes to DRAM chips, rendering such proposals inapplicable to devices used in the field today. In this chapter, we would like to design a new runtime-accessible PUF *without* modifying commodity DRAM chips.

Using cell charge retention *failures* and their resulting *error patterns* [207, 113, 206,

[155, 260, 266] is the best candidate for runtime-accessible DRAM PUF evaluation in commodity devices today, since it does *not* require a power cycle or any modifications to DRAM chips. Unfortunately, such *DRAM retention PUFs* impose two major overheads. First, due to the 1) wide distribution of charge retention times across DRAM cells [155, 206, 260, 266, 113] and 2) roughly-uniform spatial distribution of retention failures across a chip [300, 21], we find that the evaluation time of a DRAM retention PUF takes on the order of *minutes* at 55°C to identify enough retention failures. The evaluation time increases exponentially as temperature decreases. Second, this means that DRAM refresh *must* be disabled for long periods of time. Because DRAM refresh can *only* be disabled for large regions of DRAM [56], evaluating a DRAM retention PUF on a small region of memory, i.e., a *PUF memory segment*, requires disabling refresh on the *entire* large memory region containing the PUF memory segment. However, to maintain the integrity of data inside the large region but outside of the PUF memory segment, *all* such data must be *continuously* refreshed with additional DRAM commands, which results in significant system interference [364]. Based on extensive experimental analysis using 223 state-of-the-art LPDDR4 DRAM devices, we find that DRAM retention PUFs are too slow for reasonable runtime operation, e.g., they have average evaluation times of 125.8s at 55°C and 13.4s at 70°C using a 64KiB memory segment (Section 4.5).

**Our goal** in this work is to develop a new runtime-accessible PUF that 1) uses *existing* commodity DRAM devices, 2) satisfies all characteristics of an effective *runtime-accessible* PUF, and 3) provides *low-latency* evaluation with *low system interference* across *all operating conditions*.

### 4.3 Properties of a Runtime-Accessible PUF

In this section, we examine the desired properties of an *effective runtime-accessible PUF*. Prior works present various different metrics for defining the effectiveness of a PUF [328, 364, 117, 318, 222]. We consolidate these metrics into five key properties below. We then discuss two properties that we consider necessary for an effective

*runtime-accessible* PUF. We refer to these seven properties when analyzing DRAM PUFs (Section 4.5 and 4.6). In Section 4.6, we show how DRAM latency PUFs overcome the weaknesses of DRAM retention PUFs based on a comparison of these properties between the two types of PUFs.

#### 4.3.1 Characteristics of a Desirable PUF

The following five key properties must be provided by any effective *PUF* that can be evaluated across a set of devices:

1. *Diffuseness*: a single device is able to generate many unique and independent responses to different input parameters [19, 90, 69, 129].
2. *Uniqueness*: a single device can be uniquely identified among the set of devices [328, 364, 117, 355, 205, 129].
3. *Uniform Randomness*: all possible PUF responses must be equally different from each other [328, 364, 205, 129].
4. *Unclonability*: it should be practically impossible for an adversary to construct a device that exhibits the same properties as another [222, 153, 355].
5. *Repeatability*: given a set of input parameters, PUF evaluation results in the same PUF regardless of internal and external conditions (e.g., temperature, aging) [328, 364, 117, 355, 205, 129].

These five properties ensure that a PUF can be used effectively for challenge-response authentication.

#### 4.3.2 Characteristics of a *Runtime-Accessible* PUF

There are many important use cases for runtime-accessible PUFs. Examples include 1) systems that employ remote communication protocols to access devices via remote direct memory access (RDMA [4]) or to perform functions on remote devices (e.g., remote servers), 2) systems that have interchangeable/broken system components

(e.g., SSD drives, external sensors, peripheral devices). In each of these systems, a connection/component can be maliciously swapped out *during runtime* so that a *malicious* device can be swapped in. One way to avoid such an attack is to utilize a low-overhead runtime-accessible PUF-based challenge-response mechanism that frequently authenticates the communicating devices. This enables re-authentication of the system components during each step of communication rather than just once at bootup time. More generally, a fast *runtime-accessible* PUF enables the protection of a system from attacks that exploit the fact that the time of check is different from the time of use [364].

In order to be useful for *runtime* authentication, a PUF must be effectively *usable* while the system is running *without* significantly interfering with application execution and system operation. Thus, a runtime-accessible PUF must possess the following two key properties:

1. *Low Latency*: PUF evaluation must be fast so that the application requesting authentication stalls for the *smallest possible amount of time*.
2. *Low System Interference*: PUF evaluation must *not* significantly slow down concurrently-running applications.

## 4.4 Testing Environment

To analyze DRAM behavior with both reduced refresh rates and reduced timing parameters, we developed an infrastructure to characterize modern LPDDR4 [141] DRAM chips. Our testing environment gives us precise control over the DRAM commands and DRAM timing parameters as verified with a logic analyzer probing the command bus.

We perform all tests, unless otherwise specified, using a total of 223 2y-nm LPDDR4 DRAM chips from three major manufacturers in a thermally-controlled chamber held at 45°C. For consistency across results, we stabilize the ambient temperature precisely using heaters and fans controlled via a microcontroller-based proportional-integral-derivative (PID) loop to within an accuracy of 0.25°C and a reliable range of 40°C

to 55°C. We maintain DRAM temperature at 15°C above ambient temperature using a separate local heating source. We utilize temperature sensors to smooth out temperature variations caused by self-induced heating.

## 4.5 DRAM Retention PUFs: Analysis

Recent works [318, 364, 268, 319, 329] propose DRAM retention PUFs, which require no modifications to commodity DRAM chips. These works evaluate their proposals using DDR3 DRAM modules and find that while the use of charge retention times in DRAM cells can result in repeatable PUFs, delays on the order of *minutes* are required to produce enough failures for uniquely identifying many devices.

In this section, we evaluate prior proposals using our own infrastructure with 223 modern LPDDR4 DRAM modules. Our experimental results (Section 4.5.2) confirm that DRAM retention PUFs can be effectively implemented with commodity LPDDR4 DRAM devices. However, similarly to prior work [318, 364], we find that the time required to evaluate retention PUFs is prohibitively long (e.g., on the order of minutes) *at temperatures* that are likely encountered under common-case operating conditions (e.g., 35°C-55°C) [62, 191, 78].

### 4.5.1 Evaluating Retention PUFs

We evaluate DRAM retention PUFs on our modern LPDDR4 devices, as shown in Algorithm 2. The DRAM retention PUF disables refresh for a period indicated by the *wait\_time* input parameter on a memory segment indicated by the segment ID (*seg\_id*) input parameter (line 3). In order to constrain retention failures to the PUF memory segment, the user must refresh the rows contained in the DRAM rank, but *not* in the PUF memory segment during the *wait\_time* interval (line 5-8). The resulting data in the memory segment after the *wait\_time* interval is the PUF response that is returned for authentication (line 10). This PUF response is uniquely represented by the pattern of DRAM cells that fail in the memory segment after *not* being refreshed during the *wait\_time* interval.

The memory controller can disable refresh only at the granularity of DRAM ranks or banks [141]. Therefore, in order to prevent potential data loss, evaluation of a *runtime-accessible* DRAM retention PUF using a given DRAM memory segment requires continuous refreshing of all rows that are within the same rank or bank but outside of the PUF memory segment. Doing so results in high system interference (see, e.g., [56, 207]) that is greatly exacerbated by the long refresh intervals (e.g., 60s vs. the standard 64ms) required for repeatable retention PUF evaluation at common-case temperatures (e.g. 35°C-55°C).

---

**Algorithm 2:** Evaluate Retention PUF [318, 364, 268, 319, 329]

---

```

1 evaluate_DRAM_retention_PUF(seg_id, wait_time):
2   rank_id  $\leftarrow$  DRAM rank containing seg_id
3   disable refresh for Rank[rank_id]
4   start_time  $\leftarrow$  current_time()
5   while current_time() - start_time < wait_time:
6     foreach row in Rank[rank_id]:
7       if row not in Segment[seg_id]:
8         issue refresh to row                                // refresh all other rows
9   enable refresh for Rank[rank_id]
10  return data at Segment[seg_id]

```

---

#### 4.5.2 Evaluation Times of Retention PUFs

In this section, we explore the effects of DRAM temperature during DRAM retention PUF evaluation on the DRAM retention PUF evaluation time. Based on extensive experimental data from 223 LPDDR4 DRAM chips, we find that the evaluation time of a DRAM retention PUF exhibits a strong dependence on DRAM temperature during evaluation. With even just a 10°C decrease in DRAM temperature, the evaluation time for the *same* PUF memory segment increases by 10x [318, 364]. This is due to the direct correlation between retention failure rate and temperature. We reproduce the *bit error rate* (BER) vs. temperature relationship studied for DDR3 [206] and LPDDR4 [260] chips using our own LPDDR4 chips. We find that below refresh intervals of 30s, there is an exponential dependence of BER on temperature with an average exponential growth factor of 0.23 per 10°C. This results in approximately a 10x decrease in the retention failure rate with every 10°C decrease in temperature

and is consistent with prior work's findings with older DRAM chips [206, 318, 260]. Due to the sensitivity of DRAM retention PUFs to temperature, a *stable temperature* is required to generate a *repeatable* PUF response.

To find the evaluation time of DRAM retention PUFs, we use a similar methodology to prior works on DRAM retention PUFs, which disable DRAM refresh and wait for at least 512 retention failures to accumulate across a memory segment [318, 154]. Figure 4-1 shows the results of DRAM retention PUF evaluation times for three different memory segment sizes (8KiB, 64KiB, 64MiB) across our testable DRAM temperature range (i.e., 55°C-70°C). Results are shown for the average across all tested chips from each manufacturer in order to isolate manufacturer-specific variation [207, 206, 155, 260]. Figure 4-1 also shows, for comparison, the DRAM latency PUF evaluation time, which is experimentally determined to be 88.2ms on average for *any* DRAM device at *all* operating temperatures (see Section 4.6.2).



Figure 4-1: Average DRAM retention PUF evaluation time vs. temperature shown for three selected memory segment sizes for each manufacturer. Average DRAM latency PUF evaluation time (Section 4.6.2) is shown as a comparison point.

We find that at our maximum testing temperature of 70°C, the average DRAM retention PUF across all manufacturers can be evaluated on average (minimum, maximum) in 40.6s (28.1s, 58.6s) using an 8KiB segment size. By increasing the memory segment size from 8KiB to 64KiB, we can evaluate a DRAM retention PUF in 13.4s (9.6s, 16.0s), and at 64MiB, in 1.05s (1.01s, 1.09s). However, at our lowest testable temperature (i.e., 55°C), DRAM retention PUF evaluation time increases

to 2.9 *hours* (49.7 minutes, 5.6 *hours*) using an 8KiB segment, 125.8s (76.6s, 157.3s) using a 64KiB segment, and 3.0s (1.5s, 5.3s) using a 64MiB segment.<sup>1</sup>

A DRAM retention PUF evaluation time on the order of even seconds or minutes is *prohibitively high* for at least three reasons: 1) such high latency leads to very long application stall times and very high system interference, 2) since DRAM refresh intervals can be modified only at a rank/bank granularity, the memory controller must continuously issue *extra accesses*, during PUF evaluation, to each row inside the rank/bank but outside of the PUF memory segment, which causes significant bandwidth performance and energy overhead, and 3) such a long evaluation time allows ample opportunity for temperature to fluctuate, which would result in a PUF response with low similarity to the golden key, and thus, an unreliable PUF.

In general, DRAM retention PUF evaluation time increases with *decreasing* temperature. This is due to the temperature dependence of charge leakage in DRAM cell capacitors, and is a *fundamental limitation* of using DRAM retention failures as a PUF mechanism. Therefore, any devices operating at common-case operating temperatures (35°C-55°C) [191, 78, 209] or below will have great difficulty adopting DRAM retention PUFs for runtime accessibility. In Sections 4.6.1 and 4.7.2, we describe the DRAM latency PUF in detail and show how it 1) provides a much lower evaluation time than the DRAM retention PUF, and 2) enables a reliably short evaluation time across *all* operating temperatures.

#### 4.5.3 Optimizing Retention PUFs

We explore if it is possible to make DRAM retention PUFs runtime-accessible (i.e., significantly faster) at common-case operating temperatures by increasing the rate at which retention failures are induced. Given that ambient (i.e., environmental) temperature is fixed, we can increase the rate of induced retention failures in two ways: 1) using a larger PUF memory segment in DRAM, or 2) accelerating the rate

---

<sup>1</sup>These evaluation times are consistent with prior work on DRAM retention PUFs [318, 154, 364], which find that evaluation times on the order of minutes or longer are required to induce enough retention failures in a 128KiB memory segment to generate a PUF response at 20°C.

of charge leakage using means other than increasing ambient temperature.

**Larger PUF memory segments.** Using a larger PUF memory segment results in additional DRAM capacity overhead that does *not* scale favorably with decreasing temperatures. As shown in Section 4.5.2, the number of retention failures drops exponentially with temperature, so the PUF memory segment size required to compensate for the decreasing retention failure rate *increases exponentially*. Our experimental analysis in Figure 4-1 shows that at 55°C, even using a PUF memory segment size on the order of tens of megabytes, a DRAM retention PUF *cannot* be evaluated in under 1 second. Assuming the exponential growth factor of 0.23 for DRAM BER as a function of temperature (found in Section 4.5.2), a corresponding PUF evaluation time of  $\sim$ 1s at 20°C would require a PUF memory segment over a thousand times larger (i.e., hundreds of gigabytes). Thus, it is not cost-effective (i.e., scalable) to naively increase the PUF memory segment size.

**Accelerating charge leakage.** Accelerating charge leakage given a fixed temperature can be done by either 1) making hardware modifications or 2) exploiting factors other than temperature that affect charge leakage. Unfortunately, as we discuss in this section, there is no easy way to achieve these using commodity off-the-shelf (COTS) systems.

In-DRAM hardware modifications proposed in prior work can be leveraged to increase the number of retention failures observed at a fixed ambient temperature. For example, partial restoration of DRAM cells [191, 380] can be used to prepare the PUF memory segment with reduced charge levels in order to exacerbate the number of retention failures observed with a given refresh interval. Similarly, other mechanisms in prior work (e.g., [299, 120]) can be used to decrease DRAM retention PUF evaluation time at common-case temperatures where DRAM retention PUFs are otherwise infeasible. However, these approaches require modifications in DRAM or the memory controller, and thus, cannot be used in COTS DRAM.

System-level hardware modifications, such as adding a heating source local to the DRAM chip [101], could be used to exacerbate the occurrence of retention failures at low ambient temperatures. However, these approaches require custom system

architectures, which contradicts our goal of designing a PUF for COTS systems. They may also open up system security and reliability concerns.

Experimental studies on DRAM have shown that charge leakage rates are dependent on factors such as supply voltage [58], data pattern effects [206, 196, 156, 157, 158, 260, 155], and random charge fluctuations known as *variable retention time (VRT)* [366, 271, 206, 266, 155, 260]. Analogously to temperature control, any of these quantities could be intelligently manipulated to exacerbate the number of retention failures observed. Unfortunately, these effects are either *relatively weak* to significantly increase the number of observed retention failures (e.g., data pattern dependence), require *system modifications* to implement (e.g., voltage control [72, 58]), or are inherently *difficult to control* (e.g., VRT effects).

In order to reduce the number of extra row refresh operations necessary to prevent data loss throughout retention PUF evaluation (Section 4.5.1), DRAM refresh optimizations proposed in prior work [352, 207, 206, 201, 245, 68, 21, 347, 266, 34, 260] can be used to increase the granularity of the refresh operation. While this approach could potentially eliminate the extra refresh operations altogether, these mechanisms come with their own hardware and runtime overheads that may diminish the benefits of *not* having to issue the extra refresh commands during PUF evaluation. Many such mechanisms also require hardware modifications to either DRAM chips or memory controllers or both.

We conclude that there is no good known way to optimize DRAM retention PUF evaluation time for COTS DRAM devices today. While many approaches to improve evaluation time exist, they are all impractical in COTS systems due to 1) lack of applicability and scalability to common-case temperatures, 2) need for DRAM modification, or 3) inherent difficulties in control. This motivates the need for a runtime-accessible PUF that is suitable across all temperature conditions and can be implemented on COTS DRAM devices today.

## 4.6 DRAM Latency PUFs

Our goal is to develop a DRAM PUF that can be evaluated 1) with low latency and low system interference across all operating temperatures, and 2) without any modification to DRAM chips. To this end, we present the DRAM latency PUF, a new class of DRAM PUFs with these characteristics. In particular, a DRAM latency PUF provides low evaluation time at a wide range of operating temperatures (0°C-70°C), which includes common-case temperatures (35°C-55°C) [191, 78, 209].

**Key Idea.** The key idea of the DRAM latency PUF is to provide unique device signatures using the error pattern resulting from accessing DRAM with *reduced* timing parameters. These *latency failures* are inherently related to chip-specific random process variation introduced during manufacturing (Section 2.6), which allows us to use the failures as unique identifiers for each DRAM chip. To evaluate a DRAM latency PUF, we write known data into a fixed-size *memory segment* (e.g., 4 DRAM rows  $\approx$  8KiB in our LPDDR4 DRAM chips) and read it back with reduced timing parameters. The resulting failures form a pattern of bits unique to the tested device.

**Probabilistic Nature.** Inducing latency failures is a stochastic process in which the probability of cell failure is based on random variations in both the cell itself and any peripheral circuitry used to access the cell. This is due to the probabilistic behavior of circuit elements when timing requirements are violated. To find a repeatable set of latency-failure-prone DRAM cells, each cell should be accessed *multiple* times with reduced timing parameters. In the case of reduced  $t_{RCD}$ , we require multiple *iterations* of reading each cell to accumulate a reliable set of latency failures. Fortunately, as we show in Section 4.6.2, finding a reliable set of latency failures is a relatively fast process (i.e., it takes 88.2ms on average).

**Key Variables.** We identify three key variables to optimize for when designing the DRAM latency PUF. These variables define the tradeoffs between the DRAM latency PUF’s evaluation time and its effectiveness.

- 1) *Memory segment ID.* DRAM PUFs can be evaluated using memory segments from different parts of DRAM. Each segment results in unique error patterns and can

therefore be used for *different* challenge-response pairs. In Section 4.7.3, we discuss how variation in process manufacturing causes some chips to have fewer memory segments (where fewer is worse) that are viable for DRAM latency PUF evaluation than others.

2) *Memory segment size.* Larger memory segments allow more devices to be uniquely identified at the cost of higher PUF evaluation time because more memory accesses are required to induce latency failures across the memory segment. With an experimental analysis of memory segment size based on data from 223 real DRAM chips (Section 4.6.1), we find that a memory segment size of 8KiB is sufficient to find enough latency failures for an effective DRAM latency PUF.

3) *DRAM timing parameters.* Both using different timing parameters and changing the amount of reduction in the chosen timing parameter result in *different* error patterns (Section 2.6). This is because 1) different timing parameters guard against different underlying error mechanisms [58, 190, 191], and 2) different amounts of latency reduction exercise different failure-prone bits [191]. These two dimensions of control add more degrees of freedom to the DRAM latency PUF, further increasing its diffuseness (Section 4.6.1).

Throughout the rest of this section, we first demonstrate that the DRAM latency PUF satisfies all requirements for 1) a reliable PUF (Section 4.3.1) and 2) runtime-accessible PUF evaluation (Section 4.3.2) across all temperatures. We focus on  $t_{RCD}$ -induced DRAM read errors in this work, but DRAM latency PUFs also work with any other timing parameter whose timing violation results in failures (e.g.,  $t_{RP}$ ,  $t_{RAS}$ ,  $t_{WR}$ ), thereby enabling a potentially larger challenge-response space than obtained by using a single timing parameter alone.

#### 4.6.1 PUF Characteristics: Experimental Analysis

This section shows, with experimental results from 223 state-of-the-art LPDDR4 DRAM chips, that the DRAM latency PUF satisfies each of the five characteristics of a desirable PUF discussed in Section 4.3.1.

## Diffuseness

Different memory segments within the same device result in different error patterns [191, 190, 58, 55]. Given the large address space provided by modern DRAM, different memory segments provide different challenge-response pairs. For example, our selected segment size of 8KiB (Section 4.7.3) in a 2GiB DRAM, offers up to 256K ( $\frac{2\text{GiB}}{8\text{KiB}}$ ) different challenge-response pairs, which is on the same order of magnitude as prior DRAM PUFs [364, 318].

## Uniqueness and Uniform Randomness

To show the uniqueness and uniform randomness of DRAM latency PUFs evaluated across different memory segments, we study a large number of different memory segments from each of our 223 LPDDR4 DRAM chips (as specified in Table 4.1).

|   | #Chips | #Tested Memory Segments |
|---|--------|-------------------------|
| A | 91     | 17,408                  |
| B | 65     | 12,544                  |
| C | 67     | 10,580                  |

Table 4.1: The number of tested PUF memory segments across the tested chips from each of the three manufacturers.

For each memory segment, we evaluate the PUF 50 times at 70°C. To measure the uniqueness of a PUF, we use the notion of a *Jaccard index* [137], as suggested by prior work [364, 282, 17]. We use the Jaccard index to measure the similarity of two PUF responses. The Jaccard index is determined by taking the two sets of latency failures ( $s_1, s_2$ ) from two PUF responses and computing the ratio of the size of the shared set of failures over the total number of unique errors in the two sets  $\frac{|s_1 \cap s_2|}{|s_1 \cup s_2|}$ . A Jaccard index value closer to 1 indicates a high similarity between the two PUF responses, and a value closer to 0 indicates uniqueness of the two. Thus, a unique PUF should have Jaccard index values close to 0 across all pairs of *distinct* memory segments.

We choose to employ the Jaccard index instead of the *Hamming distance* [115] as our metric for evaluating the similarity between PUF responses because the Jaccard index places a heavier emphasis on the differences between two large bitfields. This is

especially true in the case of devices that exhibit inherently lower failure rates. In the case of Hamming distance, calculating similarity between two PUF responses depends heavily on the number of failures found, and we find this to be an unfair comparison due to the large variance in the number of failures across distinct memory segments. For example, consider the case where two memory segments each generate PUF responses consisting of a single failure in different locations of a bitfield comprised of 100 cells. The Hamming distance between these PUF responses would be 1, which could be mistaken for a match, but the Jaccard index would be calculated as a 0, which would guarantee a mismatch. Because we are more interested in the locations *with* failures than without, we use the Jaccard index, which discounts locations without failures. Throughout the rest of this chapter, we use the terms 1) *Intra-Jaccard* [364, 282] to refer to the Jaccard index of two PUF responses from the *same* memory segment and 2) *Inter-Jaccard* [364, 282] to refer to the Jaccard index of two PUF responses from *different* memory segments.

A PUF must exhibit uniqueness and uniform randomness across any memory segment from any device from any manufacturer. To show that these characteristics hold for the DRAM latency PUF, we ensure that the distribution of Inter-Jaccard indices are distributed near 0. This demonstrates that 1) the error patterns are unique such that no two distinct memory segments would generate PUF responses with high similarity, and 2) the error patterns are distributed uniformly randomly across the DRAM chip(s) such that the likelihood of two chips (or two memory segments) generating the same error pattern is exceedingly low.

Figure 4-2 plots, in blue, the distribution of Inter-Jaccard indices calculated between *all possible pairs* of PUF responses generated at the same operating temperature ( $70^{\circ}\text{C}$ ) from all tested memory segments across all chips from three manufacturers. The distribution of the Intra-Jaccard indices are also shown in red (discussed later in this section). The x-axis shows the Jaccard indices and the y-axis marks the probability of any pair of memory segments (either within the same device or across two different devices) resulting in the Jaccard index indicated by the x-axis. We observe that the distribution of the Inter-Jaccard indices is multimodal, but the Inter-Jaccard index

*always* remains below 0.25 for *any pair* of distinct memory segments. This means that PUFs from different memory segments have low similarity. Thus, we conclude that latency-related error patterns approximate the behavior of a desirable PUF with regard to both uniqueness and uniform randomness.



Figure 4-2: Distributions of Jaccard indices calculated across every possible pair of PUF responses across all tested PUF memory segments from each of 223 LPDDR4 DRAM chips.

To understand manufacturer-related effects, Figure 4-3 separately plots the Intra- and Inter-Jaccard distributions of PUF responses from chips of a *single* manufacturer in subplots. Each subplot indicates the manufacturer encoding in the top left corner (A, B, C). From these per-manufacturer distributions, we make three major observations: 1) Inter-Jaccard values are quite low, per-manufacturer, which shows uniqueness and uniform randomness, 2) there is variation across manufacturers, as expected, and 3) Figure 4-2’s multimodal behavior for Inter- and Intra-Jaccard index distributions can be explained by the mixture of per-manufacturer distributions. We also find that the distribution of Inter-Jaccard indices calculated between two PUF responses from chips of distinct manufacturers are tightly distributed close to 0 (not shown).

## Unclonability

We attribute the probabilistic behavior of latency failures to physical variation inherent to the chip (discussed in Section 2.6). Chips of the same design contain physical differences due to manufacturing process variation which occurs as a result of imperfections in manufacturing [191, 190, 58, 55, 54, 188, 170]. The exact physical variations are inherent to each individual chip, as shown by previous work [191, 190, 58, 55, 54, 188, 170] and confirmed by our experiments (not shown), and the pattern



Figure 4-3: Distributions of Jaccard indices calculated between PUF responses of DRAM chips from a single manufacturer.

of variations is very difficult to replicate as it is created entirely unintentionally.

## Repeatability

To demonstrate that the DRAM latency PUF exhibits repeatability, we show how well a PUF memory segment can result in the *same* PUF response 1) at different times or 2) under different operating temperatures. For each of many different memory segments, we evaluate a PUF multiple times and calculate all possible *Intra-Jaccard* indices (i.e., Jaccard indices between two PUF responses generated from the *same* exact memory segment). Because a highly-repeatable PUF generates very similar PUF responses during each evaluation, we expect the Intra-Jaccard indices between PUF responses of a highly-repeatable PUF to be tightly distributed near a value of 1. Figure 4-2 plots the distribution of Intra-Jaccard indices across every PUF memory segment we tested in red. We observe that while the distribution is multimodal, the Intra-Jaccard indices are clustered very close to 1.0 and *never* drop below 0.65.

Similarly to the Inter-Jaccard index distributions (discussed in Section 4.6.1), we find that the different modes of the Intra-Jaccard index distribution shown in Figure 4-2 arise from combining the Intra-Jaccard index distributions from all three manufacturers. We plot the Intra-Jaccard index distributions for each manufacturer

alone in Figure 4-3 as indicated by (A),(B), and (C). We observe from the higher distribution mean of Intra-Jaccard indices in Figure 4-3 for manufacturer B that DRAM latency PUFs evaluated on chips from manufacturer B exhibit higher repeatability than those from manufacturers A or C. We conclude from the high Intra-Jaccard indices in Figures 4-2 and 4-3, that DRAM latency PUFs exhibit high repeatability.

**Long-term Repeatability.** We next study the repeatability of DRAM latency PUFs on a subset of chips over a 30-day period to show that the repeatability property holds for longer periods of time (i.e., a memory segment generates a PUF response similar to its previously-enrolled golden key irrespective of the time since its enrollment). We examine a total of more than a million 8KiB memory segments *across* many chips from each of the three manufacturers as shown in Table 4.2. The right column indicates the number of memory segments across  $n$  devices, where  $n$  is indicated in the left column, and the rows indicate the different manufacturers of the chips containing the memory segments.

|   | #Chips | #Total Memory Segments |
|---|--------|------------------------|
| A | 19     | 589,824                |
| B | 12     | 442,879                |
| C | 14     | 437,990                |

Table 4.2: Number of PUF memory segments tested for 30 days.

In order to demonstrate the repeatability of evaluating a DRAM latency PUF over long periods of time, we continuously evaluate our DRAM latency PUF across a 30-day period using each of our chosen memory segments. For each memory segment, we calculate the Intra-Jaccard index between the first PUF response and each subsequent PUF response. We find the *Intra-Jaccard index range*, or the range of values ( $\max\_value - \min\_value$ ) found across the Jaccard indices calculated for every pair of PUF responses from a memory segment. If a memory segment exhibits a low Intra-Jaccard index range, the memory segment generates highly-similar PUF responses during each evaluation over our testing period. Thus, memory segments that exhibit low Intra-Jaccard index ranges demonstrate high repeatability.

Figure 4-4 shows the distribution of *Intra-Jaccard index ranges* across our memory

segments with box-and-whisker plots<sup>2</sup> for each of the three manufacturers. We observe that the Intra-Jaccard index ranges are quite low, i.e., less than 0.1 on average for all manufacturers. Thus, we conclude that the vast majority of memory segments across all manufacturers exhibit very high repeatability over long periods of time.



Figure 4-4: Distribution of the Intra-Jaccard index range values calculated between many PUF responses that a PUF memory segment generates over a 30-day period.

In order to show that every chip has a significant proportion of memory segments that exhibit high reliability over time, we analyze per-chip Intra-Jaccard index range properties. Table 4.3 shows the *median [minimum, maximum]* of the fraction of memory segments per chip that are observed to have Intra-Jaccard index ranges below 0.1 and 0.2. Over 90% of all segments *in each chip* are suitable for PUF evaluation for Intra-Jaccard index ranges below 0.1, and over 97% for Intra-Jaccard index ranges below 0.2. This means that each chip has a significant number of memory segments that are viable for DRAM latency PUF evaluation. Furthermore, the distributions are very narrow, which indicates that different chips show similar behavior. We conclude that every chip has a significant number of PUF memory segments that exhibit high repeatability across time. We show in Section 4.7.5 how we can use a simple characterization step to identify these viable memory segments quickly and reliably.

**Temperature Effects.** To demonstrate how changes in temperature affect PUF evaluation, we evaluate the DRAM latency PUF 10 times for each of the memory

---

<sup>2</sup>The box is bounded by the first quartile (i.e., the median of the first half of the ordered set of Intra-Jaccard index ranges) and third quartile (i.e., the median of the second half of the ordered set of Intra-Jaccard index ranges). The median is marked by a red line within the bounding box. The *inter-quartile range* (IQR) is defined as the difference between the third and first quartiles. The whiskers are drawn out to extend an additional  $1.5 \times IQR$  above the third quartile and  $1.5 \times IQR$  below the first quartile. Outliers are shown as orange crosses indicating data points outside of the range of whiskers.

| %Memory Segments per Chip |                                |                                |
|---------------------------|--------------------------------|--------------------------------|
|                           | Intra-Jaccard index range <0.1 | Intra-Jaccard index range <0.2 |
| A                         | 100.00 [99.08, 100.00]         | 100.00 [100.00, 100.00]        |
| B                         | 90.39 [82.13, 99.96]           | 96.34 [95.37, 100.00]          |
| C                         | 95.74 [89.20, 100.00]          | 96.65 [95.48, 100.00]          |

Table 4.3: Percentage of PUF memory segments per chip with Intra-Jaccard index ranges <0.1 or 0.2 over a 30-day period. Median [minimum, maximum] values are shown.

segments in Table 4.2 at each 5°C increment throughout our testable temperature range (55°C-70°C). Figure 4-5 shows the distributions of Intra-Jaccard indices calculated between every possible pair of PUF responses generated by the *same* memory segment. The deltas between the operating temperatures at the time of PUF evaluation are denoted in the x-axis (*temperature delta*). Since we test at four evenly-spaced temperatures, we have four distinct temperature deltas. The y-axis marks the Jaccard indices calculated between the PUF responses. The distribution of Intra-Jaccard indices found for a given temperature delta is shown using a box-and-whisker plot.

Figure 4-5 subdivides the distributions for each of the three manufacturers as indicated by A, B, and C. Two observations are in order. 1) Across all three manufacturers, the distribution of Intra-Jaccard indices strictly shifts towards zero as the temperature delta increases. 2) The Intra-Jaccard distribution of PUF responses from chips of manufacturer C are the most sensitive to changes in temperature as reflected in the large distribution shift in Figure 4-5(C). Both observations show that evaluating a PUF at a temperature different from the temperature during enrollment affects the quality of the PUF response and reduces repeatability. However, 1) for small temperature deltas (e.g., 5°), PUF repeatability is not significantly affected, and 2) we discuss in Section 4.7.5 how we can ameliorate this effect during device enrollment.

#### 4.6.2 Runtime-Accessible PUF Metrics Evaluation

Throughout the remainder of this section, we show 1) how the DRAM latency PUF satisfies the characteristics of a *runtime-accessible* PUF (i.e., low latency and low system interference) discussed in Section 4.3.2, and 2) that the DRAM latency



Figure 4-5: DRAM latency PUF repeatability vs. temperature.

PUF significantly outperforms the DRAM retention PUF in terms of both evaluation time and system interference.

### Low Latency

The DRAM latency PUF consists of two key phases: 1) inducing latency failures, and 2) filtering the PUF segment, which improves PUF repeatability (to be discussed in Section 4.7.1). During Phase 1, we induce latency failures multiple times (i.e., for multiple *iterations*) over the PUF memory segment and count the failures in a separate buffer for additional bookkeeping (we discuss this in further detail in Section 4.7.2). The execution time of this phase depends directly on three factors:

1. The value of the  $t_{RCD}$  timing parameter. A smaller  $t_{RCD}$  value causes each read to have a shorter latency.
2. The size of the PUF memory segment. A larger memory segment requires more DRAM read requests per iteration. In our devices, we observe that latency failures are induced at a granularity of 32 bytes with each read request, so we can find the total number of required DRAM reads by dividing the size of the memory segment by 32 bytes.
3. The number of iterations used to induce latency failures. More iterations lead to a longer evaluation time.

Increasing any one of these factors independently of the others directly results in an increase in PUF evaluation time. We experimentally find that a single low- $t_{RCD}$  access to DRAM, along with its associated bookkeeping and memory barrier, takes  $3.4\mu\text{s}$ . Because the value of  $t_{RCD}$  is on the scale of tens of nanoseconds [141], changing its value negligibly affects the time for each low- $t_{RCD}$  access. Thus, we use a constant  $3.4\mu\text{s}$  for each read regardless of the  $t_{RCD}$  value to find a good estimate of the PUF evaluation time in Equation 4.1. We experimentally show that Phase 2 has negligible runtime ( $< 0.1\%$  of total DRAM latency PUF evaluation time) compared with Phase 1, so we omit Phase 2 in our PUF evaluation time estimation. We express PUF evaluation time estimation as:

$$T_{PUF\_eval} = (N_{iters}) \times [(size_{mem\_seg})/(32 \text{ bytes})] \times 3.4\mu\text{s} \quad (4.1)$$

where  $N_{iters}$  is the number of times we induce latency failures on each 32 byte block of the memory segment, and  $size_{mem\_seg}$  is the size of the memory segment used to evaluate the PUF. For our final chosen configuration (discussed in detail in Section 4.7), we use the parameters  $size_{mem\_seg} = 8\text{KiB}$  (Section 4.7.3),  $t_{RCD} = 9.8\text{ns}$  (Section 4.7.4), and  $N_{iters} = 100$  (Section 4.7.1). Using Equation 4.1, we expect this configuration to result in an evaluation time of approximately 87ms.

In order to experimentally verify Equation 4.1, we measure the evaluation time of the DRAM latency PUF for 10000 evaluations across chips from all three manufacturers at  $55^\circ\text{C}$ . We find that evaluation times are normally distributed per-manufacturer according to  $\mathcal{N}_A(\mu = 89.1\text{ms}, \sigma = 0.0132\text{ms})$ ,  $\mathcal{N}_B(\mu = 88.2\text{ms}, \sigma = 0.0135\text{ms})$ , and  $\mathcal{N}_C(\mu = 87.2\text{ms}, \sigma = 0.0102\text{ms})$ . These distribution parameters show that evaluation times have very similar means and are extremely tightly distributed (i.e.,  $< 0.0002$  relative standard deviation). This is expected because, for any particular configuration, DRAM latency PUF evaluation essentially requires a *constant* number of DRAM accesses. Therefore, any variation in PUF evaluation time comes from variations in code execution (e.g., multitasking, interrupts, DRAM refresh, etc.) rather than any characteristics of the PUF itself. In order to compare these runtime distributions with the result of Equation 4.1, we take the mean of the mixture distribution of the

three per-manufacturer distributions (i.e.,  $\mathcal{N}_{ABC}(\mu = 88.2\text{ms}, \sigma = 0.716\text{ms})$ ) and find that the 87ms estimate from Equation 4.1 results in only 1.4% error.

Figure 4-1 provides a comparison of DRAM latency PUF evaluation time with retention PUF evaluation time across our testable temperature range (i.e., 55°C-70°C). We find that the DRAM latency PUF significantly outperforms the DRAM retention PUF for an equivalent DRAM capacity overhead of 64KiB (i.e., 8KiB latency PUF memory segment + 56KiB counter buffer), providing an average (minimum, maximum) speedup of 152x (109x, 181x) at 70°C and 1426x (868x, 1783x) at 55°C. By increasing the memory segment size from 64KiB to 64MiB, we can evaluate a DRAM retention PUF in 1.05s (1.01s, 1.09s) at 70°C (Section 4.5.3). However, the DRAM latency PUF still outperforms this configuration *without* an increase in DRAM capacity overhead (i.e., still with an 8KiB memory segment), providing a speedup of 12.1x (11.6x, 12.5x).

Similarly to prior work on DRAM latency reduction [55, 191], we experimentally find that inducing latency failures is minimally affected by changes in temperature. Importantly, since our method of inducing latency failures does *not* change with temperature (Section 4.7.2), DRAM latency PUF evaluation time remains reliably short across *all* operating temperatures. We conclude that the DRAM latency PUF 1) can be evaluated at speeds that are orders of magnitude faster than the DRAM retention PUF, and 2) overcomes the temperature dependence of the DRAM retention PUF and maintains a low evaluation time across all temperatures.

## Low System Interference

The DRAM latency PUF exhibits two major sources of system interference: 1) requiring exclusive DRAM rank/bank access throughout PUF evaluation, and 2) using a region in a separate DRAM rank to count latency failures (Section 4.7.2).

First, because DRAM timing parameters can only be manipulated for the coarse granularity of a DRAM rank, any other access to the same rank containing the PUF memory segment must be blocked during PUF evaluation. Such blocking prevents other accesses from obeying the same reduced timing parameters and corrupting the data. For this reason, DRAM latency PUF evaluation requires exclusive access to a

full DRAM rank for the entire duration of PUF evaluation. Fortunately, the DRAM latency PUF’s quick evaluation time (i.e., 88.2ms on average) guarantees that the DRAM rank will be unavailable only for a short period of time. This is in stark contrast with the DRAM retention PUF, which 1) blocks rank/bank access for much longer periods of time (e.g., on the order of *minutes* or seconds), and 2) requires the memory controller to issue a large number of refresh operations to rows in the rank/bank outside of the PUF memory segment for the same period of time [364].

Second, the DRAM latency PUF algorithm (described in detail in Section 4.7.2) requires a small *counter buffer* (e.g., a 56KiB buffer for an 8KiB PUF memory segment) which stores counters for each bit of the PUF memory segment. This comes at the cost of both DRAM capacity overhead and additional memory traffic penalty. However, given that the DRAM capacity overhead is small (e.g., <0.003% for a 2GB DRAM using an 8KiB memory segment) and the additional bandwidth consumed is extremely low (e.g., on the order of 100MB/s using an 8KiB memory segment) in the context of total DRAM bandwidth (e.g., 8GB/s), we conclude that the additional system interference induced by the counter buffer is insignificant. In practice, we expect system caches to (fully) hold the counter buffer, further reducing the required DRAM bandwidth.

## 4.7 Design Considerations

As we experimentally showed in Section 4.6, utilizing DRAM latency failures is a viable method for evaluating runtime-accessible PUFs in commodity, unmodified DRAM chips. However, due to variation across DRAM cells and chips, there are various important design considerations that must be made in the implementation of the DRAM latency PUF. In this section, we discuss these considerations for implementing the DRAM latency PUF.

### 4.7.1 Repeatability of Cell Latency Failures

Due to many underlying factors (e.g., process variation, temperature), each DRAM cell fails with a different probability when read with a timing parameter reduced beyond the manufacturer specification [190, 191, 55, 188, 54]. We define a *latency-weak cell* as a cell that has a significant probability of failure when read with a reduced timing parameter. Our DRAM latency PUFs are comprised of the locations of latency-weak cells because such cells can be repeatably found. In order to repeatably find the set of latency-weak cells, we employ *many* iterations (e.g., on the order of 100) of inducing latency failures at the PUF memory segment. This improves the chances of a PUF evaluation to find a significant proportion of the latency-weak cells. Because we assume that any given cell has a static probability ( $p$ ) to fail when accessed with reduced latency, we can model the number of times that a cell must be accessed before observing a latency failure as a *geometric random variable* with a success probability of  $p$ . The geometric distribution with parameter  $p$  has a mean value of  $\frac{1}{p}$ . By sampling cells over  $x$  iterations during DRAM latency PUF evaluation, we expect to find all cells that fail with a probability greater than or equal to  $\frac{1}{x}$ . There is a chance that a cell with a failure probability below the threshold fails during an instance of the PUF evaluation, reducing the similarity of the PUF responses across evaluations and thus the repeatability of the PUF. To mitigate this issue, we apply a *filter* (see Section 4.7.2) that removes cells that we observe to fail in only a small proportion of the  $x$  iterations. We empirically find that removing cells that fail in less than 10% of the iterations results in the highest Intra-Jaccard indices across PUF responses.

In order to determine how many iterations to induce latency failures for during latency PUF evaluation, we generate PUF responses across our devices using a varying number of iterations between 1 and 1024. For each set of PUF responses generated with a given number of iterations, we calculate the box-and-whisker plots for both Inter- and Intra-Jaccard distributions (not shown). We find that for PUF responses from chips across all manufacturers, the Inter-Jaccard *and* Intra-Jaccard distributions have strictly the same or increasing medians, first and third quartiles, and whiskers,

for an increasing number of iterations.

Higher Intra-Jaccard index distribution values represent a more repeatable PUF since the distribution directly reflects the similarities of PUF responses from the same memory segment. We find that the Intra-Jaccard index distribution’s median and bottom whiskers increase by 0.0025 and 0.0054, respectively, for every doubling of the number of iterations. On the other hand, higher Inter-Jaccard index distribution values represent higher similarity across distinct memory segments. Such higher values would limit the PUF’s ability to identify many unique devices. We find that the Inter-Jaccard index distribution’s median and top whiskers increase by 0.0012 and 0.0011, respectively, for every doubling of the number of iterations. Based on our experimental analyses of these tradeoffs, we choose to induce latency failures for 100 iterations during each DRAM latency PUF evaluation. We next discuss in detail our algorithm for evaluating DRAM latency PUFs with high repeatability.

#### 4.7.2 DRAM Latency PUF Evaluation Algorithm

We provide an implementable algorithm for evaluating a repeatable DRAM latency PUF at a given memory segment. While we focus on evaluating DRAM latency PUFs with  $t_{RCD}$ -induced failures, Algorithm 3 works with any other timing parameter capable of inducing failures. We first initialize the PUF memory segment indicated by  $Segment[seg\_id]$  by setting every bit in the memory segment to “1” (line 2). We then attempt to find the reliable set of failures as fast as possible in the memory segment (lines 4-11). Because DRAM RD commands require a  $t_{RCD}$  delay only after the activation of a previously closed DRAM row,  $t_{RCD}$  failures can only be observed when issuing a read request to a *closed* DRAM row. The key idea is to iterate over each row sequentially such that each read request goes to a different row (i.e., perform column order accesses through the memory segment of interest) as shown in lines 7-9. Before inducing failures across the PUF memory segment, we must first obtain exclusive access to the rank containing the PUF memory segment (line 4), due to the rank-level granularity of changing DRAM timing parameters (Section 2.6). We then reduce the value of  $t_{RCD}$  for the entire rank containing the PUF memory segment

(line 5). During the iterations of inducing  $t_{RCD}$  failures (lines 6-11), we issue a memory barrier (line 10) after each read. This ensures that 1) *only one* memory instruction is in flight at a given time and, thus, improves repeatability by simplifying the logic required by the memory controller when issuing memory accesses, and 2) read requests do *not* get reordered by the memory controller to exploit row buffer locality [274, 273, 242, 171, 243, 311, 337, 172]. Instead, each access activates a new row, while obeying the  $t_{RCD}$  timing parameter. We find that the instruction order indicated by lines 6-11 is the fastest method for finding a reliable set of latency failures in a memory segment. For every read, the  $t_{RCD}$  failure locations are determined and their failures are counted in a separate rank for bookkeeping (line 11). After all iterations of inducing  $t_{RCD}$  failures, we must reset the  $t_{RCD}$  value to the default (line 12), filter the PUF segment (line 13; see *Filtering Mechanism*), release exclusive access to the rank containing the PUF memory segment (line 14), and finally return the PUF response, i.e., the resulting *error pattern* from the PUF evaluation at the PUF memory segment (line 15).

---

**Algorithm 3:** Evaluate DRAM latency PUF

---

```

1 evaluate_DRAM_latency_PUF(seg_id):
2   write known data (all 1's) to Segment[seg_id]
3   rank_id ← DRAM rank containing seg_id
4   obtain exclusive access to Rank[rank_id]
5   set low  $t_{RCD}$  for Rank[rank_id]
6   for  $i = 1$  to num_iterations :
7     for all col in Segment[seg_id]
8       for all row in Segment[seg_id]:           // column-order reads
9         read()                                // induce read failures
10        memory_barrier()                     // one access at a time
11        count_failures()                    // record in another rank
12    set default  $t_{RCD}$  for Rank[rank_id]
13    filter the PUF memory segment          // See Filtering Mechanism
14    release exclusive access to Rank[rank_id]
15  return error pattern at Segment[seg_id]

```

---

**Filtering Mechanism.** In order to improve the repeatability of the DRAM latency PUF, we employ a *filtering mechanism* which removes the cells with low failure probability from the PUF response (as shown on line 13 in Algorithm 3). The key idea is to count, for each bit location in the PUF memory segment, the number of

iterations in which the location fails and then use that count to determine whether the bit location should be *set* (“1”) or *cleared* (“0”) in the final PUF response. Every bit in the DRAM PUF memory segment has a corresponding counter that we store in the *counter buffer*, a data structure we allocate in a DRAM rank separate from the one containing the PUF memory segment. This is to ensure that read/write requests to the counter buffer follow manufacturer-specified timing parameters and do not induce latency failures.

After each reduced-latency read request in the PUF memory segment, we find all bit locations in the read data that resulted in a latency failure, and increment their corresponding counters in the counter buffer. After all iterations of inducing latency failures are completed, we compare every counter of each bit location in the PUF memory segment against a threshold. If a counter holds a value greater than the threshold (i.e., the counter’s corresponding bit location failed more than  $n$  times, where  $n$  is the threshold), we set the corresponding bit location. Otherwise, we clear it.

**Memory Footprint.** Equation 4.2 provides the memory footprint required by PUF evaluation:

$$mem_{total} = (size_{mem\_seg}) + (size_{counter\_buffer}) \quad (4.2)$$

where  $size_{mem\_seg}$  is the size of the PUF memory segment and  $size_{counter\_buffer}$  is the size of the counter buffer. The size of the counter buffer can be calculated using Equation 4.3:

$$size_{counter\_buffer} = (size_{mem\_seg}) \times \lceil \log_2 N_{iters} \rceil \quad (4.3)$$

where  $size_{mem\_seg}$  is the size of the PUF memory segment and  $N_{iters}$  is the number of iterations that we want to induce latency failures for. Since we require one counter per bit in the memory segment, we must multiply this quantity by the size of each counter. Since the counter must be able to store up to the value of  $N_{iters}$  (e.g., in the case of a cell that fails every iteration), each counter must be  $\lceil \log_2 N_{iters} \rceil$  bits wide.

For a memory segment size of 8KiB, we find that the DRAM latency PUF’s total memory footprint is 64KiB. From this, we conclude that DRAM latency PUFs have insignificant DRAM capacity overhead.

#### 4.7.3 Variation Among PUF Memory Segments

We observe a variation in latency failure rates across different memory segments, which make some DRAM memory segments more desirable to evaluate DRAM latency PUFs with than others. Because we want to find 512 bits that fail per PUF memory segment (Section 4.5.2), we consider only those memory segments that have at least 512 failing bits as *good* memory segments. In order to determine the best size of the memory segment to evaluate the DRAM latency PUF on, we study the effect of varying memory segment size on 1) DRAM capacity overhead, 2) PUF evaluation time, and 3) fraction of good memory segments per device. As the memory segment size increases, both the DRAM capacity overhead and the PUF evaluation time increase linearly. The number of possible PUF memory segments for a DRAM device with a DRAM latency PUF is obtained by counting the number of contiguous PUF memory segments across all of DRAM (i.e., dividing the DRAM size by the PUF memory segment size). Thus, larger PUF memory segments result in fewer possible PUF memory segments for a DRAM device. From an experimental analysis of the associated tradeoffs of varying the PUF memory segment size (not shown), we choose a PUF memory segment size of 8KiB.<sup>3</sup>

In Table 4.4, we represent the distribution of the percentage of good memory segments per chip with a *median [minimum, maximum]* across each of the three manufacturers. The left column shows the number of chips tested, the right column shows the representation of the distribution, and the rows indicate the different manufacturers of the chips. We see that an overwhelming majority of memory segments from manufacturers A and B are good for PUF evaluation. Memory segments from chips of manufacturer C were observed to exhibit less latency failures, but across each

---

<sup>3</sup>We will provide details in a technical report/extended version for all other results that we cannot provide detail for in the submission.

of our chips we could find at least 19.4% of the memory segments to be good for PUF evaluation. Of the total number of PUF memory segments tested (shown in Table 4.2), we experimentally find that 100%, 64.06%, and 19.37% of memory segments are *good* (i.e., contain enough failures to be considered for PUF evaluation) in the worst-case chips from manufacturers A, B, and C. We conclude that there are plenty of PUF memory segments that are good enough for DRAM latency PUF evaluation.

|   | #Chips | Good Memory Segments per Chip (%) |
|---|--------|-----------------------------------|
| A | 19     | 100.00 [100.00, 100.00]           |
| B | 12     | 100.00 [64.06, 100.00]            |
| C | 14     | 30.86 [19.37, 95.31]              |

Table 4.4: Percentage of *good* memory segments per chip across manufacturers. Median [min, max] values are shown.

#### 4.7.4 Support for Changing Timing Parameters

In order to induce latency failures, the manufacturer-specified DRAM timing parameters must be changed. Some existing processors [191, 1, 12] enable software to directly manipulate DRAM timing parameters. These processors can trivially implement and evaluate a DRAM latency PUF with *minimal* changes to the software and no changes to hardware. However, for other processors that cannot directly manipulate DRAM timing parameters, we would need to simply enable software to programmatically modify memory controller registers which indicate the DRAM timing parameters that a memory access must observe.

We find that we can reliably induce latency failures when we reduce the value of  $t_{RCD}$  from a default value of 18ns to between 6ns and 13ns. Given this wide range of failure-inducing  $t_{RCD}$  values, most memory controllers should be able to issue read requests with a  $t_{RCD}$  value within this range.

#### 4.7.5 Device Enrollment

Device enrollment is a one-time process consisting of evaluating all possible PUFs from across the entire challenge-response space and securely storing the evaluated

PUFs in a trusted database such that they can be later queried for authentication [153, 318, 364]. Since the goal of PUF authentication is to ensure that a challenge-response is difficult to replicate without access to the original device, enrollment must be done securely so that the full set of all possible challenge-response pairs is known only to the trusted database and can be created *only* by the device owner.

Similar to prior works’ approach to DRAM PUFs [364, 328, 318], we assume that a trusted third party (e.g., the DRAM manufacturer) performs device enrollment prior to making the system available to the end consumer. This ensures that the complete set of all possible challenge responses are known only by the trusted third party. After the device is in the field, even the trusted third party *cannot* regenerate the enrollment data. Thus, allowing the trusted third party to both characterize and enroll the responses makes it extremely difficult for a malicious attacker to obtain the full set of possible response pairs without first compromising the trusted third party.

Because DRAM latency PUF responses vary depending on the temperature of DRAM during evaluation time (see Section 4.6.1), we must enroll multiple golden keys at varying temperature intervals. This enables a PUF response to match at least one golden key during authentication regardless of the temperature during evaluation time. We find that some chips generate PUF responses with less variation across a range of temperatures than other chips. Chips with less variation can enroll golden keys for temperatures at larger intervals than chips with more variation.

#### 4.7.6 In-DRAM Error Correcting Codes

Some new DRAM chips utilize in-DRAM *error-correcting codes* (ECC), which perform single-bit error correction per *word* (i.e., typically 64 data bits) invisibly to the system [246, 151, 248], to overcome the reliability challenges of DRAM technology scaling [238, 239, 170, 207, 244, 230, 284]. When such chips are used for DRAM latency PUFs, ECC words with only one error appear to be error-free to the memory controller, leaving fewer total errors available for PUF response. We note that error correction *deterministically* transforms DRAM error patterns. Thus, a DRAM PUF (on a chip with in-DRAM ECC) that repeatably induces the same error pattern prior

to ECC correction, would repeatably result in a different but *consistent error pattern* after ECC correction.

In order to support PUF evaluation in a system using DRAM chips with in-DRAM ECC, we would need to evaluate PUFs with a higher *raw bit error rate* (i.e., the error rate before ECC is performed) relative to non-ECC DRAMs. A higher raw bit error rate would produce enough observable failures (after ECC is performed) for a PUF. The DRAM latency PUF can achieve this higher raw bit error rate by simply reducing the latency parameter value further. Such reduction would also ideally reduce PUF evaluation time and system interference. Therefore, we expect the DRAM latency PUF to be evaluated even faster in chips with built-in ECC.<sup>4</sup> As stronger ECC mechanisms are used, i.e., ECC can correct DRAM words containing more than 1 error, we expect even lower evaluation latencies and lower system interference with the DRAM latency PUF.

#### 4.7.7 Effect of High-Temperature

Our evaluation of the DRAM latency and retention PUFs is limited to the 70°C maximum DRAM temperature. We clearly show in Section 4.5.2 that DRAM latency PUFs are much faster than DRAM retention PUFs at 70°C and lower (Figure 4-1). However, at higher temperatures (e.g., > 85°C), DRAM retention PUFs could become faster than DRAM latency PUFs.<sup>5</sup>

If a DRAM retention PUF is faster than the DRAM latency PUF at a very high temperature, it is easy to envision a mechanism that dynamically switches between DRAM latency PUFs and DRAM retention PUFs based on the device operating temperature at the time of evaluation. This mechanism could exploit the strengths of each type of DRAM PUF in order to allow the fastest possible PUF evaluation time. By exploiting in-DRAM temperature sensors that already exist in modern DRAM

---

<sup>4</sup>In contrast, DRAM retention PUF *must* be evaluated with a longer refresh interval to increase the raw bit error rate in a DRAM chip with in-DRAM ECC. This leads to a significantly longer DRAM retention PUF evaluation time when in-DRAM ECC is used.

<sup>5</sup>Note that the DDR protocol specifies that every cell must be refreshed at least every 32ms for LPDDR3/4 or 64ms for DDR3/4 below 85°C and at even higher rates at higher temperatures.

chips [141, 140, 191, 139], this mechanism could potentially be implemented with no additional hardware overhead beyond what is already required for DRAM retention PUFs and DRAM latency PUFs individually.

Such a mechanism would require challenge-response pairs from both the DRAM retention PUF and DRAM latency PUF to be enrolled. Furthermore, since retention failure rates vary significantly across different chips [206, 155, 260], each chip will have a different temperature at which the mechanism switches between the two DRAM PUFs. This could considerably impact enrollment time and complicate the device authentication process. Ultimately, it is up to the system architect to decide whether such a mechanism is worth the evaluation runtime benefits at very high temperatures (which is likely to be on the order of tens of milliseconds). We leave a full exploration of this hybrid DRAM latency-retention PUF mechanism to future work.

## 4.8 Related Work

To our knowledge, this is the first work to: 1) introduce the idea of violating DRAM read latency parameters to create a fast, runtime-accessible DRAM PUF without modifying commodity DRAM devices, 2) introduce an effective DRAM PUF that is runtime-accessible at all operating temperatures, 3) demonstrate a wide variety of tradeoffs in DRAM PUFs, based on extensive new experimental data from 223 state-of-the-art LPDDR4 DRAM chips, 4) demonstrate the prohibitively slow evaluation times of the DRAM retention PUF, the previously fastest DRAM PUF suitable for commodity devices.

In this section, we discuss prior works that propose DRAM PUFs and PUFs based on other substrates. The proliferation of recent works on DRAM PUFs reflects the growing importance of DRAM PUFs given DRAM’s near ubiquity in modern systems and large address space.

**DRAM Retention PUFs.** We have already described the basics of DRAM retention PUFs in Section 4.5 and extensively evaluated them in Section 4.5. We briefly explain the differences between prior proposals. Keller et al. [154] is the first to propose

using DRAM retention failures as unique identifiers, shortly followed by Xiong et al. [364] and D-PUF [318], both of which enable runtime-accessible DRAM retention PUFs. Other works propose further optimizations for improving the quality of DRAM retention PUFs [268, 322, 329]. As our experimental evaluations across 223 LPDDR4 DRAM chips show, DRAM retention PUFs take very long to evaluate at common-case operating temperatures; they are orders of magnitude slower than our proposal (see Section 4.5). A work [235] published after the DRAM Latency PUF, demonstrates highly reliable debiasing techniques for PUFs generated with retention failures.

**Other DRAM PUFs.** Hashemian et al. [117] propose adding a delay generator to the DRAM write-circuitry to induce failures. However, this requires additional hardware and cannot be applied to existing DRAM designs. Tehranipoor et al. [328, 152] suggest using DRAM start-up values for PUFs, but this precludes runtime evaluation by requiring a DRAM power cycle for every authentication. In a work published after this work, Talukder et al., [321] propose evaluating a PUF by inducing precharge latency failures and show that they can provide PUFs orders of magnitude faster, since precharge latency failures occur at larger granularities than activation latency failures.

**PUFs Based on Other Substrates.** Many PUFs have been proposed for various other substrates, including other memory technologies and customized hardware designs. We categorize them into delay-based and memory-based PUFs.

**Delay-based PUFs** include 1) arbiter PUFs [195, 200, 226, 278, 251, 252, 116, 370], which rely on process variation to extract the unique behavior of two identical competing circuit paths, 2) ring oscillator PUFs [92, 91, 314, 205], which rely on frequencies of oscillating signals from chained inverters, and 3) Current Mirror Array (CMA) PUFs [355], which rely on the manufacturing process variation in a customized circuit typically used for machine learning tasks. These works rely on customized hardware *not* present in commodity systems. FPGA-based PUFs [179, 108, 224, 223, 107, 106, 331, 93] overcome the need for hardware changes. However, they are not as common as DRAM in computer systems.

**Memory-based PUFs** include SRAM PUFs, which rely on SRAM start up values [107, 126, 127, 67, 32, 382, 359, 319] and voltage reduction induced failures [19];

butterfly PUFs, which mimic the behavior of SRAM cells with cross-coupled data latches [179]; latch PUFs, which cross-couple two NOR-gates [309]; flip-flop PUFs, which exploit the power up behavior of regular flip-flops [221, 338]; and PUFs for emerging memory technologies [136, 177, 277, 342, 59, 60, 208, 71, 374, 375, 343, 31, 254, 149]. These prior works either require additional customized hardware or usage of SRAM, which has a small address space compared to DRAM, and thus cannot accommodate a large number of challenge-response pairs.

## 4.9 Limitations

Although the DRAM Latency PUF can be used immediately in certain systems, the DRAM Latency PUF has a few limitations that prevent its immediate deployment in all real systems at its full potential.

First, the DRAM Latency PUF has a linear challenge-response space which results in its categorization as a weak PUF. While weak PUFs have many use cases in the field, a strong PUF (i.e., a PUF with an exponential challenge-response space) is more practical and provides added value to a system. We believe that studying the interactions between many DRAM timing parameters may help to identify effects of parameters that can be combined to build a strong PUF with an exponential challenge-response space.

Second, the DRAM Latency PUF requires a flexible memory controller that can interleave DRAM accesses with varying timing parameters in order to minimize the overhead that evaluating this PUF would cause on a standard system that likely has higher overhead in changing timing parameters.

Third, as the DRAM bus is considered an insecure communication channel, the current interface for evaluating the PUF over the DRAM bus is prone to interference and eavesdropping. As the DRAM Latency PUF can be evaluated purely on the DRAM chip itself, we believe that developing simple logic nearby memory to facilitate the evaluation of the DRAM latency PUF on DRAM without relying on an insecure channel may help to further improve the security of the PUF.

Fourth, our aging studies are limited to a 30-day period. While we demonstrate that PUF responses do not change significantly over 30 days, long-term aging may have stronger effects. If long-term effects change PUF responses such that they are incomparable to the golden PUF responses, occasional re-profiling may be required to reliably employ the DRAM Latency PUF.

## 4.10 Summary

We introduce and analyze the DRAM latency PUF, a new DRAM PUF suitable for runtime authentication. The DRAM latency PUF intentionally violates manufacturer-specified DRAM timing parameters in order to provide many highly repeatable, unique, and unclonable PUF responses with low latency. Through experimental evaluation using 223 state-of-the-art LPDDR4 DRAM devices, we show that the DRAM latency PUF reliably generates PUF responses at runtime-accessible speeds (i.e., 88.2ms on average) at all operating temperatures. We show that the DRAM latency PUF achieves an average speedup of 152x/1426x at 70°C/55°C when compared with a DRAM retention PUF of the same DRAM capacity overhead, and it achieves even greater speedups at lower temperatures. We conclude that the DRAM latency PUF enables a fast and effective substrate for runtime device authentication across all operating temperatures, and we hope that the advent of runtime-accessible PUFs like the DRAM latency PUF and the detailed experimental characterization data we provide on modern DRAM devices will enable security architects to develop even more secure systems for future devices.

# Chapter 5

# D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput

We propose a new DRAM-based true random number generator (TRNG) that leverages DRAM cells as an entropy source. The key idea is to intentionally violate the DRAM access timing parameters and use the resulting errors as the source of randomness. Our technique specifically decreases the DRAM row activation latency (timing parameter  $t_{RCD}$ ) below manufacturer-recommended specifications, to induce read errors, or activation failures, that exhibit true random behavior. We then aggregate the resulting data from multiple cells to obtain a TRNG capable of providing a high throughput of random numbers at low latency.

To demonstrate that our TRNG design is viable using commodity DRAM chips,

we rigorously characterize the behavior of activation failures in 282 state-of-the-art LPDDR4 devices from three major DRAM manufacturers. We verify our observations using four additional DDR3 DRAM devices from the same manufacturers. Our results show that many cells in each device produce random data that remains robust over both time and temperature variation. We use our observations to develop D-RaNGe, a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that D-RaNGe: 1) successfully passes each test, and 2) generates true random numbers with over two orders of magnitude higher throughput than the previous highest-throughput DRAM-based TRNG.

## 5.1 True Random Number Generators (TRNGs)

A *true random number generator (TRNG)* requires physical processes (e.g., radioactive decay, thermal noise, Poisson noise) to construct a bitstream of random data. Unlike pseudo-random number generators, the random numbers generated by a TRNG do *not* depend on the previously-generated numbers and *only* depend on the random noise obtained from physical processes. TRNGs are usually validated using statistical tests such as NIST [279] or DIEHARD [227]. A TRNG typically consists of 1) an *entropy source*, 2) a *randomness extraction technique*, and sometimes 3) a *post-processor*, which improves the randomness of the extracted data often at the expense of throughput. These three components are typically used to reliably generate true random numbers [308, 315].

**Entropy Source.** The entropy source is a critical component of a random number generator, as its amount of entropy affects the unpredictability and the throughput of the generated random data. Various physical phenomena can be used as entropy sources. In the domain of electrical circuits, thermal and Poisson noise, jitter, and circuit metastability have been proposed as processes that have high entropy [353, 269, 126, 127, 339, 229, 42, 332, 33]. To ensure robustness, the entropy

source should not be visible or modifiable by an adversary. Failing to satisfy that requirement would result in generating predictable data, and thus put the system into a state susceptible to security attacks.

**Randomness Extraction Technique.** The randomness extraction technique harvests random data from an entropy source. A good randomness extraction technique should have two key properties. First, it should have high throughput, i.e., extract as much as randomness possible in a short amount of time [176, 308], especially important for applications that require high-throughput random number generation (e.g., security applications [109, 346, 160, 75, 181, 61, 377, 185, 22, 276, 220, 308, 30, 323, 40, 229, 368], scientific simulation [220, 40]). Second, it should not disturb the physical process [176, 308]. Affecting the entropy source during the randomness extraction process would make the harvested data predictable, lowering the reliability of the TRNG.

**Post-processing.** Harvesting randomness from a physical phenomenon *may* produce bits that are biased or correlated [176, 267]. In such a case, a post-processing step, which is also known as *de-biasing*, is applied to eliminate the bias and correlation. The post-processing step also provides protection against environmental changes and adversary tampering [176, 267, 308]. Well-known post-processing techniques are the von Neumann corrector [146] and cryptographic hash functions such as SHA-1 [76] or MD5 [272]. These post-processing steps work well, but generally result in decreased throughput (e.g., up to 80% [182]).

## 5.2 Motivation and Goal

True random numbers sampled from physical phenomena have a number of real-world applications from system security [22, 276, 308] to recreational entertainment [308]. As user data privacy becomes a *highly-sought* commodity in Internet-of-Things (IoT) and mobile devices, enabling primitives that provide security on such systems becomes critically important [203, 263, 377]. Cryptography is one typical method for securing systems against various attacks by encrypting the sys-

tem’s data with keys generated with true random values. Many cryptographic algorithms require random values to generate keys in many standard protocols (e.g., TLS/SSL/RSA/VPN keys) to either 1) encrypt network packets, file systems, and data, 2) select internet protocol sequence numbers (TCP), or 3) generate data padding values [109, 346, 160, 75, 181, 61, 377, 185]. TRNGs are also commonly used in authentication protocols and in countermeasures against hardware attacks [61], in which psuedo-random number generators (PRNGs) are shown to be insecure [346, 61]. To keep up with the *ever-increasing* rate of secure data creation, especially with the growing number of commodity data-harvesting devices (e.g., IoT and mobile devices), the ability to generate true random numbers with *high throughput and low latency* becomes ever more relevant to maintain user data privacy. In addition, *high-throughput* TRNGs are already *essential* components of various important applications such as scientific simulation [220, 40], industrial testing, statistical sampling, randomized algorithms, and recreational entertainment [22, 276, 220, 308, 30, 323, 40, 377, 229, 368].

A *widely-available, high-throughput, low-latency* TRNG will enable all previously mentioned applications that rely on TRNGs, including improved security and privacy in most systems that are known to be vulnerable to attacks [203, 263, 377], as well as enable research that we may not anticipate at the moment. One such direction is using a one-time pad (i.e., a private key used to encode and decode only a single message) with quantum key distribution, which requires at least  $4\text{Gb/s}$  of true random number generation throughput [354, 64, 214]. Many *high-throughput* TRNGs have been recently proposed [181, 334, 378, 61, 30, 377, 247, 185, 354, 229, 368, 20, 85, 110], and the availability of these high-throughput TRNGs can enable a wide range of new applications with improved security and privacy.

DRAM offers a promising substrate for developing an effective and widely-available TRNG due to the prevalence of DRAM throughout all modern computing systems ranging from microcontrollers to supercomputers. A high-throughput DRAM-based TRNG would help enable widespread adoption of applications that are today limited to only select architectures equipped with dedicated high-performance TRNG engines. Examples of such applications include high-performance scientific simulations and

cryptographic applications for securing devices and communication protocols, both of which would run much more efficiently on mobile devices, embedded devices, or microcontrollers with the availability of higher-throughput TRNGs in the system.

In terms of the CPU architecture itself, a high-throughput DRAM-based TRNG could help the memory controller to improve scheduling decisions [337, 243, 15, 311, 310, 172, 242, 313, 312] and enable the implementation a truly-randomized version of PARA [170] (i.e., a protection mechanism against the RowHammer vulnerability [170, 239]). Furthermore, a DRAM-based TRNG would likely have additional hardware and software applications as system designs become more capable and increasingly security-critical.

In addition to traditional computing paradigms, DRAM-based TRNGs can benefit processing-in-memory (PIM) architectures [96, 240, 296], which co-locate logic within or near memory to overcome the large bandwidth and energy bottleneck caused by the memory bus and leverage the *significant* data parallelism available within the DRAM chip itself. Many prior works provide primitives for PIM or exploit PIM-enabled systems for workload acceleration [7, 8, 189, 291, 292, 295, 294, 211, 296, 261, 18, 81, 88, 89, 121, 130, 232, 316, 373, 131, 38, 39, 57, 161, 96, 240, 37, 289]. A low-latency, high-throughput DRAM-based TRNG can enable PIM applications to source random values *directly within the memory itself*, thereby enhancing the overall potential, security, and privacy, of PIM-enabled architectures. For example, in applications that require true random numbers, a DRAM-based TRNG can enable large contiguous code segments to execute in memory, which would reduce communication with the CPU, and thus improve system efficiency. A DRAM-based TRNG can also enable security tasks to run completely in memory. This would remove the dependence of PIM-based security tasks on an I/O channel and would increase overall system security.

We posit, based on analysis done in prior works [176, 146, 283], that an *effective* TRNG must satisfy *six* key properties: it must 1) have low implementation cost, 2) be fully non-deterministic such that it is impossible to predict the next output given complete information about how the mechanism operates, 3) provide a continuous stream of true random numbers with high throughput, 4) provide true random numbers

with low latency, 5) exhibit low system interference, i.e., not significantly slow down concurrently-running applications, and 6) generate random values with low energy overhead.

To this end, our **goal** in this work, is to provide a widely-available TRNG for DRAM devices that satisfies all six key properties of an effective TRNG.

### 5.3 Testing Environment

In order to test our hypothesis that DRAM cells are an effective source of entropy when accessed with reduced DRAM timing parameters, we developed an infrastructure to characterize modern LPDDR4 DRAM chips. We also use an infrastructure for DDR3 DRAM chips, SoftMC [120, 302], to demonstrate empirically that our proposal is applicable beyond the LPDDR4 technology. Both testing environments give us precise control over DRAM commands and DRAM timing parameters as verified with a logic analyzer probing the command bus.

We perform all tests, unless otherwise specified, using a total of 282 2y-nm LPDDR4 DRAM chips from three major manufacturers in a thermally-controlled chamber held at 45°C. For consistency across results, we precisely stabilize the ambient temperature using heaters and fans controlled via a microcontroller-based proportional-integral-derivative (PID) loop to within an accuracy of 0.25°C and a reliable range of 40°C to 55°C. We maintain DRAM temperature at 15°C above ambient temperature using a separate local heating source. We use temperature sensors to smooth out temperature variations caused by self-induced heating.

We also use a separate infrastructure, based on open-source SoftMC [120, 302], to validate our mechanism on 4 DDR3 DRAM chips from a single manufacturer. SoftMC enables precise control over timing parameters, and we house the DRAM chips inside another temperature chamber to maintain a stable ambient testing temperature (with the same temperature range as the temperature chamber used for the LPDDR4 devices).

To explore the various effects of temperature, short-term aging, and circuit-level

interference (in Section 5.4) on activation failures, we reduce the  $t_{RCD}$  parameter from the default  $18\text{ns}$  to  $10\text{ns}$  for all experiments, unless otherwise stated. Algorithm 4 explains the general testing methodology we use to induce activation failures. First, we write a data pattern to the region of DRAM under test (Line 2). Next, we reduce

---

**Algorithm 4:** DRAM Activation Failure Testing

---

```

1 DRAM_ACT_failure_testing(data_pattern, DRAM_region):
2   write data_pattern (e.g., solid 1s) into all cells in DRAM_region
3   set low  $t_{RCD}$  for ranks containing DRAM_region
4   foreach col in DRAM_region:
5     foreach row in DRAM_region:
6       activate(row)      // fully refresh cells
7       precharge(row)    // ensure next access activates the row
8       activate(row)
9       read(col)         // induce activation failure on col
10      precharge(row)
11      record activation failures to storage
12  set default  $t_{RCD}$  for DRAM ranks containing DRAM_region
```

---

the  $t_{RCD}$  parameter to begin inducing activation failures (Line 3). We then access the DRAM region in column order (Lines 4-5) in order to ensure that each DRAM access is to a closed DRAM row and thus requires an activation. This enables each access to induce activation failures in DRAM. Prior to each reduced-latency read, we first refresh the target row such that each cell has the same amount of charge each time it is accessed with a reduced-latency read. We effectively refresh a row by issuing an activate (Line 6) followed by a precharge (Line 7) to that row. We then induce the activation failures by issuing consecutive activate (Line 8), read (Line 9), and precharge (Line 10) commands. Afterwards, we record any activation failures that we observe (Line 11). We find that this methodology enables us to quickly induce activation failures across *all* of DRAM, and minimizes testing time.

## 5.4 Activation Failure Characterization

To demonstrate the viability of using DRAM cells as an entropy source for random data, we explore and characterize DRAM failures when employing a reduced DRAM activation latency ( $t_{RCD}$ ) across 282 LPDDR4 DRAM chips. We also compare our

findings against those of prior works that study an older generation of DDR3 DRAM chips [55, 190, 191, 163] to cross-validate our infrastructure. To understand the effects of changing environmental conditions on a DRAM cell that is used as a source of entropy, we rigorously characterize DRAM cell behavior as we vary four environmental conditions. First, we study the effects of DRAM array design-induced variation (i.e., the spatial distribution of activation failures in DRAM). Second, we study data pattern dependence (DPD) effects on DRAM cells. Third, we study the effects of temperature variation on DRAM cells. Fourth, we study a DRAM cell’s activation failure probability over time. We present several key observations that support the viability of a mechanism that generates random numbers by accessing DRAM cells with a reduced  $t_{RCD}$ . In Section 5.5, we discuss a mechanism to effectively sample DRAM cells to extract true random numbers while minimizing the effects of environmental condition variation (presented in this section) on the DRAM cells.

#### 5.4.1 Spatial Distribution of Activation Failures

To study which regions of DRAM are better suited to generating random data, we first visually inspect the spatial distributions of activation failures both across DRAM chips and within each chip individually. Figure 5-1 plots the spatial distribution of activation failures in a *representative*  $1024 \times 1024$  array of DRAM cells taken from a single DRAM chip. Every observed activation failure is marked in black. We make two observations. First, we observe that each contiguous region of 512 DRAM rows<sup>1</sup> consists of repeating rows with the same set (or subset) of column bits that are prone to activation failures. As shown in the figure, rows 0 to 511 have the same 8 (or a subset of the 8) column bits failing in the row, and rows 512 to 1023 have the same 4 (or a subset of the 4) column bits failing in the row. We hypothesize that these contiguous regions reveal the DRAM subarray architecture as a result of variation across the local sense amplifiers in the subarray. We indicate the two subarrays in Figure 5-1 as Subarray A and Subarray B. A “weaker” local sense amplifier results

---

<sup>1</sup>We note that subarrays have either 512 or 1024 (not shown) rows depending on the manufacturer of the DRAM device.

in cells that share its respective *local bitline* in the subarray having an increased probability of failure. For this reason, we observe that activation failures are localized to a few columns within a DRAM subarray as shown in Figure 5-1. Second, we observe that within a subarray, the activation failure probability increases across rows (i.e., activation failures are *more* likely to occur in higher-numbered rows in the subarray and are *less* likely in lower-numbered rows in the subarray). This can be seen from the fact that more cells fail in higher-numbered rows in the subarray (i.e., there are more black marks higher in each subarray). We hypothesize that the failure probability of a cell attached to a local bitline correlates with the distance between the row and the local sense amplifiers, and further rows have less time to amplify their data due to the signal propagation delay in a bitline. These observations are similar to those made in prior studies [190, 55, 191, 163] on DDR3 devices.



Figure 5-1: Activation failure bitmap in  $1024 \times 1024$  cell array.

We next study the granularity at which we can induce activation failures when accessing a row. We observe (not shown) that activation failures occur *only* within the first cache line that is accessed immediately following an activation. No subsequent access to an already *open* row results in activation failures. This is because cells within the *same* row have a longer time to restore their cell charge (Figure 2-3) when they are accessed after the row has already been opened. We draw two key conclusions: 1) the region *and* bitline of DRAM being accessed affect the number of observable activation failures, and 2) different DRAM subarrays *and* different local bitlines exhibit varying levels of entropy.

### 5.4.2 Data Pattern Dependence

To understand the data pattern dependence of activation failures and DRAM cell entropy, we study how effectively we can discover failures using different data patterns across multiple rounds of testing. Our *goal* in this experiment is to determine which data pattern results in the highest entropy such that we can generate random values with high throughput. Similar to prior works [260, 206] that extensively describe the data patterns, we analyze a total of 40 unique data patterns: solid 1s, checkered, row stripe, column stripe, 16 walking 1s, and the inverses of all 20 aforementioned data patterns.

Figure 5-2 plots the ratio of activation failures discovered by a particular data pattern after 100 iterations of Algorithm 4 relative to the *total* number of failures discovered by *all* patterns for a representative chip from each manufacturer. We call this metric *coverage* because it indicates the effectiveness of a single data pattern to identify all possible DRAM cells that are prone to activation failure. We show results for each pattern individually except for the WALK1 and WALK0 patterns, for which we show the mean (bar) and minimum/maximum (error bars) coverage across all 16 iterations of each walking pattern.



Figure 5-2: Data pattern dependence of DRAM cells prone to activation failure over 100 iterations

We make three key observations from this experiment. First, we find that testing with different data patterns identifies different subsets of the total set of possible activation failures. This indicates that 1) different data patterns cause different DRAM cells to fail and 2) specific data patterns induce more activation failures than others.

Thus, certain data patterns may extract more entropy from a DRAM cell array than other data patterns. Second, we find that, of *all* 40 tested data patterns, each of the 16 *walking 1s*, for a given device, provides a similarly high coverage, regardless of the manufacturer. This high coverage is similarly provided by only one other data pattern per manufacturer: solid 0s for manufacturers A and B, and walking 0s for manufacturer C. Third, if we repeat this experiment (i.e., Figure 5-2) while varying the number of iterations of Algorithm 4, the *total failure count* across all data patterns *increases* as we increase the number of iterations of Algorithm 4. This indicates that not all DRAM cells fail deterministically when accessed with a reduced  $t_{RCD}$ , providing a potential source of entropy for random number generation.

We next analyze each cell's probability of failing when accessed with a reduced  $t_{RCD}$  (i.e., its *activation failure probability*) to determine which data pattern most effectively identifies cells that provide high entropy. We note that DRAM cells with an activation failure probability  $F_{prob}$  of 50% provide high entropy when accessed many times. With the same data used to produce Figure 5-2, we study the different data patterns with regard to the number of cells they cause to fail 50% of the time. Interestingly, we find that the data pattern that induces the most failures overall does not necessarily find the most number of cells that fail 50% of the time. In fact, when searching for cells with an  $F_{prob}$  between 40% and 60%, we observe that the data patterns that find the highest number of cells are solid 0s, checkered 0s, and solid 0s for manufacturers A, B, and C, respectively. We conclude that: 1) due to manufacturing and design variation across DRAM devices from different manufacturers, different data patterns result in different failure probabilities in our DRAM devices, and 2) to provide high entropy when accessing DRAM cells with a reduced  $t_{RCD}$ , we should use the respective data pattern that finds the most number of cells with an  $F_{prob}$  of 50% for DRAM devices from a given manufacturer.

Unless otherwise stated, in the rest of this chapter, we use the solid 0s, checkered 0s, and solid 0s data patterns for manufacturers A, B, and C, respectively, to analyze  $F_{prob}$  at the granularity of a single cell and to study the effects of temperature and time on our sources of entropy.

### 5.4.3 Temperature Effects

In this section, we study whether temperature fluctuations affect a DRAM cell's activation failure probability and thus the entropy that can be extracted from the DRAM cell. To analyze temperature effects, we record the  $F_{prob}$  of cells throughout our DRAM devices across 100 iterations of Algorithm 4 at 5°C increments between 55°C and 70°). Figure 5-3 aggregates results across 30 DRAM modules from each DRAM manufacturer. Each point in the figure represents how the  $F_{prob}$  of a DRAM cell changes as the temperature changes (i.e.,  $\Delta F_{prob}$ ). The x-axis shows the  $F_{prob}$  of a single cell at temperature  $T$  (i.e., the baseline temperature), and the y-axis shows the  $F_{prob}$  of the same cell at temperature  $T + 5$  (i.e., 5°C above the baseline temperature). Because we test each cell at each temperature across 100 iterations, the granularity of  $F_{prob}$  on both the x- and y-axes is 1%. For a given  $F_{prob}$  at temperature  $T$  (x% on the x-axis), we aggregate *all* respective  $F_{prob}$  points at temperature  $T + 5$  (y% on the y-axis) with box-and-whiskers plots<sup>2</sup> to show how the given  $F_{prob}$  is affected by the increased DRAM temperature. The *box* is drawn in blue and contains the *median* drawn in red. The *whiskers* are drawn in gray, and the *outliers* are indicated with orange pluses.



Figure 5-3: Effect of temperature variation on failure probability

We observe that  $F_{prob}$  at temperature  $T + 5$  tends to be higher than  $F_{prob}$  at

---

<sup>2</sup>A box-and-whiskers plot emphasizes the important metrics of a dataset's distribution. The box is lower-bounded by the first quartile (i.e., the median of the first half of the ordered set of data points) and upper-bounded by the third quartile (i.e., the median of the second half of the ordered set of data points). The median falls within the box. The *inter-quartile range* (IQR) is the distance between the first and third quartiles (i.e., box size). Whiskers extend an additional  $1.5 \times IQR$  on either sides of the box. We indicate outliers, or data points outside of the range of the whiskers, with pluses.

temperature  $T$ , as shown by the blue region of the figure (i.e., the boxes of the box-and-whiskers plots) lying above the  $x = y$  line. However, fewer than 25% of all data points fall below the  $x = y$  line, indicating that a portion of cells have a lower  $F_{prob}$  as temperature is increased.

We observe that DRAM devices from different manufacturers are affected by temperature differently. DRAM cells of manufacturer A have the *least* variation of  $\Delta F_{prob}$  when temperature is increased since the boxes of the box-and-whiskers plots are strongly correlated with the  $x = y$  line. How a DRAM cell’s activation failure probability changes in DRAM devices from *other* manufacturers is unfortunately *less* predictable under temperature change (i.e., a DRAM cell from manufacturers B or C has higher variation in  $F_{prob}$  change), but the data still shows a strong positive correlation between temperature and  $F_{prob}$ . We conclude that temperature affects cell failure probability ( $F_{prob}$ ) to different degrees depending on the manufacturer of the DRAM device, but increasing temperature generally increases the activation failure probability.

#### 5.4.4 Entropy Variation over Time

To determine whether the failure probability of a DRAM cell changes over time, we complete 250 *rounds* of recording the activation failure probability of DRAM cells over the span of 15 days. Each round consists of accessing every cell in DRAM 100 times with a reduced  $t_{RCD}$  value and recording the failure probability for each individual cell (out of 100 iterations). We find that a DRAM cell’s activation failure probability does *not* change significantly over time. This means that, once we identify a DRAM cell that exhibits high entropy, we can rely on the cell to maintain its high entropy over time. We hypothesize that this is because a DRAM cell fails with high entropy when process manufacturing variation in peripheral and DRAM cell circuit elements combine such that, when we read the cell using a reduced  $t_{RCD}$  value, we induce a metastable state resulting from the cell voltage falling between the reliable sensing margins (i.e., falling close to  $\frac{V_{dd}}{2}$ ) [55]. Since manufacturing variation is fully determined at manufacturing time, a DRAM cell’s activation failure probability is

stable over time given the same experimental conditions. In Section 5.5.1, we discuss our methodology for selecting DRAM cells for extracting stable entropy, such that we can preemptively avoid longer-term aging effects that we do not study in this chapter.

## 5.5 D-RaNGe: A DRAM-based TRNG

Based on our rigorous analysis of DRAM activation failures (presented in Section 5.4), we propose D-RaNGe, a flexible mechanism that provides high-throughput DRAM-based true random number generation (TRNG) by sourcing entropy from a subset of DRAM cells and is built fully within the memory controller. D-RaNGe is based on the **key observation** that DRAM cells fail probabilistically when accessed with reduced DRAM timing parameters, and this probabilistic failure mechanism can be used as a source of true random numbers. While there are many other timing parameters that we could reduce to induce failures in DRAM [55, 190, 191, 163, 188, 54], we focus specifically on reducing  $t_{RCD}$  below manufacturer-recommended values to study the resulting activation failures.<sup>3</sup>

Activation failures occur as a result of reading the value from a DRAM cell *too soon* after sense amplification. This results in reading the value at the sense amplifiers before the bitline voltage is amplified to an I/O-readable voltage level. The probability of reading incorrect data from the DRAM cell therefore depends largely on the bitline's voltage at the time of reading the sense amplifiers. Because there is significant process variation across the DRAM cells and I/O circuitry [55, 190, 191, 163], we observe a wide variety of failure probabilities for different DRAM cells (as discussed in Section 5.4) for a given  $t_{RCD}$  value, ranging from 0% probability to 100% probability.

**We discover that in each DRAM chip, a subset of cells fail at  $\sim 50\%$  probability, and a subset of these cells fail randomly with high entropy (shown in Section 5.6.2).** In this section, we first discuss our method of identifying

---

<sup>3</sup>We believe that reducing other timing parameters could be used to generate true random values, but we leave their exploration to future work.

such cells, which we refer to as *RNG cells* (in Section 5.5.1). Second, we describe the mechanism with which D-RaNGe *samples* RNG cells to extract random data (Section 5.5.2). Finally, we discuss a potential design for integrating D-RaNGe in a full system (Section 5.5.3).

### 5.5.1 RNG Cell Identification

Prior to generating random data, we must first identify cells that are capable of producing truly random output (i.e., RNG cells). Our process of identifying RNG cells involves reading every cell in the DRAM array 1000 times with a *reduced*  $t_{RCD}$  and approximating each cell’s Shannon entropy [298] by counting the occurrences of 3-bit symbols across its 1000-bit stream. We identify cells that generate an approximately equal number of every possible 3-bit symbol ( $\pm 10\%$  of the number of expected symbols) as RNG cells.

We find that RNG cells provide unbiased output, meaning that a post-processing step (described in Section 5.1) is *not* necessary to provide sufficiently high entropy for random number generation. We also find that RNG cells *maintain high entropy across system reboots*. In order to account for our observation that entropy from an RNG cell changes depending on the DRAM temperature (Section 5.4.3), we identify reliable RNG cells at each temperature and store their locations in the memory controller. Depending on the DRAM temperature at the time an application requests random values, D-RaNGe samples the appropriate RNG cells. To ensure that DRAM aging does not negatively impact the reliability of RNG cells, we require re-identifying the set of RNG cells at regular intervals. From our observation that entropy does not change significantly over a tested 15 day period of sampling RNG cells (Section 5.4.4), we expect the interval of re-identifying RNG cells to be at least 15 days long. Our RNG cell identification process is effective at identifying cells that are reliable entropy sources for random number generation, and we quantify their randomness using the NIST test suite for randomness [279] in Section 5.6.1.

### 5.5.2 Sampling RNG Cells for Random Data

Given the availability of these RNG cells, we use our observations in Section 5.4 to design a high-throughput TRNG that quickly and repeatedly samples RNG cells with reduced DRAM timing parameters. Algorithm 5 demonstrates the key components of D-RaNGe that enable us to generate random numbers with high throughput. D-RaNGe takes in *num\_bits* as an argument, which is defined as the number of

---

**Algorithm 5:** D-RaNGe: A DRAM-based TRNG

---

```

1 D-RaNGe(num_bits): // num_bits: number of random bits requested
2   DP: a known data pattern that results in high entropy
3   select 2 DRAM words with RNG cells in distinct rows in each bank
4   write DP to chosen DRAM words and their neighboring cells
5   get exclusive access to rows of chosen DRAM words and nearby cells
6   set low tRCD for DRAM ranks containing chosen DRAM words
7   for each bank:
8     read data in DW1 // induce activation failure
9     write the read value of DW1's RNG cells to bitstream
10    write original data value back into DW1
11    memory barrier // ensure completion of write to DW1
12    read data in DW2 // induce activation failure
13    write the read value of DW2's RNG cells to bitstream
14    write original data value back into DW2
15    memory barrier // ensure completion of write to DW2
16    if bitstreamsize ≥ num_bits:
17      break
18    set default tRCD for DRAM ranks of the chosen DRAM words
19    release exclusive access to rows of chosen words and nearby cells

```

---

random bits desired (Line 1). D-RaNGe then prepares to generate random numbers in Lines 2-6 by first selecting DRAM words (i.e., the granularity at which a DRAM module is accessed) containing known RNG cells for generating random data (Line 3). To maximize the throughput of random number generation, D-RaNGe chooses DRAM words with the highest density of RNG cells in each bank (to exploit DRAM parallelism). Since each DRAM access can induce activation failures *only* in the accessed DRAM word, the density of RNG cells per DRAM word determines the number of random bits D-RaNGe can generate per access. For each available DRAM bank, D-RaNGe selects two DRAM words (in distinct DRAM rows) containing RNG cells. The purpose of selecting two DRAM words in *different* rows is to *repeatedly* cause *bank conflicts*, or issue requests to *closed* DRAM rows so that every read request will *immediately* follow

an activation. This is done by alternating accesses to the chosen DRAM words in different DRAM rows. After selecting DRAM words for generating random values, D-RaNGe writes a known data pattern that results in high entropy to each chosen DRAM word and its neighboring cells (Line 4) and gains exclusive access to rows containing the two chosen DRAM words as well as their neighboring cells (Line 5).<sup>4</sup> This ensures that the data pattern surrounding the RNG cell and the original value of the RNG cell stay constant prior to each access such that the failure probability of each RNG cell remains reliable (as observed to be necessary in Section 5.4.2). To begin generating random data (i.e., sampling RNG cells), D-RaNGe reduces the value of  $t_{RCD}$  (Line 6). From every available bank (Line 7), D-RaNGe generates random values in parallel (Lines 8-15). Lines 8 and 12 indicate the commands to alternate accesses to two DRAM words in distinct rows of a bank to both 1) induce activation failures and 2) precharge the recently-accessed row. After inducing activation failures in a DRAM word, D-RaNGe extracts the value of the RNG cells within the DRAM word (Lines 9 and 13) to use as random data and restores the DRAM word to its original data value (Lines 10 and 14) to maintain the original data pattern. Line 15 ensures that writing the original data value is complete before attempting to sample the DRAM words again. Lines 16 and 17 simply end the loop if enough random bits of data have been harvested. Line 18 sets the  $t_{RCD}$  timing parameter back to its default value, so other applications can access DRAM without corrupting data. Line 19 releases exclusive access to the rows containing the chosen DRAM words and their neighboring rows.

We find that this methodology maximizes the opportunity for activation failures in DRAM, thereby maximizing the rate of generating random data from RNG cells.

---

<sup>4</sup>Ensuring exclusive access to DRAM rows can be done by remapping rows to 1) redundant DRAM rows or 2) buffers in the memory controller so that these rows are hidden from the system software and only accessible by the memory controller for generating random numbers.

### 5.5.3 Full System Integration

In this work, we focus on developing a flexible substrate for sampling RNG cells fully from within the memory controller. D-RaNGe generates random numbers using a simple firmware routine running entirely within the memory controller. The firmware executes the sampling algorithm (Algorithm 5) whenever an application requests random samples and there is available DRAM bandwidth (i.e., DRAM is not servicing other requests or maintenance commands). In order to minimize latency between requests for samples and their corresponding responses, a small queue of already-harvested random data may be maintained in the memory controller for use by the system. Overall performance overhead can be minimized by tuning both 1) the queue size and 2) how the memory controller prioritizes requests for random numbers relative to normal memory requests.

In order to integrate D-RaNGe with the rest of the system, the system designer needs to decide how to best expose an interface by which an application can leverage D-RaNGe to generate true random numbers on their system. There are many ways to achieve this, including, but not limited to:

- Providing a simple `REQUEST` and `RECEIVE` interface for applications to request and receive the random numbers using memory-mapped configuration status registers (CSRs) [357] or other existing I/O datapaths (e.g., x86 `IN` and `OUT` opcodes, Local Advanced Programmable Interrupt Controller (LAPIC configuration [133])).
- Adding a new ISA instruction (e.g., Intel `RDRAND` [114]) that retrieves random numbers from the memory controller and stores them into processor registers.

The operating system may then expose one or more of these interfaces to user applications through standard kernel-user interfaces (e.g., system calls, file I/O, operating system APIs). The system designer has complete freedom to choose between these (and other) mechanisms that expose an interface for user applications to interact with D-RaNGe. We expect that the best option will be system specific, depending both on the desired D-RaNGe use cases and the ease with which the design can be

implemented.

## 5.6 D-RaNGe Evaluation

We evaluate three key aspects of D-RaNGe. First, we show that the random data obtained from RNG cells identified by D-RaNGe passes all of the tests in the NIST test suite for randomness (Section 5.6.1). Second, we analyze the existence of RNG cells across 59 LPDDR4 and 4 DDR3 DRAM chips (due to long testing time) randomly sampled from the overall population of DRAM chips across all three major DRAM manufacturers (Section 5.6.2). Third, we evaluate D-RaNGe in terms of the six key properties of an ideal TRNG as explained in Section 5.2 (Section 5.6.3).

### 5.6.1 NIST Tests

First, we identify RNG cells using our RNG cell identification process (Section 5.5.1). Second, we sample *each* identified RNG cell one million times to generate large amounts of random data (i.e., 1 Mb *bitstreams*). Third, we evaluate the entropy of the bitstreams from the identified RNG cells with the NIST test suite for randomness [279]. Table 5.1 shows the average results of 236 1 Mb bitstreams<sup>5</sup> across the 15 tests of the full NIST test suite for randomness. P-values are calculated for each test,<sup>6</sup> where the null hypothesis for each test is that a perfect random number generator would *not* have produced random data with *better* characteristics for the given test than the tested sequence [228]. Since the resulting P-values for each test in the suite are greater than our chosen level of significance,  $\alpha = 0.0001$ , we accept our null hypothesis for each test. We note that all 236 bitstreams pass all 15 tests with similar P-values. Given our  $\alpha = 0.0001$ , our proportion of passing sequences (1.0) falls within the range of acceptable proportions of sequences that pass each test ([0.998,1] calculated by the

---

<sup>5</sup>We test data obtained from 4 RNG cells from each of 59 DRAM chips, to maintain a reasonable NIST testing time and thus show that RNG cells across all tested DRAM chips reliably generate random values.

<sup>6</sup>A p-value close to 1 indicates that we must accept the null hypothesis, while a p-value close to 0 and below a small threshold, e.g.,  $\alpha = 0.0001$  (recommended by the NIST Statistical Test Suite documentation [279]), indicates that we must reject the null hypothesis.

| NIST Test Name                    | P-value | Status |
|-----------------------------------|---------|--------|
| monobit                           | 0.675   | PASS   |
| frequency_within_block            | 0.096   | PASS   |
| runs                              | 0.501   | PASS   |
| longest_run_ones_in_a_block       | 0.256   | PASS   |
| binary_matrix_rank                | 0.914   | PASS   |
| dft                               | 0.424   | PASS   |
| non_overlapping_template_matching | >0.999  | PASS   |
| overlapping_template_matching     | 0.624   | PASS   |
| maurers_universal                 | 0.999   | PASS   |
| linear_complexity                 | 0.663   | PASS   |
| serial                            | 0.405   | PASS   |
| approximate_entropy               | 0.735   | PASS   |
| cumulative_sums                   | 0.588   | PASS   |
| random_excursion                  | 0.200   | PASS   |
| random_excursion_variant          | 0.066   | PASS   |

Table 5.1: D-RaNGe results with NIST randomness test suite.

NIST statistical test suite using  $(1 - \alpha) \pm 3\sqrt{\frac{\alpha(1-\alpha)}{k}}$ , where  $k$  is the number of tested sequences). This *strongly* indicates that D-RaNGe can generate unpredictable, truly random values. Using the proportion of 1s and 0s generated from each RNG cell, we calculate Shannon entropy [298] and find the *minimum* entropy across all RNG cells to be 0.9507.

### 5.6.2 RNG Cell Distribution

The throughput at which D-RaNGe generates random numbers is a function of the 1) density of RNG cells per DRAM word and 2) bandwidth with which we can access DRAM words when using our methodology for inducing activation failures. Since each DRAM access can induce activation failures *only* in the accessed DRAM word, the density of RNG cells per DRAM word indicates the number of random bits D-RaNGe can sample per access. We first study the density of RNG cells per word across DRAM chips. Figure 5-4 plots the distribution of the number of words containing  $x$  RNG cells (indicated by the value on the x-axis) per *bank* across 472 banks from 59 DRAM devices from all manufacturers. The distribution is presented as a box-and-whiskers plot where the y-axis has a logarithmic scale with a 0 point. The three plots respectively show the distributions for DRAM devices from the three

manufacturers (indicated at the bottom left corner of each plot).



Figure 5-4: Density of RNG cells in DRAM words per bank.

We make three key observations. First, RNG cells are *widely available in every bank* across many chips. This means that we can use the available DRAM access parallelism that multiple banks offer and sample RNG cells from each DRAM bank in parallel to improve random number generation throughput. Second, *every* bank that we analyze has *multiple DRAM words* containing at least one RNG cell. The DRAM bank with the smallest occurrence of RNG cells has 100 DRAM words containing only 1 RNG cell (manufacturer B). Discounting this point, the distribution of the number of DRAM words containing only 1 RNG cell is tight with a high number of RNG cells (e.g., tens of thousands) in each bank, regardless of the manufacturer. Given our random sample of DRAM chips, we expect that the existence of RNG cells in DRAM banks will hold true for all DRAM chips. Third, we observe that a single DRAM word can contain as many as 4 RNG cells. Because the throughput of accesses to DRAM is fixed, the number of RNG cells in the accessed words essentially acts as a multiplier for the throughput of random numbers generated (e.g., accessing DRAM words containing 4 RNG cells results in  $4x$  the throughput of random numbers compared to accessing DRAM words containing 1 RNG cell).

### 5.6.3 TRNG Key Characteristics Evaluation

We now evaluate D-RaNGe in terms of the six key properties of an effective TRNG as explained in Section 5.2.

**Low Implementation Cost.** To induce activation failures, we must be able to reduce the DRAM timing parameters below manufacturer-specified values. Because

memory controllers issue memory accesses according to the timing parameters specified in a set of internal registers, D-RaNGe requires simple software support to be able to programmatically modify the memory controller’s registers. Fortunately, there exist some processors [191, 11, 12, 80] that *already* enable software to directly change memory controller register values, i.e., the DRAM timing parameters. These processors can easily generate random numbers with D-RaNGe.

All other processors that do *not* currently support direct changes to memory controller registers require *minimal* software changes to expose an interface for changing the memory controller registers [14, 281, 120, 302]. To enable a more efficient implementation, the memory controller could be programmed such that it issues DRAM accesses with distinct timing parameters on a per-access granularity to reduce the overhead in 1) changing the DRAM timing parameters and 2) allow concurrent DRAM accesses by other applications. In the rare case where these registers are unmodifiable by even the hardware, the hardware changes necessary to enable register modification are minimal and are simple to implement [191, 120, 302].

We experimentally find that we can induce activation failures with  $t_{RCD}$  between 6ns and 13ns (reduced from the default of 18ns). Given this wide range of failure-inducing  $t_{RCD}$  values, most memory controllers should be able to adjust their timing parameter registers to a value within this range.

**Fully Non-deterministic.** As we have shown in Section 5.6.1, the bitstreams extracted from the D-RaNGe-identified RNG cells pass *all* 15 NIST tests. We have full reason to believe that we are inducing a metastable state of the sense amplifiers (as hypothesized by [55]) such that we are effectively sampling random physical phenomena to extract unpredictable random values.

**High Throughput of Random Data.** Due to the various use cases of random number generation discussed in Section 5.2, different applications have different throughput requirements for random number generation, and applications may tolerate a reduction in performance so that D-RaNGe can quickly generate true random numbers. Fortunately, D-RaNGe provides flexibility to tradeoff between the *system interference* it causes, i.e., the slowdown experienced by concurrently running applica-

tions, and the random number generation throughput it provides. To demonstrate this flexibility, Figure 5-5 plots the TRNG throughput of D-RaNGe when using varying numbers of banks ( $x$  banks on the x-axis) across the three DRAM manufacturers (indicated at the top left corner of each plot). For each number of banks used, we plot the distribution of TRNG throughput that we observe *real* DRAM devices to provide. The available density of RNG cells in a DRAM device (provided in Figure 5-4) dictates the TRNG throughput that the DRAM device can provide. We plot each distribution as a box-and-whiskers plot. For each number of banks used, we select  $x$  banks with the greatest sum of RNG cells across each banks' two DRAM words with the highest density of RNG cells (that are *not* in the same DRAM row). We select two DRAM words per bank because we must alternate accesses between two DRAM rows (as shown in Lines 8 and 12 of Algorithm 5). The sum of the RNG cells available across the two selected DRAM words for each bank is considered each bank's *TRNG data rate*, and we use this value to obtain D-RaNGe's throughput. We use Ramulator [3, 174] to obtain the rate at which we can execute the core loop of Algorithm 5 with varying numbers of banks. We obtain the random number generation throughput for  $x$  banks with the following equation:

$$TRNG\_Throughput_{x\_Banks} = \sum_{n=1}^x \frac{TRNG\_data\_rate_{Bank\_n}}{Alg2\_Runtime_{x\_banks}} \quad (5.1)$$

where  $TRNG\_data\_rate_{Bank\_n}$  is the TRNG data rate for the selected bank, and  $Alg2\_Runtime_{x\_banks}$  is the runtime of the core loop of Algorithm 5 when using  $x$  Banks. We note that because we observe small variation in the density of RNG cells per word (between 0 and 4), we see that TRNG throughput across different chips is generally very similar. For this reason, we see that the box and whiskers are condensed into a single point for distributions of manufacturers B and C. We find that when *fully* using *all* 8 banks in a single DRAM channel, every device can produce *at least* 40 Mb/s of random data regardless of manufacturer. The highest throughput we observe from devices of manufacturers A/B/C respectively are 179.4/134.5/179.4 Mb/s. On average, across all manufacturers, we find that D-RaNGe can provide a throughput of

108.9 Mb/s.



Figure 5-5: Distribution of TRNG throughput across chips.

We draw two key conclusions. First, due to the *parallelism* of multiple banks, the throughput of random number generation increases linearly as we use more banks. Second, there is variation of TRNG throughput across different DRAM devices, but the medians across manufacturers are very similar.

We note that any throughput sample point on this figure can be multiplied by the number of available channels in a memory hierarchy for a better TRNG throughput estimate for a system with multiple DRAM channels. For an example memory hierarchy comprised of 4 DRAM channels, D-RaNGe results in a maximum (average) throughput of 717.4 Mb/s (435.7 Mb/s).

**Low Latency.** Since D-RaNGe’s sampling mechanism consists of a single DRAM access, the latency of generating random values is directly related to the DRAM access latency. Using the timing parameters specified in the JEDEC LPDDR4 specification [141], we calculate D-RaNGe’s latency to generate a 64-bit random value. To calculate the *maximum latency* for D-RaNGe, we assume that 1) each DRAM access provides only 1 bit of random data (i.e., each DRAM word contains *only* 1 RNG cell) and 2) we can use only a single bank within a single channel to generate random data. We find that D-RaNGe can generate 64 bits of random data with a *maximum latency* of 960ns. If D-RaNGe takes full advantage of DRAM’s channel- and bank-level parallelism in a system with 4 DRAM channels and 8 banks per channel, D-RaNGe can generate 64 bits of random data by issuing 16 DRAM accesses per channel in parallel. This results in a latency of 220ns. To calculate the *empirical minimum latency* for D-RaNGe, we fully parallelize D-RaNGe across banks in all 4

channels while also assuming that each DRAM access provides 4 bits of random data, since we find a maximum density of 4 RNG cells per DRAM word in the LPDDR4 DRAM devices that we characterize (Figure 5-4). We find the empirical minimum latency to be *only* 100ns in our tested devices.

**Low System Interference.** The flexibility of using a different number of banks across the available channels in a system’s memory hierarchy allows D-RaNGe to cause varying levels of system interference at the expense of TRNG throughput. This enables application developers to generate random values with D-RaNGe at varying tradeoff points depending on the running applications’ memory access requirements. We analyze D-RaNGe’s system interference with respect to DRAM storage overhead and DRAM latency.

In terms of storage overhead, D-RaNGe simply requires exclusive access rights to six DRAM rows per bank, consisting of the two rows containing the RNG cells and each row’s two physically-adjacent DRAM rows containing the chosen data pattern.<sup>7</sup> This results in an insignificant 0.018% DRAM storage overhead cost.

To evaluate D-RaNGe’s effect on the DRAM access latency of regular memory requests, we present one implementation of D-RaNGe. For a single DRAM channel, which is the granularity at which DRAM timing parameters are applied, D-RaNGe can alternate between using a reduced  $t_{RCD}$  and the default  $t_{RCD}$ . When using a reduced  $t_{RCD}$ , D-RaNGe generates random numbers across every bank in the channel. On the other hand, when using the default  $t_{RCD}$ , memory requests from running applications are serviced to ensure application progress. The length of these time intervals (with default/reduced  $t_{RCD}$ ) can both be adjusted according to the applications’ random number generation requirements. Overall, D-RaNGe provides significant flexibility in trading off its system overhead with its TRNG throughput. However, it is up to the system designer to use and exploit the flexibility for their requirements. To show the potential throughput of D-RaNGe without impacting concurrently-running applications, we run simulations with the SPEC CPU2006 [5] workloads, and calculate

---

<sup>7</sup>As in prior work [170, 27, 239], we argue that manufacturers can disclose which rows are physically adjacent to each other.

the idle DRAM bandwidth available that we can use to issue D-RaNGe commands. We find that, across all workloads, we can obtain an average (maximum, minimum) random-value throughput of 83.1 (98.3, 49.1) Mb/s with *no* significant impact on overall system performance.

**Low Energy Consumption.** To evaluate the energy consumption of D-RaNGe, we use DRAMPower [2] to analyze the output traces of Ramulator [3, 174] when DRAM is (1) generating random numbers (Algorithm 5), and (2) idling and not servicing memory requests. We subtract quantity (2) from (1) to obtain the estimated energy consumption of D-RaNGe. We then divide the value by the total number of random bits found during execution and find that, on average, D-RaNGe finds random bits at the cost of 4.4 nJ/bit.

## 5.7 Comparison with Prior DRAM TRNGs

To our knowledge, D-RaNGe is the highest-throughput TRNG *for commodity DRAM devices* that works by exploiting activation failures as a sampling mechanism for observing entropy in DRAM cells. There are a number of proposals to construct TRNGs using commodity DRAM devices, which we summarize in Table 5.2 based on their entropy sources. In this section, we compare each of these works with D-RaNGe. We show how D-RaNGe fulfills the six key properties of an ideal TRNG (Section 5.2) better than any prior DRAM-based TRNG proposal. We group our comparisons by the entropy source of each prior DRAM-based TRNG proposal.

| Proposal           | Year | Entropy Source      | True Random | Streaming Capable | 64-bit TRNG Latency | Energy Consumption         | Peak Throughput |
|--------------------|------|---------------------|-------------|-------------------|---------------------|----------------------------|-----------------|
| Pyo+ [264]         | 2009 | Command Schedule    | ✗           | ✓                 | 18 $\mu$ s          | N/A                        | 3.40 Mb/s       |
| Keller+ [154]      | 2014 | Data Retention      | ✓           | ✓                 | 40s                 | 6.8mJ/bit                  | 0.05 Mb/s       |
| Tehranipoor+ [330] | 2016 | Startup Values      | ✓           | ✗                 | > 60ns (optimistic) | > 245.9pJ/bit (optimistic) | N/A             |
| Sutar+ [317]       | 2018 | Data Retention      | ✓           | ✓                 | 40s                 | 6.8mJ/bit                  | 0.05 Mb/s       |
| <b>D-RaNGe</b>     | 2018 | Activation Failures | ✓           | ✓                 | 100ns < x < 960ns   | 4.4nJ/bit                  | 717.4 Mb/s      |

Table 5.2: Comparison to previous DRAM-based TRNG proposals.

### 5.7.1 DRAM Command Scheduling

Prior work [264] proposes using non-determinism in DRAM command scheduling for true random number generation. In particular, since pending access commands contend with regular refresh operations, the latency of a DRAM access is hard to predict and is useful for random number generation.

Unfortunately, this method fails to satisfy two important properties of an ideal TRNG. First, it harvests random numbers from the instruction and DRAM command scheduling decisions made by the processor and memory controller, which does *not* constitute a fully non-deterministic entropy source. Since the quality of the harvested random numbers depends directly on the quality of the processor and memory controller implementations, the entropy source is visible to and potentially modifiable by an adversary (e.g., by simultaneously running a memory-intensive workload on another processor core [234]). Therefore, this method does not meet our design goals as it does not securely generate random numbers.

Second, although this technique has a higher throughput than those based on DRAM data retention (Table 5.2), D-RaNGe still outperforms this method in terms of throughput by 211x (maximum) and 128x (average) because a single byte of random data requires a *significant* amount of time to generate. Even if we scale the throughput results provided by [264] to a modern day system (e.g., 5GHz processor, 4 DRAM channels<sup>8</sup>), the theoretical maximum throughput of Pyo et al.'s approach<sup>9</sup> is *only* 3.40Mb/s as compared with the maximum (average) throughput of 717.4Mb/s (435.7Mb/s) for D-RaNGe. To calculate the latency of generating random values, we assume the same system configuration with [264]'s claimed number of cycles 45000

---

<sup>8</sup>The authors do not provide their DRAM configuration, so we optimistically assume that they evaluate their proposal using one DRAM channel. We also assume that by utilizing 4 DRAM channels, the authors can harvest four times the entropy, which gives the benefit of the doubt to [264].

<sup>9</sup>We base our estimations on [264]'s claim that they can harvest one byte of random data every 45000 cycles. However, using these numbers along with the authors' stated processor configuration (i.e., 2.8GHz) leads to a discrepancy between our calculated maximum throughput ( $\approx 0.5\text{Mb/s}$ ) and that reported in [264] ( $\approx 5\text{Mb/s}$ ). We believe our estimation methodology and calculations are sound. In our work, we compare D-RaNGe's peak throughput against that of [264] using a more modern system configuration (i.e., 5GHz processor, 4 DRAM channels) than used in the original work, which gives the benefit of the doubt to [264].

to generate random bits. To provide 64 bits of random data, [264] takes  $18\mu s$ , which is significantly higher than D-RaNGe’s minimum/maximum latency of  $100ns/960ns$ . Energy consumption for [264] depends heavily on the entire system that it is running on, so we do not compare against this metric.

### 5.7.2 DRAM Data Retention

Prior works [154, 317] propose using DRAM data retention failures to generate random numbers. Unfortunately, this approach is *inherently too slow* for high-throughput operation due to the long wait times required to induce data retention failures in DRAM. While the failure rate can be increased by increasing the operating temperature, a wait time on the order of seconds is required to induce enough failures [206, 155, 266, 260, 164] to achieve high-throughput random number generation, which is orders of magnitude slower than D-RaNGe.

Sutar et al. [317] report that they are able to generate 256-bit random numbers using a hashing algorithm (e.g., SHA-256) on a  $4\text{ MiB}$  DRAM block that contains data retention errors resulting from having disabled DRAM refresh for 40 seconds. Optimistically assuming a large DRAM capacity of  $32\text{ GiB}$  and ignoring the time required to read out and hash the erroneous data, a waiting time of 40 seconds to induce data retention errors allows for an estimated maximum random number throughput of  $0.05\text{ Mb/s}$ . This throughput is already far smaller than D-RaNGe’s measured maximum throughput of  $717.4\text{ Mb/s}$ , and it would decrease linearly with DRAM capacity. Even if we were able to induce a large number of data retention errors by waiting only 1 second, the maximum random number generation throughput would be  $2\text{ Mb/s}$ , i.e., orders of magnitude smaller than that of D-RaNGe.

Because [317] requires a wait time of 40 seconds before producing any random values, its latency for random number generation is extremely high (40s). In contrast, D-RaNGe can produce random values very quickly since it generates random values potentially with each DRAM access (10s of nanoseconds). D-RaNGe therefore has a latency many orders of magnitude lower than Sutar et al.’s mechanism [317].

We estimate the energy consumption of retention-time based TRNG mechanisms

with Ramulator [174, 3] and DRAMPower [2, 52]. We model first writing data to a 4MiB DRAM region (to constrain the energy consumption estimate to the region of interest), waiting for 40 seconds, and then reading from that region. We then divide the energy consumption of these operations by the number of bits found (256 bits). We find that the energy consumption is around  $6.8mJ$  per bit, which is orders of magnitude more costly than that of D-RaNGe, which provides random numbers at  $4.4nJ$  per bit.

### 5.7.3 DRAM Startup Values

Prior work [330, 77] proposes using DRAM startup values as random numbers. Unfortunately, this method is unsuitable for continuous high-throughput operation since it requires a DRAM power cycle in order to obtain random data. We are unable to accurately model the latency of this mechanism since it relies on the startup time of DRAM (i.e., bus frequency calibration, temperature calibration, timing register initialization [132]). This heavily depends on the implementation of the system and the DRAM device in use. Ignoring these components, we estimate the throughput of generating random numbers using startup values by taking into account only the latency of a single DRAM read (*after* all initialization is complete), which is  $60ns$ . We model energy consumption ignoring the initialization phase as well, by modeling the energy to read a MiB of DRAM and dividing that quantity by [330]'s claimed number of random bits found in that region (420Kbit). Based on this calculation, we estimate energy consumption as  $245.9pJ$  per bit. While the energy consumption of [330] is smaller than the energy cost of D-RaNGe, we note that our energy estimation for [330] does *not* account for the energy consumption required for initializing DRAM to be able to read out the random values. Additionally, [330] requires a full system reboot which is often impractical for applications and for effectively providing a *steady stream of random values*. [77] suffers from the same issues since it uses the same mechanism as [330] to generate random numbers and is strictly worse since [77] results in  $31.8x$  less entropy.

### 5.7.4 Combining DRAM-based TRNGs

We note that D-RaNGe’s method for sampling random values from DRAM is entirely distinct from prior DRAM-based TRNGs that we have discussed in this section. This makes it possible to combine D-RaNGe with prior work to produce random values at an even higher throughput.

## 5.8 Other Related Works

In this work, we focus on the design of a DRAM-based hardware mechanism to implement a TRNG, which makes the focus of our work orthogonal to those that design PRNGs. In contrast to prior DRAM-based TRNGs discussed in Section 5.7, we propose using *activation failures* as an entropy source. Prior works characterize activation failures in order to exploit the resulting error patterns for overall DRAM latency reduction [55, 163, 190, 191] and to implement physical unclonable functions (PUFs) [164]. However, none of these works measure the randomness inherent in activation failures or propose using them to generate random numbers.

Many TRNG designs have been proposed that exploit sources of entropy that are *not* based on DRAM. Unfortunately, these proposals either 1) require custom hardware modifications that preclude their application to commodity devices, or 2) do not sustain continuous (i.e., constant-rate) high-throughput operation. We briefly discuss different entropy sources with examples.

**Flash Memory Read Noise.** Prior proposals use random telegraph noise in flash memory devices as an entropy source (up to 1 Mbit/s) [353, 269]. Unfortunately, flash memory is orders of magnitude slower than DRAM, making flash unsuitable for high-throughput and low-latency operation. A more recent work [50] demonstrates that random numbers can be extracted from the variability of write and erase latency in flash memory devices. However this technique suffers from low throughput (i.e., up to 0.25 Kb/s) as it depends on long flash memory latencies.

**SRAM-based Designs.** SRAM-based TRNG designs exploit randomness in

startup values [126, 127, 339]. Unfortunately, these proposals are unsuitable for continuous, high-throughput operation since they require a power cycle.

**GPU- and FPGA-Based Designs.** Several works harvest random numbers from GPU-based (up to 447.83 Mbit/s) [51, 336, 327] and FPGA-based (up to 12.5 Mbit/s) [225, 356, 63, 122, 84] entropy sources. These proposals do not require modifications to commodity GPUs or FPGAs. Yet, GPUs and FPGAs are not as prevalent as DRAM in commodity devices today.

**Custom Hardware.** Various works propose TRNGs based in part or fully on non-determinism provided by custom hardware designs (with TRNG throughput up to 2.4 Gbit/s) [10, 367, 44, 33, 262, 229, 42, 332, 175, 128, 127, 255, 307]. Unfortunately, the need for custom hardware limits the widespread use of such proposals in commodity hardware devices (today).

## 5.9 Limitations

While D-RaNGe can be immediately deployed in certain available systems today, D-RaNGe has limitations in implementation that limits its full potential even in these systems.

First, D-RaNGe requires a flexible memory controller that has the ability to issue DRAM commands with varying timing parameters on the fly. Without such a memory controller, D-RaNGe is limited to existing systems with tunable timing parameters and may have high overhead in switching latencies for D-RaNGe accesses and regular memory accesses. We believe that a flexible memory controller has many use cases for the system, including Solar-DRAM, the DRAM Latency PUF, and D-RaNGe, and we expect future work to develop such a memory controller.

Second, D-RaNGe requires an intelligent memory access scheduler that can interleave D-RaNGe accesses with regular accesses. The scheduler may have to predict and account for many parameters including 1) future memory idle time, 2) priority levels for D-RaNGe accesses and concurrently running applications, 3) anticipated random number throughput and latency requirements, and 4) amount of random numbers

saved in the buffer. Developing such a memory controller would enable systems to satisfy random number requirements of running applications with minimal interference from D-RaNGe.

## 5.10 Summary

We propose D-RaNGe, a mechanism for extracting true random numbers with high throughput from unmodified commodity DRAM devices on any system that allows manipulation of DRAM timing parameters in the memory controller. D-RaNGe harvests fully non-deterministic random numbers from DRAM row activation failures, which are bit errors induced by intentionally accessing DRAM with lower latency than required for correct row activation. Our TRNG is based on two key observations: 1) activation failures can be induced quickly and 2) repeatedly accessing certain DRAM cells with reduced activation latency results in reading true random data. We validate the quality of our TRNG with the commonly-used NIST statistical test suite for randomness. Our evaluations show that D-RaNGe significantly outperforms the previous highest-throughput DRAM-based TRNG by up to 211x (128x on average). We conclude that DRAM row activation failures can be effectively exploited to efficiently generate true random numbers with high throughput on a wide range of devices that use commodity DRAM chips.

## Chapter 6

# Revisiting RowHammer: An Experimental Analysis of Modern Devices and Mitigation Techniques

RowHammer is a circuit-level DRAM vulnerability, first rigorously analyzed and introduced in 2014, where repeatedly accessing data in a DRAM row can cause bit flips in nearby rows. The RowHammer vulnerability has since garnered significant interest in both computer architecture and computer security research communities because it stems from physical circuit-level interference effects that worsen with continued DRAM density scaling. As DRAM manufacturers primarily depend on density scaling to increase DRAM capacity, future DRAM chips will likely be more vulnerable to RowHammer than those of the past. Many RowHammer mitigation mechanisms have been proposed by both industry and academia, but it is unclear whether these mechanisms will remain viable solutions for future devices, as their overheads increase with DRAM's vulnerability to RowHammer.

In order to shed more light on how RowHammer affects modern and future devices

at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408× DDR3, 652× DDR4, and 520× LPDDR4) from 300 DRAM modules (60× DDR3, 110× DDR4, and 130× LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three major DRAM manufacturers. Our studies definitively show that newer DRAM chips are more vulnerable to RowHammer: as device feature size reduces, the number of activations needed to induce a RowHammer bit flip also reduces, to as few as  $9.6k$  (4.8k to two rows each) in the most vulnerable chip we tested.

We evaluate five state-of-the-art RowHammer mitigation mechanisms using cycle-accurate simulation in the context of real data taken from our chips to study how the mitigation mechanisms scale with chip vulnerability. We find that existing mechanisms either are not scalable or suffer from prohibitively large performance overheads in projected future devices given our observed trends of RowHammer vulnerability. Thus, it is critical to research more effective solutions to RowHammer.

## 6.1 RowHammer: DRAM Disturbance Errors

Modern DRAM devices suffer from *disturbance errors* that occur when a high rate of accesses to a single DRAM row unintentionally flip the values of cells in nearby rows. This phenomenon is known as *RowHammer* [170]. It inherently stems from electromagnetic interference between nearby cells. RowHammer is exacerbated by reduction in process technology node size because adjacent DRAM cells become both smaller and closer to each other. Therefore, as DRAM manufacturers continue to increase DRAM storage density, a chip’s vulnerability to RowHammer bit flips increases [170, 239, 241].

RowHammer exposes a system-level security vulnerability that has been studied by many prior works both from the attack and defense perspectives. Prior works demonstrate that RowHammer can be used to mount system-level attacks for privilege escalation (e.g., [66, 104, 105, 204, 265, 270, 285, 324, 340, 360, 87, 145]), leaking

confidential data (e.g., [184]), and denial of service (e.g., [104, 204]). These works effectively demonstrate that a system must provide protection against RowHammer to ensure robust (i.e., reliable and secure) execution.

Prior works propose defenses against RowHammer attacks both at the hardware (e.g., [170, 194, 280, 303, 94, 372, 297, 150, 159, 99, 23, 25, 28, 103, 26, 24, 118, 83]) and software (e.g., [13, 16, 41, 170, 178, 198, 341, 197, 124, 135, 43, 86, 358, 36, 169, 351, 49, 350]) levels. DRAM manufacturers themselves employ in-DRAM RowHammer prevention mechanisms such as *Target Row Refresh (TRR)* [139], which internally performs proprietary operations to reduce the vulnerability of a DRAM chip against potential RowHammer attacks, although these solutions have been recently shown to be vulnerable [87]. Memory controller and system manufacturers have also included defenses such as increasing the refresh rate [13, 16, 197] and Hardware RHP [134, 249, 333, 345]. For a detailed survey of the RowHammer problem, its underlying causes, characteristics, exploits building on it, and mitigation techniques, we refer the reader to [241].

## 6.2 Motivation and Goal

Despite the considerable research effort expended towards understanding and mitigating RowHammer, scientific literature still lacks rigorous experimental data on how the RowHammer vulnerability is changing with the advancement of DRAM designs and process technologies. In general, important practical concerns are difficult to address with existing data in literature. For example:

- How vulnerable to RowHammer are future DRAM chips expected to be at the circuit level?
- How well would RowHammer mitigation mechanisms prevent or mitigate RowHammer in future devices?
- What types of RowHammer solutions would cope best with increased circuit-level vulnerability due to continued technology node scaling?

While existing experimental characterization studies [170, 257, 256] take important steps towards building an overall understanding of the RowHammer vulnerability, they are too scarce and collectively do not provide a holistic view of RowHammer evolution into the modern day. To help overcome this lack of understanding, we need a unifying study of the RowHammer vulnerability of a broad range of DRAM chips spanning the time since the original RowHammer paper was published in 2014 [170].

To this end, **our goal** in this chapter is to evaluate and understand how the RowHammer vulnerability of real DRAM chips at the circuit level changes across different chip types, manufacturers, and process technology node generations. Doing so enables us to predict how the RowHammer vulnerability in DRAM chips will scale as the industry continues to increase storage density and reduce technology node size for future chip designs. To achieve this goal, we perform a rigorous experimental characterization study of DRAM chips from three different DRAM types (i.e., DDR3, DDR4, and LPDDR4), three major DRAM manufacturers, and at least two different process technology nodes from each DRAM type. We show how different chips from different DRAM types and technology nodes (abbreviated as “type-node” configurations) have varying levels of vulnerability to RowHammer. We compare the chips’ vulnerabilities against each other and project how they will likely scale when reducing the technology node size even further (Section 6.4). Finally, we study how effective existing RowHammer mitigation mechanisms will be, based on our observed and projected experimental data on the RowHammer vulnerability (Section 6.5).

## 6.3 Experimental Methodology

We describe our methodology for characterizing DRAM chips for RowHammer.

### 6.3.1 Testing Infrastructure

In order to characterize the effects of RowHammer across a broad range of modern DRAM chips, we experimentally study DDR3, DDR4, and LPDDR4 DRAM chips across a wide range of testing conditions. To achieve this, we use two different testing

infrastructures: (1) the SoftMC framework [302, 120] capable of testing DDR3 and DDR4 DRAM modules in a temperature-controlled chamber and (2) an in-house temperature-controlled testing chamber capable of testing LPDDR4 DRAM chips.

**SoftMC.** Figure 6-1 shows our SoftMC setup for testing DDR4 chips. In this setup, we use an FPGA board with a Xilinx Virtex UltraScale 95 FPGA [362], two DDR4 SODIMM slots, and a PCIe interface. To open up space around the DDR4 chips for temperature control, we use a vertical DDR4 SODIMM riser board to plug a DDR4 module into the FPGA board. We heat the DDR4 chips to a target temperature using silicone rubber heaters pressed to both sides of the DDR4 module. We control the temperature using a thermocouple, which we place between the rubber heaters and the DDR4 chips, and a temperature controller. To enable fast data transfer between the FPGA and a host machine, we connect the FPGA to the host machine using PCIe via a 30 cm PCIe extender. We use the host machine to program the SoftMC hardware and collect the test results. Our SoftMC setup for testing DDR3 chips is similar but uses a Xilinx ML605 FPGA board [361]. Both infrastructures provide fine-grained control over the types and timings of DRAM commands sent to the chips under test and provide precise temperature control at typical operating conditions.



Figure 6-1: Our SoftMC infrastructure [302, 120] for testing DDR4 DRAM chips.

**LPDDR4 Infrastructure.** Our LPDDR4 DRAM testing infrastructure uses industry-developed in-house testing hardware for package-on-package LPDDR4 chips. The LPDDR4 testing infrastructure is further equipped with cooling and heating

capabilities that also provide us with precise temperature control at typical operating conditions.

### 6.3.2 Characterized DRAM Chips

Table 6.1 summarizes the DRAM chips that we test using both infrastructures. We have chips from all of the three major DRAM manufacturers spanning DDR3, DDR4, and two known technology nodes of LPDDR4. We refer to the DRAM type (e.g., LPDDR4) and technology node of a DRAM chip as a *DRAM type-node configuration* (e.g., LPDDR4-1x). For DRAM chips whose technology node we do not exactly know, we identify their node as *old* or *new*.

Table 6.1: Summary of DRAM chips tested.

| DRAM<br>type-node | Number of Chips (Modules) Tested |          |          |                 |
|-------------------|----------------------------------|----------|----------|-----------------|
|                   | Mfr. A                           | Mfr. B   | Mfr. C   | Total           |
| DDR3-old          | 56 (10)                          | 88 (11)  | 28 (7)   | <b>172 (28)</b> |
| DDR3-new          | 80 (10)                          | 52 (9)   | 104 (13) | <b>236 (32)</b> |
| DDR4-old          | 112 (16)                         | 24 (3)   | 128 (18) | <b>264 (37)</b> |
| DDR4-new          | 264 (43)                         | 16 (2)   | 108 (28) | <b>388 (73)</b> |
| LPDDR4-1x         | 12 (3)                           | 180 (45) | N/A      | <b>192 (48)</b> |
| LPDDR4-1y         | 184 (46)                         | N/A      | 144 (36) | <b>328 (82)</b> |

**DDR3 and DDR4.** Among our tested DDR3 modules, we identify two distinct batches of chips based on their manufacturing date, datasheet publication date, purchase date, and RowHammer characteristics. We categorize DDR3 devices with a manufacturing date earlier than 2014 as DDR3-old chips, and devices with a manufacturing date including and after 2014 as DDR3-new chips. Using the same set of properties, we identify two distinct batches of devices among the DDR4 devices. We categorize DDR4 devices with a manufacturing date before 2018 or a datasheet publication date of 2015 as DDR4-old chips and devices with a manufacturing date including and after 2018 or a datasheet publication date of 2016 or 2017 as DDR4-new chips. Based on our observations on RowHammer characteristics from these chips, we expect that DDR3-old/DDR4-old chips are manufactured at an older date with an older process technology compared to DDR3-new/DDR4-new chips, respectively.

This enables us to directly study the effects of shrinking process technology node sizes in DDR3 and DDR4 DRAM chips.

**LPDDR4.** For our LPDDR4 chips, we have two known distinct generations manufactured with different technology node sizes, 1x-nm and 1y-nm, where 1y-nm is smaller than 1x-nm. Unfortunately, we are missing data from some generations of DRAM from specific manufacturers (i.e., LPDDR4-1x from manufacturer C and LPDDR4-1y from manufacturer B) since we did not have access to chips of these manufacturer-technology node combinations due to confidentiality issues. Note that while we know the external technology node values for the chips we characterize (e.g., 1x-nm, 1y-nm), these values are *not* standardized across different DRAM manufacturers and the actual values are confidential. This means that a 1x chip from one manufacturer is not necessarily manufactured with the same process technology node as a 1x chip from another manufacturer. However, since we do know relative process node sizes of chips from the *same* manufacturer, we can directly observe how technology node size affects RowHammer on LPDDR4 DRAM chips.

### 6.3.3 Effectively Characterizing RowHammer

In order to characterize RowHammer effects on our DRAM chips at the circuit-level, we want to test our chips at the worst-case RowHammer conditions. We identify two conditions that our tests must satisfy to effectively characterize RowHammer at the circuit level: our testing routines must both: 1) run without interference (e.g., without DRAM refresh or RowHammer mitigation mechanisms) and 2) systematically test each DRAM row’s vulnerability to RowHammer by issuing the *worst-case sequence of DRAM accesses* for that particular row.

**Disabling Sources of Interference.** To directly observe RowHammer effects at the circuit level, we want to minimize the external factors that may limit 1) the effectiveness of our tests or 2) our ability to effectively characterize/observe circuit-level effects of RowHammer on our DRAM chips. First, we want to ensure that we have control over how our RowHammer tests behave without disturbing the desired access pattern in any way. Therefore, during the core loop of each RowHammer

test (i.e., when activations are issued at a high rate to induce RowHammer bit flips), we disable all DRAM self-regulation events such as refresh and calibration, using control registers in the memory controller. This guarantees consistent testing without confounding factors due to intermittent events (e.g., to avoid the possibility that a victim row is refreshed during a RowHammer test routine such that we observe fewer RowHammer bit flips). Second, we want to directly observe the circuit-level bit flips such that we can make conclusions about DRAM’s vulnerability to RowHammer at the circuit technology level rather than the system level. To this end, to the best of our knowledge, we disable all DRAM-level (e.g., TRR [141, 139, 87]) and system-level RowHammer mitigation mechanisms (e.g., pTRR [9]) along with all forms of rank-level error-correction codes (ECC), which could obscure RowHammer bit flips. Unfortunately, all of our LPDDR4-1x and LPDDR4-1y chips use on-die ECC [6, 180, 151, 259, 183] (i.e., an error correcting mechanism that corrects single-bit failures entirely within the DRAM chip [259]), which we cannot disable. Third, we ensure that the core loop of our RowHammer test runs for less than 32 ms (i.e., the lowest refresh interval specified by manufacturers to prevent DRAM data retention failures across our tested chips [260, 206, 155, 138, 139, 141]) so that we do not conflate retention failures with RowHammer bit flips.

**Worst-case RowHammer Access Sequence.** We leverage *three* key observations from prior work [170, 16, 104, 360, 65] in order to craft a worst-case RowHammer test pattern. First, a repeatedly accessed row (i.e., *aggressor row*) has the greatest impact on its immediate physically-adjacent rows (i.e., repeatedly accessing physical row  $N$  will cause the highest number of RowHammer bit flips in physical rows  $N + 1$  and  $N - 1$ ). Second, a *double-sided hammer* targeting physical victim row  $N$  (i.e., repeatedly accessing physical rows  $N - 1$  and  $N + 1$ ) causes the *highest* number of RowHammer bit flips in row  $N$  compared to any other access pattern. Third, increasing the rate of DRAM activations (i.e., issuing the same number of activations within shorter time periods) results in an increasing number of RowHammer bit flips. This rate of activations is limited by the DRAM timing parameter  $t_{RC}$  (i.e., the time between two successive activations) which depends on the DRAM clock frequency and the

DRAM type: DDR3 (52.5ns) [138], DDR4 (50ns) [139], LPDDR4 (60ns) [141]. Using these observations, we test each row’s worst-case vulnerability to RowHammer by repeatedly accessing the two directly physically-adjacent rows as fast as possible.

To enable the quick identification of physical rows  $N - 1$  and  $N + 1$  for a given row  $N$ , we reverse-engineer the *undocumented* and *confidential* logical-to-physical DRAM-internal row address remapping. To do this, we exploit RowHammer’s key observation that repeatedly accessing an arbitrary row causes the two directly physically-adjacent rows to contain the *highest* number of RowHammer bit flips [170]. By repeating this analysis across rows throughout the DRAM chip, we can deduce the address mappings for each type of chip that we test. We can then use this mapping information to quickly test RowHammer effects at worst-case conditions. We note that for our LPDDR4-1x chips from Manufacturer B, when we repeatedly access a single row within two consecutive rows such that the first row is an even row (e.g., rows 2 and 3) in the logical row address space as seen by the memory controller, we observe 1) no RowHammer bit flips in either of the two consecutive rows and 2) a near equivalent number of RowHammer bit flips in each of the four immediately adjacent rows: the two previous consecutive rows (e.g., rows 0 and 1) and the two subsequent consecutive rows (e.g., rows 4 and 5). This indicates a row address remapping that is internal to the DRAM chip such that every pair of consecutive rows share the same internal wordline. To account for this DRAM-internal row address remapping, we test each row  $N$  in LPDDR4-1x chips from manufacturer B by repeatedly accessing physical rows  $N - 2$  and  $N + 2$ .

**Additional Testing Parameters.** To investigate RowHammer characteristics, we explore two testing parameters at a stable ambient temperature of  $50^\circ C$ :

1. **Hammer count (HC).** We test the effects of changing the number of times we access (i.e., activate) a victim row’s physically-adjacent rows (i.e., aggressor rows). We count each pair of activations to the two neighboring rows as one *hammer* (e.g., one activation each to rows  $N - 1$  and  $N + 1$  counts as one hammer). We sweep the hammer count from 2k to 150k (i.e., 4k to 300k activations) across our chips so that the hammer test runs for less than 32ms.

2. **Data pattern ( $DP$ ).** We test several commonly-used DRAM data patterns where every byte is written with the same data: Solid0 (SO0: 0x00), Solid1 (SO1: 0xFF), Colstripe0 (CO0: 0x55), Colstripe1 (CO1: 0xAA) [206, 260, 155]. In addition, we test data patterns where each byte in every other row, including the row being hammered, is written with the same data, Checkered0 (CH0: 0x55) or Rowstripe0 (RS0: 0x00), and all other rows are written with the inverse data, Checkered1 (CH1: 0xAA) or Rowstripe1 (RS1: 0xFF), respectively.

**RowHammer Testing Routine.** Algorithm 6 presents the general testing methodology we use to characterize RowHammer on DRAM chips. For different data patterns ( $DP$ ) (line 2) and hammer counts ( $HC$ ) (line 8), the test individually targets each row in DRAM (line 4) as a victim row (line 5). For each victim row, we identify the two physically-adjacent rows ( $aggressor\_row1$  and  $aggressor\_row2$ ) as aggressor rows (lines 6 and 7). Before beginning the core loop of our RowHammer test (Lines 11-13), two things happen: 1) the memory controller disables DRAM refresh (line 9) to ensure no interruptions in the core loop of our test due to refresh operations, and 2) we refresh the victim row (line 10) so that we begin inducing RowHammer bit flips on a fully-charged row, which ensures that bit flips we observe are not due to retention time violations. The core loop of our RowHammer test (Lines 11-13) induces RowHammer bit flips in the victim row by first activating  $aggressor\_row1$  then  $aggressor\_row2$ ,

---

**Algorithm 6:** DRAM RowHammer Characterization

---

```

1 DRAM_RowHammer_Characterization():
2   foreach  $DP$  in [Data Patterns]:
3     write  $DP$  into all cells in  $DRAM$ 
4     foreach  $row$  in  $DRAM$ :
5       set  $victim\_row$  to  $row$ 
6       set  $aggressor\_row1$  to  $victim\_row - 1$ 
7       set  $aggressor\_row2$  to  $victim\_row + 1$ 
8       foreach  $HC$  in [ $HC$  sweep]:
9         Disable DRAM refresh
10        Refresh  $victim\_row$ 
11        for  $n = 1 \rightarrow HC$ : // core test loop
12          activate  $aggressor\_row1$ 
13          activate  $aggressor\_row2$ 
14        Enable DRAM refresh
15        Record RowHammer bit flips to storage
16        Restore bit flips to original values

```

---

$HC$  times. After the core loop of our RowHammer test, we re-enable DRAM refresh

(line 14) to prevent retention failures and record the observed bit flips to secondary storage (line 15) for analysis (presented in Section 6.4). Finally, we prepare to test the next  $HC$  value in the sweep by restoring the observed bit flips to their original values (Line 16) depending on the data pattern ( $DP$ ) being tested.

**Fairly Comparing Data Across Infrastructures.** Our carefully-crafted RowHammer test routine allows us to compare our test results between the two different testing infrastructures. This is because, as we described earlier, we 1) reverse engineer the row address mappings of each DRAM configuration such that we effectively test double-sided RowHammer on every single row, 2) issue activations as fast as possible for each chip, such that the activation rates are similar across infrastructures, and 3) disable all sources of interference in our RowHammer tests.

## 6.4 RowHammer Characterization

In this section, we present our comprehensive characterization of RowHammer on the 1580 DRAM chips we test.<sup>1</sup>

### 6.4.1 RowHammer Vulnerability

We first examine which of the chips that we test are susceptible to RowHammer. Across all of our chips, we sweep the hammer count ( $HC$ ) between 2K and 150K (i.e., 4k and 300k activates for our double-sided RowHammer test) and observe whether we can induce any RowHammer bit flips at all in each chip. We find that we can induce RowHammer bit flips in all chips except many DDR3 chips. Table 6.2 shows the fraction of DDR3 chips in which we *can* induce RowHammer bit flips (i.e., *RowHammerable* chips).

**Observation 1.** *Newer DRAM chips appear to be more vulnerable to RowHammer based on the increasing fraction of RowHammerable chips from DDR3-old to DDR3-new DRAM chips of manufacturers B and C.*

---

<sup>1</sup>We list our full set of chips in Appendix A of our extended technical report [166].

Table 6.2: Fraction of DDR3 DRAM chips vulnerable to RowHammer when  $HC < 150k$ .

| DRAM<br>type-node | RowHammerable chips |        |        |
|-------------------|---------------------|--------|--------|
|                   | Mfr. A              | Mfr. B | Mfr. C |
| DDR3-old          | 24/88               | 0/88   | 0/28   |
| DDR3-new          | 8/72                | 44/52  | 96/104 |

We find that the fraction of manufacturer A’s chips that are RowHammerable decreases from DDR3-old to DDR3-new chips, but we also note that the number of RowHammer bit flips that we observe across each of manufacturer A’s chips is very low ( $< 20$  on average across RowHammerable chips) compared to the number of bit flips found in manufacturer B and C’s DDR3-new chips (87k on average across RowHammerable chips) when  $HC = 150K$ . Since DDR3-old chips of all manufacturers and DDR3-new chips of manufacturer A have very few to no bit flips, we refrain from analyzing and presenting their characteristics in many plots in Section 6.4.

#### 6.4.2 Data Pattern Dependence

To study data pattern effects on observable RowHammer bit flips, we test our chips using Algorithm 6 with  $hammer\_count (HC) = 150k$  at  $50^\circ C$ , sweeping the 1) *victim\_row* and 2) *data\_pattern* (as described in Section 6.3.3).<sup>2</sup>

We first examine the set of all RowHammer bit flips that we observe when testing with different data patterns for a given  $HC$ . For each data pattern, we run our RowHammer test routine ten times. We then aggregate all unique RowHammer bit flips per data pattern. We combine all unique RowHammer bit flips found by all data patterns and iterations into a full set of observable bit flips. Using the combined data, we calculate the fraction of the full set of observable bit flips that each data pattern identifies (i.e., the data pattern’s *coverage*). Figure 6-2 plots the coverage (y-axis) per individual data pattern (shared x-axis) for a single representative DRAM chip from each DRAM type-node configuration that we test. Each row of subplots shows the

---

<sup>2</sup>Note that for a given data pattern (*DP*), the same data is always written to *victim\_row*. For example, when testing Rowstripe0, every byte in *victim\_row* is always written with 0x00 and every byte in the two physically-adjacent rows are written with 0xFF.

coverages for chips of the same manufacturer (indicated on the right y-axis), and the columns show the coverages for chips of the same DRAM type-node configuration (e.g., DDR3-new).



Figure 6-2: RowHammer bit flip coverage of different data patterns (described in Section 6.3.3) for a single representative DRAM chip of each type-node configuration.

**Observation 2.** *Testing with different data patterns is essential for comprehensively identifying RowHammer bit flips because no individual data pattern achieves full coverage alone.*

**Observation 3.** *The worst-case data pattern (shown in Table 6.3) is consistent across chips of the same manufacturer and DRAM type-node configuration.<sup>3</sup>*

Table 6.3: Worst-case data pattern for each DRAM type-node configuration at 50°C split into different manufacturers.

| DRAM<br>type-node | Worst Case Data Pattern at 50°C |            |            |
|-------------------|---------------------------------|------------|------------|
|                   | Mfr. A                          | Mfr. B     | Mfr. C     |
| DDR3-new          | N/A                             | Checkered0 | Checkered0 |
| DDR4-old          | RowStripe1                      | RowStripe1 | RowStripe0 |
| DDR4-new          | RowStripe0                      | RowStripe0 | Checkered1 |
| LPDDR4-1x         | Checkered1                      | Checkered0 | N/A        |
| LPDDR4-1y         | RowStripe1                      | N/A        | RowStripe1 |

<sup>3</sup>We do not consider the true/anti cell pattern of a chip [206, 170, 87] and agnostically program the data pattern accordingly into the DRAM array. More RowHammer bit flips can be induced by considering the true/anti-cell pattern of each chip and devising corresponding data patterns to exploit this knowledge [87].

We believe that different data patterns induce the most RowHammer bit flips in different chips because DRAM manufacturers apply a variety of proprietary techniques for DRAM cell layouts to maximize the cell density for different DRAM type-node configurations. For the remainder of this chapter, we characterize each chip using *only* its worst-case data pattern.<sup>4</sup>

### 6.4.3 Hammer Count (*HC*) Effects

We next study the effects of increasing the hammer count on the number of observed RowHammer bit flips across our chips. Figure 6-3 plots the effects of increasing the number of hammers on the RowHammer bit flip rate<sup>5</sup> for our tested DRAM chips of various DRAM type-node configurations across the three major DRAM manufacturers. For all chips, we hammer each row, sweeping *HC* between 10,000 and 150,000. For each *HC* value, we plot the average rate of observed RowHammer bit flips across all chips of a DRAM type-node configuration.



Figure 6-3: Hammer count (*HC*) vs. RowHammer bit flip rate across DRAM type-node configurations.

**Observation 4.** *The log of the number of RowHammer bit flips has a linear relationship with the log of *HC*.*<sup>6</sup>

We observe this relationship between *HC* and RowHammer bit flip rate because more accesses to a single row results in more cell-to-cell interference, and therefore

<sup>4</sup>We use the worst-case data pattern to 1) minimize the extensive testing time, 2) induce many RowHammer bit flips, and 3) experiment at worst-case conditions. A diligent attacker would also try to find the worst-case data pattern to maximize the probability of a successful RowHammer attack.

<sup>5</sup>We define the RowHammer bit flip rate as the number of observed RowHammer bit flips to the total number of bits in the tested DRAM rows.

<sup>6</sup>Our observation is consistent with prior work [257].

more charge is lost in victim cells of nearby rows.

We examine the effects of DRAM technology node on the RowHammer bit flip rate in Figure 6-3. We observe that the bit flip rate curve shifts *upward* and *leftward* when going from DDR4-old to DDR4-new chips, indicating respectively, 1) a higher rate of bit flips for the same  $HC$  value and 2) occurrence of bit flips at lower  $HC$  values, as technology node size reduces from DDR4-old to DDR4-new.

**Observation 5.** *Newer DDR4 DRAM technology nodes show a clear trend of increasing RowHammer bit flip rates: the same  $HC$  value causes an increased average RowHammer bit flip rate from DDR4-old to DDR4-new DRAM chips of all DRAM manufacturers.*

We believe that due to increased density of DRAM chips from older to newer technology node generations, cell-to-cell interference increases and results in DRAM chips that are more vulnerable to RowHammer bit flips.

#### 6.4.4 RowHammer Spatial Effects

We next experimentally study the spatial distribution of RowHammer bit flips across our tested chips. In order to normalize the RowHammer effects that we observe across our tested chips, we first take each DRAM chip and use a hammer count specific to that chip to result in a RowHammer bit flip rate of  $10^{-6}$ .<sup>7</sup> For each chip, we analyze the spatial distribution of bit flips throughout the chip. Figure 6-4 plots the fraction of RowHammer bit flips that occur in a given row offset from the *victim\_row* out of all observed RowHammer bit flips. Each column of subplots shows the distributions for chips of different manufacturers and each row of subplots shows the distribution for a different DRAM type-node configuration. The error bars show the standard deviation of the distribution across our tested chips. Note that the repeatedly-accessed rows (i.e., *aggressor rows*) are at  $x = 1$  and  $x = -1$  for all plots except in LPDDR4-1x chips from manufacturer B, where they are at  $x = -2$  and  $x = 2$  (due to the internal address remapping that occurs in these chips as we describe in Section 6.3.3). Because

---

<sup>7</sup>We choose a RowHammer bit flip rate of  $10^{-6}$  since we are able to observe this bit flip rate in most chips that we characterize with  $HC < 150k$ .

an access to a row essentially refreshes the data in the row, repeatedly accessing aggressor rows during the core loop of the RowHammer test prevents any bit flips from happening in the aggressor rows. Therefore, there are no RowHammer bit flips in the aggressor rows across each DRAM chip in our plots (i.e.,  $y = 0$  for  $x = [-2, -1, 2, 3]$  for LPDDR4-1x chips from manufacturer B and for  $x = 1$  and  $x = -1$  for all other chips).



Figure 6-4: Distribution of RowHammer bit flips across row offsets from the victim row.

We make three observations from Figure 6-4. First, we observe a general trend across DRAM type-node configurations of a given DRAM manufacturer where newer DRAM technology nodes have an increasing number of rows that are susceptible to RowHammer bit flips that are *farther* from the victim row. For example, in LPDDR4-1y chips, we observe RowHammer bit flips in as far as 6 rows from the victim row (i.e.,  $x = -6$ ), whereas in DDR3 and DDR4 chips, RowHammer bit flips only occur in as far as 2 rows from the victim row (i.e.,  $x = -2$ ). We believe that this effect could be due to 1) an increase in DRAM cell density, which leads to cell-to-cell interference extending farther than a single row, with RowHammer bit flips occurring in rows

increasingly farther away from the aggressor rows (e.g., 5 rows away) for higher-density chips, and 2) more shared structures internal to the DRAM chip, which causes farther (and multiple) rows to be affected by circuit-level interference.

**Observation 6.** *For a given DRAM manufacturer, chips of newer DRAM technology nodes can exhibit RowHammer bit flips 1) in more rows and 2) farther away from the victim row.*

Second, we observe that rows containing RowHammer bit flips that are farther from the victim row have fewer RowHammer bit flips than rows closer to the victim row. Non-victim rows adjacent to the aggressor rows ( $x = 2$  and  $x = -2$ ) contain RowHammer bit flips, and these bit flips demonstrate the effectiveness of a single-sided RowHammer attack as only one of their adjacent rows are repeatedly accessed. As discussed earlier (Section 6.3.3), the single-sided RowHammer attack is not as effective as the double-sided RowHammer attack, and therefore we find fewer bit flips in these rows. In rows farther away from the victim row, we attribute the diminishing number of RowHammer bit flips to the diminishing effects of cell-to-cell interference with distance.

**Observation 7.** *The number of RowHammer bit flips that occur in a given row decreases as the distance from the victim row increases.*

Third, we observe that only even-numbered offsets from the victim row contain RowHammer bit flips in all chips except LPDDR4-1x chips from Manufacturer B. However, the rows containing RowHammer bit flips in Manufacturer B's LPDDR4-1x chips would be even-numbered offsets if we translate all rows to physical rows based on our observation in Section 6.3.3 (i.e., divide each row number by 2 and round down). While we are uncertain why we observe RowHammer bit flips only in physical even-numbered offsets from the victim row, we believe that it may be due to the internal circuitry layout of DRAM rows.

We next study the spatial distribution of RowHammer-vulnerable DRAM cells in a DRAM array using the same set of RowHammer bit flips. Figure 6-5 shows the distribution of 64-bit words containing  $x$  RowHammer bit flips across our tested DRAM chips. We find the proportion of 64-bit words containing  $x$  RowHammer bit

flips out of all 64-bit words in each chip containing any RowHammer bit flip and plot the distribution as a bar chart with error bars for each x value.



Figure 6-5: Distribution of the number of RowHammer bit flips per 64-bit word for each DRAM type-node configuration.

**Observation 8.** *At a RowHammer bit flip rate of  $10^{-6}$ , a single 64-bit value can contain up to four RowHammer bit flips.*

Because ECC [151, 6, 248, 259] is typically implemented for DRAM at a 64-bit granularity (e.g., a single-error correcting code would only protect a 64-bit word if it contains at most one error), observation 8 indicates that even at a relatively low bit flip rate of  $10^{-6}$ , a DRAM chip can only be protected from RowHammer bit flips with a strong ECC code (e.g., 4-bit error correcting code), which has high hardware overhead.

**Observation 9.** *The distribution of RowHammer bit flip density per word changes significantly in LPDDR4 chips compared to other DRAM types.*

We find DDR3 and DDR4 chips across all manufacturers to exhibit an exponential decay curve for increasing RowHammer bit flip densities with most words containing only one RowHammer bit flip. However, LPDDR4 chips across all manufacturers exhibit a much smaller fraction of words containing a single RowHammer bit flip and

significantly larger fractions of words containing two and three RowHammer bit flips compared to DDR3 and DDR4 chips. We believe this change in the bit flip density distribution is due to the on-die ECC that manufacturers have included in LPDDR4 chips [151, 6, 248, 259], which is a 128-bit single-error correcting code that corrects and hides *most* single-bit failures within a 128-bit ECC word using redundant bits (i.e., *parity-check bits*) that are hidden from the system.

With the failure rates at which we test, many ECC words contain several bit flips. This exceeds the ECC’s correction strength and causes the ECC logic to behave in an undefined way. The ECC logic may 1) correct one of the bit flips, 2) do nothing, or 3) introduce an *additional* bit flip by corrupting an error-free data bit [304, 259]. On-die ECC makes single-bit errors rare because 1) any true single-bit error is immediately corrected and 2) a multi-bit error can *only* be reduced to a single-bit error when there are no more than two bit flips within the data bits *and* the ECC logic’s undefined action happens to change the bit flip count to exactly one. In contrast, there are many more scenarios that yield two or three bit-flips within the data bits, and a detailed experimental analysis of how on-die ECC affects DRAM failure rates in LPDDR4 DRAM chips can be found in [259].

#### 6.4.5 First RowHammer Bit Flips

We next study the vulnerability of each chip to RowHammer. One critical component of vulnerability to the double-sided RowHammer attack [170] is identifying the weakest cell, i.e., the DRAM cell that fails with the fewest number of accesses to physically-adjacent rows. In order to perform this study, we sweep  $HC$  at a fine granularity and record the  $HC$  that results in the first RowHammer bit flip in the chip ( $HC_{\text{first}}$ ). Figure 6-6 plots the distribution of  $HC_{\text{first}}$  across all tested chips as box-and-whisker plots.<sup>8</sup> The subplots contain the distributions of each tested DRAM

---

<sup>8</sup>A box-and-whiskers plot emphasizes the important metrics of a dataset’s distribution. The box is lower-bounded by the first quartile (i.e., the median of the first half of the ordered set of data points) and upper-bounded by the third quartile (i.e., the median of the second half of the ordered set of data points). The median falls within the box. The *inter-quartile range* (IQR) is the distance between the first and third quartiles (i.e., box size). Whiskers extend an additional  $1.5 \times IQR$  on either sides of the box. We indicate outliers, or data points outside of the range of the whiskers, with

type-node configuration for the different DRAM manufacturers. The x-axis organizes the distributions by DRAM type-node configuration in order of age (older on the left to younger on the right). We further subdivide the subplots for chips of the same DRAM type (e.g., DDR3, DDR4, LPDDR4) with vertical lines. Chips of the same DRAM type are colored with the same color for easier visual comparison across DRAM manufacturers.



Figure 6-6: Number of hammers required to cause the first RowHammer bit flip ( $HC_{first}$ ) per chip across DRAM type-node configurations.

**Observation 10.** *Newer chips from a given DRAM manufacturer appear to be more vulnerable to RowHammer bit flips. This is demonstrated by the clear reduction in  $HC_{first}$  values from old to new DRAM generations (e.g., LPDDR4-1x to LPDDR4-1y in manufacturer A, or DDR4-old to DDR4-new in manufacturers A and C).*

We believe this observation is due to DRAM technology process scaling wherein both 1) DRAM cell capacitance reduces and 2) DRAM cell density increases as technology node size reduces. Both factors together lead to more interference between cells and likely faster charge leakage from the DRAM cell's smaller capacitors, leading to a higher vulnerability to RowHammer. We find two exceptions to this trend (i.e., pluses.

a general increase in  $HC_{\text{first}}$  from DDR3-old to DDR3-new chips of manufacturer A and from DDR4-old to DDR4-new chips of manufacturer B), but we believe these potential anomalies may be due to our inability to identify explicit manufacturing dates and correctly categorize these particular chips.

**Observation 11.** *In LPDDR4-1y chips from manufacturer A, there are chips whose weakest cells fail after only 4800 hammers.*

This observation has serious implications for the future as DRAM technology node sizes will continue to reduce and  $HC_{\text{first}}$  will only get smaller. We discuss these implications further in Section 6.5. Table 6.4 shows the lowest observed  $HC_{\text{first}}$  value for any chip within a DRAM type-node configuration (i.e., the minimum values of each distribution in Figure 6-6).

Table 6.4: Lowest  $HC_{\text{first}}$  values ( $\times 1000$ ) across all chips of each DRAM type-node configuration.

| DRAM<br>type-node | $HC_{\text{first}}$ (Hammers until first bit flip) $\times 1000$ |                  |                  |
|-------------------|------------------------------------------------------------------|------------------|------------------|
|                   | Mfr.<br><b>A</b>                                                 | Mfr.<br><b>B</b> | Mfr.<br><b>C</b> |
| DDR3-old          | 69.2                                                             | 157              | 155              |
| DDR3-new          | 85                                                               | 22.4             | 24               |
| DDR4-old          | 17.5                                                             | 30               | 87               |
| DDR4-new          | 10                                                               | 25               | 40               |
| LPDDR4-1x         | 43.2                                                             | 16.8             | N/A              |
| LPDDR4-1y         | 4.8                                                              | N/A              | 9.6              |

**Effects of ECC.** The use of error correcting codes (ECC) to improve the reliability of a DRAM chip is common practice, with most system-level [29, 66, 100, 167] or on-die [6, 180, 151, 259, 183] ECC mechanisms providing single error correction capabilities at the granularity of 64- or 128-bit words. We examine 64-bit ECCs since, for the same correction capability (e.g., single-error correcting), they are stronger than 128-bit ECCs. In order to determine the efficacy with which ECC can mitigate RowHammer effects on real DRAM chips, we carefully study three metrics across each of our chips: 1) the lowest  $HC$  required to cause the first RowHammer bit flip (i.e.,  $HC_{\text{first}}$ ) for a given chip (shown in Figure 6-6), 2) the lowest  $HC$  required to cause at least two RowHammer bit flips (i.e.,  $HC_{\text{second}}$ ) within any 64-bit word, and 3) the lowest  $HC$  required to cause at least three RowHammer bit flips (i.e.,  $HC_{\text{third}}$ ) within any

64-bit word. These quantities tell us, for ECCs of varying strengths (e.g., single-error correction code, double-error correction code), at which  $HC$  values the ECC can 1) mitigate RowHammer bit flips and 2) no longer reliably mitigate RowHammer bit flips for that particular chip.

Figure 6-7 plots as a bar graph the  $HC$  (left y-axis) required to find the first 64-bit word containing one, two, and three RowHammer bit flips (x-axis) across each DRAM type-node configuration. The error bars represent the standard deviation of  $HC$  values across all chips tested. On the same figure, we also plot with red boxplots, the increase in  $HC$  (right y-axis) between the  $HC$ s required to find the first 64-bit word containing one and two RowHammer bit flips, and two and three RowHammer bit flips. These multipliers indicate how  $HC_{\text{first}}$  would change in a chip if the chip uses single-error correcting ECC or moves from a single-error correcting to a double-error correcting ECC. Note that we 1) leave two plots (i.e., Mfr. A DDR3-new and Mfr. C DDR4-old) empty since we are unable to induce enough RowHammer bit flips to find 64-bit words containing more than one bit flip in the chips and 2) do not include data from our LPDDR4 chips because they already include on-die ECC [6, 180, 151, 259, 183], which obfuscates errors potentially exposed to any other ECC mechanisms [259].



Figure 6-7: Hammer Count (left y-axis) required to find the first 64-bit word containing one, two, and three RowHammer bit flips. Hammer Count Multiplier (right y-axis) quantifies the  $HC$  difference between every two points on the x-axis (as a multiplication factor of the left point to the right point).

**Observation 12.** A single-error correcting code can significantly improve  $HC_{first}$  by up to  $2.78\times$  in DDR4-old and DDR4-new DRAM chips, and  $1.65\times$  in DDR3-new DRAM chips.

**Observation 13.** Moving from a double-error correcting code to a triple-error correcting code has diminishing returns in DDR4-old and DDR4-new DRAM chips (as indicated by the reduction in the HC multiplier) compared to when moving from a single-error correcting code to a double-error correcting code. However, using a triple-error correcting code in DDR3-new DRAM chips continues to further improve the  $HC_{first}$  and thus reduce the DRAM chips' vulnerability to RowHammer.

#### 6.4.6 Single-Cell RowHammer Bit Flip Probability

We examine how the failure probability of a single RowHammer bit flip changes as  $HC$  increases. We sweep  $HC$  between 25k to 150k with a step size of 5k and hammer each DRAM row over 20 iterations. For each  $HC$  value, we identify each cell's bit flip probability (i.e., the number of times we observe a RowHammer bit flip in that cell out of all 20 iterations). We then observe how each cell's bit flip probability changes as  $HC$  increases. We expect that by exacerbating the RowHammer conditions (e.g., increasing the hammer count), the exacerbated circuit-level interference effects should result in an increasing RowHammer bit flip probability for each individual cell. Out of the full set of bits that we observe *any* RowHammer bit flips in, Table 6.5 lists the percentage of cells that have a strictly monotonically increasing bit flip probability as we increase  $HC$ .

Table 6.5: Percentage of cells with monotonically increasing RowHammer bit flip probabilities as  $HC$  increases.

| DRAM type-node | Cells with monotonically increasing RowHammer bit flip probabilities (%) |                |                |
|----------------|--------------------------------------------------------------------------|----------------|----------------|
|                | Mfr. A                                                                   | Mfr. B         | Mfr. C         |
| DDR3-new       | $97.6 \pm 0.2$                                                           | 100            | 100            |
| DDR4-old       | $98.4 \pm 0.1$                                                           | 100            | 100            |
| DDR4-new       | $99.6 \pm 0.1$                                                           | 100            | 100            |
| LPDDR4-1x      | $50.3 \pm 1.2$                                                           | $52.4 \pm 1.4$ | N/A            |
| LPDDR4-1y      | $47.0 \pm 0.8$                                                           | N/A            | $54.3 \pm 5.7$ |

**Observation 14.** *For DDR3 and DDR4 chips, an overwhelming majority (i.e., more than 97%) of the cells tested have monotonically increasing RowHammer bit flip probabilities for DDR3 and DDR4 chips.*

This observation indicates that exacerbating the RowHammer conditions by increasing  $HC$  increases the probability that a DRAM cell experiences a RowHammer bit flip. However, we find that the proportion of cells with monotonically increasing RowHammer bit flip probabilities as  $HC$  increases is around only 50% in the LPDDR4 chips that we test. We believe that this decrease is due to the addition of on-die ECC in LPDDR4 chips, which can obscure the probability of observing a RowHammer bit flip from the system’s perspective in two ways. First, a RowHammer bit flip at bit X can no longer be observable from the system’s perspective if another RowHammer bit flip at bit Y occurs within the same ECC word as a result of increasing  $HC$ , and the error correction logic corrects the RowHammer bit flip at bit X. Second, the system may temporarily observe a bit flip at bit X at a specific  $HC$  if the set of real RowHammer bit flips within an ECC word results in a miscorrection at bit X. Since this bit flip is a result of the ECC logic misbehaving rather than circuit-level interference, we do not observe the expected trends for these transient miscorrected bits.

## 6.5 Implications for Future Systems

Our characterization results have major implications for continued DRAM technology scaling since DRAM’s increased vulnerability to RowHammer means that systems employing future DRAM devices will likely need to handle significantly elevated failure rates. While prior works propose a wide variety of RowHammer failure mitigation techniques (described in Sections 6.5.1 and 6.6), these mechanisms will need to manage increasing failure rates going forward and will likely suffer from high overhead (as we show in Section 6.5.2).

While DRAM and system designers currently implement several RowHammer mitigation mechanisms (e.g., *pseudo Target Row Refresh* (pTRR) [148], *Target Row*

*Refresh* (TRR) [202])<sup>9</sup>, the designers make a number of unknown implementation choices in these RowHammer mitigation mechanisms that are not discussed in public documentation. Therefore, we cannot fairly evaluate how their performance overheads scale as DRAM chips become more vulnerable to RowHammer. Instead, we evaluate five state-of-the-art academic proposals for RowHammer mitigation mechanisms [170, 194, 303, 372] as well as an ideal refresh-based mitigation mechanism.

We evaluate each RowHammer mitigation mechanism in terms of two major challenges that they will face going forward as they will need to support DRAM chips more vulnerable to RowHammer: design scalability and system performance overhead. We first qualitatively explain and discuss the five state-of-the-art mitigation mechanisms and how they can potentially scale to support DRAM chips that are more vulnerable to RowHammer. We then quantitatively evaluate their performance overheads in simulation as  $HC_{\text{first}}$  decreases. In order to show the opportunity for reducing performance overhead in RowHammer mitigation, we also implement and study an *ideal refresh-based mechanism* that prevents RowHammer by refreshing a DRAM row *only* immediately before it is about to experience a bit flip.

### 6.5.1 RowHammer Mitigation Mechanisms

There is a large body of work (e.g., [16, 341, 178, 41, 358, 36, 169, 351, 49, 198, 94, 104, 135]) that proposes software-based RowHammer mitigation mechanisms. Unfortunately, many of these works have critical weaknesses (e.g., inability to track all DRAM activations) that make them vulnerable to carefully-crafted RowHammer attacks, as demonstrated in some followup works (e.g., [104]). Therefore, we focus on evaluating six mechanisms (i.e., five state-of-the-art hardware proposals and one *ideal* refresh-based mitigation mechanism), which address a strong threat model that assumes an attacker can cause row activations with precise memory location and timing information. We briefly explain each mitigation mechanism and how its design

---

<sup>9</sup>Frigo et al. [87] recently demonstrated that these mechanisms do *not* prevent *all* RowHammer bit flips from being exposed to the system, and an attacker can still take over a system even with these mechanisms in place.

scales for DRAM chips with increased vulnerability to RowHammer (i.e., lower  $HC_{\text{first}}$  values).

**Increased Refresh Rate [170].** The original RowHammer study [170] describes increasing the overall DRAM refresh rate such that it is impossible to issue enough activations within one refresh window (i.e., the time between two consecutive refresh commands to a single DRAM row) to any single DRAM row to induce a RowHammer bit flip. The study notes that this is an undesirable mitigation mechanism due to its associated performance and energy overheads. In order to reliably mitigate RowHammer bit flips with this mechanism, we scale the refresh rate such that the refresh window (i.e.,  $t_{REFW}$ ; the time interval between consecutive refresh commands to a single row) equals the number of hammers until the first RowHammer bit flip (i.e.,  $HC_{\text{first}}$ ) multiplied by the activation latency  $t_{RC}$ . Due to the large number of rows that must be refreshed within a refresh window, this mechanism inherently does not scale to  $HC_{\text{first}}$  values below 32k.

**PARA [170].** Every time a row is opened and closed, PARA (Probabilistic Adjacent Row Activation) refreshes one or more of the row's adjacent rows with a low probability  $p$ . Due to PARA's simple approach, it is possible to easily tune  $p$  when PARA must protect a DRAM chip with a lower  $HC_{\text{first}}$  value. In our evaluation of PARA, we scale  $p$  for different values of  $HC_{\text{first}}$  such that the bit error rate (BER) does not exceed 1e-15 per hour of continuous hammering.<sup>10</sup>

**ProHIT [303].** ProHIT maintains a history of DRAM activations in a set of tables to identify any row that may be activated  $HC_{\text{first}}$  times. ProHIT manages the tables probabilistically to minimize the overhead of tracking frequently-activated DRAM rows. ProHIT [303] uses a pair of tables labeled "Hot" and "Cold" to track the victim rows. When a row is activated, ProHIT checks whether each adjacent row is already in either of the tables. If a *row* is not in either table, it is inserted into the cold table with a probability  $p_i$ . If the table is full, the least recently inserted entry in the cold table is then evicted with a probability  $(1 - p_e) + p_e / (\#\text{cold\_entries})$

---

<sup>10</sup>We adopt this BER from typical consumer memory reliability targets [231, 143, 260, 47, 218, 219, 45, 217, 46].

and the other entries are evicted with a probability  $p_e / (\#\text{cold\_entries})$ . If the row already exists in the cold table, the row is promoted to the highest-priority entry in the hot table with a probability  $(1 - p_t) + p_t / (\#\text{hot\_entries})$  and to other entries with a probability  $p_t / (\#\text{hot\_entries})$ . If the row already exists in the hot table, the entry is upgraded to a higher priority position. During each refresh command, ProHIT simultaneously refreshes the row at the top entry of the hot table, since this row has likely experienced the most number of activations, and then removes the entry from the table.

For ProHIT [303] to effectively mitigate RowHammer with decreasing  $HC_{\text{first}}$  values, the size of the tables and the probabilities for managing the tables (e.g.,  $p_i$ ,  $p_e$ ,  $p_t$ ) must be adjusted. Even though Son et al. show a low-cost mitigation mechanism for a specific  $HC_{\text{first}}$  value (i.e., 2000), they do *not* provide models for appropriately setting these values for arbitrary  $HC_{\text{first}}$  values and how to do so is not intuitive. Therefore, we evaluate ProHIT only when  $HC_{\text{first}} = 2000$ .

**MRLoc** [372]. MRLoc refreshes a victim row using a probability that is dynamically adjusted based on each row’s access history. This way, according to memory access locality, the rows that have been recorded as a victim more recently have a higher chance of being refreshed. MRLoc uses a queue to store victim row addresses on each activation. Depending on the time between two insertions of a given victim row into the queue, MRLoc adjusts the probability with which it issues a refresh to the victim row that is present in the queue.

MRLoc’s parameters (the queue size and the parameters used to calculate the probability of refresh) are tuned for  $HC_{\text{first}} = 2000$ . You et al. [372] choose the values for these parameters empirically, and there is no concrete discussion on how to adjust these parameters as  $HC_{\text{first}}$  changes. Therefore we evaluate MRLoc for only  $HC_{\text{first}} = 2000$ .

As such, even though we quantitatively evaluate both ProHIT [303] and MRLoc [372] for completeness and they may seem to have good overhead results at one data point, we are unable to demonstrate how their overheads scale as DRAM chips become more vulnerable to RowHammer.

**TWiCe** [194]. TWiCe tracks the number of times a victim row’s aggressor rows

are activated using a table of counters and refreshes a victim row when its count is above a threshold such that RowHammer bit flips cannot occur. TWiCe uses two counters per entry: 1) a lifetime counter, which tracks the length of time the entry has been in the table, and 2) an *activation counter*, which tracks the number of times an aggressor row is activated. The key idea is that TWiCe can use these two counters to determine the rate at which a row is being hammered and can quickly prune entries that have a low rate of being hammered. TWiCe also minimizes its table size based on the observation that the number of rows that can be activated enough times to induce RowHammer failures within a refresh window is bound by the DRAM chip's vulnerability to RowHammer.

When a row is activated, TWiCe checks whether its adjacent rows are already in the table. If so, the activation count for each row is incremented. Otherwise, new entries are allocated in the table for each row. Whenever a row's activation count surpasses a threshold  $t_{RH}$  defined as  $HC_{\text{first}}/4$ , TWiCe refreshes the row. TWiCe also defines a pruning stage that 1) increments each lifetime counter, 2) checks each row's hammer rate based on both counters, and 3) prunes entries that have a lifetime hammer rate lower than a *pruning threshold*, which is defined as  $t_{RH}$  divided by the number of refresh operations per refresh window (i.e.,  $t_{RH}/(t_{REFW}/t_{REFI})$ ). TWiCe performs pruning operations during refresh commands so that the latency of a pruning operation is hidden behind the DRAM refresh commands.

If  $t_{RH}$  is lower than the number of refresh intervals in a refresh window (i.e., 8192), a couple of complications arise in the design. TWiCe either 1) cannot prune its table, resulting in a very large table size since every row that is accessed at least once will remain in the table until the end of the refresh window or 2) requires floating point operations in order to calculate thresholds for pruning, which would significantly increase the latency of the pruning stage. Either way, the pruning stage latency would increase significantly since a larger table also requires more time to check each entry, and the latency may no longer be hidden by the refresh command.

As a consequence, TWiCe does *not* support  $t_{RH}$  values lower than the number of refresh intervals in a refresh window ( $\sim 8k$  in several DRAM standards, e.g., DDR3,

DDR4, LPDDR4). This means that in its current form, we *cannot* fairly evaluate TWiCe for  $HC_{\text{first}}$  values below  $32k$ , as  $t_{RH} = HC_{\text{first}}/4$ . However, we do evaluate an ideal version of TWiCe (i.e., *TWiCe-ideal*) for  $HC_{\text{first}}$  values below  $32k$  assuming that TWiCe-ideal solves *both* issues of the large table size and the high-latency pruning stage at lower  $HC_{\text{first}}$  values.

**Ideal Refresh-based Mitigation Mechanism.** We implement an ideal refresh-based mitigation mechanism that tracks all activations to every row in DRAM and issues a refresh command to a row only right before it can potentially experience a RowHammer bit flip (i.e., when a physically-adjacent row has been activated  $HC_{\text{first}}$  times).

### 6.5.2 Evaluation of Viable Mitigation Mechanisms

We first describe our methodology for evaluating the five state-of-the-art RowHammer mitigation mechanisms (i.e., increased refresh rate [170], PARA [170], Pro-HIT [303], MRLoc [372], TWiCe [194]) and the ideal refresh-based mitigation mechanism.

#### Evaluation Methodology

We use Ramulator [3, 174], a cycle-accurate DRAM simulator with a simple core model and a system configuration as listed in Table 6.6, to implement and evaluate the RowHammer mitigation mechanisms. To demonstrate how the performance overhead of each mechanism would scale to future devices, we implement, to the best of our ability, parameterizable methods for scaling the mitigation mechanisms to DRAM chips with varying degrees of vulnerability to RowHammer (as described in Section 6.5.1).

**Workloads.** We evaluate 48 8-core workload mixes drawn randomly from the full SPEC CPU2006 benchmark suite [5] to demonstrate the effects of the RowHammer mitigation mechanisms on systems during typical use (and *not* when a RowHammer attack is being mounted). The set of workloads exhibit a wide range of memory intensities. The workloads' MPKI values (i.e., last-level cache misses per kilo-instruction)

Table 6.6: System configuration for simulations.

| Parameter         | Configuration                                                                     |
|-------------------|-----------------------------------------------------------------------------------|
| Processor         | 4GHz, 8-core, 4-wide issue, 128-entry instr. window                               |
| Last-level Cache  | 64-Byte cache line, 8-way set-associative, 16MB                                   |
| Memory Controller | 64 read/write request queue, FR-FCFS [274, 383]                                   |
| Main Memory       | DDR4, 1-channel, 1-rank, 4-bank groups, 4-banks per bank group, 16k rows per bank |

range from 10 to 740. This wide range enables us to study the effects of RowHammer mitigation on workloads with widely varying degrees of memory intensity. We note that there could be other workloads with which mitigation mechanisms exhibit higher performance overheads, but we did not try to maximize the overhead experienced by workloads by biasing the workload construction in any way. We simulate each workload until each core executes at least 200 million instructions. For all configurations, we initially warm up the caches by fast-forwarding 100 million instructions.

**Metrics.** Because state-of-the-art RowHammer mitigation mechanisms rely on additional DRAM refresh operations to prevent RowHammer, we use two different metrics to evaluate their impact on system performance. First, we measure *DRAM bandwidth overhead*, which quantifies the fraction of the total system DRAM bandwidth consumption coming from the RowHammer mitigation mechanism. Second, we measure overall workload performance using the *weighted speedup* metric [79, 301], which effectively measures job throughput for multi-core workloads [79]. We normalize the weighted speedup to its baseline value, which we denote as 100%, and find that when using RowHammer mitigation mechanisms, most values appear below the baseline. Therefore, for clarity, we refer to normalized weighted speedup as *normalized system performance* in our evaluations.

## Evaluation of Mitigation Mechanisms

Figure 6-8 shows the results of our evaluation of the RowHammer mitigation mechanisms (as described in Section 6.5.1) for chips of varying degrees of RowHammer

vulnerability (i.e.,  $200k \geq HC_{\text{first}} \geq 64$ ) for our two metrics: 1) DRAM bandwidth overhead in Figure 6-8a and 2) normalized system performance in Figure 6-8b. Each data point shows the average value across 48 workloads with minimum and maximum values drawn as error bars.

For each DRAM type-node configuration that we characterize, we plot the minimum  $HC_{\text{first}}$  value found across chips within the configuration (from Table 6.4) as a vertical line to show how each RowHammer mitigation mechanism would impact the overall system when using a DRAM chip of a particular configuration. Above the figures (sharing the x-axis with Figure 6-8), we draw horizontal lines representing the ranges of  $HC_{\text{first}}$  values that we observe for every tested DRAM chip per DRAM type-node configuration across manufacturers. We color the ranges according to DRAM type-node configuration colors in the figure, and indicate the average value with a gray point. Note that these lines directly correspond to the box-and-whisker plot ranges in Figure 6-6.

We make *five* key observations from this figure. First, DRAM bandwidth overhead is highly correlated with normalized system performance, as DRAM bandwidth consumption is the main source of system interference caused by RowHammer mitigation mechanisms. We note that several points (i.e., ProHIT, MRLoc, and TWiCe and Ideal evaluated at higher  $HC_{\text{first}}$  values) are not visible in Figure 6-8a since we are plotting an inverted log graph and these points are very close to zero. Second, in the latest DRAM chips (i.e., the LPDDR4-1y chips), only PARA, ProHIT, and MRLoc are viable options for mitigating RowHammer bit flips with reasonable average normalized system performance: 92%, 100%, and 100%, respectively. Increased Refresh Rate and TWiCe do not scale to such degrees of RowHammer vulnerability (i.e.,  $HC_{\text{first}} = 4.8k$ ), as discussed in Section 6.5.1. Third, only PARA’s design scales to low  $HC_{\text{first}}$  values that we may see in future DRAM chips, but has very low average normalized system performance (e.g., 72% when  $HC_{\text{first}} = 1024$ ; 47% when  $HC_{\text{first}} = 256$ ; 20% when  $HC_{\text{first}} = 128$ ). While TWiCe-ideal has higher normalized system performance over PARA (e.g., 98% when  $HC_{\text{first}} = 1024$ ; 86% when  $HC_{\text{first}} = 256$ ; 73% when  $HC_{\text{first}} = 128$ ), there are significant practical limitations in enabling TWiCe-ideal for



Figure 6-8: Effect of RowHammer mitigation mechanisms on a) DRAM bandwidth overhead (note the inverted log-scale y-axis) and b) system performance, as DRAM chips become more vulnerable to RowHammer (from left to right).

such low  $HC_{\text{first}}$  values (discussed in Section 6.5.1). Fourth, ProHIT and MRLoc both exhibit high normalized system performance at their single data point (i.e., 95% and 100%, respectively when  $HC_{\text{first}} = 2000$ ), but these works do not provide models for scaling their mechanisms to lower  $HC_{\text{first}}$  values and how to do so is not intuitive (as described in Section 6.5.1). Fifth, the ideal refresh-based mitigation mechanism is *significantly* and increasingly better than any existing mechanism as  $HC_{\text{first}}$  reduces below 1024. This indicates that there is still significant opportunity for developing a refresh-based RowHammer mitigation mechanism with low performance overhead that scales to low  $HC_{\text{first}}$  values. However, the ideal mechanism affects system performance at very low  $HC_{\text{first}}$  values (e.g., 99.96% when  $HC_{\text{first}} = 1024$ ; 97.91% when  $HC_{\text{first}} = 256$ ; 93.53% when  $HC_{\text{first}} = 128$ ), indicating the potential need for a better approach to solving RowHammer in future ultra-dense DRAM chips.

We conclude that while existing mitigation mechanisms may exhibit reasonably small performance overheads for mitigating RowHammer bit flips in modern DRAM chips, their overheads do *not* scale well in future DRAM chips that will likely exhibit higher vulnerability to RowHammer. Thus, we need new mechanisms and approaches to RowHammer mitigation that will scale to DRAM chips that are highly vulnerable to RowHammer bit flips.

### 6.5.3 RowHammer Mitigation Going Forward

DRAM manufacturers continue to adopt smaller technology nodes to improve DRAM storage density and are forecasted to reach 1z and 1a technology nodes within the next couple of years [326]. Unfortunately, our findings show that future DRAM chips will likely be increasingly vulnerable to RowHammer. This means that, to maintain market competitiveness without suffering factory yield loss, manufacturers will need to develop effective RowHammer mitigations for coping with increasingly vulnerable DRAM chips.

## Future Directions in RowHammer Mitigation

RowHammer mitigation mechanisms have been proposed across the computing stack ranging from circuit-level mechanisms built into the DRAM chip itself to system-level mechanisms that are agnostic to the particular DRAM chip that the system uses. Of these solutions, our evaluations in Section 6.5.1 show that, while the ideal refresh-based RowHammer mitigation mechanism, which inserts the minimum possible number of additional refreshes to prevent RowHammer bit flips, scales reasonably well to very low  $HC_{\text{first}}$  values (e.g., only 6% performance loss when  $HC_{\text{first}}$  is 128), existing RowHammer mitigation mechanisms either *cannot* scale or cause severe system performance penalties when they scale.

To develop a scalable and low-overhead mechanism that can prevent RowHammer bit flips in DRAM chips with a high degree of RowHammer vulnerability (i.e., with a low  $HC_{\text{first}}$  value), we believe it is essential to explore all possible avenues for RowHammer mitigation. Going forward, we identify two promising research directions that can potentially lead to new RowHammer solutions that can reach or exceed the scalability of the ideal refresh-based mitigation mechanism: (1) DRAM-system cooperation and (2) profile-guided mechanisms. The remainder of this section briefly discusses our vision for each of these directions.

**DRAM-System Cooperation.** Considering either DRAM-based or system-level mechanisms alone ignores the potential benefits of addressing the RowHammer vulnerability from both perspectives together. While the root causes of RowHammer bit flips lie within DRAM, their negative effects are observed at the system-level. Prior work [244, 238] stresses the importance of tackling these challenges at all levels of the stack, and we believe that a holistic solution can achieve a high degree of protection at relatively low cost compared to solutions contained within either domain alone.

**Profile-Guided Mechanisms.** The ability to accurately profile for RowHammer-susceptible DRAM cells or memory regions can provide a powerful substrate for building targeted RowHammer solutions that efficiently mitigate RowHammer bit flips at low cost. Knowing (or effectively predicting) the locations of bit flips before they

occur in practice could lead to a large reduction in RowHammer mitigation overhead, providing new information that no known RowHammer mitigation mechanism exploits today. For example, within the scope of known RowHammer mitigation solutions, increasing the refresh rate can be made far cheaper by only increasing the refresh rate for known-vulnerable DRAM rows. Similarly, ECC or DRAM access counters can be used only for known-vulnerable cells, and even a software-based mechanism can be adapted to target only known-vulnerable rows (e.g., by disabling them or remapping them to reliable memory).

Unfortunately, there exists no such effective RowHammer error profiling methodology today. Our characterization in this work essentially follows the naïve approach of individually testing each row by attempting to induce the worst-case testing conditions (e.g., *HC*, data pattern, ambient temperature etc.). However, this approach is extremely time consuming due to having to test each row individually (potentially multiple times with various testing conditions). Even for a relatively small DRAM module of 8GB with 8KB rows, hammering each row only once for only one refresh window of 64ms requires over 17 hours of continuous testing, which means that the naïve approach to profiling is infeasible for a general mechanism that may be used in a production environment or for online operation. We believe that developing a fast and effective RowHammer profiling mechanism is a key research challenge, and we hope that future work will use the observations made in this study and other RowHammer characterization studies to find a solution.

## 6.6 Related Work

Although many works propose RowHammer attacks and mitigation mechanisms, only three works [170, 257, 256] provide detailed failure-characterization studies that examine how RowHammer failures manifest in real DRAM chips. However, none of these studies show how the number of activations to induce RowHammer bit flips is changing across modern DRAM types and generations, and the original RowHammer study [170] is already six years old and limited to DDR3 DRAM chips only. This

section highlights the most closely related prior works that study the RowHammer vulnerability of older generation chips or examine other aspects of RowHammer.

**Real Chip Studies.** Three key studies (i.e., the pioneering RowHammer study [170] and two subsequent studies [257, 256]) perform extensive experimental RowHammer failure characterization using older DDR3 devices. However, these studies are restricted to only DDR3 devices and do not provide a scaling study of hammer counts across DRAM types and generations. In contrast, our work provides the first rigorous experimental study showing how RowHammer characteristics scale across different DRAM generations and how DRAM chips designed with newer technology nodes are increasingly vulnerable to RowHammer. Our work complements and furthers the analyses provided in prior studies.

**Simulation Studies.** Yang et al. [369] use device-level simulations to explore the root cause of the RowHammer vulnerability. While their analysis identifies a likely explanation for the failure mechanism responsible for RowHammer, they do not present experimental data taken from real devices to support their conclusions.

**RowHammer Mitigation Mechanisms.** Many prior works [372, 303, 16, 178, 341, 41, 170, 159, 135, 99, 194, 43, 23, 25, 28, 103, 26, 24, 13, 124, 197, 86, 118, 358, 36, 169, 351, 83, 49, 198, 350] propose RowHammer mitigation techniques. Additionally, several patents for RowHammer prevention mechanisms have been filed [24, 25, 28, 26, 23, 102]. However, these works do not analyze how their solutions will scale to future DRAM generations and do not provide detailed failure characterization data from modern DRAM devices. Similar and other related works on RowHammer can be found in a recent retrospective [241].

## 6.7 Limitations

Due to limited resources and available testing time, we were unable to study several aspects of RowHammer that could help to further understand it as a phenomenon.

First, while we demonstrate a scaling study on RowHammer as DRAM process technology node size scales, we are unsure of the exact process technology node sizes

for several chip generations. Therefore, we could only provide a relative study on how the vulnerability to RowHammer in DRAM chips across technology node generations changes. Given the exact sizes of the technology nodes, we could provide an analytical model such that we could predict the progression of RowHammer vulnerability in future chips.

Second, we did not account for chip aging in our RowHammer study. We believe that understanding how aging affects the RowHammer vulnerability in chips is critical to understand whether a system will become more vulnerable over time as its DRAM chip ages in use. Furthermore, if aging can affect the RowHammer vulnerability of a chip to the point that mitigation mechanisms can no longer effectively mitigate RowHammer bit flips, it is important to understand when and how often a DRAM chip should be swapped out to maintain a RowHammer-free system.

## 6.8 Summary

We provide the first rigorous experimental RowHammer failure characterization study that demonstrates how the RowHammer vulnerability of modern DDR3, DDR4, and LPDDR4 DRAM chips scales across DRAM generations and technology nodes. Using experimental data from 1580 real DRAM chips produced by the three major DRAM manufacturers, we show that modern DRAM chips that use smaller process technology node sizes are significantly more vulnerable to RowHammer than older chips. Using simulation, we show that existing RowHammer mitigation mechanisms 1) suffer from prohibitively large performance overheads at projected future hammer counts and 2) are still far from an *ideal* selective-refresh-based RowHammer mitigation mechanism. Based on our study, we motivate the need for a scalable and low-overhead solution to RowHammer and provide two promising research directions to this end. We hope that the results of our study will inspire and aid future work to develop efficient solutions for the RowHammer bit flip rates we are likely to see in DRAM chips in the near future.

# Chapter 7

## Putting It All Together

Each of these four works individually provide mechanisms that improve system performance, security or reliability. However, these works are all orthogonal in that a subset or all of them can be implemented simultaneously on the same system to provide each individual benefit with little interference. Furthermore, since each mechanism does not require any changes to DRAM, the implementation should be constrained to the memory controller. There are many ways to potentially combine each individual mechanism, but we provide a simple high-level example.

### 7.1 Implementing All Proposed Techniques on the Same System

Solar-DRAM, The DRAM Latency PUF, and D-RaNGe all rely on latency failure profiles for their specific cases on a given DRAM chip. Once the profiles are generated via characterization, the storage overhead of the profiles is simply the combined size of each profile. Depending on which mechanism is issuing a DRAM access, the access latency is set according to the respective profile.

As Solar-DRAM improves system performance with its faster DRAM accesses, the latency of regular DRAM accesses should be set according to the Solar-DRAM profile. In the relatively infrequent event that the system requests a PUF response or a true

random value, firmware for The DRAM Latency PUF or D-RaNGe mechanism will override regular DRAM accesses and issue DRAM accesses with latencies adjusted according to their respective profiles.

While Solar-DRAM purely provides performance benefits, evaluating PUFs (via DRAM Latency PUF) and generating true random values (via D-RaNGe) come at the cost of additional DRAM accesses interleaved within regular DRAM accesses. These additional DRAM accesses likely will increase the system performance overhead depending on the implementations (as discussed in Sections 5.6.3 and 4.6.2). Fortunately, D-RaNGe and The DRAM Latency PUF both enable flexible implementations and can minimize their overheads depending on the user needs and the importance of and the need for timely PUF evaluations and true random values.

While utilizing these mechanisms in conjunction may have compound overheads in terms of both performance and storage, we believe that enabling these mechanisms on a system simultaneously will provide system performance, security, and reliability benefits that outweigh the combined overheads of the mechanisms.

## 7.2 Cost-Benefit Analysis

While each individual mechanism provides benefits to the system, whether performance, security, or reliability, each mechanism has associated costs in deployment. These costs come in the form of 1) performance or energy overheads, 2) profiling time prior to deployment (or during online operation), or 3) minor system changes. These costs must be considered when deploying each mechanism on various systems according to the manufacturer and user constraints (e.g., fiscal, physical, time, and other resource limitations), and requirements (e.g., performance, reliability, security service level agreements). However, due to the flexibility and orthogonality of these mechanisms, costs can be reduced by implementing any subset of these works on a system according to the constraints and requirements.

Since our works mainly focus on demonstrating the new fundamental ideas rather than present an optimal implementation for each of these works, we cannot provide

a quantitative cost-benefit analysis of each of our mechanisms. However, given the benefits that each mechanism provides with the relatively low overheads in the relatively non-optimized solutions offered in the works, we do foresee future work in supporting near-optimal deployment strategies and implementations for each mechanism that would significantly reduce the associated costs and improve the ease of for implementation in real systems.

# Chapter 8

# Conclusions and Future Directions

## 8.1 Conclusions

In this dissertation, we present a number of novel observations on DRAM, via characterization studies of real DRAM chips, that we exploit to develop mechanisms that improve system performance and enhance system security and reliability.

First, we introduce 1) a rigorous characterization of activation failures across 282 *real state-of-the-art LPDDR4* DRAM modules, 2) Solar-DRAM, whose key idea is to exploit our observations and issue DRAM accesses with variable latency depending on the target DRAM location’s propensity to fail with reduced access latency, and 3) an evaluation of Solar-DRAM and its three individual components, with comparisons to the state-of-the-art [55]. We find that Solar-DRAM provides significant performance improvement over the state-of-the-art DRAM latency reduction mechanism across a wide variety of workloads, *without* requiring any changes to DRAM chips or software.

Second, we propose D-RaNGe, a mechanism for extracting true random numbers with high throughput from unmodified commodity DRAM devices on any system that allows manipulation of DRAM timing parameters in the memory controller. D-RaNGe harvests fully non-deterministic random numbers from DRAM row activation failures,

which are bit errors induced by intentionally accessing DRAM with lower latency than required for correct row activation. Our TRNG is based on two key observations: 1) activation failures can be induced quickly and 2) repeatedly accessing certain DRAM cells with reduced activation latency results in reading true random data. We validate the quality of our TRNG with the commonly-used NIST statistical test suite for randomness. Our evaluations show that D-RaNGe significantly outperforms the previous highest-throughput DRAM-based TRNG by up to 211x (128x on average). We conclude that DRAM row activation failures can be effectively exploited to improve the security of a wide range of systems that use commodity DRAM chips via a high-throughput true random number generator, which can enable a number of security applications such as cryptography.

Third, we introduce and analyze the DRAM latency PUF, a new DRAM PUF suitable for runtime authentication. The DRAM latency PUF intentionally violates manufacturer-specified DRAM timing parameters in order to provide many highly repeatable, unique, and unclonable PUF responses with low latency. Through experimental evaluation using 223 state-of-the-art LPDDR4 DRAM devices, we show that the DRAM latency PUF reliably generates PUF responses at runtime-accessible speeds (i.e., 88.2ms on average) at all operating temperatures. We show that the DRAM latency PUF achieves an average speedup of 152x/1426x at 70°C/55°C when compared with a DRAM retention PUF of the same DRAM capacity overhead, and it achieves even greater speedups at lower temperatures. We conclude that the DRAM latency PUF enables a fast and effective substrate for runtime device authentication across all operating temperatures, and we hope that the advent of runtime-accessible PUFs like the DRAM latency PUF and the detailed experimental characterization data we provide on modern DRAM devices will enable security architects to develop even more secure systems for future devices.

Finally, we provide the first rigorous experimental RowHammer failure characterization study that demonstrates how the RowHammer vulnerability of modern DDR3, DDR4, and LPDDR4 DRAM chips scales across DRAM generations and technology nodes. Using experimental data from 1580 real DRAM chips produced by the three

major DRAM manufacturers, we show that modern DRAM chips that use smaller process technology node sizes are significantly more vulnerable to RowHammer than older chips. Using simulation, we show that existing RowHammer mitigation mechanisms 1) suffer from prohibitively large performance overheads at projected future hammer counts and 2) are still far from an *ideal* selective-refresh-based RowHammer mitigation mechanism. Based on our study, we motivate the need for a scalable and low-overhead solution to RowHammer and provide two promising research directions to this end. We hope that the results of our study will inspire and aid future work to develop efficient solutions for the RowHammer bit flip rates we are likely to see in DRAM chips in the near future.

## 8.2 Future Research Directions

Our four preliminary works, Solar-DRAM, the DRAM Latency PUF, D-RaNGe, and Revisiting RowHammer demonstrate that novel observations via DRAM characterization can be exploited to improve latency, security, and reliability aspects of DRAM by developing mechanisms that exploit the DRAM characteristics when they are accessed with reduced DRAM timing parameters. To explore other methods for improving these aspects of DRAM, our proposal for future work is to 1) characterize various additional timing parameters and circuit-level aspects, and 2) propose mechanisms based on our prior observations.

### 8.2.1 Reducing DRAM Latency by Exploiting Different Timing Parameters

In Chapter 3, we have shown that we can reduce the DRAM timing parameter, tRCD, to substantially improve DRAM access latency. However, we believe that there are a number of other DRAM timing parameters (e.g., tRP, tWR, tRTP) that could also have substantial impact on overall system performance when reduced below manufacturer-specified values.

The key challenge in demonstrating a viable mechanism for reliably reducing DRAM timing parameters lies in rigorously characterizing DRAM to demonstrate exploitable trends for efficiently reducing DRAM timing parameters. While Chapter 3 discusses exploitable spatial distributions for reducing tRCD (i.e., failures constrained to local bitlines), we expect that each timing parameter, when reduced, will result in various exploitable spatial distributions. Using this knowledge gained from characterization, we can then safely and reliably reduce DRAM timing parameters for system performance improvement depending on the region of DRAM being accessed. To further exploit findings from this type of characterization, we believe the next step would be to determine the behavior of interactions when reducing multiple DRAM timing parameters simultaneously, such that we can further reduce DRAM access latencies reliably.

We identify two further challenges that are necessary to overcome for enabling this direction. First, it is critical to determine an *efficient* method for profiling DRAM chips (e.g., determining the set of reliable timing parameters). Due to the large search space in parameters (e.g., temperature, data pattern, timing parameter, timing parameter value), a comprehensive characterization for every chip is far too expensive. We believe that searching for correlations between failures of cells with different parameters, could help to reduce the characterization phase in such a mechanism. For example, if we find that tRCD failures are correlated with tWR failures, then we can simply characterize one of the parameters and set the other value accordingly. Second, it is essential to develop a memory controller and scheme that determines when an access can be issued with reduced timing parameters and dynamically adjusts timing parameters to minimize DRAM access latency while maintaining data correctness during operation.

### 8.2.2 Improving Security Primitives for DRAM Chips

In Chapters 4 and 5, we introduced two security primitives, a Physical Unclonable Function (PUF) and a True Random Number Generator (TRNG), for commodity DRAM chips. We believe that there is significant room to improve the security and reliability guarantees of each of our proposed mechanisms: The DRAM Latency PUF

and D-RaNGe.

First, as discussed in Chapter 4, a PUF with a larger challenge response space can uniquely identify a larger set of devices and improve the authentication process to minimize the probability for misidentification. In the ideal case, the challenge response space would grow exponentially with its input parameters. A PUF with an exponential challenge response space is classified as a strong PUF which has substantially more use cases compared to a weak PUF. By rigorous characterization of varying DRAM timing parameters, we can determine whether it is feasible to increase the challenge response space simply by using DRAM timing parameters as an additional dimension to the input challenges. Ideally, reducing multiple DRAM timing parameters would result in uncorrelated failure locations such that the input space would grow exponentially with the number of additional DRAM timing parameters used. We believe that following an experimental methodology similar to the one described in Chapter 4 can demonstrate whether this direction is viable.

Second, we believe it is important to demonstrate practical end-to-end implementations of both The DRAM Latency PUF and D-RaNGe for adoption in real devices. We identify various challenges for the DRAM Latency PUF and D-RaNGe. For a practical implementation of The DRAM Latency PUF, it is critical to investigate 1) methods for supporting dynamically changing temperatures at runtime such that a PUF response's variations due to ambient temperature do not prohibit correct authentication, and 2) a DRAM memory controller that can schedule PUF evaluation accesses with custom DRAM timing parameters and minimize its overhead on running workloads. For a practical implementation of the random number generator, D-RaNGe, we identify three directions that are critical to investigate. First, it is critical to demonstrate a process for quickly identifying RNG cells that work under varying conditions (e.g., temperature). Second, it is critical to investigate a smarter DRAM memory controller that can intelligently schedule D-RaNGe accesses with custom DRAM timing parameters for minimizing the overhead that would come with directly implementing D-RaNGe on any existing memory controller today. Third, it is important to further increase the throughput and decrease the latency. We believe that by combining the

varying existing DRAM-based TRNGs (discussed in Chapter 4.8), we can invoke the different TRNGs (or a combination of them) depending on the running workloads to maximize the utility of idle DRAM time.

### 8.2.3 RowHammer Mitigation Going Forward

DRAM manufacturers continue to adopt smaller technology nodes to improve DRAM storage density and are forecasted to reach 1z and 1a technology nodes within the next couple of years [326]. Unfortunately, our findings show that future DRAM chips will likely be increasingly vulnerable to RowHammer. This means that, to maintain market competitiveness without suffering factory yield loss, manufacturers will need to develop effective RowHammer mitigation solutions for coping with increasingly vulnerable DRAM chips.

RowHammer-mitigation solutions range from circuit-level mechanisms built into the DRAM chip itself to system-level mechanisms that are agnostic to the particular DRAM chip that the system uses. Of these solutions, our evaluations in Section 6.5.1 show that, while the ideal refresh-based RowHammer-mitigation mechanism that inserts the minimum possible number of additional refreshes to prevent RowHammer bit flips scales well to relatively low  $HC_{first}$  values (e.g., only 6% performance degradation when  $HC_{first}$  is 128), currently existing refresh-based mechanisms fall short of the ideal mechanism’s benefits (e.g., at least 28% performance degradation when  $HC_{first}$  is 128).

To achieve the scalability of the ideal refresh-based mechanism while maintaining an efficient implementation, we believe it is essential to explore all possible avenues for RowHammer mitigation. Going forward, we identify two promising research directions that can potentially lead to new RowHammer-mitigation solutions that meet or exceed the scalability of the ideal refresh-based mitigation mechanism: (1) DRAM-system cooperation and (2) a profile-guided mechanism. The remainder of this section briefly discusses our vision for each of these directions.

**DRAM-System Cooperation.** Considering either DRAM-based or system-level mechanisms alone ignores the potential benefits of addressing the RowHammer vul-

nerability from both perspectives together. While the root causes of RowHammer bit flips lie within DRAM, their negative effects are observed at the system-level. Prior work [244] stresses the importance of tackling these challenges at all levels of the stack, and we believe that a holistic solution can achieve a high degree of protection at relatively low cost compared to solutions contained within either domain alone.

**Profile-Guided Mechanisms.** The ability to accurately profile for RowHammer-susceptible DRAM cells or memory regions can provide a powerful substrate for building targeted RowHammer-mitigation solutions that efficiently mitigate RowHammer bit flips at low cost. Knowing (or effectively predicting) the locations of bit flips before they occur in practice could lead to a large reduction in RowHammer-mitigation overhead, providing new information that no known RowHammer-mitigation mechanism exploits today. For example, within the scope of known RowHammer-mitigation solutions, increasing the refresh rate can be made far cheaper by only increasing the refresh rate for known vulnerable DRAM rows. Similarly, ECC and DRAM access counters can be used only for cells that are known to be vulnerable, and even software-based mechanism can be adapted to target only known vulnerable rows (e.g., disabling them, remapping them to reliable memory).

Unfortunately, there exists no such effective RowHammer error profiling methodology today. Our characterization in this work essentially follows the naïve approach of individually testing each row by attempting to induce the worst-case testing conditions (e.g., *HC*, data pattern, ambient temperature etc.). However, this approach is extremely time consuming due to having to test each row individually (potentially multiple times with various testing conditions). Even for a relatively small DRAM module of 8GB with 8KB rows, hammering each row only once for only one refresh window of 64ms requires over 17 hours of continuous testing, which means that the naïve approach to profiling is infeasible for a general mechanism that may be used in a production environment or for online operation. We believe that developing a fast and effective RowHammer profiling mechanism is a key research challenge, and we hope that future work will use the observations made in this and other RowHammer characterization studies to find a solution.

## 8.3 Final Concluding Remarks

In this dissertation, we demonstrated that by rigorously understanding and exploiting DRAM device characteristics, we can significantly improve system performance and enhance system security and reliability. We have presented four characterization-based works that show by understanding per-chip error characteristics using a profiling mechanism, we can develop mechanisms that exploit the chip-dependent error profiles to improve system performance or enhance system security and reliability: 1) Solar-DRAM, which exploits our experimental characterization on latency variation within a chip to reduce latency with a profile that identifies regions that can be reliably accessed with lower latencies, 2) The DRAM Latency PUF, which exploits our observation from characterization that latency failures are unique to a DRAM chip due to process manufacturing variation and a profile of these error characteristics can be used generate reliable and unique identifiers, 3) D-RaNGe, which demonstrates via characterization that a profile of error characteristics can identify specific cells in DRAM that can be repeatedly accessed with reduced latency to result in random failures and can be used as an efficient true random number generator, a feature often used in security applications, and 4) Revisiting RowHammer, which demonstrates via characterization that the DRAM-based RowHammer vulnerability is getting worse as technology node size scales and existing RowHammer mitigation mechanisms either do not scale or have prohibitively high overheads to mitigate RowHammer bit flips in future DRAM chips. We conclude and hope that the proposed characterization-based studies of DRAM chips and novel observations will pave the way for new research that can develop new mechanisms to improve system performance, energy efficiency, system security, or reliability of future memory systems.

## Other Works of the Author

Throughout the course of my Ph.D. study, I have worked on several different topics with many fellow graduate students from Carnegie Mellon University, ETH Zurich, and other institutions. In this chapter, I would like to acknowledge these works.

I have worked on a number of other projects on DRAM. In collaboration with Minesh Patel, we have developed REAPER [260], a profiling mechanism for retention failures to maintain DRAM reliability while reducing the refresh overhead. We also characterized and studied DRAM chips with on-die ECC [259], demonstrating how on-die ECC affects the DRAM's properties, and we develop Error-correction Inference (EIN), a statistical inference methodology that infers pre-correction error rates of DRAM with on-die ECC. In a follow-up work called BEER [258], we propose a new methodology for determining the full DRAM on-die ECC function (i.e., its parity-check matrix) without hardware support, hardware intrusion, or access to error syndromes and parity information. In collaboration with Hasan Hassan, we developed CROW [118], a flexible substrate that we use to lower DRAM activation latency to frequently accessed rows and reduce the overhead of DRAM refresh operations. In collaboration with Yaohua Wang, we developed CAL [349], a mechanism that predicts the next access time of a given row, and only partially restores charge to a row that will be accessed soon, and a substrate which can perform cache-block level data relocation within a DRAM bank at a distance-independent latency [348]. I have also contributed to a retrospective survey [241] on the DRAM-based security vulnerability, RowHammer, which was first rigorously analyzed in a paper [170] that I contributed to before my Ph.D. study. In collaboration with Lucian Cojocar, Stefan Saroiu, and Alec Wolman at Microsoft Research and others, we developed a methodology for testing the RowHammer vulnerability in server nodes and identify an instruction sequence that results in the highest rate of DRAM activations (i.e., hammers). I have also contributed to Ambit [294], an accelerator for bulk bitwise operations in memory and a positioning paper describing important domains of work toward the practical construction and widespread adoption of Processing-in-Memory (PIM) architectures.

I have also contributed to a positioning paper [95], which describes the challenges that remain for the widespread adoption of PIM.

I have also contributed to FLIN [325], a lightweight I/O request scheduling mechanism that provides fairness among requests from different applications, SysScale [111], a multi-domain power management technique that improves the energy efficiency of mobile SoCs, and FlexWatts [112], a hybrid adaptive Power Delivery Network (PDN) for modern high-end client processors whose goal is to maintain high energy-efficiency across the processor's wide spectrum of power consumption and workloads.

Another topic that I have developed an interest and worked on was bioinformatics. I authored GRIM-Filter [161], a fast seed location filtering algorithm for DNA read mapping and AirLift [162], a methodology for quickly and comprehensively mapping a set of reads from one reference to another reference. I collaborated with Damla Senol on a survey of Nanopore sequencing technologies [287] and GenASM [286], a flexible approximate string matching acceleration framework. In addition, I also worked with Can Firtina on Apollo [82], a sequencing-technology-independent assembly polishing algorithm, and with Hongyi Xin on LEAP [363], an algorithm for sequence alignment.

# Bibliography

- [1] AMD Opteron 4300 Series Processors. <http://www.amd.com/en-us/products/server/4000/4300>.
- [2] DRAMPower Source Code. <https://github.com/tukl-msd/DRAMPower>.
- [3] Ramulator Source Code. <https://github.com/CMU-SAFARI/ramulator>.
- [4] RDMA Protocol Specification. <http://www.rdmaconsortium.org/>.
- [5] Standard Performance Evaluation Corporation. <http://www.spec.org/cpu2006>.
- [6] ECC Brings Reliability and Power Efficiency to Mobile Devices. Technical report, Micron Technology inc., 2017.
- [7] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In *ISCA*, 2015.
- [8] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. PIM-Enabled Instructions: a Low-overhead, Locality-aware Processing-in-Memory Architecture. In *ISCA*, 2015.
- [9] Barbara Aichinger. DDR memory errors caused by Row Hammer. In *2015 IEEE High Performance Extreme Computing Conference (HPEC)*, pages 1–5. IEEE, 2015.
- [10] Takehiko Amaki, Masanori Hashimoto, and Takao Onoye. An Oscillator-based True Random Number Generator with Process and Temperature Tolerance. In *DAC*, 2015.
- [11] AMD. AMD Opteron 4300 Series Processors. 2012.
- [12] AMD. BKDG for AMD Family 16h Models 00h-0Fh Processors. 2013.
- [13] Apple Inc. About the Security Content of Mac EFI Security Update 2015-001. <https://support.apple.com/en-us/HT204934>, 2015.
- [14] ARM. ARM CoreLink DMC-520 Dynamic Memory Controller Technical Reference Manual. 2016.
- [15] Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In *ISCA*, 2012.

- [16] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, and Todd Austin. ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks. In *ASPLOS*, 2016.
- [17] Aydin Aysu, Ye Wang, Patrick Schaumont, and Michael Orshansky. A New Maskless Debiasing Method for Lightweight Physical Unclonable Functions. In *HOST*, 2017.
- [18] Oreoluwatomiwa O Babarinsa and Stratos Idreos. JAFAF: Near-Data Processing for Databases. In *SIGMOD*, 2015.
- [19] Anys Bacha and Radu Teodorescu. Authenticache: Harnessing Cache ECC for System Authentication. In *MICRO*, 2015.
- [20] SangGeun Bae, Yongtae Kim, Yunsoo Park, and Chulwoo Kim. 3-Gb/s High-speed True Random Number Generator using Common-mode Operating Comparator and Sampling Uncertainty of D Flip-flop. In *IEEE Journal of Solid-State Circuits*, volume 52, pages 605–610.
- [21] Seungjae Baek, Sangyeun Cho, and Rami Melhem. Refresh Now and Then. In *IEEE Transactions on Computers*, volume 63, pages 3114–3126, 2014.
- [22] Vittorio Bagini and Marco Bucci. A Design of Reliable True Random Number Generator for Cryptographic Applications. In *CHES*, 1999.
- [23] Kuljit Bains, John Halbert, Christopher Mozak, Theodore Schoenborn, and Zvika Greenfield. Row Hammer Refresh Command, 2015. US Patent 9,117,544.
- [24] Kuljit S Bains and John B Halbert. Row Hammer Monitoring Based on Stored Row Hammer Threshold Value, May 12 2015. US Patent 9,032,141.
- [25] Kuljit S Bains and John B Halbert. Distributed Row Hammer Tracking, March 29 2016. US Patent 9,299,400.
- [26] Kuljit S Bains, John B Halbert, Christopher P Mozak, Theodore Z Schoenborn, and Zvika Greenfield. Row Hammer Refresh Command. US Patent 9,236,110, 2016.
- [27] Kuljit S Bains, John B Halbert, Suneeta Sah, and Zvika Greenfield. Method, apparatus and system for providing a memory refresh, May 12 2015. US Patent 9,030,903.
- [28] Kuljit S Bains, John B Halbert, Suneeta Sah, and Zvika Greenfield. Method, Apparatus and System for Providing a Memory Refresh, May 12 2015. US Patent 9,030,903.
- [29] Rajeev Balasubramonian. Innovations in the Memory System. *Synthesis Lectures on Computer Architecture*, 2019.
- [30] Mahmood Barangi, Joseph S Chang, and Pinaki Mazumder. Straintronics-Based True Random Number Generator for High-Speed and Energy-Limited Applications. In *IEEE Transactions on Magnetics*, volume 52, pages 1–9, 2016.
- [31] Karsten Beckmann, Harika Manem, and Nathaniel Cady. Performance Enhancement of a Time-Delay PUF Design by Utilizing Integrated Nanoscale ReRAM Devices. In *IEEE Transactions on Emerging Topics in Computing*, volume 5, pages 304–316, 2017.

- [32] Mudit Bhargava, Cagla Cakir, and Ken Mai. Reliability Enhancement of Bi-Stable PUFs in 65nm Bulk CMOS. In *HOST*, 2012.
- [33] Mudit Bhargava, Kaship Sheikh, and Ken Mai. Robust True Random Number Generator using Hot-carrier Injection Balanced Metastable Sense Amplifiers. In *HOST*, 2015.
- [34] Ishwar Bhati, Mu-Tien Chang, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. DRAM Refresh Mechanisms, Penalties, and Trade-offs. In *IEEE Transactions on Computers*, volume 65, pages 108–121, 2016.
- [35] Ishwar Bhati, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. Flexible Auto-Refresh: Enabling Scalable and Energy-Efficient DRAM Refresh Reductions. In *ISCA*, 2015.
- [36] Carsten Bock, Ferdinand Brasser, David Gens, Christopher Liebchen, and Ahmad Reza Sadeghi. RIP-RH: Preventing Rowhammer-Based Inter-Process Attacks. In *ASIA-CCS*, 2019.
- [37] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks. In *ASPLOS*, 2018.
- [38] Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. In *IEEE Computer Architecture Letters*, volume 16, pages 46–50, 2017.
- [39] Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna T Malladi, Hongzhong Zheng, et al. CoNDA: Efficient Cache Coherence Support for Near-data Accelerators. In *ISCA*, 2019.
- [40] Roelof Cornelis Botha. *The Development of a Hardware Random Number Generator for Gamma-ray Astronomy*. PhD thesis, North-West University, 2005.
- [41] Ferdinand Brasser, Lucas Davi, David Gens, Christopher Liebchen, and Ahmad-Reza Sadeghi. Can't Touch This: Practical and Generic Software-only Defenses Against RowHammer Attacks. *USENIX Security*, 2017.
- [42] Ralf Brederlow, Ramesh Prakash, Christian Paulus, and Roland Thewes. A Low-power True Random Number Generator using Random Telegraph Noise of Single Oxide-traps. In *ISSCC*, 2006.
- [43] Lake Bu, Jaya Dofe, Qiaoyan Yu, and Michel A Kinsky. SRASA: a Generalized Theoretical Framework for Security and Reliability Analysis in Computing Systems. *Journal of Hardware and Systems Security*, 3(3):200–218, 2018.
- [44] Marco Bucci, Lucia Germani, Raimondo Luzzi, Alessandro Trifiletti, and Mario Varanouovo. A High-speed Oscillator-based Truly Random Number Source for Cryptographic Applications on a Smart Card IC. volume 52, pages 403–409, 2003.

- [45] Yu Cai. Flash Memory SSD Errors, Mitigation, and Recovery. In *Proceedings of the IEEE*, 2017.
- [46] Yu Cai, Saugata Ghose, Erich F Haratsch, Yixin Luo, and Onur Mutlu. Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery. *Invited Book Chapter in Inside Solid State Drives*, 2018.
- [47] Yu Cai, Erich F Haratsch, Onur Mutlu, and Ken Mai. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In *DATE*, 2012.
- [48] Sanguhn Cha, O Seongil, Hyunsung Shin, Sangjoon Hwang, Kwangil Park, Seong Jin Jang, Joo Sun Choi, Gyo Young Jin, Young Hoon Son, Hyunyoon Cho, et al. Defect Analysis and Cost-effective Resilience Architecture for Future DRAM Devices. In *HPCA*, 2017.
- [49] Anirban Chakraborty, Manaar Alam, and Debdeep Mukhopadhyay. Deep Learning Based Diagnostics for Rowhammer Protection of DRAM Chips. In *ATS*, 2019.
- [50] Supriya Chakraborty, Abhilash Garg, and Manan Suri. True Random Number Generation From Commodity NVM Chips. *IEEE Transactions on Electron Devices*, 67(3):888–894, 2020.
- [51] Jose Juan Mijares Chan, Bhanu Sharma, Jiaqing Lv, Gabriel Thomas, Ruppa Thulasiram, and Parimala Thulasiraman. True Random Number Generator using GPUs and Histogram Equalization Techniques. In *HPCC*, 2011.
- [52] Karthik Chandrasekar, Benny Akesson, and Kees Goossens. Improved Power Modeling of DDR SDRAMs. In *DSD*, 2011.
- [53] Karthik Chandrasekar, Sven Goossens, Christian Weis, Martijn Koedam, Benny Akesson, Norbert Wehn, and Kees Goossens. Exploiting Expendable Process-Margins in DRAMs for Run-Time Performance Optimization. In *DATE*, 2014.
- [54] Kevin K. Chang. *Understanding and Improving Latency of DRAM-Based Memory Systems*. PhD thesis, Carnegie Mellon University, 2017.
- [55] Kevin K Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In *SIGMETRICS*, 2016.
- [56] Kevin K Chang, Donghyuk Lee, Zeshan Chishti, Alaa R Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. Improving DRAM Performance by Parallelizing Refreshes with Accesses. In *HPCA*, 2014.
- [57] Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K Qureshi, and Onur Mutlu. Low-cost Inter-linked Subarrays (LISA): Enabling Fast Inter-subarray Data Movement in DRAM. In *HPCA*, 2016.
- [58] Kevin K Chang, Abdullah Yaglikci, Saugata Ghose, Adity Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu. Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms. In *SIGMETRICS*, 2017.

- [59] Wenjie Che, Jim Plusquellic, and Swarup Bhunia. A Non-Volatile Memory Based Physically Unclonable Function without Helper Data. In *ICCAD*, 2014.
- [60] Pai-Yu Chen, Runchen Fang, Rui Liu, Chaitali Chakrabarti, Yu Cao, and Shimeng Yu. Exploiting Resistive Cross-Point Array for Compact Design of Physical Unclonable Function. In *HOST*, 2015.
- [61] Abdelkarim Cherkaoui, Viktor Fischer, Laurent Fesquet, and Alain Aubert. A Very High Speed True Random Number Generator with Entropy Assessment. In *CHES*, 2013.
- [62] Jeonghwan Choi, Youngjae Kim, Anand Sivasubramaniam, Jelena Srebric, Qian Wang, and Joonwon Lee. Modeling and Managing Thermal Profiles of Rack-mounted Servers with Thermostat. In *HPCA*, 2007.
- [63] Pong P Chu and Robert E Jones. Design Techniques of FPGA Based Random Number Generator. In *MAPLD*, 1999.
- [64] Patrick J Clarke, Robert J Collins, Philip A Hiskett, Paul D Townsend, and Gerald S Buller. Robust Gigahertz Fiber Quantum Key Distribution. *Applied Physics Letters*, 2011.
- [65] Lucian Cojocar, Jeremie Kim, Minesh Patel, Lillian Tsai, Stefan Saroiu, Alec Wolman, and Onur Mutlu. Are We Susceptible to Rowhammer? An End-to-End Methodology for Cloud Providers. *SP*, 2020.
- [66] Lucian Cojocar, Kaveh Razavi, Cristiano Giuffrida, and Herbert Bos. Exploiting Correcting Codes: On the Effectiveness of ECC Memory Against RowHammer Attacks. In *SP*, 2019.
- [67] Mafalda Cortez, Said Hamdioui, Vincent van der Leest, Roel Maes, and Geert-Jan Schrijen. Adapting Voltage Ramp-up Time for Temperature Noise Reduction on Memory-based PUFs. In *HOST*, 2013.
- [68] Zehan Cui, Sally A McKee, Zhongbin Zha, Yungang Bao, and Mingyu Chen. DTail: A Flexible Approach to DRAM Refresh Management. In *SC*, 2014.
- [69] J L Danger, Sylvain Guillet, Philippe Nguyen, and Olivier Rioul. PUFs: Standardization and Evaluation. In *MST*, 2016.
- [70] Anup Das, Hasan Hassan, and Onur Mutlu. VRL-DRAM: Improving DRAM Performance via Variable Refresh Latency. *DAC*, 2018.
- [71] Jayita Das, Kevin Scott, Srinath Rajaram, Drew Burgett, and Sanjukta Bhanja. MRAM PUF: A Novel Geometry Based Magnetic PUF with Integrated CMOS. In *IEEE Transactions on Nanotechnology*, volume 14, pages 436–443, 2015.
- [72] Howard David, Chris Fallin, Eugene Gorbatov, Ulf R Hanebutte, and Onur Mutlu. Memory Power Management via Dynamic Voltage/Frequency Scaling. In *ICAC*, 2011.
- [73] Satyajit Desai. *Process Variation Aware DRAM (Dynamic Random Access Memory) Design Using Block-based Adaptive Body Biasing Algorithm*. PhD thesis, Utah State University, 2012.

- [74] Wei Ding, Diana Guttman, and Mahmut Kandemir. Compiler Support for Optimizing Memory Bank-level Parallelism. In *MICRO*, 2014.
- [75] Milos Drutarovsky and Pavol Galajda. A Robust Chaos-based True Random Number Generator Embedded in Reconfigurable Switched-Capacitor Hardware. In *Radioelektronika*, 2007.
- [76] Donald Eastlake and Paul Jones. US Secure Hash Algorithm 1 (SHA1). Technical report, 2001.
- [77] Charles Eckert, Fatemeh Tehranipoor, and John A Chandy. DRNG: DRAM-based Random Number Generation Using its Startup Value Behavior. In *MWSCAS*, 2017.
- [78] Nosayba El-Sayed, Ioan A Stefanovici, George Amvrosiadis, Andy A Hwang, and Bianca Schroeder. Temperature Management in Data Centers: Why Some (Might) Like it Hot. In *SIGMETRICS*, 2012.
- [79] Stijn Eyerman and Lieven Eeckhout. System-level Performance Metrics for Multiprogram Workloads. In *IEEE Micro*, 2008.
- [80] Gene Fabron. RAM Overclocking Guide: How (and Why) to Tweak Your Memory. [https://www.tomshardware.com/reviews/ram-overclocking-guide\\_4693.html](https://www.tomshardware.com/reviews/ram-overclocking-guide_4693.html).
- [81] Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In *HPCA*, 2015.
- [82] Can Firtina, Jeremie S Kim, Mohammed Alser, Damla Senol Cali, A Ercument Cicek, Can Alkan, and Onur Mutlu. Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm. *Bioinformatics*, 36(12):3669–3679, 2020.
- [83] David Edward Fisch and William C Plants. DRAM Adjacent Row Disturb Mitigation, 2017. US Patent 9,812,185.
- [84] Viktor Fischer and Miloš Drutarovský. True Random Number Generator Embedded in Reconfigurable Hardware. In *CHES*, 2002.
- [85] Viktor Fischer, Miloš Drutarovský, Martin Šimka, and Nathalie Bochard. High performance true random number generator in Altera Stratix FPLDs. In *FPL*, 2004.
- [86] Troy Fridley and Omar Santos. Mitigations Available for the DRAM Row Hammer Vulnerability. <http://blogs.cisco.com/security/mitigations-available-for-the-dram-row-hammer-vulnerability>, 2015.
- [87] Pietro Frigo, Emanuele Vannacci, Hasan Hassan, Victor van der Veen, Onur Mutlu, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. TRRespass: Exploiting the Many Sides of Target Row Refresh. In *SP*, 2020.
- [88] Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In *PACT*, 2015.
- [89] Mingyu Gao and Christos Kozyrakis. HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing. In *HPCA*, 2016.

- [90] Yansong Gao, Damith C Ranasinghe, Said F Al-Sarawi, Omid Kavehei, and Derek Abbott. Memristive Crypto Primitive for Building Highly Secure Physical Unclonable Functions. In *Scientific Reports*, volume 5, 2015.
- [91] Blaise Gassend, Dwaine Clarke, Marten Van Dijk, and Srinivas Devadas. Silicon Physical Random Functions. In *CCS*, 2002.
- [92] Blaise LP Gassend. *Physical Random Functions*. PhD thesis, Massachusetts Institute of Technology, 2003.
- [93] Wei Ge, Shenxin Hu, Jiquan Huang, Bo Liu, and Min Zhu. FPGA Implementation of a Challenge Pre-processing Structure Arbiter PUF Designed for Machine Learning Attack Resistance. *IEICE Electronics Express*, 2019.
- [94] Mohsen Ghasempour, Mikel Lujan, and Jim Garside. ARMOR: A Run-Time Memory Hot-Row Detector, 2015.
- [95] Saugata Ghose, Amirali Boroumand, Jeremie S Kim, Juan Gómez-Luna, and Onur Mutlu. Processing-in-memory: A Workload-driven Perspective. *IBM J. Res. & Dev.*, 2019.
- [96] Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions. *arXiv preprint arXiv:1802.00320*, 2018.
- [97] Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. Demystifying Complex Workload-DRAM Interactions: An Experimental Study. *SIGMETRICS*, 2019.
- [98] Saugata Ghose, Giray Yaglikci, Raghav Gupta, Donghyuk Lee, Kais Kudrolli, William Liu, Hasan Hassan, Kevin Chang, Niladri Chatterjee, Aditya Agrawal, Mike O'Connor, and Onur Mutlu. What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study. In *SIGMETRICS*, 2018.
- [99] Hector Gomez, Andres Amaya, and Elkim Roa. DRAM Row-hammer Attack Reduction using Dummy Cells. In *NORCAS*, 2016.
- [100] Seong-Lyong Gong, Jungrae Kim, Sangkug Lym, Michael Sullivan, Howard David, and Mattan Erez. Duo: Exposing on-chip Redundancy to Rank-level ECC for High Reliability. In *HPCA*, 2018.
- [101] Sudhakar Govindavajhala and Andrew W Appel. Using Memory Errors to Attack a Virtual Machine. In *SP*, 2003.
- [102] Zvika Greenfield, Kuljit S Bains, Theodore Z Schoenborn, Christopher P Mozak, and John B Halbert. Row Hammer Condition Monitoring. US Patent App. 13/539,417, January 2, 2014.
- [103] Zvika Greenfield, John B Halbert, and Kuljit S Bains. Method, Apparatus and System for Determining a Count of Accesses to a Row of Memory. US Patent App. 13/626,479, March 27 2014.

- [104] Daniel Gruss, Moritz Lipp, Michael Schwarz, Daniel Genkin, Jonas Juffinger, Sioli O’Connell, Wolfgang Schoechl, and Yuval Yarom. Another Flip in the Wall of RowHammer Defenses. In *S&P*, 2018.
- [105] Daniel Gruss, Clémentine Maurice, and Stefan Mangard. Rowhammer.js: A Remote Software-Induced Fault Attack in Javascript. In *CoRR*, 2016.
- [106] Chongyan Gu and Maire O’Neill. Ultra-Compact and Robust FPGA-Based PUF Identification Generator. In *ISCAS*, 2015.
- [107] Jorge Guajardo, Sandeep S Kumar, Geert-Jan Schrijen, and Pim Tuyls. FPGA Intrinsic PUFs and Their Use for IP Protection. In *CHES*, 2007.
- [108] Jorge Guajardo, Sandeep S Kumar, Geert-Jan Schrijen, and Pim Tuyls. Physical Unclonable Functions and Public-key Crypto for FPGA IP Protection. In *FPL*, 2007.
- [109] Zvi Guterman, Benny Pinkas, and Tzachy Reinman. Analysis of the Linux Random Number Generator. In *SP*, 2006.
- [110] Tamas Gyorfi, Octavian Cret, and Alin Suciu. High Performance True Random Number Generator Based on FPGA Block RAMs. In *IPDPS*, 2009.
- [111] Jawad Haj-Yahya, Mohammed Alser, Jeremie Kim, A. Giray Yaglikci, Nandita Vijaykumar, Efraim Rotem, and Onur Mutlu. SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors. In *ISCA*, 2020.
- [112] Jawad Haj-Yahya, Mohammed Alser, Lois Orosa, Jeremie Kim, Efraim Rotem, Avi Mendelson, Anupam Chattopadhyay, and Onur Mutlu. A Power- and Workload-aware Hybrid Power Delivery Network for Energy-efficient High-end Client Processors. In *MICRO*, 2020.
- [113] Takeshi Hamamoto, Soichi Sugiura, and Shizuo Sawada. On the Retention Time Distribution of Dynamic Random Access Memory (DRAM). In *IEEE Transactions on Electron devices*, volume 45, pages 1300–1309, 1998.
- [114] Mike Hamburg, Paul Kocher, and Mark E Marson. Analysis of Intel’s Ivy Bridge Digital Random Number Generator. [www.cryptography.com/public/pdf/Intel\\_TRNG\\_Report\\_20120312.pdf](http://www.cryptography.com/public/pdf/Intel_TRNG_Report_20120312.pdf), 2012.
- [115] Richard W Hamming. Error Detecting and Error Correcting Codes. In *Bell Labs Technical Journal*, 1950.
- [116] Ghaith Hammouri, Erdinç Öztürk, Berk Birand, and Berk Sunar. Unclonable Lightweight Authentication Scheme. In *ICICS*, 2008.
- [117] Maryam S Hashemian, Bhanu Singh, Francis Wolff, Daniel Weyer, Steve Clay, and Christos Papachristou. A Robust Authentication Methodology Using Physically Unclonable Functions in DRAM Arrays. In *DATE*, 2015.
- [118] Hasan Hassan, Minesh Patel, Jeremie S Kim, A Giray Yağlıkçı, Nandita Vijaykumar, Nika Mansouri Ghiasi, Saugata Ghose, and Onur Mutlu. CROW: A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability. In *ISCA*, 2019.

- [119] Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality. In *HPCA*, 2016.
- [120] Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. SoftMC: A Flexible and Practical Open-source Infrastructure for Enabling Experimental DRAM Studies. In *HPCA*, 2017.
- [121] Syed Minhaj Hassan, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore. In *MEMSYS*, 2015.
- [122] Hisashi Hata and Shuichi Ichikawa. FPGA Implementation of Metastability-based True Random Number Generator. *IEICE Transactions on Information and Systems*, 95(2):426–436, 2012.
- [123] Ryan Helinski, Dhruva Acharyya, and Jim Plusquellic. A Physical Unclonable Function Defined Using Power Distribution System Equivalent Resistance Variations. In *DAC*, 2009.
- [124] Hewlett-Packard Enterprise. HP Moonshot Component Pack Version 2015.05.0. <http://h17007.www1.hp.com/us/en/enterprise/servers/products/moonshot/component-pack/index.aspx>, 2015.
- [125] Hideto Hidaka, Yoshio Matsuda, Mikio Asakura, and Kazuyasu Fujishima. The Cache DRAM Architecture: A DRAM with an on-chip Cache Memory. *MICRO*, 1990.
- [126] Daniel E Holcomb, Wayne P Burleson, and Kevin Fu. Initial SRAM State as a Fingerprint and Source of True Random Numbers for RFID Tags. In *RFID*, 2007.
- [127] Daniel E Holcomb, Wayne P Burleson, and Kevin Fu. Power-Up SRAM State as an Identifying Fingerprint and Source of True Random Numbers. In *IEEE Transactions on Computers*, volume 58, pages 1198–1210, 2009.
- [128] Jeremy Holleman, Seth Bridges, Brian P Otis, and Chris Diorio. A 3mu W CMOS True Random Number Generator with Adaptive Floating-Gate Offset Cancellation. *IEEE Journal of Solid-State Circuits*, 43(5):1324–1336, 2008.
- [129] Yohei Hori, Takahiro Yoshida, Toshihiro Katashita, and Akashi Satoh. Quantitative and Statistical Performance Evaluation of Arbiter Physical Unclonable Functions on FPGAs. In *ReConfig*, 2010.
- [130] Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladri Chatterjee, Mike O’Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In *ISCA*, 2016.
- [131] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation. In *ICCD*, 2016.

- [132] SK Hynix. DDR4 SDRAM Device Operation.
- [133] Intel. Intel Architecture Software Developer’s Manual, 2018.
- [134] Intel Corporation. CannonLake Intel Firmware Support Package (FSP) Integration Guide. <https://usermanual.wiki/Pdf/CannonLakeFSPIntegrationGuide.58784693.pdf>, 2017.
- [135] Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. MASCAT: Stopping Microarchitectural Attacks Before Execution. *IACR Cryptology ePrint Archive*, 2016.
- [136] Anirudh Iyengar, Kenneth Ramclam, and Swaroop Ghosh. DWM-PUF: A Low-Overhead, Memory-Based Security Primitive. In *HOST*, 2014.
- [137] Paul Jaccard. Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. In *Bull Soc Vaudoise Sci Nat*, 1901.
- [138] JEDEC. Double Data Rate 3 (DDR3) SDRAM Specification. 2012.
- [139] JEDEC. Double Data Rate 4 (DDR4) SDRAM Standard. 2012.
- [140] JEDEC. Low Power Double Data Rate 3 (LPDDR3). 2012.
- [141] JEDEC. Low Power Double Data Rate 4 (LPDDR4) SDRAM Specification. 2014.
- [142] JEDEC. Annex L: Serial Presence Detect (SPD) for DDR4 SDRAM Modules. 2015.
- [143] JEDEC Solid State Technology Association. Failure Mechanisms and Models for Semiconductor Devices. *JEDEC Publication JEP122G*, 2011.
- [144] Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems. In *HPCA*, 2012.
- [145] Sangwoo Ji, Youngjoo Ko, Saeyoung Oh, and Jong Kim. Pinpoint Rowhammer: Suppressing Unwanted Bit Flips on Rowhammer Attacks. In *ASIACCS*, 2019.
- [146] Benjamin Jun and Paul Kocher. The Intel Random Number Generator. *Cryptography Research Inc. white paper*, 1999.
- [147] Matthias Jung, Carl C Rheinländer, Christian Weis, and Norbert Wehn. Reverse Engineering of DRAMs: Row Hammer with Crosshair. In *MEMSYS*, 2016.
- [148] Marcin Kaczmarski. Thoughts on Intel Xeon e5–2600 v2 Product Family Performance Optimisation–Component Selection Guidelines, 2014.
- [149] Kamal Y Kamal and Radu Muresan. Mixed-signal Physically Unclonable Function with CMOS Capacitive Cells. *IEEE Access*, 7:130977–130998, 2019.
- [150] Ingab Kang, Eojin Lee, and Jung Ho Ahn. CAT-TWO: Counter-Based Adaptive Tree, Time Window Optimized for DRAM Row-Hammer Prevention. *IEEE Access*, 8:17366–17377, 2020.

- [151] Uksong Kang, Hak-soo Yu, Churoo Park, Hongzhong Zheng, John Halbert, Kuljit Bains, S Jang, and Joo Sun Choi. Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling. In *The Memory Forum*, 2014.
- [152] Nima Karimian and Fatemeh Tehranipoor. How to Generate Robust Keys from Noisy DRAMs? In *GLSVLSI*, 2019.
- [153] Stefan Katzenbeisser, Ünal Kocabas, Vladimir Rožić, Ahmad-Reza Sadeghi, Ingrid Verbauwhede, and Christian Wachsmann. PUFs: Myth, Fact or Busted? A Security Evaluation of Physically Unclonable Functions (PUFs) Cast in Silicon. In *CHES*, 2012.
- [154] Christoph Keller, Frank Gurkaynak, Hubert Kaeslin, and Norbert Felber. Dynamic Memory-based Physically Unclonable Function for the Generation of Unique Identifiers and True Random Numbers. In *ISCAS*, 2014.
- [155] Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R Alameldeen, Chris Wilkerson, and Onur Mutlu. The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study. In *SIGMETRICS*, 2014.
- [156] Samira Khan, Donghyuk Lee, and Onur Mutlu. PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM. In *DSN*, 2016.
- [157] Samira Khan, Chris Wilkerson, Donghyuk Lee, Alaa R Alameldeen, and Onur Mutlu. A Case for Memory Content-Based Detection and Mitigation of Data-Dependent Failures in DRAM. In *IEEE Computer Architecture Letters*, volume 16, pages 88–93, 2016.
- [158] Samira Khan, Chris Wilkerson, Zhe Wang, Alaa R Alameldeen, Donghyuk Lee, and Onur Mutlu. Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content. In *MICRO*, 2017.
- [159] Dae-Hyun Kim, Prashant J Nair, and Moinuddin K Qureshi. Architectural Support for Mitigating Row Hammering in DRAM Memories. *IEEE Computer Architecture Letters*, 14(1):9–12, 2014.
- [160] Jeeson Kim, Taimur Ahmed, Hussein Nili, Nhan Duy Truong, Jiawei Yang, Doo Seok Jeong, Sharath Sriram, Damith C Ranasinghe, and Omid Kavehei. Nano-Intrinsic True Random Number Generation. *arXiv preprint arXiv:1701.06020*, 2017.
- [161] Jeremie S Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu. GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-memory Technologies. *BMC Genomics*, 19(2):23–40, 2018.
- [162] Jeremie S Kim, Can Firtina, Damla Senol Cali, Mohammed Alser, Nastaran Hajinazar, Can Alkan, and Onur Mutlu. AirLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes. *arXiv preprint arXiv:1912.08735*, 2019.
- [163] Jeremie S Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu. Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines. In *ICCD*, 2018.

- [164] Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu. The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices. In *HPCA*, 2018.
- [165] Jeremie S Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu. D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput. In *HPCA*, 2019.
- [166] Jeremie S. Kim, Minesh Patel, Abdullah G. Yaglikci, Hasan Hassan, Roknoddin Azizi, Lois Orosa, and Onur Mutlu. Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques. In *ISCA*, 2020.
- [167] Jungrae Kim, Michael Sullivan, and Mattan Erez. Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Computer Memory. In *HPCA*, 2015.
- [168] Kinam Kim and Jooyoung Lee. A New Investigation of Data Retention Time in Truly Nanoscaled DRAMs. In *IEEE Electron Device Letters*, volume 30, pages 846–848, 2009.
- [169] Moonsoo Kim, Jungwoo Choi, Hyun Kim, and Hyuk-Jae Lee. An Effective DRAM Address Remapping for Mitigating Rowhammer Errors. *IEEE Transactions on Computers*, 68(10):1428–1441, 2019.
- [170] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In *ISCA*, 2014.
- [171] Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. In *HPCA*, 2010.
- [172] Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In *MICRO*, 2010.
- [173] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A Case for Exploiting Subarray-level Parallelism (SALP) in DRAM. In *ISCA*, 2012.
- [174] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and Extensible DRAM Simulator. In *IEEE Computer architecture letters*, volume 15, pages 45–49, 2016.
- [175] DJ Kinniment and EG Chester. Design of an On-chip Random Number Generator using Metastability. In *ESSCIRC*, 2002.
- [176] Çetin Kaya Koç. About Cryptographic Engineering. In *Cryptographic Engineering*. 2009.
- [177] Patrick Koeberl, Ünal Kocabas, and Ahmad-Reza Sadeghi. Memristor PUFs: A New Generation of Memory-based Physically Unclonable Functions. In *DATE*, 2013.

- [178] Radhesh Krishnan Konoth, Marco Oliverio, Andrei Tatar, Dennis Andriesse, Herbert Bos, Cristiano Giuffrida, and Kaveh Razavi. ZebRAM: Comprehensive and Compatible Software Protection Against Rowhammer Attacks. In *OSDI*, 2018.
- [179] Sandeep S Kumar, Jorge Guajardo, Roel Maes, Geert-Jan Schrijen, and Pim Tuyls. The Butterfly PUF Protecting IP on Every FPGA. In *HOST*, 2008.
- [180] Nohhyup Kwak, Saeng-Hwan Kim, Kyong Ha Lee, Chang-Ki Baek, Mun Seon Jang, Yongsuk Joo, Seung-Hun Lee, Woo Young Lee, Eunryeong Lee, Donghee Han, et al. A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with Sub- $100\mu\text{A}$  Self-Refresh Current for IoT Applications. In *ISSCC*, 2017.
- [181] Sammy HM Kwok and Edmund Y Lam. FPGA-based High-speed True Random Number Generator for Cryptographic Applications. In *TENCON*, 2006.
- [182] SiewHwee Kwok, YenLing Ee, Guanhan Chew, Kanghong Zheng, Khoongming Khoo, and ChikHow Tan. A Comparison of Post-processing Techniques for Biased Random Number Generators. In *WISTP*, 2011.
- [183] Hye-Jung Kwon, Eunsung Seo, Chan-Yong Lee, Young-Hun Seo, Gong-Heum Han, Hye-Ran Kim, Jong-Ho Lee, Min-Su Jang, Sung-Geun Do, Seung-Hyun Cho, et al. An Extremely Low-Standby-Power 3.733 Gb/s/pin 2Gb LPDDR4 SDRAM for Wearable Devices. In *ISSCC*, 2017.
- [184] Andrew Kwong, Daniel Genkin, Daniel Gruss, and Yuval Yarom. RAMBleed: Reading Bits in Memory Without Accessing Them. In *SP*, 2020.
- [185] Quintessence Labs. Random Number Generators White Paper, 2015.
- [186] Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N Patt. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems. *HPS Technical Report*, 2010.
- [187] Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N Patt. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In *MICRO*, 2009.
- [188] Donghyuk Lee. *Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity*. PhD thesis, Carnegie Mellon University, 2016.
- [189] Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost. In *ACM Transactions on Architecture and Code Optimization (TACO)*, volume 12, pages 1–29, 2016.
- [190] Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu. Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms. In *SIGMETRICS*, 2017.
- [191] Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu. Adaptive-latency DRAM: Optimizing DRAM Timing for the Common-case. In *HPCA*, 2015.

- [192] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu. Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. In *HPCA*, 2013.
- [193] Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, and Onur Mutlu. Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM. In *PACT*, 2015.
- [194] Eojin Lee, Ingab Kang, Sukhan Lee, G Edward Suh, and Jung Ho Ahn. TWiCe: Preventing Row-Hammering by Exploiting Time Window Counters. In *ISCA*, 2019.
- [195] Jae W Lee, Daihyun Lim, Blaise Gassend, G Edward Suh, Marten Van Dijk, and Srinivas Devadas. A Technique to Build a Secret Key in Integrated Circuits for Identification and Authentication Applications. In *Symposium on VLSIC*, 2004.
- [196] Myoung Jin Lee and Kun Woo Park. A Mechanism for Dependence of Refresh Time on Data Pattern in DRAM. In *IEEE Electron Device Letters*, volume 31, pages 168–170, 2010.
- [197] Lenovo. Row Hammer Privilege Escalation. [https://support.lenovo.com/us/en/product\\_security/row\\_hammer](https://support.lenovo.com/us/en/product_security/row_hammer), 2015.
- [198] Congmiao Li and Jean-Luc Gaudiot. Detecting Malicious Attacks Exploiting Hardware Vulnerabilities Using Performance Counters. In *COMPSAC*, 2019.
- [199] Yan Li, Helmut Schneider, Florian Schnabel, Roland Thewes, and Doris Schmitt-Landsiedel. DRAM Yield Analysis and Optimization by a Statistical Design Approach. In *CSI*, 2011.
- [200] Daihyun Lim, Jae W Lee, Blaise Gassend, G Edward Suh, Marten Van Dijk, and Srinivas Devadas. Extracting Secret Keys from Integrated Circuits. *VLSI*, 2005.
- [201] Chung Hsiang Lin, De-Yu Shen, Yi-Jung Chen, Chia-Lin Yang, and Michael Wang. SECRET: Selective Error Correction for Refresh Energy Reduction in DRAMs. In *ICCD*, 2012.
- [202] Jiang Lin and Matthew Garrett. Handling Maximum Activation Count Limit and Target Row Refresh in DDR4 SDRAM, 2017. US Patent 9,589,606.
- [203] Jie Lin, Wei Yu, Nan Zhang, Xinyu Yang, Hanlin Zhang, and Wei Zhao. A Survey on Internet of Things: Architecture, Enabling Technologies, Security and Privacy, and Applications. In *IoT*, 2017.
- [204] Moritz Lipp, Misiker Tadesse Aga, Michael Schwarz, Daniel Gruss, Clémentine Maurice, Lukas Raab, and Lukas Lamster. Nethammer: Inducing RowHammer Faults Through Network Requests. *arXiv preprint arXiv:1805.04956*, 2018.
- [205] Chao Qun Liu, Yuan Cao, and Chip Hong Chang. ACRO-PUF: A Low-power, Reliable and Aging-Resilient Current Starved Inverter-Based Ring Oscillator Physical Unclonable Function. In *TCS*, volume 64, pages 3138–3149, 2017.

- [206] Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms. In *ISCA*, 2013.
- [207] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In *ISCA*, 2012.
- [208] Rui Liu, Huaqiang Wu, Yachuan Pang, He Qian, and Shimeng Yu. Experimental Characterization of Physical Unclonable Function Based on 1 KB Resistive Random Access Memory Arrays. In *IEEE Electron Device Letters*, volume 36, pages 1380–1383, 2015.
- [209] Song Liu, Brian Leung, Alexander Neckar, Seda Olgreni Memik, Gokhan Memik, and Nikos Hardavellas. Hardware/Software Techniques for DRAM Thermal Management. In *HPCA*, 2011.
- [210] Wenchao Liu, Zhenhua Zhang, Miaoxin Li, and Zhenglin Liu. A Trustworthy Key Generation Prototype Based on DDR3 PUF for Wireless Sensor Networks. In *Sensors*, volume 14, pages 11542–11556, 2014.
- [211] Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. Concurrent Data Structures for Near-Memory Computing. In *SPAA*, 2017.
- [212] Keith Lofstrom, W Robert Daasch, and Donald Taylor. IC Identification Circuit Using Device Mismatch. In *ISSCC*, 2000.
- [213] Shih-Lien Lu, Ying-Chen Lin, and Chia-Lin Yang. Improving DRAM Latency with Dynamic Asymmetric Subarray. In *MICRO*, 2015.
- [214] XiaoMing Lu, LiJun Zhang, YongGang Wang, Wei Chen, DaJun Huang, Deng Li, Shuang Wang, DeYong He, ZhenQiang Yin, Yu Zhou, et al. FPGA Based Digital Phase-coding Quantum Key Distribution System. *Science China Physics, Mechanics & Astronomy*, 2015.
- [215] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In *PLDI*, 2005.
- [216] Haocong Luo, Taha Shahroodi, Hasan Hassan, Minesh Patel, Abdullah Giray Yaglikci, Lois Orosa, Jisung Park, and Onur Mutlu. CLR-DRAM: A Low-Cost DRAM Architecture Enabling Dynamic Capacity-Latency Trade-Off. *ISCA*, 2020.
- [217] Yixin Luo, Saugata Ghose, Yu Cai, Erich F Haratsch, and Onur Mutlu. Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory. In *IEEE Journal on Selected Areas in Communications*, volume 34, pages 2294–2311, 2016.
- [218] Yixin Luo, Saugata Ghose, Yu Cai, Erich F Haratsch, and Onur Mutlu. HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-recovery and Temperature Awareness. In *HPCA*, 2018.

- [219] Yixin Luo, Saugata Ghose, Yu Cai, Erich F Haratsch, and Onur Mutlu. Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation. *SIGMETRICS*, 2018.
- [220] Xiongfeng Ma, Xiao Yuan, Zhu Cao, Bing Qi, and Zhen Zhang. Quantum Random Number Generation. *Quantum Inf.*, 2016.
- [221] Roel Maes, Pim Tuyls, and Ingrid Verbauwhede. Intrinsic PUFs from Flip-flops on Reconfigurable Devices. In *WISSec*, 2008.
- [222] Roel Maes and Ingrid Verbauwhede. Physically Unclonable Functions: A Study on the State of the Art and Future Research Directions. In *Towards Hardware-Intrinsic Security*. 2010.
- [223] Abhranil Maiti, Logan McDougall, and Patrick Schaumont. The Impact of Aging on an Fpga-Based Physical Unclonable Function. In *FPL*, 2011.
- [224] Mehrdad Majzoobi, Farinaz Koushanfar, and Srinivas Devadas. FPGA PUF Using Programmable Delay Lines. In *WIFS*, 2010.
- [225] Mehrdad Majzoobi, Farinaz Koushanfar, and Srinivas Devadas. FPGA-based True Random Number Generation using Circuit Metastability with Adaptive Feedback Control. In *CHES*, 2011.
- [226] Mehrdad Majzoobi, Farinaz Koushanfar, and Miodrag Potkonjak. Testing Techniques for Hardware Security. In *ITC*, 2008.
- [227] George Marsaglia. The Marsaglia Random Number CDROM Including the Diehard Battery of Tests of Randomness. 2008. <http://www.stat.fsu.edu/pub/diehard/>.
- [228] Kinga Marton and Alin Suciu. On the Interpretation of Results from the NIST Statistical Test Suite. *Science and Technology*, 2015.
- [229] Sanu K Mathew, Suresh Srinivasan, Mark A Anders, Himanshu Kaul, Steven K Hsu, Farhana Sheikh, Amit Agarwal, Sudhir Satpathy, and Ram K Krishnamurthy. 2.4 Gbps, 7 mW All-digital PVT-variation Tolerant True Random Number Generator for 45 nm CMOS High-performance Microprocessors. In *IEEE Journal of Solid-State Circuits*, volume 47, pages 2807–2821, 2012.
- [230] Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. Revisiting Memory Errors in Large-scale Production Data Centers: Analysis and Modeling of New Trends from the Field. In *DSN*, 2015.
- [231] Rino Micheloni, Peter Z Onufryk, Alessia Marelli, Christopher IW Norrie, and Ihab Jaser. Apparatus and Method Based on LDPC Codes for Adjusting a Correctable Raw Bit Error Rate Limit in a Memory System, 2015. US Patent 9,092,353.
- [232] Amir Morad, Leonid Yavits, and Ran Ginosar. GP-SIMD Processing-in-Memory. In *ACM Transactions on Architecture and Code Optimization (TACO)*, volume 11, pages 1–26, 2015.
- [233] Yuki Mori, Kiyonori Ohyu, Kensuke Okonogi, and R-I Yamada. The Origin of Variable Retention Time in DRAM. In *IEDM*, 2005.

- [234] Thomas Moscibroda and Onur Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In *USENIX Security*, 2007.
- [235] Sven Müelich, Sebastian Bitzer, Chirag Sudarshan, Christian Weis, Norbert Wehn, Martin Bossert, and Robert FH Fischer. Channel Models for Physical Unclonable Functions based on DRAM Retention Measurements. In *REDUNDANCY*, 2019.
- [236] Janani Mukundan, Hillery Hunter, Kyu-hyoun Kim, Jeffrey Stuecheli, and José F Martínez. Understanding and Mitigating Refresh Overheads in High-density DDR4 DRAM Systems. In *ISCA*, 2013.
- [237] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda. Reducing Memory Interference in Multicore Systems via Application-aware Memory Channel Partitioning. In *MICRO*, 2011.
- [238] Onur Mutlu. Memory Scaling: A Systems Architecture Perspective. In *IMW*, 2013.
- [239] Onur Mutlu. The RowHammer Problem and Other Issues we may Face as Memory Becomes Denser. In *DATE*, 2017.
- [240] Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing Data Where it Makes Sense: Enabling In-memory Computation. *MicPro*, 2019.
- [241] Onur Mutlu and Jeremie S Kim. RowHammer: A Retrospective. *TCAD*, 2019.
- [242] Onur Mutlu and Thomas Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In *MICRO*, 2007.
- [243] Onur Mutlu and Thomas Moscibroda. Parallelism-aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In *ISCA*, 2008.
- [244] Onur Mutlu and Lavanya Subramanian. Research Problems and Opportunities in Memory Systems. In *SUPERFRI*, 2014.
- [245] Prashant J Nair, Dae-Hyun Kim, and Moinuddin K Qureshi. ArchShield: Architectural Framework for Assisting DRAM Scaling by Tolerating High Error Rates. In *ISCA*, 2013.
- [246] Prashant J Nair, Vilas Sridharan, and Moinuddin K Qureshi. XED: Exposing On-Die Error Detection Information for Strong Memory Reliability. In *ISCA*, 2016.
- [247] Liao Ning, Jiang Ding, Bai Chuang, and Zou Xuecheng. Design and Validation of High Speed True Random Number Generators Based on Prime-length Ring Oscillators. *The Journal of China Universities of Posts and Telecommunications*, 22(4):1–6, 2015.
- [248] Tae-Young Oh, Hoeju Chung, Jun-Young Park, Ki-Won Lee, Seunghoon Oh, Su-Yeon Doo, Hyoung-Joo Kim, ChangYong Lee, Hye-Ran Kim, Jong-Ho Lee, et al. A 3.2Gbps/pin 8Gb 1.0V LPDDR4 SDRAM with Integrated ECC Engine for sub-1V DRAM Core Operation. In *ISSCC*, 2014.
- [249] Omron. NY-series Industrial Box PC - Hardware User's Manual. [https://assets.omron.eu/downloads/manual/en/v6/w553\\_ny-series\\_industrial\\_box\\_pc\\_users\\_manual\\_en.pdf](https://assets.omron.eu/downloads/manual/en/v6/w553_ny-series_industrial_box_pc_users_manual_en.pdf), 2019.

- [250] Lois Orosa, Yaohua Wang, Ivan Puddu, Mohammad Sadrosadati, Kaveh Razavi, Juan Gómez-Luna, Hasan Hassan, Nika Mansouri-Ghiasi, Arash Tavakkol, Minesh Patel, Jeremie Kim, et al. Dataplant: Enhancing System Security with Low-cost in-DRAM Value Generation Primitives. *arXiv preprint arXiv:1902.07344*, 2019.
- [251] Erdinç Öztürk, Ghaith Hammouri, and Berk Sunar. Physical Unclonable Function with Tristate Buffers. In *ISCAS*, 2008.
- [252] Erdinç Öztürk, Ghaith Hammouri, and Berk Sunar. Towards Robust Low Cost Authentication for Pervasive Devices. In *PerCom*, 2008.
- [253] Vijay S Pai and Sarita Adve. Code Transformations to Improve Memory Parallelism. In *MICRO*, 1999.
- [254] Yachuan Pang, Huaqiang Wu, Bin Gao, Ning Deng, Dong Wu, Rui Liu, Shimeng Yu, An Chen, and He Qian. Optimization of RRAM-based Physical Unclonable Function with a Novel Differential Read-Out Method. In *IEEE Electron Device Letters*, volume 38, pages 168–171, 2017.
- [255] Fabio Pareschi, Gianluca Setti, and Riccardo Rovatti. A Fast Chaos-based True Random Number Generator for Cryptographic Applications. In *ESSCIRC*, 2006.
- [256] Kyungbae Park, Chulseung Lim, Donghyuk Yun, and Sanghyeon Baeg. Experiments and Root Cause Analysis for Active-Precharge Hammering Fault in DDR3 SDRAM under 3× nm Technology. *Microelectronics Reliability*, 57:39–46, 2016.
- [257] Kyungbae Park, Donghyuk Yun, and Sanghyeon Baeg. Statistical Distributions of Row-hammering Induced Failures in DDR3 Components. *Microelectronics Reliability*, 67:143–149, 2016.
- [258] Minesh Patel, Jeremie Kim, Taha Shahroodi, Hasan Hassan, and Onur Mutlu. Bit-Exact ECC Recovery (BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics. *MICRO*, 2020.
- [259] Minesh Patel, Jeremie S Kim, Hasan Hassan, and Onur Mutlu. Understanding and Modeling On-Die Error Correction in Modern DRAM: An Experimental Study Using Real Devices. In *DSN*, 2019.
- [260] Minesh Patel, Jeremie S Kim, and Onur Mutlu. The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions. In *ISCA*, 2017.
- [261] Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, and Chita R Das. Scheduling Techniques for GPU Architectures with Processing-in-Memory Capabilities. In *PACT*, 2016.
- [262] Craig S Petrie and J Alvin Connelly. A Noise-based IC Random Number Generator for Applications in Cryptography. In *IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications*, volume 47, pages 615–621, 2000.
- [263] Ponemon Institute LLC. Study on Mobile and IoT Application Security, 2017.

- [264] Changwoo Pyo, Sungil Pae, and Gyungho Lee. DRAM as Source of Randomness. In *Electronics Letters*, volume 45, pages 26–27, 2009.
- [265] Rui Qiao and Mark Seaborn. A New Approach for RowHammer Attacks. In *HOST*, 2016.
- [266] Moinuddin K Qureshi, DaeHyun Kim, Samira Khan, Prashant J Nair, and Onur Mutlu. AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems. In *DSN*, 2015.
- [267] Md Tauhidur Rahman, Kan Xiao, Domenic Forte, Xuhei Zhang, Jerry Shi, and Mohammad Tehranipoor. TI-TRNG: Technology Independent True Random Number Generator. In *DAC*, 2014.
- [268] Amir Rahmati, Matthew Hicks, Daniel E Holcomb, and Kevin Fu. Probable Cause: The Deanonymizing Effects of Approximate DRAM. In *ISCA*, 2016.
- [269] Biswajit Ray and Aleksandar Milenković. True Random Number Generation Using Read Noise of Flash Memory Cells. In *IEEE Transactions on Electron Devices*, volume 65, pages 963–969, 2018.
- [270] Kaveh Razavi, Ben Gras, Erik Bosman, Bart Preneel, Cristiano Giuffrida, and Herbert Bos. Flip Feng Shui: Hammering a Needle in the Software Stack. In *USENIX Security*, 2016.
- [271] Phillip J Restle, JW Park, and Brian F Lloyd. DRAM Variable Retention Time. In *IEDM*, pages 807–810, 1992.
- [272] Ronald Rivest. The MD5 Message-Digest Algorithm. In *RFC*, 1992.
- [273] Scott Rixner. Memory Controller Optimizations For Web Servers. In *MICRO*, 2004.
- [274] Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. Memory Access Scheduling. In *ISCA*, 2000.
- [275] Scott Rixner, John D Owens, Peter Mattson, Ujval J Kapasi, and William J Dally. Memory Access Scheduling. In *ISCA*, 2000.
- [276] Andrea Röck. *Pseudorandom Number Generators for Cryptographic Applications*. 2005.
- [277] Garrett S Rose, Nathan McDonald, Lok-Kwong Yan, Bryant Wysocki, and Karen Xu. Foundations of Memristor Based PUF Architectures. In *NANOARCH*, 2013.
- [278] Ulrich Rührmair, Jan Sölter, and Frank Sehnke. On the Foundations of Physical Unclonable Functions. In *IACR Cryptology Archive*, 2009.
- [279] Andrew Rukhin, Juan Soto, James Nechvatal, Miles Smid, and Elaine Barker. A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Technical report, Booz-Allen and Hamilton Inc Mclean Va, 2001.

- [280] Seong-Wan Ryu, Kyungkyu Min, Jungho Shin, Heimi Kwon, Donghoon Nam, Taekyung Oh, Tae-Su Jang, Minsoo Yoo, Yongtaik Kim, and Sungjoo Hong. Overcoming the Reliability Limitation in the Ultimately Scaled DRAM using Silicon Migration Technique by Hydrogen Annealing. In *IEEE International Electron Devices Meeting*, 2017.
- [281] Samsung. S5P4418 Application Processor Revision 0.10. 2014.
- [282] André Schaller, Wenjie Xiong, Nikolaos Athanasios Anagnostopoulos, Muhammad Umair Saleem, Sebastian Gabmeyer, Stefan Katzenbeisser, and Jakub Szefer. Intrinsic Rowhammer PUFs: Leveraging the Rowhammer Effect for Improved Security. In *HOST*, 2017.
- [283] Werner Schindler and Wolfgang Killmann. Evaluation Criteria for True (Physical) Random Number Generators Used in Cryptographic Applications. In *CHES*, 2002.
- [284] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM Errors in the Wild: a Large-scale Field Study. In *SIGMETRICS*, 2009.
- [285] Mark Seaborn and Thomas Dullien. Exploiting the DRAM RowHammer Bug to Gain Kernel Privileges. *Black Hat*, 2015.
- [286] Damla Senol Cali, Gupreet Kalsi, Zulal Bingöl, Lavanya Subramanian, Can Firtina, Jeremie Kim, Rachata Ausavarungnirun, Mohammed Alser, Anant Nori, Juan Luna, Amirali Boroumand, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu. GenASM: A Low-Power, Memory-Efficient Approximate String Matching Acceleration Framework for Genome Sequence Analysis. *MICRO*, 2020.
- [287] Damla Senol Cali, Jeremie S Kim, Saugata Ghose, Can Alkan, and Onur Mutlu. Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions. *Briefings in Bioinformatics*, 20(4):1542–1559, 2019.
- [288] O Seongil, Young Hoon Son, Nam Sung Kim, and Jung Ho Ahn. Row-buffer Decoupling: A Case for Low-latency DRAM Microarchitecture. In *ISCA*, 2014.
- [289] Vivek Seshadri. *Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems*. PhD thesis, Carnegie Mellon University, 2016.
- [290] Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The Dirty-block Index. In *ISCA*, 2014.
- [291] Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Fast Bulk Bitwise AND and OR in DRAM. *IEEE Computer Architecture Letters*, 14(2):127–131, 2015.
- [292] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd Mowry. RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization. In *MICRO*, 2013.

- [293] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM. In *arXiv preprint arXiv:1611.09988*, 2016.
- [294] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In *MICRO*, 2017.
- [295] Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-Unit Strided Accesses. In *MICRO*, 2015.
- [296] Vivek Seshadri and Onur Mutlu. Simple Operations in Memory to Reduce Data Movement. In *Advances in Computers*. 2017.
- [297] Seyed Mohammad Seyedzadeh, Alex K Jones, and Rami Melhem. Mitigating Wordline Crosstalk Using Adaptive Trees of Counters. In *ISCA*, 2018.
- [298] Claude Elwood Shannon. A Mathematical Theory of Communication. *Bell System Technical Journal*, 1948.
- [299] Wongyu Shin, Jeongmin Yang, Jungwhan Choi, and Lee-Sup Kim. NUAT: A Non-Uniform Access Time Memory Controller. In *HPCA*, 2014.
- [300] C Glenn Shirley and W Robert Daasch. Copula Models of Correlation: A DRAM Case Study. In *IEEE Transactions on Computers*, volume 63, pages 2389–2401, 2014.
- [301] Allan Snavely and Dean M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreading Processor. In *ASPLoS*, 2000.
- [302] SoftMC Source Code. <https://github.com/CMU-SAFARI/SoftMC>.
- [303] Mungyu Son, Hyunsun Park, Junwhan Ahn, and Sungjoo Yoo. Making DRAM Stronger Against Row Hammering. In *DAC*, 2017.
- [304] Young Hoon Son, Sukhan Lee, O Seongil, Sanghyuk Kwon, Nam Sung Kim, and Jung Ho Ahn. CiDRA: A Cache-Inspired DRAM Resilience Architecture. In *HPCA*, 2015.
- [305] Young Hoon Son, O Seongil, Yuhwan Ro, Jae W Lee, and Jung Ho Ahn. Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations. In *ISCA*, 2013.
- [306] Vilas Sridharan and Dean Liberty. A Study of DRAM Failures in the Field. In *SC*, 2012.
- [307] André Stefanov, Nicolas Gisin, Olivier Guinnard, Laurent Guinnard, and Hugo Zbinden. Optical Quantum Random Number Generator. In *Journal of Modern Optics*, volume 47, pages 595–598, 2000.

- [308] Mario Stipčević and Çetin Kaya Koç. True Random Number Generators. In *Open Problems in Mathematics and Computational Science*. 2014.
- [309] Ying Su, Jeremy Holleman, and Brian Otis. A 1.6 pJ/bit 96% Stable Chip-ID Generating Circuit Using Process Variations. In *ISSCC*, 2007.
- [310] Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. The Blacklisting Memory Scheduler: Achieving High Performance And Fairness At Low Cost. In *ICCD*, 2014.
- [311] Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling. In *TPDS*, 2016.
- [312] Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory. In *MICRO*, 2015.
- [313] Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In *HPCA*, 2013.
- [314] G Edward Suh and Srinivas Devadas. Physical Unclonable Functions for Device Authentication and Secret Key Generation. In *DAC*, 2007.
- [315] Berk Sunar, William J Martin, and Douglas R Stinson. A Provably Secure True Random Number Generator with Built-in Tolerance to Active Attacks. In *IEEE Transactions on computers*, volume 56, pages 109–119, 2007.
- [316] Zehra Sura, Arpit Jacob, Tong Chen, Bryan Rosenburg, Olivier Sallenave, Carlo Bertolli, Samuel Antao, Jose Brunheroto, Yoonho Park, Kevin O’Brien, et al. Data Access Optimization in a Processing-in-Memory System. In *CF*, 2015.
- [317] Soubhagya Sutar, Arnab Raha, Devadatta Kulkarni, Rajeev Shorey, Jeffrey Tew, and Vijay Raghunathan. D-PUF: An Intrinsically Reconfigurable DRAM PUF for Device Authentication and Random Number Generation. In *TECS*, 2018.
- [318] Soubhagya Sutar, Arnab Raha, and Vijay Raghunathan. D-PUF: An Intrinsically Reconfigurable DRAM PUF for Device Authentication in Embedded Systems. In *CASES*, 2016.
- [319] Soubhagya Sutar, Arnab Raha, and Vijay Raghunathan. Memory-based Combination PUFs for Device Authentication in Embedded Systems. In *IEEE Transactions on Multi-Scale Computing Systems*, volume 4, pages 793–810, 2017.
- [320] BMS Bahar Talukder, Joseph Kerns, Biswajit Ray, Thomas Morris, and Md Tauhidur Rahman. Exploiting DRAM Latency Variations for Generating True Random Numbers. In *ICCE*, 2019.
- [321] BMS Bahar Talukder, Biswajit Ray, Domenic Forte, and Md Tauhidur Rahman. Prelatpu: Exploiting DRAM Latency Variations for Generating Robust Device Signatures. *IEEE Access*, 2019.

- [322] Qianying Tang, Chen Zhou, Woong Choi, Gyuseong Kang, Jongsun Park, Keshab K Parhi, and Chris H Kim. A DRAM Based Physical Unclonable Function Capable of Generating  $>10^{32}$  Challenge Response Pairs per 1Kbit Array for Secure Chip Authentication. In *CICC*, 2017.
- [323] Sha Tao and Elena Dubrova. TVL-TRNG: Sub-Microwatt True Random Number Generator Exploiting Metastability in Ternary Valued Latches. In *ISMVL*, 2017.
- [324] Andrei Tatar, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. Defeating Software Mitigations Against Rowhammer: A Surgical Precision Hammer. In *RAID*, 2018.
- [325] Arash Tavakkol, Mohammad Sadrosadati, Saugata Ghose, Jeremie Kim, Yixin Luo, Yaohua Wang, Nika Mansouri Ghiasi, Lois Orosa, Juan Gómez-Luna, and Onur Mutlu. FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives. In *ISCA*, 2018.
- [326] Tech Insights. DRAM Technology/Products Roadmap, 2019.
- [327] Je Sen Teh, Azman Samsudin, Mishal Al-Mazrooei, and Amir Akhavan. GPUs and Chaos: A New True Random Number Generator. In *Nonlinear Dynamics*, 2015.
- [328] Fatemeh Tehranipoor, Nima Karimian, Kan Xiao, and John Chandy. DRAM Based Intrinsic Physical Unclonable Functions for System Level Security. In *GLVLSI*, 2015.
- [329] Fatemeh Tehranipoor, Nima Karimian, Wei Yan, and John A Chandy. Investigation of DRAM PUFs Reliability Under Device Accelerated Aging Effects. In *ISCAS*, 2017.
- [330] Fatemeh Tehranipoor, Wei Yan, and John A Chandy. Robust Hardware True Random Number Generators using DRAM Remanence Effects. In *HOST*, 2016.
- [331] Shanquan Tian, Wenjie Xiong, Ilias Giechaskiel, Kasper Rasmussen, and Jakub Szefer. Fingerprinting Cloud FPGA Infrastructures. In *FPGA*, 2020.
- [332] Carlos Tokunaga, David Blaauw, and Trevor Mudge. True Random Number Generator with a Metastability-based Quality Control. In *IEEE Journal of Solid-State Circuits*, volume 43, pages 78–85, 2008.
- [333] TQ-Systems. TQMx80UC User’s Manual. <https://www.tq-group.com/filedownloads/files/products/embedded/manuals/x86/embedded-modul/COM-Express-Compact/TQMx80UC/TQMx80UC.UM.0102.pdf>, 2020.
- [334] Kuen Hung Tsoi, Ka Ho Leung, and Philip Heng Wai Leong. High Performance Physical Random Number Generator. In *IET Computers & Digital Techniques*, volume 1, pages 349–352, 2007.
- [335] Pim Tuyls, GeertJan Schrijen, Boris Škorić, Jan Van Geloven, Nynke Verhaegh, and Rob Wolters. Read-proof Hardware From Protective Coatings. In *CHES*, 2006.
- [336] Stanley Tzeng and LiYi Wei. Parallel White Noise Generation on a GPU via Cryptographic Hash. In *I3D*, 2008.

- [337] Hiroyuki Usui, Lavanya Subramanian, Kevin KaiWei Chang, and Onur Mutlu. DASH: Deadline-Aware High-performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators. In *ACM Transactions on Architecture and Code Optimization (TACO)*, volume 12, pages 1–28, 2016.
- [338] Vincent Van der Leest, Geert-Jan Schrijen, Helena Handschuh, and Pim Tuyls. Hardware Intrinsic Security from D Flip-flops. In *ACM STC*, 2010.
- [339] Vincent Van der Leest, Erik Van der Sluis, Geert-Jan Schrijen, Pim Tuyls, and Helena Handschuh. Efficient Implementation of True Random Number Generator Based on SRAM PUFs. In *Cryptography and Security: From Theory to Applications*. 2012.
- [340] Victor Van der Veen, Yanick Fratantonio, Martina Lindorfer, Daniel Gruss, Clemantine Maurice, Giovanni Vigna, Herbert Bos, Kaveh Razavi, and Cristiano Giuffrida. Drammer: Deterministic Rowhammer Attacks on Mobile Platforms. In *CCS*, 2016.
- [341] Victor Van der Veen, Martina Lindorfer, Yanick Fratantonio, Harikrishnan Padmanabha Pillai, Giovanni Vigna, Christopher Kruegel, Herbert Bos, and Kaveh Razavi. GuardION: Practical Mitigation of DMA-Based Rowhammer Attacks on ARM. In *DIMVA*, 2018.
- [342] Elena Ioana Vatajelu, Giorgio Di Natale, Marco Indaco, and Paolo Prinetto. STT MRAM-Based PUFs. In *DATE*, 2015.
- [343] Elena Ioana Vatajelu, Giorgio Di Natale, Mario Barbareschi, Lionel Torres, Marco Indaco, and Paolo Prinetto. STT-MRAM-Based PUF Architecture Exploiting Magnetic Tunnel Junction Fabrication-Induced Variability. In *ACM Journal on Emerging Technologies in Computing Systems (JETC)*, volume 13, pages 1–21, 2016.
- [344] Ravi K Venkatesan, Stephen Herr, and Eric Rotenberg. Retention-aware Placement in DRAM (RAPID): Software Methods for Quasi-non-volatile DRAM. In *HPCA*, 2006.
- [345] VersaLogic Corporation. Blackbird BIOS Reference Manual. <https://www.versalogic.com/wp-content/themes/vsl-new/assets/pdf/manuals/MEPU44624562BRM.pdf>, 2019.
- [346] Vincent von Kaenel and Toshinari Takayanagi. Dual True Random Number Generators for Cryptographic Applications Embedded on a 200 Million Device Dual CPU SOC. In *CICC*, 2007.
- [347] Jue Wang, Xiangyu Dong, and Yuan Xie. ProactiveDRAM: A DRAM-initiated Retention Management Scheme. In *ICCD*, 2014.
- [348] Yaohua Wang, Lois Orosa, Xiangjun Peng, Yang Guo, Saugata Ghose, Minesh Patel, Jeremie Kim, Juan Gomez Luna, Mohammad Sadrosadati, Nika Mansouri Ghiasi, and Onur Mutlu. Reducing DRAM Latency via Fine-grained in-DRAM Cache. *MICRO*, 2020.
- [349] Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose, Nika Mansouri Ghiasi, Minesh Patel, Jeremie S. Kim, Hasan Hassan, Mohammad Sadrosadati, and Onur Mutlu. Reducing DRAM Latency via Charge-Level-Aware Look-Ahead Partial Restoration. *MICRO*, 2018.

- [350] Yicheng Wang, Yang Liu, Peiyun Wu, and Zhao Zhang. Detect DRAM Disturbance Error by Using Disturbance Bin Counters. *IEEE Computer Architecture Letters*, 18(1):35–38, 2019.
- [351] Yicheng Wang, Yang Liu, Peiyun Wu, and Zhao Zhang. Reinforce Memory Error Protection by Breaking DRAM Disturbance Correlation Within ECC Words. In *ICCD*, 2019.
- [352] Ying Wang, Yinhe Han, Cheng Wang, Huawei Li, and Xiaowei Li. RADAR: A Case for Retention-aware DRAM Assembly and Repair in Future FGR DRAM Memory. In *DAC*, 2015.
- [353] Yinglei Wang, WingKei Yu, Shuo Wu, Greg Malysa, G Edward Suh, and Edwin C Kan. Flash Memory for Ubiquitous Hardware Security Functions: True Random Number Generation and Device Fingerprints. In *SP*, 2012.
- [354] Yonggang Wang, Cong Hui, Chong Liu, and Chao Xu. Theory and Implementation of a Very High Throughput True Random Number Generator in Field Programmable Gate Array. *Review of Scientific Instruments*, 2016.
- [355] Zheng Wang, Yi Chen, Aakash Patil, Jayasanker Jayabalan, Xueyong Zhang, Chip-Hong Chang, and Arindam Basu. Current Mirror Array: A Novel Circuit Topology for Combining Physical Unclonable Function and Machine Learning. In *IEEE Transactions on Circuits and Systems I: Regular Papers*, volume 65, pages 1314–1326, 2017.
- [356] Piotr Zbigniew Wieczorek. An FPGA Implementation of the Resolve Time-based True Random Number Generator with Quality Control. In *IEEE Transactions on Circuits and Systems I: Regular Papers*, 2014.
- [357] Gilbert Wolrich, Debra Bernstein, Daniel Cutter, Christopher Dolan, and Matthew J Adiletta. Mapping Requests from a Processing Unit That Uses Memory-Mapped Input-Output Space, 2004. US Patent 6,694,380.
- [358] Xin-Chuan Wu, Timothy Sherwood, Frederic T Chong, and Yanjing Li. Protecting Page Tables from RowHammer Attacks using Monotonic Pointers in DRAM True-Cells. In *ASPLOS*, 2019.
- [359] Kan Xiao, Md Tauhidur Rahman, Domenic Forte, Yu Huang, Mei Su, and Mohammad Tehranipoor. Bit Selection Algorithm Suitable for High-volume Production of SRAM-PUF. In *HOST*, 2014.
- [360] Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Radu Teodorescu. One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation. In *USENIX Security*, 2016.
- [361] Xilinx. ML605 Hardware User Guide. [https://www.xilinx.com/support/documentation/boards\\_and\\_kits/ug534.pdf](https://www.xilinx.com/support/documentation/boards_and_kits/ug534.pdf).
- [362] Xilinx. Virtex UltraScale FPGAs.
- [363] Hongyi Xin, Jeremie Kim, Sunny Nahar, Carl Kingsford, Can Alkan, and Onur Mutlu. LEAP: A Generalization of the Landau-Vishkin Algorithm with Custom Gap Penalties. *RECOMB-Sq*, 2017.

- [364] Wenjie Xiong, André Schaller, Nikolaos A Anagnostopoulos, Muhammad Umair Saleem, Sebastian Gabmeyer, Stefan Katzenbeisser, and Jakub Szefer. Run-time Accessible DRAM PUFs in Commodity Devices. In *CHES*, 2016.
- [365] Wei Yan, Fatemeh Tehranipoor, and John A Chandy. A Novel Way to Authenticate Untrusted Integrated Circuits. In *ICCAD*, 2015.
- [366] David S Yaney, ChihYuan Lu, Ross A Kohler, Michael J Kelly, and James T Nelson. A Meta-stable Leakage Phenomenon in DRAM Charge Storage-Variable Hold Time. In *International Electron Devices Meeting*, pages 336–339, 1987.
- [367] Kaiyuan Yang, David Blaauw, and Dennis Sylvester. An All-digital Edge Racing True Random Number Generator Robust Against PVT Variations. In *IEEE Journal of Solid-State Circuits*, volume 51, pages 1022–1031, 2016.
- [368] Kaiyuan Yang, David Fick, Michael B Henry, Yoonmyung Lee, David Blaauw, and Dennis Sylvester. 16.3 A 23Mb/s 23pJ/b Fully Synthesized True-random-number Generator in 28nm and 65nm CMOS. In *ISSCC*, 2014.
- [369] Thomas Yang and Xi-Wei Lin. Trap-assisted DRAM Row Hammer Effect. *IEEE Electron Device Letters*, 40(3):391–394, 2019.
- [370] Jing Ye, Yu Hu, and Xiaowei Li. OPUF: Obfuscation Logic Based Physical Unclonable Function. In *IOLTS*, 2015.
- [371] Chi En Yin, Gang Qu, and Qiang Zhou. Design and Implementation of a Group-Based RO PUF. In *DATE*, 2013.
- [372] Jung Min You and Joon-Sung Yang. MRLoc: Mitigating Row-hammering based on memory Locality. In *DAC*, 2019.
- [373] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. TOP-PIM: Throughput-Oriented Programmable Processing in Memory. In *HPDC*, 2014.
- [374] Le Zhang, Xuanyao Fong, Chip-Hong Chang, Zhi Hui Kong, and Kaushik Roy. Highly Reliable Spin-Transfer Torque Magnetic RAM-Based Physical Unclonable Function with Multi-Response-Bits per Cell. In *TIFS*, 2015.
- [375] Le Zhang, Xuanyao Fong, Chip-Hong Chang, Zhi Hui Kong, and Kaushik Roy. Optimizing Emerging Nonvolatile Memories for Dual-Mode Applications: Data Storage and Key Generator. In *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, volume 34, pages 1176–1187, 2015.
- [376] Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. Half-DRAM: A High-bandwidth and Low-power DRAM Architecture from the Rethinking of Fine-grained Activation. In *ISCA*, 2014.
- [377] Teng Zhang, Minghui Yin, Changmin Xu, Xiayan Lu, Xinhao Sun, Yuchao Yang, and Ru Huang. High-speed True Random Number Generation Based on Paired Memristors for Security Electronics. *Nanotechnology*, 28(45), 2017.

- [378] XG Zhang, YQ Nie, H Zhou, H Liang, X Ma, J Zhang, and JW Pan. 68 Gbps Quantum Random Number Generation by Measuring Laser Phase Fluctuations. In *Review of Scientific Instruments*, 2015.
- [379] Xianwei Zhang, Youtao Zhang, Bruce R Childers, and Jun Yang. Exploiting DRAM Restore Time Variations In Deep Sub-micron Scaling. In *DATE*, 2015.
- [380] Xianwei Zhang, Youtao Zhang, Bruce R Childers, and Jun Yang. Restore Truncation for Performance Improvement in Future DRAM Systems. In *HPCA*, 2016.
- [381] Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In *MICRO*, 2000.
- [382] Yu Zheng, Maryam S Hashemian, and Swarup Bhunia. RESP: A Robust Physical Unclonable Function Retrofitted into Embedded SRAM Array. In *DAC*, 2013.
- [383] William K Zuravleff and Timothy Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to be Issued Out of Order, 1997. US Patent 5,630,096.



# European Design and Automation Association

## Outstanding Dissertation Award 2020

Category "New directions in safety, reliability and security-aware hardware design, validation and test for systems and circuits"

*Jeremie Kim, Ph.D.*

*Improving DRAM Performance, Security and Reliability by Understanding and Exploiting DRAM Timing Parameter Margins*

*[Signature]*

---

Lorena Anghel  
EDAA Vice Chair

---

4 February 2021

---

Norbert Wehn  
EDAA Chair