

# IRONHIDE: A Secure Multicore that Efficiently Mitigates Microarchitecture State Attacks for Interactive Applications

Hamza Omar

University of Connecticut, Storrs, CT, USA  
hamza.omar@uconn.edu

Omer Khan

University of Connecticut, Storrs, CT, USA  
khan@uconn.edu

## ABSTRACT

Microprocessors enable aggressive hardware virtualization by means of which multiple processes temporally execute on the system. These security-critical and ordinary processes interact with each other to assure application progress. However, temporal sharing of hardware resources exposes the processor to various microarchitecture state attack vectors. State-of-the-art secure processor, such as MI6 adopts Intel's SGX enclave execution model. MI6 architects strong isolation against these vulnerabilities by isolating large memory components, and purging the microarchitecture state of private state resources on every enclave entry/exit. The purging overhead significantly impacts performance as the interactivity across the secure and insecure processes increases. This paper proposes IRONHIDE that extends the MI6 architecture in the context of multicores to form spatially isolated secure and insecure clusters of cores. For a given *secure–insecure* process tuple of an interactive application, IRONHIDE pins the secure process to the secure cluster, and it executes and interacts with the insecure process(es) without incurring the overheads of purging microarchitecture state on each interaction event. For a set of interactive applications, IRONHIDE improves performance by  $\sim 32\%$  over MI6, and ensures similar security guarantees against microarchitecture state attacks.

## 1. INTRODUCTION

Hardware virtualization enables multiple processes to co-locate and temporally execute on processor hardware. These security-critical and ordinary processes interact over their execution for an application to progress. Traditionally, process-level isolation (e.g., Intel's SMAP) is assumed across these processes to guarantee memory isolation. However, it falls short in providing processor security [1, 2, 3, 4]. Thus, Intel's SGX [5] takes a step ahead and executes secure processes in containers called *enclaves*, which are isolated from temporally executing ordinary (insecure) processes. For each enclave entry, the processor switches to the enclave mode where it first authenticates and attests the enclave, and later, decrypts the data for processing. Conversely, on every secure enclave exit, the enclave data is encrypted, core's pipeline is flushed, and the processor is switched back to the normal mode. However, Intel's SGX is known to be vulnerable across several speculative [6, 7, 8, 9] and non-speculative [10, 11, 12, 13, 14] microarchitecture state attacks. Virtualization enables hardware sharing of processor resources, such as caches, translation look-aside buffers, on-chip network, and even memory controllers. Therefore, the execution footprint of processes leave microarchitecture state vulnerable in shared hardware resources by means of which an attacker process can infer secret data value(s). In a nutshell, process-level isolation

and strong cryptographic primitives alone are insufficient for mitigating microarchitecture state attacks.

Strong isolation [15] must be provided across secure and insecure processes to ensure no insecure process is able to infer secure process's data through the shared hardware resources. Recent academic secure processors [16, 17, 18] consider the Intel's SGX enclave execution model, and provide *strong isolation*. Large state structures, such as last level caches are statically partitioned across processes [16, 17, 18], while main memory region(s) are physically isolated via *page coloring* and distributed across processes. However, per-core private (small) hardware resources, such as private caches and TLBs are temporally partitioned and thus flushed (referred to as *purge* in [16, 17]) on each secure enclave entry and exit. In a realistic setting, the secure and insecure processes frequently interact with each other to assure application progress [19]. For instance, consider a machine learning model (trained on secure data) that periodically interacts with an ordinary (insecure) vision pipeline to classify a stream of images [20]. Certainly, each interaction to communicate image inputs across processes invokes the enclave entry or exit protocol, resulting in microarchitecture state purging overheads. The MI6 secure processor [17] reports an average  $\sim 5.4\%$  execution time overhead of purging. The purged microarchitecture state is then brought back again for the next input, incurring further performance woes. This overhead is expected to stack up in case of interactive applications, as it is directly proportional to the interactivity across secure and insecure processes. Higher the interactivity, higher will be the impact of microarchitecture state purging on overall performance. Indeed, such performance implications are unacceptable, and therefore there is a need to re-think secure processor architecture from grounds up to provide strong isolation, yet enable high performance.

All prior secure processor works [5, 16, 17, 18] consider the processor as a single monolithic entity, where secure and insecure processes temporally execute. This work proposes to take a step further in the context of multicore architectures that incorporate tens or even hundreds of cores on a chip [21, 22]. Unlike traditional secure processors where applications temporally execute on core-level resources, multicores allow *spatial* sharing of cores as well. Figure 1 shows the envisioned IRONHIDE architecture, where two clusters of cores are formed to enable strong isolation between the secure and insecure processes. When a secure–insecure process tuple is scheduled on the multicore, the secure process is pinned to the secure cluster, and interacts with the insecure process(es) using a secure communication protocol. Since the secure process never context switches to interact with insecure process(es), the purging overheads to mitigate microarchitecture state attacks are not accumulated in IRONHIDE.



**Figure 1: An envisioned secure multicore architecture.**

State-of-the-art MI6 [17] is considered as the baseline secure processor<sup>1</sup>. To model MI6 on a multicore processor, all available cores and private resources (such as, private caches and TLBs) are temporally time multiplexed across the secure enclave and insecure processes. For mitigating non-speculative microarchitecture state attacks, these time-shared resources are flushed/purged on every secure enclave entry and exit. For shared caches and TLBs, rather than adopting static cache-set partitioning from MI6, the multicore model spatially distributes per-core shared cache slices (and TLBs) across processes. Moreover, the multicore baseline statically distributes main memory across the secure and insecure processes to form isolated DRAM regions. All on-chip memory controllers are flushed/purged on each secure enclave entry/exit to ensure strong isolation. To attest and authenticate the secure process, a *secure kernel* (similar to the security monitor in MI6 [17, 23]) is considered. The MI6 architecture also mitigates speculative microarchitecture state attacks, where a victim (insecure) process is tricked to speculatively access secure data, and pollute shared hardware resources that are accessible to an attacker (again an insecure) process [24]. A lightweight hardware check is adopted from MI6 that tracks the progress of a memory request initiated by an insecure process to access secure enclave’s memory regions. Such requests are stalled until they get resolved. If resolved as speculative, the request is discarded. Otherwise, when resolved as non-speculative, it is processed for exception handling. Adopting these mechanisms from MI6 ensure strong isolation within the multicore model (c.f. Figure 1:①). Lastly, interactions across the secure enclave and insecure processes are carried out via a shared inter-process communication (IPC) buffer, which resides in the shared cache slices (or memory regions) of the insecure process (similar to the protocol adopted in [17, 19]), as shown in Figure 1:②. However, frequent interactions across processes lead to intermittent flushing/purging overheads in the MI6 baseline, leading to degraded performance.

To minimize the frequent microarchitecture state flushing overheads, this work proposes IRONHIDE that extends the MI6 multicore baseline and spatially partitions all available cores across secure and insecure processes of a given application. IRONHIDE forms two *strongly isolated* secure and insecure clusters of cores. The per-core resources (including private caches and TLBs) are spatially partitioned across these clusters. The deterministic on-chip network is also isolated across clusters ensuring no packets (originated by and destined to the same process) drift to another cluster; only the network packets for interaction purposes are allowed to

cross cluster boundaries. Moreover, the spatial partitioning of shared cache slices and DRAM regions is adopted from the multicore MI6 baseline. However, IRONHIDE statically allocates memory controller(s) by spatially pinning them to their respective clusters to ensure strong isolation. A key insight here is that IRONHIDE allows both clusters to fully exploit their allocated hardware resources by *temporally* executing their respective processes. The secure process (attested by the *secure kernel*) is pinned to the secure cluster (c.f. Figure 1:③), where it spatially interacts with insecure processes via the shared IPC buffer. Hence, no secure process exits are necessary for interactive applications, avoiding the need for microarchitecture state flushes without violating strong isolation. Note, the IRONHIDE execution model builds on the MI6 baseline temporal model. A set of *secure-insecure* process tuples are temporally executed on the processor. However, during the execution of a given secure process, the secure cluster ensures a spatial binding of core-level resources to avoid frequent penalties due to enclave entries/exits.

As the core-level resource requirements vary for any given *secure-insecure* process tuple, it is imperative to ensure near-optimal hardware resource utilization. IRONHIDE proposes *dynamic hardware isolation* to guarantee strong isolation, yet enable a mechanism where the core-level hardware resources of the clusters are allowed to be reconfigured to enable load balance among cluster resources. The secure kernel employs a *core re-allocation predictor* that finds a single resource binding for each interactive *secure-insecure* process tuple. Whenever an application tuple is scheduled for execution, the predictor re-configures the core-level resources among the two clusters. This cluster reconfiguration is done once for each interactive tuple invocation to ensure bounded leakage [25, 26]. However, re-configuring secure cluster’s resources exposes the private resources of the reallocated cores (such as, private caches/TLBs) to microarchitecture state vulnerabilities. Moreover, the shared cache slices (and shared TLBs) of these reallocated cores also get exposed due to sharing of the on-chip network routers. To mitigate these vulnerabilities, on every dynamic hardware re-allocation event, the system is stalled and the private resources of the reallocated cores are *flushed-and-invalidated*, followed by the *re-allocation* of memory pages (data structures) mapped to the shared cache slices (shared TLBs) of the respective cores. Note, this *flushing and re-allocation* overhead in IRONHIDE is only incurred once per *secure-insecure* process tuple invocation, and executed in parallel on each re-allocated core. In a nutshell, *dynamic hardware isolation* keeps strong isolation intact, while maximizing each cluster’s performance.

This work offers following *novel contributions*:

- IRONHIDE extends the MI6 architecture on multicores by forming spatially isolated secure and insecure clusters of cores. It pins the secure process to execute in the secure cluster. Consequently, for interactive applications, IRONHIDE minimizes performance overheads of purging the microarchitecture state, while retaining strong isolation guarantees.
- IRONHIDE is prototyped on a real *Tilera® Tile-Gx72™* multicore. For a set of interactive applications, average performance improvements of  $\sim 32\%$  are observed over the multicore MI6 baseline due to no purging overheads associated with enclave entry/exit.

## 2. THREAT MODEL ASSUMPTIONS

<sup>1</sup>MI6 [17] processor is built on top of DAWG [16] and Sanctum [18].

The threat model assumptions are adopted from MI6<sup>2</sup> [17], where all possible microarchitecture state attacks that rely on covert/side channels are considered. It is assumed that the OS and user applications are untrusted; however, the processor chip (hardware), main memory (i.e., DRAM), and a small security monitor (adopted from MI6) are trusted. Formally, the adversarial model considers that an attacker process can co-locate with a victim process on a shared microarchitecture structure i.e., the core pipeline buffers, the per-core private and shared caches/TLBs, the on-chip network, and shared memory controllers, in the multicore processor. By means of which the attacker process can flush, reload, prime, and probe hardware resources to conduct non-speculative state attacks, such as cache timing/access based attacks [10, 11, 12, 13, 14], and/or on-chip network exploits [27, 28]. Moreover, the attacker can manipulate/train the hardware resources dedicated for speculative execution, such as branch predictors, to launch attacks that rely on leaking the speculative microarchitecture state of the shared resources [6, 8, 9, 29]. The objective of IRONHIDE is to provide holistic protection against all microarchitecture state attacks, while enabling high performance.

IRONHIDE exclusively addresses software-based microarchitecture state attacks, and assumes the absence of any adversary with physical access. Thus, physical channels dependent on power [30], thermal imaging [31], and EM [32] are considered orthogonal attack vectors. This also includes physical attacks on memory, which can be efficiently mitigated by incorporating mechanisms, such as integrity checking [33] and ORAM [34]. Moreover, attacks by compromised system software, e.g., OS refusing to allocate secure application resources are not possible within the proposed threat model. Lastly, hardware attacks outside the microarchitecture state, such as exploiting hardware bugs to conduct fault-inject attacks [35], and employing trojan applications to leak information [36] are all orthogonal to this threat model’s scope.

### 3. MOTIVATION FOR IRONHIDE

Intel’s SGX [5, 37] introduces processor extensions for allowing processes to execute in isolated containers, called *enclaves*. These enclaves are used to execute security-critical processes and temporally co-execute with ordinary (insecure) processes, such as untrusted OS and off-the-shelf processes. As described in [38], for each enclave entry, the processor is switched to *enclave mode* where the secure process is first attested and authenticated via strong cryptography primitives. Upon gaining access, the secure process’s data is decrypted for processing in the enclave. On the contrary, for every secure enclave exit, the processor encrypts all enclave related data and flushes the core pipeline. Lastly, the processor is switched to the *normal mode* to execute ordinary (insecure) processes. Due to temporal execution of the secure enclaves with insecure processes, an attacker process can either directly monitor accesses made by the enclave [11], or befuddle the system in making speculative accesses [9] to leak secure enclave’s data. In a nutshell, Intel’s SGX is vulnerable to various microarchitecture state attacks as it falls short in providing strong isolation guarantees [15].

State-of-the-art secure processor, MI6 [17] (considered as



Figure 2: Strong isolation in MI6 secure processor.

the baseline) adopts the Intel’s SGX [5] enclave execution model, and enables strong isolation to protect against *all* microarchitecture state attacks. Similar to Intel’s SGX, the secure and insecure processes are time multiplexed on a single monolithic processor, where these processes time-share hardware resources. As shown in Figure 2, to mitigate sharing across large stateful resources, such as shared last-level cache and main memory, MI6 statically partitions these resources across the secure enclave and insecure processes. Each process utilizes statically partitioned cache sets and DRAM regions. Moreover, to mitigate sharing across time-shared resources, such as core pipeline buffers, private caches/TLBs, and memory controller queues, MI6 purges/flushes these resources on each secure enclave entry/exit.

**Shortcomings of MI6:** In MI6, each secure enclave entry and exit requires flushing/purging of certain microarchitecture state that leads to performance implications. The state that is purged also needs to be brought back into the hardware caches/TLBs, which further exacerbates performance. According to MI6, the purging of private microarchitecture state incurs an average overhead of  $\sim 5.4\%$  of the total completion time of applications. This translates to  $\sim 54ms$  for every *second* of application execution. This overhead is expected to stack up in case of interactive applications, as it is directly proportional to the interactivity across secure and insecure processes. Higher the interactivity, higher will be the impact of microarchitecture state purging on overall performance. This key observation motivates to re-think secure processor architecture that ensures strong isolation, yet enables high performance for interactive applications.

**Key Idea of IRONHIDE:** All prior works [16, 17, 18] consider the processor as a single monolithic entity, where secure and insecure processes temporally execute. However, this work proposes to take a step further in the context of multicores [21, 22], where all possible speculative and non-speculative microarchitecture state attacks can be conducted. The goal of IRONHIDE is to spatially partition core-level resources to construct secure and insecure clusters of cores, and pin the secure process to the secure cluster where it can execute and interact with insecure processes executing in the insecure cluster. By doing so, if an application comprises of frequent interactions between secure and insecure processes, these interactions can happen without context switching the secure process, avoiding enclave entry/exit purging overheads.

### 4. STRONG MULTICORE ISOLATION

Section 4.1 provides details on how strong isolation mechanisms proposed by the MI6 [17] baseline secure processor are adopted in the context of multicores. Section 4.2 describes IRONHIDE, and its formulation of strongly isolated spatial

<sup>2</sup>MI6 builds on Intel’s SGX [5] and prior academic works, DAWG and SANCTUM [16, 18].



**Figure 3: A strongly isolated multicore MI6 baseline.**

secure and insecure clusters of cores. Lastly, Section 4.3 discusses various system implications of IRONHIDE.

## 4.1 The Baseline MI6 Architecture

The MI6 [17] secure processor is adopted in the context of multicores. The security monitor in MI6 attests and authenticates secure processes before allowing them to execute in the secure enclave (details in Section 4.3). Figure 3 shows the multicore MI6 setup, where secure and an insecure processes are time-multiplexed on the system. Clearly, providing all cores to temporally executing processes exposes the microarchitecture state of the secure process to insecure (attacker) processes. Strong isolation of MI6 is implemented for all shared hardware resources in the multicore.

### 4.1.1 Protecting the Non-Speculative State

To enable strong isolation against non-speculative microarchitecture state attacks (Figure 3:①), the *temporally shared* per-core private resources, such as, private caches, TLBs, and core pipeline buffers are *purged* (flushed) on every secure enclave (process) entry and exit. The *purge* operation performs *flush-and-invalidate* routine on each core concurrently to clean up per-core private microarchitecture state. Moreover, each temporally executing process on the multicore MI6 is provided with spatially partitioned large stateful resources, i.e., shared cache slices and TLBs, and DRAM memory regions (Figure 3:①).

**a) Partitioning On-Chip Shared Cache:** Multicores deploy last-level cache that is logically shared, but physically distributed as cache slices across all cores. By default, an entire memory page is hashed across all shared caches at cache line granularity. However, hashing data among all shared cache slices violates strong isolation as the data for one process may be mapped to the shared cache slices of another process, essentially forming an information leakage channel. To avoid leakage through such a channel, it is important to keep each process’s data within its own set of shared cache slices (clustered together). Therefore, a *local homing* policy is adopted, where an entire memory page (or data structure) is mapped to a single shared cache slice (by the programmer). Data replication in last-level cache is disabled to ensure an access to each shared cache slice is made by a single process. This limits an insecure process from accessing secure process’s shared cache slices. Similar static partitioning schemes have recently been proposed in Intel’s Cache Allocation Technology (CAT) [39] and MIT DAWG [16].

**b) Partitioning Off-Chip Memory:** MI6 partitions the main memory into multiple physically isolated DRAM regions, where these regions are statically distributed across

secure and insecure processes. The last-level cache misses of a process are routed to the memory controller(s) that map its respective DRAM region(s). Multicores deploy multiple memory controllers, and DRAM regions are interleaved across all memory controllers to optimize memory bandwidth. However, shared buffers/queues in the memory controllers are an open vector for microarchitecture state attacks. MI6 ensures strong isolation by assuming constant latency memory controllers, and leaves the exploration of variable latency controllers as a part of future work. Since commercial multicores deploy variable latency memory controllers, the multicore MI6 implements a purge of all memory controller queues/buffers at each secure process entry and exit. This approach ensures no off-chip memory interference and disallows timing leakage across processes.

### 4.1.2 Protecting the Speculative State

Speculative microarchitecture state attacks (e.g., Spectre [8, 40]) have shown that a victim (insecure) process can be tricked by an attacker (insecure) process to speculatively access secret data by manipulating hardware structures, such as branch predictors and return stack buffers. Later, the victim process performs a second memory request with an address based on the secret data. This evicts attacker’s primed data from a shared hardware resource. Hence, the attacker process infers information about the secret data by observing the timing difference in accessing primed cache entries.

To mitigate such speculative microarchitecture state attacks, a solution proposed by MI6 is adopted, where the physical address range of the secure process is checked in hardware for each access made by the insecure process (c.f. Section 3). In multicore MI6 baseline, a lightweight hardware check is employed in the core pipeline that tracks memory accesses destined to data mapped in the secure cluster DRAM region(s). This is done by checking whether the home location of the data is physically mapped to the memory regions of the secure or insecure process. If an insecure process initiates a (speculative) request to access the DRAM region(s) of a secure process, the progress of such request is stalled until it is resolved. Consequently, the memory request is discarded if it is resolved to be on the speculative path, thus incurring no performance overhead. However, if resolved as non-speculative, the exception handler detects such a request due to protection check enable under MI6 strong isolation. In this situation, the memory request is discarded without performance impact.

### 4.1.3 Communication Across Interactive Processes

Similar to MI6 and HotCalls [17, 19], the multicore MI6 baseline adopts shared memory inter-process communication across secure and insecure processes. This allows; (1) processes to exchange their respective output states, and (2) the secure enclave kernel to safely communicate with the insecure OS (Figure 3:②). This is achieved using a shared memory region (referred to as *shared IPC buffer*) that is granted access to both processes. Strong isolation for the shared IPC buffer is assured by allocating it to the dedicated DRAM region(s) of the insecure process. This disallows insecure processes to ever access secure process’s data, otherwise it would violate *strong isolation* guarantees. However, the secure process (enclave) is allowed to access the shared IPC

buffer (and thus DRAM regions of insecure processes), which does not violate *strong isolation* because, (1) the shared data is considered insecure, and (2) no secure data crosses the DRAM regions dedicated to the secure processes. Indeed, a microarchitecture state attack never commences without the insecure process accessing secure data.

#### 4.1.4 Performance Limitations of the Baseline

As discussed in Sections 3 and 4.1.1, the microarchitecture state of time-shared private resources is purged on every secure enclave entry and exit, which further escalates the state reload latency when the same process is temporally switched back later. Alongside purging, static partitioning of the shared cache slices disallows processes to exploit data locality in shared cache resources. Lastly, the multicore MI6 baseline does not cater for the core-level parallelism of different parallelized processes. Indeed, all these factors contribute to degrade the performance of co-executing processes, and these overheads stack up as the interactivity across the secure and insecure processes increases.

## 4.2 The Proposed IRONHIDE Architecture

This paper proposes a novel mechanism to overcome the performance limitation of MI6 architecture, while keeping strong isolation intact. IRONHIDE creates two strongly isolated clusters of cores, where secure and insecure processes are temporally executed within their respective clusters. IRONHIDE adopts spatial partitioning mechanism for shared cache slices and DRAM regions from multicore MI6 baseline. However, instead of time multiplexing per-core resources (private caches and TLBs) across secure and insecure processes, IRONHIDE proposes to spatially distribute these per-core resources across the secure and insecure clusters. Moreover, the on-chip network is also isolated across clusters to ensure that no such packets that are originated by one cluster and destined to the same cluster, drift outside the cluster boundary. Only the network packets intended for application interaction purposes are allowed to drift from one cluster to the other. Lastly, the memory controllers are statically partitioned across secure and insecure clusters to enable strong isolation. These cluster formulation mechanisms are discussed next.

**a) Forming Isolated Core Clusters:** IRONHIDE forms two clusters of cores that temporally execute their respective secure and insecure processes. Each cluster is assigned a set of cores, and the respective process threads are pinned to its assigned cluster. For strong isolation, cores assigned to clusters must never overlap each other.

**b) Isolating the On-Chip Network:** For each cluster, the network traffic must be routed such that all requests and data packets remain within the boundary of the cluster. Thus, a deterministic network routing protocol (such as X-Y routing) is envisioned in the target multicore, since it enables isolation of network traffic. For example, X-Y routing with 2-D mesh network topology recognizes each router by its coordinates ( $X, Y$ ), and transmits packets first in  $X$  direction followed by  $Y$  direction. In a square floor plan, rows of cores are assigned to each cluster with this respective memory controller on the outside edges, such that X-Y routing never drifts across the clusters. However, with just X-Y routing in place, an entire row of cores must be allocated to any given cluster. If cores within a row are allocated among the two clusters, it

is possible for the X-Y routing to drift packets across cores allocated to different clusters, violating strong isolation. Employing Y-X routing mitigates this scenario, since packets are routed in  $Y$  direction first to ensure they safely traverse to their respective row of cores. Hence, the deterministic routing algorithm supports bidirectional routing [41] (allows both X-Y and Y-X routing) of packets in the on-chip network.

**c) Isolating DRAM Regions:** For each cluster, the memory controllers must be strongly isolated such that the respective DRAM region(s) of the process being executed in that cluster are accessible. Unlike the multicore MI6 baseline, the memory controllers are statically partitioned among the two clusters<sup>3</sup>. The respective DRAM region(s) are mapped in such a way that they are accessible from their dedicated memory controller(s). For strong isolation guarantees, memory controller(s) assigned to clusters must never overlap each other. Specifically, the secure cluster dedicates the DRAM region(s) of all secure processes to the memory controller(s) that allow any given secure process to access its respective physical memory channels, banks, and rows. At each secure process context switch, the queues/buffers of memory controller(s) assigned to the secure cluster are purged to ensure strong isolation. Moreover, the insecure cluster has its own dedicated memory controller/channels, and it is free to context switch without any purging overheads.

As shown in Figure 4:(a), the formation of spatially isolated secure and insecure clusters enables each cluster to temporally execute respective processes, while fully utilizing its dedicated hardware resources i.e., private caches and TLBs, shared cache slices and TLBs, and memory controllers/channels. The key insight here is that the secure process is pinned to the secure cluster, where it spatially interacts with insecure processes via the shared IPC buffer, adopted from the multicore MI6 baseline. IRONHIDE does not require the secure process (enclave) to purge its private microarchitecture state on each entry/exit for interactive applications. However, statically partitioning core-level hardware resources across secure and insecure clusters leads to under-utilization of hardware cache and core resources. To tackle this challenge, IRONHIDE proposes dynamic hardware isolation.

#### 4.2.1 Dynamic Hardware Isolation in IRONHIDE

To adapt the performance variations among the processes of a given secure–insecure process tuple, IRONHIDE implements dynamic hardware isolation that enables a mechanism where the secure cluster is allowed to give up or gain cores, yet guarantee strong isolation. The secure cluster deploys a *core re-allocation predictor* that statically finds a *single* core-level resource binding for each interactive secure–insecure process tuple. When an interactive application is scheduled, the pre-computed number of cores for each cluster are configured for execution (details in Section 4.3). Note, this cluster reconfiguration is done once for each interactive secure–insecure process tuple invocation, which ensures bounded leakage [25, 26].

**Strong Isolation Implications:** If the secure and insecure clusters are already balanced then no cluster reconfiguration is required, and strong isolation stays intact. However, when

<sup>3</sup>An alternative is to adopt static memory bandwidth partitioning [42, 43]. However, for such mechanism to ensure strong isolation, the on-chip network must still guarantee strong isolation between clusters.



**Figure 4: The proposed IRONHIDE architecture. (a) shows strongly isolated static clusters of cores. (b) shows when core-level resources are reconfigured under dynamic hardware isolation.**

clusters' core-level resources are reconfigured, the private caches/TLBs and network router of the cores reallocated from one cluster to the other get shared across processes. This results in violating strong isolation guarantees.

Figure 4:(b)-(1) depicts a scenario where the secure cluster gives away a set of shaded cores to the insecure cluster. Each core given up by the secure cluster temporally shares the secure process's data with the insecure process in the private core pipeline buffers, caches, and TLB resources. The insecure process can monitor the private resources of these reallocated cores to leak the microarchitecture state of the secure process [14]. Moreover, the secure cluster's data remains pinned to the shared cache slice of each dynamically allocated core. The secure process's accesses to this data contend with the accesses made by the insecure process on the shared network routers, leading to potential information leakage of secure data. Similarly, Figure 4:(b)-(2) shows a scenario where the secure cluster gains a set of cores from the insecure cluster. The insecure cluster can prime the private caches and TLBs of each reallocated core, and probe after acquiring it back via cluster re-configuration. Furthermore, insecure process's data remains pinned to shared cache slices of each core gained by the secure cluster that is still accessible by the insecure cluster. The insecure cluster can contend these network routers and create covert timing channels.

**Guaranteeing Strong Isolation:** Clearly, cluster reconfiguration via dynamic hardware isolation partially violates strong isolation, as the private memory resources and network routers (indirectly exposing shared cache slices) of the reallocated cores get shared across clusters. To regain strong isolation, following mechanisms are adopted.

*a) Private Cache and TLB Flush-and-Invalidate:* To protect the directly exposed private resources from leaking secure cluster's data, IRONHIDE *flushes and invalidates* the core pipeline buffers and private caches (and TLBs) of all reallocated cores. This is done in the same way as the multicore MI6 baseline, but it is only applied once per *secure-insecure* process tuple invocation.

*b) Data Re-Allocation in Shared Caches and TLBs:* The shared cache (and TLB) resources of the re-allocated cores are indirectly exposed due to sharing of network routers. To enforce strong isolation, IRONHIDE *re-allocates* the process's data structures (memory pages) for all shared cache slices of the dynamically re-allocated cores. This mechanism unmaps the data structure from its current home (cache slice), by which all dirty data is propagated to the off-chip memory.

Lastly, the data structure is re-mapped to the reconfigured secure cluster's shared cache slice(s). Consequently, strong isolation for the on-chip network is regained, as the network routers do not get shared across clusters anymore.

*c) Cluster Reconfiguration Procedure:* On every dynamic hardware re-allocation event, IRONHIDE first stalls all cores in the system. The re-allocated cores are concurrently passed through the *flush-and-invalidate* and *memory-fence* routines. Consequently, the data present in private resources is flushed to its respective shared cache slices (home). Upon which, the *re-allocation* routine is called. Finally, both clusters proceed execution after the new thread work distribution.

#### 4.2.2 The Execution Model of IRONHIDE

Similar to multicore MI6, IRONHIDE allows the execution of multiple secure and insecure processes, where these processes temporally execute in their respective clusters. Like any other processor, IRONHIDE executes multiple insecure (ordinary) processes temporally without flushing the microarchitecture state of the resources allocated to the spatially isolated insecure cluster. In an application with no secure process, the system is reconfigured to a single cluster utilizing all available core-level resources. In such a scenario, the data for the secure cluster resides in its statically isolated DRAM regions that are never accessed by the insecure cluster.

In case where multiple secure applications are executed, IRONHIDE adopts time-sharing strategy for the secure processes, as proposed by MI6. The private resources (such as, private caches, TLBs, and core pipeline) are flushed/purged on every secure process (enclave) entry and exit, since these secure processes are mutually distrusting. The shared cache slices (and TLBs) are either (1) partitioned (set-partitioning) across all or a set secure enclaves, or (2) these resources (belonging to the secure cluster) are flushed/invalidated on every secure process context switch. Since no enclave state is left behind in these resources on a secure process context switch in both cases, these solutions provide similar level of security guarantees. Lastly, the memory of the secure processes are statically partitioned into physically isolated DRAM regions. For strong isolation, the memory controller(s) of the secure cluster are purged on every secure process context switch.

The key insight of IRONHIDE is that it exploits performance when a *secure-insecure* process tuple of an interactive application is mapped on the system. Since IRONHIDE spatially pins the secure process to the secure cluster, the secure process executes and interacts with the insecure process(es) without incurring frequent purging overheads. Moreover,

with dynamic hardware isolation, the secure and insecure clusters execute their respective processes with near-optimal core-level resources, resulting in better cache utilization, improved data locality (reuse), and load-balanced execution.

### 4.3 System Implications of IRONHIDE

**Implications of the Secure Kernel:** Similar to the security monitor in MI6, a secure kernel in IRONHIDE ensures strong isolation across secure and insecure cluster. The secure kernel executes alongside the secure processes in the secure cluster. To ensure that only secure applications temporally execute in the secure cluster, it implements signature checking and attestation mechanisms. Additionally, all secure processes within the secure cluster operate by proxying their OS services to the OS executing in the insecure cluster. For strong isolation, the secure kernel verifies all decisions made by the untrusted OS e.g., resource management decisions that no DRAM regions overlap among processes. Upon failure, the secure kernel raises an exception and disallows execution of the secure process on the system. Similar to [17, 18, 23], the secure kernel runs in the *machine mode*, managing its own memory and hardware resources. A secure boot protocol is also enabled to ensure that the secure kernel has not been compromised. Furthermore, in case of page faults and interrupts, the secure kernel is expected to intervene for preserving strong isolation.

**Implications of the Core Re-allocation Predictor:** As described in Section 4.2.1, the secure kernel implements a predictor that load-balances the allocation of core resources in secure and insecure clusters. The predictor statically finds a (one-time) resource binding for each interactive *secure-insecure* process tuple, based on which the cluster reconfiguration is done once per process tuple. Although an application tuple may exhibit sensitivity to varying core allocations during its execution, IRONHIDE enforces a limited cluster reconfiguration capability to ensure bounded leakage [25, 26]. This single resource binding is pre-computed manually by observing the performance counters and core-level resource requirements for each *secure-insecure* process tuple. Automating this mechanism of computing the right core-level resource binding for each process tuple is left as future work.

**Implications of Highly Interactive Applications:** This work considers a variety of interactive applications, where the interactivity across secure and insecure processes lies within the range of 380 to 400 interactions per second. For these applications, IRONHIDE enables performance improvements over the multicore MI6 baseline by minimizing the purging of the microarchitecture state. Hot-Calls [19] makes an effort in quantifying the overheads for Intel’s SGX enclave entry (ECALL) and exit (OCALL), when the secure enclave interacts with the untrusted OS. In Hot-Calls, each secure enclave entry and exit incurs an overhead of  $\sim 2.5$  to  $5 \mu s$  in Intel’s SGX, which does not include the overheads associated with microarchitecture state purging. Even with such small overheads, Hot-Calls reports performance degradation of  $\sim 40\%$  when a database application generates  $\sim 200,000$  requests per second that are served by the insecure OS. Indeed, IRONHIDE is expected to mitigate these overheads, and achieve significant performance benefits over MI6 for such highly interactive applications.

## 5. METHODS

IRONHIDE is prototyped on a real multicore *Tilera® Tile-Gx72™* processor. *Tile-Gx72* provides all necessary hardware reconfiguration capabilities needed for the proposed temporal and spatial strong isolation mechanisms. An API library, Tilera Multicore Components (TMC) include facilities that are used to form clusters of cores, manage network traffic across clusters, regulate on-chip and off-chip memory accesses, and manage shared cache data placement.

**Architectural Specifications:** *Tilera® Tile-Gx72™* is a tiled multicore architecture comprising of 72 tiles, where each tile consists of a 64-bit multi-issue in-order core, private level-1 (L1) data and instruction caches of 32 KB each, private instruction and data TLBs of 32 entries each, and a 256 KB slice of the shared level-2 (L2) cache (LLC capacity of 18 MB). Moreover, it consists of 5 independent 2-D mesh networks with *X-Y routing*, one for on-chip cache coherence traffic, one for memory controller traffic, and others for core-to-core and I/O traffic. The off-chip memory (DRAM) is accessible using four on-chip 72-bit ECC protected DDR memory controllers that are attached to independent physical memory channels (DIMMs). The on-chip networks enable communication between neighboring tiles and the memory controllers.

**Modeling Multicore MI6 (BASELINE):** The *secure-insecure* process tuple executes temporally, where 64 of the available 72 cores are time-shared across processes based on application interactions. Each process is provided with statically partitioned 32 L2 slices, and half of the DRAM regions. The default *hash-for-homing* scheme is overridden with the *local homing* scheme that maps each process’s data structures on specific L2 slices using `tmc_alloc_set_home(&alloc, core_id)`. Moreover, *L2-replication* is also disabled to allow only one process to access any given L2 cache slice.

All cores, their L1 caches and TLBs, as well as the memory controllers are available to any given process that executes on the processor. Hence, they must be purged/flushed on each secure process entry and exit to ensure strong isolation of MI6. To *flush-and-invalidate* the private L1, a dummy buffer of size equal to the cache size is read into each L1 cache (*in parallel*). Reading this buffer removes all secure process’s data from private L1 caches. Then, a memory fence operation (`tmc_mem_fence()` call) is performed that ensures propagation of dirty data to respective L2 slices. Similarly, the TLBs are flushed using Tilera specific user commands. However, L1s and TLBs are purged in parallel. Finally, the queues/buffers of all memory controllers are purged using `tmc_mem_fence_node(controller_id)` call.

**Modeling IRONHIDE:** The secure and insecure clusters of cores are formed by pinning process’s threads to respective cores via `tmc_cpus_set_my_cpu(tid)`. The L2 cache slices are allocated to their respective cluster using the *local homing* scheme. A clusters’ accesses to its physically isolated DRAM regions are realized by forwarding its respective L2 miss traffic to dedicated memory controllers via `tmc_alloc_set_nodes_interleaved(&alloc, pos)`, where `pos`<sup>4</sup> represents the bit-mask representation of memory controllers to be selected. Note, *Tilera®* specific memory allocation type is used which is represented by `alloc` in the aforesaid API calls. *Tile-Gx72™* implements

<sup>4</sup>For the secure cluster, `pos = 0b0011` to enable on  $MC_0$  and  $MC_1$ , whereas, `pos = 0b1100` ( $MC_2$  and  $MC_3$ ) for the insecure cluster.

a smart *X-Y routing* with 2-D mesh network topology, which isolates the network traffic by routing each packet to/from the allocated clusters' memory resources.

The dynamic hardware isolation capabilities of IRONHIDE are also supported on the prototype. At each *secure-insecure* process tuple invocation, the private L1 and TLB *flush-and-invalidate* mechanism from the multicore MI6 baseline is invoked for the re-allocated cores. To *re-allocate* data structures (pages) in L2s, the pages are first un-mapped from their current L2 home cache slices using *Tilera®* specific API call `tmc_alloc_unmap (*addr, size)`. Next, for each page, the new L2 home is set using `tmc_alloc_set_home (&alloc, core_id)` API call. Finally, each page is mapped to the new L2 home using `tmc_alloc_remap (&alloc, size, new_size)` API call. Note, the prototype only contains private TLBs (which are flushed as explained above), thus only shared L2 cache slices need to be *re-allocated*. However, if present, the shared TLBs will also be re-allocated for strong isolation.

**Benchmarks & Execution Settings:** Four different classes of safety-critical secure algorithms are evaluated, namely; (1) Graph Algorithms [44, 45]: Single Source Shortest Path (SSSP), PageRank (PR), and Triangle Counting (TC) executing with the California road network graph [46], (2) Mission Planning: Artificial Bee Colony [47] (ABC), adopted from advanced driver-assistance system with an input of a real-world road scenario, (3) Perception Algorithms: ALEXNET and Squeeze-Net (SQZ-NET) processing inputs from ImageNet [48], and (4) Encryption Algorithm: Advanced Encryption Standard (AES). Additionally, four different sets of algorithms are considered as insecure processes, namely; (1) Matrix Computation Algorithms [49]: FFT, RADIX, CHOLESKY, RAYTRACE, and LU, (2) Image Processing Vision Pipeline [50] (VISION) that performs image processing kernels on RAW images, (3) Graph Generation Algorithm [51] that reads values at various time intervals from sensors deployed on the roads, and generates temporal graph updates to the California road network graph, (4) Query Generation [52] (QUERY) that periodically generates database queries for systems (e.g., ATM) to process.

Three different classes of interactive *secure-insecure* process tuples are evaluated, namely; (1) *Real-time Graph Processing*: Each of the three secure graph algorithms interacts with the insecure graph generation algorithm that generates temporal graph inputs for the secure graph algorithm process, (2) *Real-Time Perception and Mission Planning*: Insecure VISION pipeline interacts with each of the two secure perception, as well as the mission planning processes by feeding processed images as inputs, and (3) *Query Encryption*: Insecure QUERY process generates queries for the secure AES process to encrypt, which enables secure query processing [53]. Interactions across these processes are carried out via the *shared IPC buffer*, and in general the interactivity between the secure and insecure processes is measured at  $\sim 400$  interactions per second. Note that this interactivity is well below the one observed in Hot-Calls [19], where the evaluated interactive applications measure in excess of 200K interactions per second. Therefore, for Hot-Calls applications, IRONHIDE is expected to deliver even higher performance over the multicore MI6 baseline, which we leave for future work.



**Figure 5: The Comparison of IRONHIDE against the BASELINE scheme for various interactive applications. Geometric mean completion times are reported.**

**Measurements:** Both multicore MI6 and IRONHIDE handle interactions as a pipeline, where the insecure process first reads and processes the input, and feeds it to the secure process. The baseline MI6 executes these processes temporally on the multicore. However, IRONHIDE pipelines execution, where secure and insecure processes are pinned to their respective clusters of cores to carry out their execution of inputs. For both setups, the input pipeline is filled with 20 inputs to obtain steady-state, and then completion time is measured for a fixed number of inputs. For IRONHIDE, each *secure-insecure* process tuple is first executed with an initial cluster configuration of 32 cores per cluster. The system is then reconfigured to a pre-computed core-level resource binding after purging and re-allocating the hardware resources of the the re-allocated cores. These overheads are measured and added to the completion time. Similarly, for multicore MI6, the completion time includes time-sharing of both processes, as well as the time taken to purge microarchitecture state for each secure process entry and exit

## 6. EVALUATION

### 6.1 Interactive Application Analysis

Figure 5 shows the completion time comparison of the MI6 BASELINE against IRONHIDE. The reported numbers show the geometric mean completion time (left y-axis) for each interactive application tuple (x-axis). For each application, the secure process interacts with the insecure process for an average of 13.3K inputs<sup>5</sup> executed with an average of 70 seconds of MI6 baseline execution, leading to its interactivity rate of  $\sim 400$  secure process entry and exit events per second.

IRONHIDE improves the geometric completion time by  $\sim 32\%$  over MI6 BASELINE. MI6 purges per-core private resources and memory controller queues on each secure process interaction. This overhead is measured as  $\sim 0.24ms$  per event, and for 13.3K inputs incurring this overhead for each secure process entry and exit, the total purging overhead is recorded at  $\sim 6.45$  seconds (or  $\sim 6.4\%$  of the average per input execution time). This overhead is clearly visible as the BASELINE flushing component, while the remaining completion time

<sup>5</sup>A tuple executes 500, 1K, 5K, 10K, and 50K inputs, and the reported completion time is the average across these runs.

component is broken down into the execution time for the secure and insecure processes. On the contrary, IRONHIDE spatially pins the execution of the *secure-insecure* process tuple without context switching on each interaction event. Therefore, it does not incur the purging overhead observed in MI6. However, it experiences an average of  $\sim 15ms$  one-time overhead of L1/TLB purging and L2 cache slice re-allocation overhead per application tuple invocation for the cores that are re-allocated (given up or gained) by the secure cluster. The marker on top of each interactive application bar (right y-axis) shows geometric mean number of cores that are given up (below 32) or gained (above 32) by the secure cluster under IRONHIDE. The results from Figure 5 indicate that IRONHIDE improves completion time component of purging by  $430\times$  over the MI6 BASELINE. However, the total completion time of the application plays a significant role, and the overall benefits are observed as a 32% improvement.

It is also clear from the geometric mean result that the completion time of secure and insecure processes (excluding the purging overheads) is  $\sim 62.6$  seconds under the MI6 BASELINE, while IRONHIDE executes its interactive processes (excluding the purging and re-allocation overheads) in  $\sim 52.3$  seconds. This improvement is due to better hardware resource utilization of on-chip resources under IRONHIDE. First, purging the private microarchitecture state under the MI6 BASELINE limits each process from exploiting private cache locality, essentially thrashing the L1 cache and TLBs on each purge event. This overhead is not present in IRONHIDE since it enables each secure and insecure process to exploit its private resources better. Second, statically partitioned L2 slices impact the shared cache usage of processes, as a process may demand larger cache capacity for improved performance. The MI6 BASELINE operates with a fixed static partition, while IRONHIDE implements dynamic hardware isolation to improve the load-balancing of last-level cache capacity by intelligently re-allocating L2 cache slice resources at the invocation of a secure-insecure process tuple. The details of caching behaviors are discussed in further detail in Section 6.1.1. Third, the MI6 BASELINE provides each process with all available cores to enable core-level parallelism. However, nearly all processes considered in this work are somewhat memory and/or synchronization bound, and hence do not show linear performance scaling at the high core count of the prototype multicore. IRONHIDE takes into account the core-level parallelism exposed by each process in a tuple, and makes its core-level resource allocation decision accordingly.

In summary, IRONHIDE is able to improve completion time over the MI6 BASELINE by 32%, due to (1) dramatic decrease in microarchitecture state purging overhead, (2) improved private cache/TLB utilization, (3) improved last-level cache capacity allocation, and (4) improved utilization of core-level resources.

### 6.1.1 Analyzing the Cache Miss Behavior

Figure 6:(a) depicts the private L1 cache miss rates for each interactive application under the MI6 BASELINE and IRONHIDE. As compared to MI6 BASELINE, the private L1 cache miss rates dramatically reduce for IRONHIDE from  $2\times$  to  $4.5\times$ , with the exception of the  $\langle TC, GRAPH \rangle$  tuple. The MI6 BASELINE experiences L1 cache thrashing as a consequence of frequent L1 cache purging. However, the spatial



Figure 6: Private L1 and shared L2 cache miss rates for each interactive application. Geometric mean miss rates are shown for  $\sim 13K$  process interactions on average.

execution of the processes under IRONHIDE pins respective threads on each cluster's cores, and dramatically improves the private cache utilization.  $\langle TC, GRAPH \rangle$  does not exhibit much L1 cache locality for the TC process, while the GRAPH process has a small private working set. Therefore, the MI6 BASELINE purge does not impact the L1 cache miss behavior significantly. On the other hand, as observed in Figure 5, TC is executed in a secure cluster configured with only two cores, while GRAPH executes with the remaining 62 cores allocated to the insecure cluster. The TC process incurs significant thread synchronization overheads, thus it is allocated a small number of cores, while the GRAPH process benefits primarily from core-level parallelism. Thus, the overall L1 cache behavior of this tuple only improves slightly under IRONHIDE since the performance of both processes is not primarily sensitive to L1 cache.

Figure 6:(b) depicts the shared L2 miss rates for each interactive application under the MI6 BASELINE and IRONHIDE. Again, the L2 miss rates are improved by  $1.5\times$  to  $2\times$ , with the exception of the  $\langle TC, GRAPH \rangle$  tuple. However, unlike L1 cache, the benefits from IRONHIDE primarily arise due to its dynamic hardware isolation capability that enables the processor to load-balance the allocation of L2 cache slices on a per-tuple granularity. On the other hand, the MI6 BASELINE configures the last-level cache with a static allocation of 32 L2 cache slices per secure and insecure process. Due to better utilization of the available last-level cache resources, IRONHIDE delivers improved L2 cache miss rates.

In summary, as seen in Figure 5, the completion time of secure and insecure processes (excluding the purging overheads) improves for all tuples except  $\langle TC, GRAPH \rangle$ . This is correlated directly to the L1 and L2 cache miss rate behaviors in Figure 6. Therefore, we conclude that IRONHIDE benefits greatly from (1) improved private caching effects of not purging the microarchitecture state at each interaction, and (2) better shared cache resource utilization.

### 6.1.2 Impact of Increasing Input Interactions

Figure 7 shows the impact of increasing *secure-insecure* process tuple interactions from 500 to 50,000 inputs being processed. The geometric mean completion time of all tuples is shown (left y-axis) for various number of inputs (x-axis). The marker on top of each input count represents the %performance gains by IRONHIDE over the MI6 BASELINE.

Each input interaction induces a secure process entry and exit in the MI6 BASELINE scheme, which in turn results in purging the microarchitecture state of the per-core private



**Figure 7: Impact of increasing input interactions across secure and insecure processes on the performance.**

resources and memory controllers queues/buffers. However, IRONHIDE only incurs a single purging event at each *secure-insecure* process tuple invocation. The results show that purging overheads alone increase from  $21\times$  to  $1500\times$  when the input interactions are increased from 500 to 50,000 in IRONHIDE relative to the MI6 BASELINE. In addition to the purging of microarchitecture state, the MI6 BASELINE also suffers from diminished cache hierarchy utilization effects, as observed in Figure 6. These cache effects accumulate as the multicore executes an increasing number of input interactions. However, IRONHIDE is able to setup a core-level resource allocation at the start of each tuple, and optimize the execution of all inputs as they are processed through the system. It is observed that completion time of secure and insecure processes (excluding the purging overheads) improve from  $\sim 8\%$  to  $\sim 29\%$  as inputs are increased from 500 to 50,000 in IRONHIDE relative to the MI6 BASELINE. Overall, both these effects combine to deliver better performance for IRONHIDE over the MI6 BASELINE as the number of inputs processed by a *secure-insecure* process tuple increase<sup>6</sup>.

#### 6.1.3 Impact of Core Re-allocation Prediction

The performance improvements provided by IRONHIDE highly depend on the resource binding computed by the core reallocation predictor for load-balanced execution. This resource binding is pre-computed statically for each *secure-insecure* process tuple. To observe the impact of variations in predictor decisions, Figure 8 shows the impact on the performance of IRONHIDE. Geometric mean completion time (y-axis) for all interactive applications is reported for a variety of decision variations (x-axis). The variations with a positive (+) sign represent that the secure cluster is provided with X% more cores as compared to the optimal configuration. Conversely, the negative (-) shows when X% more cores (as compared to optimal) are given away from the secure cluster to the insecure cluster. Here, X implies  $\pm 5\%$ ,  $\pm 10\%$ ,  $\pm 20\%$ , and  $\pm 25\%$ .

The performance improvements provided by IRONHIDE over the MI6 BASELINE scheme are 32% at optimal load balancing of cluster core-level resources. As the variation increases, IRONHIDE falls short in providing both clusters with the desirable core-level resources, specifically the L1 and L2 caches. This in turn impacts the cache utilization and data locality (reuse), leading to inferior performance compared to the optimal resource binding. However, even with the variation window of  $\pm 25\%$ , IRONHIDE is able to

<sup>6</sup>Note, that the Figure 5 shows an average across the number of inputs evaluated in Figure 7.



**Figure 8: Impact of the variations in decisions made by the predictor on the performance of IRONHIDE.**



**Figure 9: Comparing baseline multicore MI6 to IRONHIDE and an Intel-SGX like setup.**

provide performance improvements of  $\sim 6$  to  $\sim 15\%$  over the MI6 BASELINE. Thus, it is imperative to have a near-optimal (if not optimal) core reallocation predictor to enable IRONHIDE to maximize its performance benefits.

#### 6.1.4 Quantifying the Impact of Strong Isolation

To quantify the real impact of the performance gains of IRONHIDE over the MI6 BASELINE, Figure 9 presents the geometric mean completion times for each interactive application tuple. A third configuration, INTEL-SGX is added to this completion time comparison. INTEL-SGX models a secure enclave execution model, where each entry and exit does not purge the microarchitecture state of the private hardware resources. However, to model the ECALL and OCALL overheads of SGX as observed in HotCalls [19], a constant  $5\ \mu s$  latency is added for each event. The INTEL-SGX also does not statically partition the shared caches and DRAM regions. Thus, it fully benefits from data locality and memory bandwidth of the system, and also avoids the purging of the queues/buffers in the memory controllers.

For the reasons discussed in previous sections, IRONHIDE improves by 32% over the MI6 BASELINE. However, the INTEL-SGX delivers an average of 44% performance improvement over the MI6 BASELINE. A decrease of 8.7% performance is observed for IRONHIDE over the INTEL-SGX. This is primarily attributed to the shared cache and memory controller partitioning required by IRONHIDE to deliver strong isolation guarantees that are not available in INTEL-SGX system. For application tuples that are not sensitive to these large resource partitions, i.e.,  $<TC, GRAPH>$ , the results show no performance degradation under IRONHIDE. The results also show that both INTEL-SGX and IRONHIDE incur negligible overheads from flushing/purging. However,



**Figure 10: Comparing IRONHIDE with the BASELINE scheme for non-interactive applications. Impact of predictor decision variations on IRONHIDE is also shown.**

IRONHIDE ensures protection against all microarchitecture state attacks.

## 6.2 Non-Interactive Application Analysis

To observe how IRONHIDE performs in the absence of interactive applications, this section evaluates non-interactive applications, where each secure process (executing in the secure cluster) is co-executed with various independent insecure processes in the insecure cluster. The non-interactive analysis is performed for 56 different non-interactive process combinations that are formed by taking each secure process and co-executing it with each insecure process discussed in Section 5. Figure 10 shows the geometric mean completion time (y-axis) for executing all these applications on the MI6 BASELINE and IRONHIDE schemes. Moreover, the impact of variations (i.e.,  $\pm 5\%$ ,  $\pm 10\%$ ,  $\pm 20\%$ , and  $\pm 25\%$ ) in predictor decisions on the performance of IRONHIDE is also shown alongside the performance at optimal resource binding.

Due to lack of any interactions between the two processes, IRONHIDE incurs more overhead in flushing and re-allocation ( $\sim 15ms$ ) as compared to the single purge of private resources by MI6 BASELINE ( $\sim 0.24ms$ ). However, IRONHIDE is able overcome these overheads by providing optimal core-level resources to each process via dynamic hardware isolation, which results in better cache utilization and data locality (reuse). Overall, IRONHIDE is able to gain performance benefits of  $\sim 8\%$  over the MI6 BASELINE scheme. These improvements are only for the case when the clusters are balanced in an optimal fashion by the core re-allocation predictor. However, when there exist variations in predictor decisions, the performance of IRONHIDE suffers. IRONHIDE can only sustain a predictor decision variation window of  $\pm 5\%$ , where the improvements are  $\sim 1.3\%$  and  $\sim 3\%$  for  $-5\%$  and  $+5\%$ , respectively. Variations beyond  $\pm 5\%$  i.e.,  $\pm 10\%$ ,  $\pm 20\%$ , and  $\pm 25\%$ , lead to a performance degradation as compared to the MI6 BASELINE.

In conclusion, IRONHIDE is slightly better (or at par with) MI6 BASELINE for applications where the secure process does not interact with the outside world. However, when secure and insecure processes interact with some interactivity rate, IRONHIDE promises high performance while assuring similar strong isolation guarantees promised by the MI6 BASELINE. It is noteworthy that as the interactivity increases, the performance advantage of IRONHIDE also increases.

## 7. RELATED WORK

**Secure Processor Works:** Academic works, XOM [54] and Aegis [55] reduce the trusted computing/code base (TCB)

to a secure processor chip. The TCB assumes a program running on the processor to be trusted such that the memory accesses do not leak sensitive information. Industry developed NGSCB [56], Trustzone [57], and TPM [58] as a fixed-function unit with limited set of capabilities. To secure arbitrary computation, TPM was extended with SVM [59] or TXT [60] to implement an integrity checking boot process that attests to the software stack. Intel’s SGX [5] platform maintains on-chip enclaves, which isolate processes from the untrusted OS via key management, and address partitioning. However, various microarchitecture state leakage channels in SGX have led to attacks [61, 62]. Recent secure processor works [17, 18, 37, 63, 64] extend the idea of enclaves in SGX to alleviate microarchitecture state attacks.

All prior works [5, 16, 18], such as MI6 [17] consider the *temporal* execution of secure and insecure processes. MI6 [17] introduces the concept of strong isolation which requires purging of the microarchitecture state of time-shared resources at every secure enclave entry/exit, leading to performance degradation. Hot-Calls [19] makes an effort in quantifying the overheads ( $\sim 2.5$  to  $5\mu s$  for each ECALL/O-CALL) for processes that interact with each other to assure application progress; however, these overheads do not include the microarchitecture state purging. Even with such small overheads, a performance degradation of  $\sim 40\%$  is reported for a database application generating  $\sim 200,000$  requests per second to the untrusted OS. Clearly, the overall numbers (including purging overheads) are expected to significantly degrade the performance of MI6. Thus, this paper re-thinks secure processor designs in the context of multicores, where *spatially* isolated secure/insecure clusters are formed. The secure process is pinned to the secure cluster which allows the secure process to execute without requiring frequent context switching, and thereby, limits the purging overheads.

**Protecting Non-Speculative Microarchitecture State:** Cache side-channel [10, 12, 13] attacks have been studied extensively, such as *Prime+Probe* [14], where the attacker’s goal is to determine which cache sets have been accessed by the victim application by observing the latency difference between a cache hit or a miss. Page translation caches (TLBs) have also been attacked [65] using similar schemes under Intel’s SGX. Various works on cache partitioning either isolate caches [16, 66, 67], or scramble address accesses [68, 69, 70] to diminish information leakage. Research has also shown that routers in the *on-chip networks* expose application traffic traces [27, 28] that leads to information leakage. Furthermore, information can also be leaked via *off-chip memory*-based timing channels, where the adversary monitors memory latencies of the victim application [71]. Prior works [42, 43, 72] have explored various mitigation mechanisms such as, employing time-multiplexed memory bandwidth reservation [42], or adopting non-interference memory controller scheme [43].

The aforementioned works focus on certain covert channels and do not protect against all microarchitecture state attacks. However, this paper takes a holistic approach to efficiently prevent all potential microarchitecture state attacks in the context of multicores, while introducing minimal hardware modifications. IRONHIDE spatially isolates core-level resources across secure and insecure clusters of cores, where secure and insecure processes of an interactive

application execute. These clusters are later reconfigured to a single pre-computed resource binding for maximizing system throughout and performance. Efficient *flushing* and *re-allocation* mitigation mechanisms (that exploit multicore parallelism) are employed to clean the microarchitecture state of reallocated cores on every dynamic reconfiguration.

**Protecting Speculative Microarchitecture State:** DAWG [16] utilizes protection (or security) domains to isolate secure data from malicious insecure applications. Both caches and DRAM are partitioned to ensure secure data is physically isolated from the insecure data. Therefore, speculative microarchitecture state attacks [8, 9, 29] do not materialize due to strong isolation. However, since these caches are latency sensitive to capacity and conflicts, the performance penalties stack up with DAWG-like approaches. InvisiSpec [73] does not assume security domains, and handles speculative microarchitecture states in both private and shared caches by only committing non-speculative data. It builds hardware to temporarily hold unresolved load data in an isolated buffer invisible at each level of the cache hierarchy. It also adds hardware to ensure data consistency checks before committing loads that resolve as non-speculative. However, this incurs invasive hardware changes and performance losses (reported >20% for honest code) due to diminished benefits from speculative execution. In IRONHIDE, the victim and attacker process pairing for Spectre-like attacks is only possible in the insecure cluster. The secure process is strongly isolated from the attacker since secure data is only allowed to map inside the secure cluster. However, private caches/TLBs of the insecure cluster are allowed to access *any* type of data speculatively. Similar to MI6, IRONHIDE envisions a lightweight hardware check for each memory access that ensures the insecure cluster does not access the secure cluster's data.

## 8. CONCLUSION

To enable secure processor execution, Intel's SGX introduces the concept of enclaves that temporally execute alongside ordinary processes on the processor. However, it has been shown to be vulnerable across various speculative and non-speculative microarchitecture state attacks. State-of-the-art MI6 secure processor adopts the idea of strong isolation to mitigate all such vulnerabilities. However, it suffers from performance degradation due to purging of the microarchitecture state of private resources on every secure enclave entry and exit. This paper proposes IRONHIDE that extends the MI6 architecture in the context of multicores, and forms spatially isolated secure and insecure clusters of cores. For a given *secure-insecure* process tuple, IRONHIDE pins the secure process to the secure cluster, where it interacts with the insecure process without the need to purge microarchitecture state on each secure enclave entry and exit. Additionally, IRONHIDE implements dynamic hardware isolation, where core-level resources of clusters are load-balanced to optimize the core-level resource utilization for the concurrently executing secure and insecure processes. The evaluation shows that IRONHIDE improves performance by 32% for a set of interactive applications over the multicore MI6 baseline.

## Acknowledgments

This research was supported by the National Science Foundation under Grant No. CNS-1929261.

## 9. REFERENCES

- [1] V. Roberge, M. Tarbouchi, and G. Labonte, "Comparison of parallel genetic algorithm and particle swarm optimization for real-time uav path planning," *IEEE Trans. on Industrial Informatics*, vol. 9, no. 1, pp. 132–141, 2013.
- [2] T. Ungerer, F. Cazorla, P. Sainrat, M. Houston, F. Kluge, S. Metzlaff, and J. Mische, "Merasa: Multicore execution of hard real-time applications supporting analyzability," *IEEE Micro'10*.
- [3] M. Wolf and D. Serpanos, "Safety and security of cyber-physical and internet of things systems [point of view]," *Proceedings of the IEEE*, vol. 105, pp. 983–984, June 2017.
- [4] J. Son and J. Alves-Foss, "Covert timing channel analysis of rate monotonic real-time scheduling algorithm in mls systems," in *2006 IEEE Information Assurance Workshop*, pp. 361–368, June 2006.
- [5] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi, V. Shanbhogue, and U. R. Savagaonkar, "Innovative instructions and software model for isolated execution," in *HASP@ ISCA*, p. 10, 2013.
- [6] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, "Meltdown: Reading kernel memory from user space," in *27th USENIX Security Symposium (USENIX Security 18)*, 2018.
- [7] J. V. Bulck, M. Minkin, O. Weisse, D. Genkin, B. Kasikci, F. Piessens, M. Silberstein, T. F. Wenisch, Y. Yarom, and R. Strackx, "Foreshadow: Extracting the keys to the intel SGX kingdom with transient out-of-order execution," in *27th USENIX Security Symposium (USENIX Security 18)*, (Baltimore, MD), p. 991–1008, USENIX Association, 2018.
- [8] P. Kocher, J. Horn, A. Fogh, , D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, "Spectre attacks: Exploiting speculative execution," in *40th IEEE Symposium on Security and Privacy (S&P'19)*, 2019.
- [9] G. Chen, S. Chen, Y. Xiao, Y. Zhang, Z. Lin, and T. H. Lai, "SgxPectre Attacks: Stealing Intel Secrets from SGX Enclaves via Speculative Execution," *arXiv e-prints*, p. arXiv:1802.09085, Feb 2018.
- [10] Z. He and R. B. Lee, "How secure is your cache against side-channel attacks?," in *Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO-50 '17, (New York, NY, USA), pp. 341–353, ACM, 2017.
- [11] J. Götzfried, M. Eckert, S. Schinzel, and T. Müller, "Cache attacks on intel sgx," in *Proceedings of the 10th European Workshop on Systems Security*, EuroSec'17, (New York, NY, USA), pp. 2:1–2:6, ACM, 2017.
- [12] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, and S. Mangard, "Armageddon: Cache attacks on mobile devices," in *25th USENIX Security Symposium (USENIX Security 16)*, (Austin, TX), pp. 549–564, USENIX Association, 2016.
- [13] J. Bonneau and I. Mironov, "Cache-collision timing attacks against aes," in *Cryptographic Hardware and Embedded Systems - CHES 2006* (L. Goubin and M. Matsui, eds.), (Berlin, Heidelberg), pp. 201–215, Springer Berlin Heidelberg, 2006.
- [14] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, "Last-level cache side-channel attacks are practical," in *Proceedings of the 2015 IEEE Symposium on Security and Privacy*, SP '15, (Washington, DC, USA), pp. 605–622, IEEE Computer Society, 2015.
- [15] P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas, and S. A. Seshia, "A formal foundation for secure remote execution of enclaves," in *Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security*, CCS '17, (New York, NY, USA), pp. 2435–2450, ACM, 2017.
- [16] V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, and J. Emer, "Dawg: A defense against cache timing attacks in speculative execution processors," in *51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*, 2018.
- [17] T. Bourgeat, I. A. Lebedev, A. Wright, S. Zhang, Arvind, and S. Devadas, "MI6: secure enclaves in a speculative out-of-order processor," *CoRR*, vol. abs/1812.09822, 2018.
- [18] V. Costan, I. Lebedev, and S. Devadas, "Sanctum: Minimal hardware extensions for strong software isolation," in *25th USENIX Security Symposium (USENIX Security 16)*, 2016.
- [19] O. Weisse, V. Bertacco, and T. Austin, "Regaining lost cycles with hotcalls: A fast interface for sgx secure enclaves," in *Proceedings of*

- the 44th Annual International Symposium on Computer Architecture*, ISCA '17, (New York, NY, USA), pp. 81–93, ACM, 2017.
- [20] M. Yan, C. W. Fletcher, and J. Torrellas, “Cache telepathy: Leveraging shared resource attacks to learn DNN architectures,” *CoRR*, vol. abs/1808.04761, 2018.
  - [21] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C. C. Miao, J. F. B. III, and A. Agarwal, “On-chip interconnection architecture of the tile processor,” *IEEE Micro*, vol. 27, pp. 15–31, Sept 2007.
  - [22] P. Hammarlund, A. Martinez, A. Bajwa, D. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton, “Haswell: The fourth-generation intel core processor,” *Micro, IEEE*, vol. 34, pp. 6–20, Mar 2014.
  - [23] I. A. Lebedev, K. Hogan, J. Drean, D. Kohlbrenner, D. Lee, K. Asanovic, D. X. Song, and S. Devadas, “Sanctorum: A lightweight security monitor for secure enclaves,” *IACR Cryptology ePrint Archive*, vol. 2019, p. 1, 2018.
  - [24] N. Abu-Ghazaleh, D. Ponomarev, and D. Evtyushkin, “How the spectre and meltdown hacks really worked,” *IEEE Spectrum*.
  - [25] C. W. Fletcherly, L. Ren, X. Yu, M. Van Dijk, O. Khan, and S. Devadas, “Suppressing the oblivious ram timing channel while making information leakage and program efficiency trade-offs,” in *2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)*, pp. 213–224, Feb 2014.
  - [26] S. K. Haider and M. van Dijk, “Revisiting definitional foundations of oblivious RAM for secure processor implementations,” *CoRR*, vol. abs/1706.03852, 2017.
  - [27] H. M. G. Wassel, Y. Gao, J. K. Oberg, T. Huffmire, R. Kastner, F. T. Chong, and T. Sherwood, “Surfnoc: A low latency and provably non-interfering approach to secure networks-on-chip,” in *Proceedings of the 40th Annual International Symposium on Computer Architecture*, ISCA '13, (New York, NY, USA), pp. 583–594, ACM, 2013.
  - [28] Y. Wang and G. E. Suh, “Efficient timing channel protection for on-chip networks,” in *2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip*, pp. 142–151, May 2012.
  - [29] V. Kiriansky and C. Waldspurger, “Speculative Buffer Overflows: Attacks and Defenses,” *arXiv e-prints*, p. arXiv:1807.03757, Jul 2018.
  - [30] R. J. Masti, D. Rai, A. Ranganathan, C. Müller, L. Thiele, and S. Capkun, “Thermal covert channels on multi-core platforms,” in *24th USENIX Security Symposium (USENIX Security 15)*, (Washington, D.C.), pp. 865–880, USENIX Association, 2015.
  - [31] R. J. Masti, D. Rai, A. Ranganathan, C. Müller, L. Thiele, and S. Capkun, “Thermal covert channels on multi-core platforms,” in *24th USENIX Security Symposium (USENIX Security 15)*, (Washington, D.C.), pp. 865–880, USENIX Association, 2015.
  - [32] A. Nazari, N. Sehatbakhsh, M. Alam, A. Zajic, and M. Prvulovic, “Eddie: Em-based detection of deviations in program execution,” in *2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)*, pp. 333–346, June 2017.
  - [33] B. Gassend, G. E. Suh, D. Clarke, M. van Dijk, and S. Devadas, “Caches and hash trees for efficient memory integrity verification,” in *The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.*, pp. 295–306, Feb 2003.
  - [34] H. Omar, S. K. Haider, L. Ren, M. van Dijk, and O. Khan, “Breaking the oblivious-ram bandwidth wall,” in *36th IEEE International Conference on Computer Design*, 2018.
  - [35] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them: An experimental study of dram disturbance errors,” in *2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)*, pp. 361–372, June 2014.
  - [36] S. Bhunia, M. S. Hsiao, M. Banga, and S. Narasimhan, “Hardware trojan attacks: Threat analysis and countermeasures,” *Proceedings of the IEEE*, vol. 102, pp. 1229–1247, Aug 2014.
  - [37] F. McKeen, I. Alexandrovich, A. Brenzenon, C. Rozas, H. Shafi, V. Shanbhogue, and U. Savagaonkar, “Innovative Instructions and Software Model for Isolated Execution,” in *Hardware and Architectural Support for Security and Privacy*, 2013.
  - [38] V. Costan and S. Devadas, “Intel sgx explained.” Cryptology ePrint Archive, Report 2016/086, 2016. <https://eprint.iacr.org/2016/086>.
  - [39] A. Herdrich, E. Verplanke, P. Autee, R. Iliikkal, C. Ganos, R. Singhal, and R. Iyer, “Cache qos: From concept to reality in the intel xeon processor e5-2600 v3 product family,” in *2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, pp. 657–668, March 2016.
  - [40] D. Feldman, “Timing is everything: Understanding the meltdown and spectre attacks.” <https://hackernoon.com/timing-is-everything-understanding-the-meltdown-and-spectre-attacks-5e1946e44f9f>, 2018.
  - [41] D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi, “Near-optimal worst-case throughput routing for two-dimensional mesh networks,” in *Proceedings of the 32Nd Annual International Symposium on Computer Architecture*, ISCA '05, (Washington, DC, USA), pp. 432–443, IEEE Computer Society, 2005.
  - [42] A. Gundu, G. Sreekumar, A. Shafee, S. Pugsley, H. Jain, R. Balasubramonian, and M. Tiwari, “Memory bandwidth reservation in the cloud to avoid information leakage in the memory controller,” in *Proceedings of the Third Workshop on Hardware and Architectural Support for Security and Privacy*, HASP '14, (New York, NY, USA), pp. 11:1–11:5, ACM, 2014.
  - [43] Y. Wang, A. Ferraiuolo, and G. E. Suh, “Timing channel protection for a shared memory controller,” in *2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)*, pp. 225–236, Feb 2014.
  - [44] “Speculative Task Parallel Algorithm for Single Source Shortest Path.” <https://khan.engr.uconn.edu/pubs/sssp-spec.pdf>, 2019.
  - [45] “CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores.” <https://github.com/masabahmad/CRONO>.
  - [46] C. Demetrescu, A. V. Goldberg, and D. S. Johnson, eds., *The Shortest Path Problem, Proceedings of a DIMACS Workshop, Piscataway, New Jersey, USA, November 13–14, 2006*, vol. 74 of *DIMACS Series in Discrete Mathematics and Theoretical Computer Science*, DIMACS/AMS, 2009.
  - [47] Y. Xue, J. Jiang, B. Zhao, and T. Ma, “A self-adaptive artificial bee colony algorithm based on global best for global optimization,” *Soft Computing*, vol. 22, pp. 2935–2952, May 2018.
  - [48] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in *Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on*, pp. 248–255, June 2009.
  - [49] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH-2 programs: characterization and methodological considerations,” in *ISCA*, pp. 24–36, 1995.
  - [50] M. Buckler, S. Jayasuriya, and A. Sampson, “Reconfiguring the imaging pipeline for computer vision,” in *The IEEE International Conference on Computer Vision (ICCV)*, 2017.
  - [51] U. Demiryurek, B. Pan, F. Banaei-Kashani, and C. Shahabi, “Towards modeling the traffic data on road networks,” in *Proceedings of the Second International Workshop on Computational Transportation Science*, IWCTS '09, (New York, NY, USA), pp. 13–18, ACM, 2009.
  - [52] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with YCSB,” in *SoCC'10*.
  - [53] D. Dunning, “What encryption is used on an atm machine?”
  - [54] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and M. Horowitz, “Architectural Support for Copy and Tamper Resistant Software,” in *Proceedings of the 9<sup>th</sup> Int'l Conference on Architectural Support for Programming Languages and Operating Systems (ASPOLOS-IX)*, pp. 168–177, November 2000.
  - [55] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas, “AEGIS: Architecture for Tamper-Evident and Tamper-Resistant Processing,” in *Proceedings of the 17<sup>th</sup> ICS (MIT-CSAIL-CSG-Memo-474 is an updated version)*, (New-York), ACM, June 2003.
  - [56] Microsoft, “Next-Generation Secure Computing Base.” <http://www.microsoft.com/resources/ngscb/default.mspx>.
  - [57] T. Alves and D. Felton, “Trustzone: Integrated hardware and software security.” ARM white paper, 2004.

- [58] Trusted Computing Group, “TCG Specification Architecture Overview Revision 1.2.”  
<http://www.trustedcomputinggroup.com/home>, 2004.
- [59] Advanced Micro Devices, “Amd64 virtualization: Secure virtual machine architecture reference manual.” AMD Publication no. 33047 rev. 3.01, May 2005.
- [60] D. Grawrock, *The Intel Safer Computing Initiative: Building Blocks for Trusted Computing*. Intel Press, 2006.
- [61] A. Biondo, M. Conti, L. Davi, T. Frassetto, and A.-R. Sadeghi, “The guard’s dilemma: Efficient code-reuse attacks against intel SGX,” in *27th USENIX Security Symposium (USENIX Security 18)*, (Baltimore, MD), pp. 1213–1227, USENIX Association, 2018.
- [62] J. Götzfried, M. Eckert, S. Schinzel, and T. Müller, “Cache attacks on intel sgx,” in *Proceedings of the 10th European Workshop on Systems Security*, EuroSec’17, (New York, NY, USA), pp. 2:1–2:6, ACM, 2017.
- [63] M.-W. Shih, S. Lee, T. Kim, and M. Peinado, “T-sgx: Eradicating controlled-channel attacks against enclave programs,” Internet Society, February 2017.
- [64] S. Shinde, D. L. Tien, S. Tople, and P. Saxena, “Panoply: Low-tcb linux applications with SGX enclaves,” in *NDSS*, The Internet Society, 2017.
- [65] J. V. Bulck, N. Weichbrodt, R. Kapitza, F. Piessens, and R. Strackx, “Telling your secrets without page faults: Stealthy page table-based attacks on enclaved execution,” in *26th USENIX Security Symposium (USENIX Security 17)*, (Vancouver, BC), pp. 1041–1056, USENIX Association, 2017.
- [66] F. Liu, Q. Ge, Y. Yarom, F. McKeen, C. Rozas, G. Heiser, and R. B. Lee, “Catalyst: Defeating last-level cache side channel attacks in cloud computing,” in *2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, pp. 406–418, March 2016.
- [67] Z. Zhou, M. K. Reiter, and Y. Zhang, “A software approach to defeating side channels in last-level caches,” in *Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security*, CCS ’16, (New York, NY, USA), pp. 871–882, ACM, 2016.
- [68] M. K. Qureshi, “Ceaser : Mitigating conflict-based cache attacks via encrypted-address and remapping,” in *51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)*, 2018.
- [69] M. K. Qureshi, “New attacks and defense for encrypted-address cache,” in *Proceedings of the 46th International Symposium on Computer Architecture*, ISCA ’19, (New York, NY, USA), pp. 360–371, ACM, 2019.
- [70] F. Liu, H. Wu, K. Mai, and R. B. Lee, “Newcache: Secure cache architecture thwarting cache side-channel attacks,” *IEEE Micro*, vol. 36, pp. 8–16, Sept 2016.
- [71] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get off of my cloud: Exploring information leakage in third-party compute clouds,” in *Proceedings of the 16th ACM Conference on Computer and Communications Security*, CCS ’09, (New York, NY, USA), pp. 199–212, ACM, 2009.
- [72] A. Shafiee, A. Gundu, M. Shevgoor, R. Balasubramonian, and M. Tiwari, “Avoiding information leakage in the memory controller with fixed service policies,” in *Proceedings of the 48th International Symposium on Microarchitecture*, MICRO-48, (New York, NY, USA), pp. 89–101, ACM, 2015.
- [73] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. W. Fletcher, and J. T. and, “Invisispec: Making speculative execution invisible in the cache hierarchy,” *MICRO*, 2018.