

# NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

Zhe Zhou<sup>\*1,2</sup>, Yiqi Chen<sup>\*1</sup>, Tao Zhang<sup>4</sup>, Yang Wang<sup>4</sup>, Ran Shu<sup>4</sup>, Shuotao Xu<sup>4</sup>, Peng Cheng<sup>4</sup>, Lei Qu<sup>4</sup>, Yongqiang Xiong<sup>4</sup>, Jie Zhang<sup>2,5</sup>, Guangyu Sun<sup>†1,3</sup>

<sup>1</sup>School of Integrated Circuits, <sup>2</sup>School of Computer Science, Peking University

<sup>3</sup> Beijing Advanced Innovation Center for Integrated Circuits

<sup>4</sup> Microsoft Research, <sup>5</sup> Zhongguancun Laboratory

{zhou.zhe, yiqi.chen, jiez, gsun}@pku.edu.cn

{zhangt, yang.wang, ran.shu, shuotaoxu, pengc, lei.qu, yongqiang.xiong}@microsoft.com

**Abstract**—The Compute Express Link (CXL) interconnect makes it feasible to integrate diverse types of memory into servers via its byte-addressable SerDes links. Considering the various access latency, harnessing the full potential of CXL-based heterogeneous memory systems requires efficient memory tiering. However, prior work can hardly make a fundamental progress owing to low-resolution and high-overhead memory access profiling techniques. To address this critical challenge, we propose a novel memory tiering solution called *NeoMem*, which features a hardware/software co-design. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called *NeoProf*. *NeoProf* readily monitors memory accesses and provides the OS with crucial page hotness statistics and other useful system state information. On the OS kernel side, we design a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on *NeoProf* statistics. We implement *NeoMem* on a real FPGA-based CXL memory platform and Linux kernel v6.3. Comprehensive evaluations demonstrate that *NeoMem* achieves 32% ~ 67% geomean speedup over several existing memory tiering solutions.

## I. INTRODUCTION

The Compute Express Link (CXL) technology provides a coherent and byte-addressable interconnect between host CPUs and various peripheral devices [23]. Among its multiple use cases [7], [8], [14], [33], [40], [65], [81], CXL-based memory extension (CXL memory for short) has risen as a focal point [11], [26], [29], [45], [51]. As illustrated in Figure 1-(a), CXL enables servers to incorporate diverse memory devices to expand the memory capacity and bandwidth without necessitating hardware modifications on the CPU side. As depicted in Figure 1-(b), CXL memories are usually exposed to the OS as CPU-less NUMA nodes [45], [51]. CPUs directly access these address-mapped NUMA nodes without invoking page faults or swaps.

However, CXL memory exhibits the elevated access latency, which can be over twice as high as that of DDR DRAMs [38], [45], [67]. The latency is more pronounced when replacing DRAM with slower memory media like PCM and ReRAM in the CXL memory devices [63]. Therefore, in a system with local DDR DRAM and diverse CXL memories, the



Fig. 1. A Typical CXL-based Tiered Memory System.

disparities in memory latency, bandwidth and capacity result in the formation of a tiered memory system [42], [45], [51], [63]. Typically, accesses to faster memory tiers are characterized by lower latency and higher bandwidth, whereas the opposite holds true for slower tiers. Given this differential, the OS should place frequently accessed “hot” pages in fast memory tiers while putting “cold” pages in slow tiers to maximize system performance, which is referred to as *Memory Tiering* technique [5], [21], [36], [37], [56], [78]. Obviously, efficient hot page detection methods are crucial to memory tiering.

However, realizing efficient hot page detection in CXL-based tiered memory systems is challenging, mainly due to the low-resolution and high-overhead memory access profiling techniques. Unlike RDMA-based memory disaggregation, where the OS monitors external memory access [6], [28], CPU directly accesses CXL memory via traditional load/store instructions without OS awareness of the access patterns. Therefore, existing memory tiering techniques rely on special profiling methods like PTE-scan [48], hint-fault monitoring [25], and PMU sampling [73], each with inherent limitations, to gain visibility to memory access.

To be specific, PTE-scan periodically clears *Access* bits in Page-Table Entries (PTEs). It then identifies accessed pages by scanning the page table. PTE-scan only captures one access per page in each scanning epoch, leading to low resolution. In hint-fault monitoring [25], the OS “poisons” some PTEs by setting special bits. The following accesses to these pages will trigger protection faults, immediately notifying the OS with the page access. However, due to frequently triggering page faults, hint-fault monitoring incurs severer overhead [59]. Moreover, both techniques operate at the TLB level and cannot detect

\* Co-first authors.

† Corresponding author.



Fig. 2. Illustration of NeoProf-based Memory Profiling

true LLC misses. If a page is frequently accessed but always resides in cache, it is redundant to migrate it to fast memory.

Performance Monitoring Units (PMUs) in CPUs, such as Intel's PEBS and AMD's IBS, can be employed to track LLC misses via sampling events. While PMU-sampling directly tracks LLC misses, it always operates at a low sampling frequency to control overhead [21], [42], hindering the achievement of optimal hot page detection recall in practical scenarios [21]. Additionally, PMU-based methods are CPU vendor-specific, which limits their generality. A detailed analysis of these memory profiling techniques is presented in Section II-C.

Given these challenges, Linux developers have reached a consensus [53] that “*The biggest problem for memory tiering still appears to be page promotion*” because “*It is difficult to determine when a page has become hot*”. It is urgent to devise efficient and practical memory access profiling and hot page detection techniques for CXL-based memory tiering.

**Design goals:** We claim that an ideal memory access profiling mechanism should aim to fulfill the following goals.

**G1: High Resolution** – First of all, the profiler should accurately and promptly track the frequency and location of memory accesses with high time and space resolution.

**G2: Low Overhead** – It is also crucial for the profiler to consume minimal CPU cycles, ensuring that system’s performance is not detrimentally impacted.

**G3: Cache Awareness** – The profiler should be aware of CPU’s cache and only captures LLC misses to ensure a high accuracy.

**G4: Universal Compatibility** – The design should be versatile, ensuring compatibility across various platforms, provided they support CXL-based memory expansion.

**G5: Comprehensive Profiling** – Additionally, if the profiler can also capture crucial runtime information such as system bandwidth utilization, read/write distribution and access frequency distribution, etc., the OS can perform a better scheduling by leveraging these information.

**Our Work:** We capitalize on the special architecture of CXL memory and propose *NeoMem*, a *CXL-native*<sup>1</sup> memory tiering solution featuring a hardware-software co-design. *NeoMem* integrates memory access profiling units, named *NeoProf*, into CXL memories’ device-side controllers, as illustrated in Figure 2. *NeoProf* readily analyzes LLC misses to CXL memory and provides the OS with crucial information like page hotness, memory bandwidth utilization, read/write ratio, and access frequency distribution, etc. Meanwhile, on the OS side, we design an advanced memory-tiering strategy by leveraging *NeoProf*’s insights for efficient hot page promotion. We demonstrate how *NeoMem* meets the five design goals:

<sup>1</sup>We call *NeoMem* “*CXL-native*” because it fully utilizes the special architecture of CXL memory that has device-side controllers.



Fig. 3. Characterizing CXL-enabled Commodity Hardware.

For goal **G1**, *NeoProf* uses a customized Sketch-based [18] hot-page detector to efficiently analyze *each* physical page access and identify hot pages with a fine granularity of 4KB. By offloading hot page detection to dedicated hardware, *NeoMem* saves precious CPU cycles, which inherently satisfies **G2**. As *NeoProf* resides at the CXL memory side, it directly monitors true LLC misses, which naturally fulfills **G3**. To ensure broad compatibility, as outlined in **G4**, we limit hardware modifications, namely *NeoProf*, to the CXL device side. This guarantees seamless integration with any host CPUs. Beyond tracking page hotness, *NeoProf* offers insights into other vital metrics, such as bandwidth utilization, read/write ratio and access frequency distribution, etc. These statistics empower the OS to dynamically control page migration aggressiveness for optimal performance, which fulfills **G5**.

In contrast to previous studies using emulation to prototype their designs [45], [51], [63], *NeoMem* is validated on a real FPGA-based CXL memory platform. We implement *NeoMem*’s driver and memory tiering daemon in the Linux kernel v6.3. Both the software and hardware components are publicly available in the provided repository<sup>2</sup>. To summarize, we have made the following key contributions:

- **Limitation Analysis.** We investigate the limitations of existing memory tiering methods and demonstrate the necessity of an efficient memory access profiling technique in emerging CXL-based tiered memory systems (Sec. II).
- ***NeoMem* Solution.** We propose a novel *NeoMem* solution, which leverages a dedicated hardware profiler, *NeoProf*, in memory-side controller to realize efficient memory profiling (Sec. III). We carefully design the architecture of *NeoProf* to ensure high profiling accuracy and low overhead (Sec. IV). Based on *NeoProf*, we introduce *NeoMem*’s software design and dynamic migration policy (Sec. V).
- **Real-Platform Prototyping.** We conduct real-platform prototyping of *NeoMem* based on a CXL-enabled FPGA platform and Linux kernel v6.3 with our *NeoMem* patch (Sec. VI).

According to our evaluation, *NeoMem* achieves 32% to 67% geomean speedup on eight representative benchmarks compared to several existing memory-tiering solutions.

## II. BACKGROUND AND MOTIVATION

### A. CXL-based Tiered Memory System

CXL (Compute-Express-Link) [23], built on the PCIe 5.0 physical layer, creates a cache-coherent, byte-addressable interconnect through efficient SerDes links. It comprises three sub-protocols: CXL.io, CXL.cache, and CXL.mem. The CXL.mem protocol specifically allows CPU to directly access

<sup>2</sup> <https://github.com/PKUZHOU/NeoMem-MICRO-2024>



Fig. 4. Evaluating Different Memory Profiling Mechanisms.

CXL memory devices with load/store instructions. Unlike traditional DDR-based memory, where the memory controller is embedded in the host CPUs and has challenges to support new types of memories, CXL memory employs device-side controllers and interact with host CPU via standard CXL protocol. This decoupled architecture facilitates the integration of various types of memory into servers, catering to specific capacity, performance, and cost needs [32], [45], [51], [66].

However, compared to CPU-attached DDR-DRAMs, CXL-connected memories are considered to exhibit higher latency. On the one hand, the control and transmission overhead of CXL is non-negligible [38], [45], [51], [67]. On the other hand, the latency is more pronounced when integrating high-density but slower memory media like PCM and ReRAM as CXL memory [33], [63]. To delve deeper into this difference, we conduct a performance characterization on Intel’s FPGA-based CXL memory prototype [68], with its detailed configurations provided in Section VI-A.

As shown in Figure 3-(a), Intel’s CXL memory prototype exhibits a latency of approximately 430 ns, around  $3.6\times$  that of the faster CPU-attached DDR-DRAM. Prior studies [45], [51] tend to assume a CXL-memory latency of 170-250ns through NUMA-based emulation, less than the available prototype but still up to twice as high compared to fast-tier DRAM memory. Considering that this paper’s goal is to propose efficient memory access profiling and hot page detection mechanisms, rather than reduce CXL memories’ latency, the sub-optimal latency of Intel’s prototype will not affect our conclusions.

We also compare the end-to-end performance using several benchmarks (introduced in Section VI-A). We bound CPU threads to either the fast or slow memory tier to assess the resultant slowdown. As indicated in Figure 3-(b), solely utilizing the CXL memory results in a performance decrease ranging from 64% (in Redis) to 295% (in Page-Rank). These observations align with prior findings [67].

### B. Tiered Memory Management

In a system with heterogeneous types of memories, memory tiering techniques are usually adopted to maximize system performance. Memory tiering fundamentally hinges on the data locality of workloads and strategically places the frequently accessed “hot” pages, which are performance critical, in faster memory tiers. The iterative memory tiering process typically unfolds in three key stages:

- **Memory-Access Profiling.** The system first gathers page access statistics via specific memory profiling techniques.

- **Page Classification.** Profiled pages are then classified as either “hot” or “cold” based on their access frequency.
- **Page Migration.** A process, termed as *Promotion*, migrates the hot pages identified in slow memory tiers to fast tiers. Conversely, a process, known as *Demotion*, relocates the cold pages to slower memory tiers.

Among the three steps, memory-access profiling plays a fundamental role [15], [48], [56], [60], [63]. A high-resolution and low-overhead memory profiling method is critical for promptly identifying performance-critical “hot” pages. Since physical memory accesses are not visible to the operating system, several special profiling techniques have been proposed, which are analysed as follows.

### C. Memory Access Profiling Techniques

**PTE-scan.** Page-Table-Entry scanning (PTE-scan) is a widely used method for page access tracking [5], [21], [30], [36], [78]. In this approach, a daemon thread in the OS kernel periodically resets the *Accessed* bits in PTEs. When a page is accessed by the CPU, the corresponding bit in the PTE is set. After a certain interval after previous resets, the daemon thread scans the PTEs to check which pages have been accessed. Therefore, PTE-scan tracks memory access in distinct epochs.

However, PTE-scan faces challenges with low time resolution and high overhead. For instance, scanning PTEs in a large-scale memory system can even take several seconds [56]. Furthermore, PTE-scan can only detect a single access per page in each scanning epoch, necessitating multiple epochs to identify frequently accessed pages, which escalates costs and reduces timeliness. Efforts like DAMON [48] and AMP [30] employ region sampling or huge pages to mitigate PTE scanning overhead. They however compromise on space resolution.

We analyze the trade-offs in PTE-scan using DAMON [48] in Linux. DAMON allows customization of both time resolution (interval between scans, in milliseconds) and space resolution (number of monitored regions, where a higher number equates to finer granularity). As depicted in Figure 4-(a), our analysis reveals that to achieve prompt memory access tracking with manageable overhead in PTE-scan, there’s a significant compromise in space resolution, and vice-versa.

**Challenge#1:** PTE-scan methods cannot achieve high time and space resolution while maintaining a low overhead.

**Hint-fault Monitoring.** In contrast to PTE-scan, which gathers page-access information in distinct epochs, hint-fault monitoring can achieve more immediate page-access tracking [5],

TABLE I  
MEMORY-ACCESS PROFILING TECHNIQUES COMPARISON

|                      | PTE-Scan             | Hint-fault Monitoring       | PMU Sampling     | <b>NeoProf</b>             |
|----------------------|----------------------|-----------------------------|------------------|----------------------------|
| Profiling Location   | TLB                  | TLB                         | PMU Monitor      | Device-side CXL Controller |
| Profiling Resolution | One Access Per Epoch | One Access to Sampled Pages | Sampled Accesses | Each Access                |
| Cache Aware?         | X                    | X                           | ✓                | ✓                          |
| Overhead             | High                 | High                        | Medium           | Low                        |

[10], [37], [47], [51]. For example, Thermostat [5] periodically samples a subset of pages and “poisons” the corresponding PTEs by setting protection bits. Successive page accesses to these poisoned pages invokes protection faults immediately, signaling to the OS which pages have been accessed. However, as each page tracking operation initiates a costly TLB shootdown and page fault, hint-fault monitoring necessitates a sampling approach to temper overheads, which results in a low coverage in practice [60].

It is worth noting that both PTE-scan and hint-fault monitoring track TLB misses rather than LLC misses (a.k.a, true CXL memory access). Our detailed profiling reveals that TLB misses and LLC misses may not exhibit a strong correlation across various workload traces. This observation is illustrated in Figure 4-(b), where we graph the total number of TLB accesses (plotted on the Y-axis) against the LLC accesses (on the X-axis) for sampled pages from a Redis [57] trace. We utilize the KCacheSim [12] simulator for this analysis. The memory accesses are filtered by 32KB/core L1D/I caches and 2MB/core L2 cache. The scatter plot in the figure clearly demonstrates a high level of dispersion, indicating that a page with frequent TLB accesses does not necessarily have a high number of LLC misses.

**Challenge#2:** TLB-based memory profiling methods may not capture the actual CXL memory access situation.

**PMU Sampling.** PMU (Performance Monitoring Unit) sampling utilizes hardware monitors in CPUs for direct LLC miss tracking. Intel’s PEBS, for instance, supports sampling LLC misses and storing them in a dedicated memory buffer. When this buffer reaches capacity, it triggers an interrupt, prompting the kernel to process these samples.

This direct tracking method is prevalent in memory tiering systems [21], [42], [56]. However, as illustrated in Figure 4-(c), the overhead of PEBS increases with sampling frequency. Decreasing the sampling interval from every 10,000 to every 10 LLC misses can cause a workload slowdown of more than 50%. Consequently, existing systems usually adopt a low sampling frequency to minimize performance impact, which however sacrifices resolution [21], [42]. Furthermore, PMU sampling is often closely tied to specific CPU platforms. Consequently, a system designed for Intel CPU servers may not seamlessly transition to AMD or ARM servers, which contradicts the open and versatile principles of CXL.

**Challenge#3:** The PMU-sampling methods are CPU-vendor specific and can hardly achieve a high profiling resolution.



Fig. 5. Overview of NeoMem Solution.

### III. NEOMEM SOLUTION

To overcome these limitations, we propose a novel *CXL-native* memory tiering solution named NeoMem. NeoMem’s key design philosophy is to offload the memory profiling functions from CPU to a dedicated hardware unit in CXL memory’s controller, which is named NeoProf in our design. NeoProf readily monitors *each* memory access to CXL memory in a page granularity and directly provides the OS with crucial page hotness statistics and other useful runtime state information, avoiding wasting precious CPU cycles for profiling. Based on NeoProf’s statistics, the OS performs accurate and timely hot page promotion with a revamped memory-tiering strategy. As depicted in Figure 4-(a) and Figure 4-(c) and compared in Table I, NeoProf achieves exceptionally high profiling resolution while imposing minimal CPU overhead.

#### A. Overview of NeoMem Solution

Figure 5 presents the overview of NeoMem. In the system, we currently assume two memory tiers: the CPU-attached DDR DRAM serves as the fast memory tier, while CXL-connected memory is the slow memory tier. Different memory tiers are managed via the NUMA APIs of Linux. NeoMem introduces the following key components:

**Memory Access Profiling.** A pivotal feature of NeoMem is its efficient memory access profiling capability. NeoMem facilitates two core profiling functions: (1) hot page detection in CXL memory and (2) monitoring of runtime states.

• **Hot Page Detection.** As shown in Figure 5, NeoProf (❶) resides in CXL memory’s device-side controllers. It snoops memory-access requests sent via the CXL channel, analyzes them and generates page hotness information as well as other useful statistics (❸). The host CPU controls NeoProf’s execution and reads out the profiled statistics periodically via sending commands (❷) through an MMIO (Memory-Mapped I/O) interface. In the OS kernel space, we implement the driver (❹) to interact with NeoProf hardware. Since NeoProf does not require any modifications to the host CPU, it is drop-in compatible with any CXL-enabled server platforms.

Note that cold pages in fast memory tiers should also be detected and demoted to slow memory to create space for hot page promotion. Since the detection of cold pages does not need a high resolution, as highlighted in [21], NeoMem



Fig. 6. The Block Diagram of NeoProf.

employs the well-established LRU 2Q mechanism [35] in the Linux kernel for the detection of cold pages (6).

- **State Monitoring.** In a tiered memory system, monitoring runtime states, such as bandwidth utilization, read/write ratios, and page access frequency distribution, is crucial for effective memory tiering. For example, when the slow CXL memory experiences increased bandwidth usage, it becomes advantageous to migrate more hot pages to fast memory. Additionally, in cases where CXL-side memory employs devices with asymmetrical read/write bandwidths, the migration scheduler should better consider their distinct performance characteristics [63]. Furthermore, the distribution of page access frequency also counts in hot page classification [42]. Therefore, we incorporate state monitors into NeoProf. The OS retrieves these critical runtime states through NeoProf commands.

**NeoMem Daemon.** Within the OS kernel, we implement a NeoMem daemon (5) that interfaces with NeoProf. This daemon is responsible for managing hot page promotions, adhering to a migration policy specified in user space (3). Hot pages are migrated by invoking the kernel’s page migration functions (7), at intervals set by the `migration_interval` parameter. Additionally, the daemon resets NeoProf’s states at regular intervals, as determined by the `clear_interval` setting.

**Migration Policy.** Based on the rich information provided by NeoProf, NeoMem’s migration policy establishes the guidelines for the NeoMem daemon to orchestrate memory profiling and migration (7). This policy decision is made in the user space, enabling customization and tuning by the user. We will introduce the implementation of NeoMem daemon and migration policy in Section V.

#### IV. NEOPROF DETAILS

##### A. Architecture Overview

Figure 6 illustrates the block diagram of NeoProf hardware, consisting of three custom units: the State Monitor, Page Monitor, and NeoProf Core. The Page Monitor snoops requests from device-side CXL controller, identifies physical page addresses, and forks them to the NeoProf Core for analysis. The State Monitor tracks Read/Write transactions, estimates bandwidth utilization and read/write ratios, while the NeoProf Core collaborates with both monitors, identifying hot pages and responding to host CPU control commands.

As shown in the left part, the registers of NeoProf core are memory-mapped to a predefined address space, enabling the OS kernel to access them via Memory-Mapped IO (MMIO).



Fig. 7. Hot-Page Detector Architecture.

The host CPU sends commands to NeoProf by writing data to specific addresses with different offsets. Additionally, the CXL memory is mapped to another address space, allowing the OS to manage it as a CPU-less NUMA node [45], [51].

Notably, current NeoProf prototype is implemented on an FPGA platform. As a result, we place the NeoProf Core in a low-frequency domain, while the State Monitor and Page Monitor operate at a high frequency to match the memory controller’s requirements. We use asynchronous FIFOs for data transmission from the monitors to the NeoProf core.

##### B. Hot Page Detector

**Challenges.** Given a physical page access stream represented as  $S = \{P_1, P_2, \dots, P_n\}$ , where  $P_i$  denotes the  $i$ -th accessed page, we define a page as “hot” if its access frequency surpasses a certain threshold  $\theta$ . A straw-man approach for hardware-based hot page detection involves utilizing counters to monitor the access frequency of individual physical pages. However, consider a 512GB CXL memory expander [66], housing a total of 128 million 4KB pages. Assigning a 32-bit counter to each page would require 512MB of buffer to store counters. Moreover, updating these counters with every page access would impose a significant burden on DRAM bandwidth. Additionally, reading and processing these counters to identify hot pages could introduce considerable latency, severely affecting the timeliness of profiling.

**Efficient Hot Page Detection.** To address the challenges outlined, we propose an efficient hot-page detection architecture. Our approach hinges on the Count-Min (CM) Sketch algorithm [18], a hash-based technique designed for estimating item frequencies within a data stream. We enhance the CM-Sketch with two capabilities: (1) we propose an efficient hot page filtering mechanism to prevent duplication. (2) We introduce an error-bound control mechanism in the hot page detector to guarantee a high detection accuracy.

As illustrated in Figure 7, a CM-Sketch with parameters  $(\epsilon, \delta)$  is represented by a two-dimensional array counts (1) with a width of  $W$  and a depth of  $D$ . Given parameters  $(\epsilon, \delta)$ , we set  $W = \lceil 2/\epsilon \rceil$  and  $D = \lceil \log_2(1/\delta) \rceil$ . Each entry in the array consists of a counter, a hot bit and a valid bit (2).  $D$  different hash functions (3) map the input page address to one of  $W$  entries in each lane. When a page address  $P$



Fig. 8. Hot Page Detector Pipeline.

arrives, the hash functions calculate offsets in each row of the sketching array, denoted as  $\Delta_i = h_i(P)$  for the  $i$ -th row, where  $i \in [1, \mathbf{D}]$ . Subsequently, the counters of the hashed entries are indexed and incremented:

$$\Delta_i = h_i(P), \quad A[i][\Delta_i] \leftarrow A[i][\Delta_i] + 1 \quad (1)$$

Then the access frequency of page  $P$ , represented as  $a(P)$ , is approximated by the minimum value in each lane:

$$\hat{a}(P) = \min_{i=1}^{\mathbf{D}} (A[i][\Delta_i]) \quad (2)$$

According to the theory [18], the estimated access frequency  $\hat{a}(P)$  falls within the following range with probability  $1 - \delta$ :

$$a(P) \leq \hat{a}(P) \leq a(P) + \epsilon N \quad (3)$$

Here,  $N$  is the total number of accesses seen by the sketch. Given a threshold  $\theta$ , the hot page detector (④) assesses whether the approximated access count  $\hat{a}(P)$  exceeds  $\theta$ :

$$\text{isHotPage}(P, \theta) = \begin{cases} \text{True}, & \hat{a}(P) > \theta \\ \text{False}, & \hat{a}(P) \leq \theta \end{cases} \quad (4)$$

NeoProf clears these counters after each hot page detection period, which is implemented via resetting each entry's `Valid` bit. In each sketch lane, the `Valid` bits are physically arranged in a contiguous manner, allowing for rapid resetting.

**Hot-Page Filtering.** In NeoProf's design, the addresses of detected hot pages are put into the hot page buffer (⑥ in Figure 7). However, in each detection period, a page can be identified as hot repeatedly once its access frequency exceeds the threshold  $\theta$ , which will fill up the hot-page buffer quickly and is redundant for hot page migration. We avoid this problem via introducing a `Hot` bit in each sketch entry (②).

Before transferring a page address to the output buffer, the hot page filter (⑤) examines the `Hot` bits in the hashed entries. If all the hot bits are set to `True`, this suggests that the page might have been recorded previously, leading us to dismiss it. Conversely, if any `Hot` bit is `False`, this indicates a newly detected hot page. We then set the hot bits in the corresponding entries to `True`. Such a design can be thought of as equivalent to adding a bloom filter [75] after the CM-Sketch unit to probabilistically determine the presence of a hot page [34]. Our design is more efficient as it reuses the hashing results and introduces only a minimal number of additional hot bits.

**Accurate Error-Bound Estimation.** One significant challenge in Sketch-based hotness estimation is the increasing ap-



Fig. 9. Histogram-based Error-Bound Estimation

proximation error as the number of streamed-in page addresses grows ( $N$  in Eq. 3). In extreme cases, all counters in the sketch array exceeding the threshold  $\theta$  can lead to an unseen page being incorrectly labeled as “hot”. Equation 3 provides a worst-case error bound estimation, which has been criticized as overly “loose” for practical use [13]. Chen et al. [13] introduced a technique to estimate a “near-optimal” approximation error by sorting counters within any row of the sketch array in descending order, denoted as  $\{A[1][1], A[1][2], \dots, A[1][W]\}$ . The tight error bound, referred to as  $e$ , is then determined as the  $(W \cdot \lceil \delta^{1/D} \rceil)$ -percentile value of these sorted counters.

With probability  $1 - \delta$ , we can say that  $\hat{a}(P) \leq e + a(P)$ . If the approximated page access count,  $\hat{a}(P)$ , exceed threshold  $\theta$ , we can assert that  $a(P) > \theta - e$ . For example, given  $D = 2$  and  $\delta = 0.25$ , we can choose the median value of a row to serve as the error estimation, denoted as  $e$ . When this error  $e$  comes to a large value (e.g., when  $e > \theta$ ), the sketch array may have reached saturation. NeoMem relies on the estimated error bound to ensure accurate hot page detection.

**Hardware Implementation.** Based on these algorithms, we introduce the hardware implementation details of the proposed hot-page detector and the error-bound estimation logic:

- **Pipelined Hot Page Detection.** In Figure 8, we break down the detector's pipeline into three primary stages: (1) hash index computation, (2) hot page checking, and (3) hot page filtering. Each stage comprises finer pipeline stages. We utilize the  $H_3$  [55] hash function for hash index computation. This function calculates an  $m$ -bit hash value based on a  $n$ -bit input value  $x$  and a  $n \cdot m$ -bit seed  $\pi$ :

$$h_\pi(x) = x(0) \cdot \pi(0) \oplus x(1) \cdot \pi(1) \dots \oplus x(n-1) \cdot \pi(n-1) \quad (5)$$

where the input  $x$  is the  $n$ -bit page address.  $x(i) \cdot \pi(i)$  performs logic AND between each bit  $x(i)$  and every bit of  $\pi(i)$ . The  $\oplus$  operator performs logic XOR operation on vectors. The resulting vector  $h_\pi(x)$  is an  $m$ -bit hashed index. To ensure efficient and pipelined processing, we divide this reduction tree into  $M$  stages. A total of  $D$  pipeline units handle the hash functions for each sketch row in parallel.

Another challenge arises when the sketch array width  $W$  increases. This makes it difficult to achieve a single-cycle index and update for any counter. To address this, we follow prior practices [69] and partition the memory into  $K$  sub-blocks and implement the sketch array in a pipelined manner.

- **Histogram-based Error-Bound Estimation.** According to the accurate error-bound estimation algorithm introduced above, we should read out a row of sketch counters, sort them, and select the  $p$ -percentile value as the error bound. To reduce CXL channel occupation and save host CPU cycles, we propose a

TABLE II  
NEOPROF COMMANDS

| Command      | Offset | Operation      | Description                        |
|--------------|--------|----------------|------------------------------------|
| Reset        | 0x100  | Write 1        | Reset NeoProf                      |
| SetThreshold | 0x200  | Write $\theta$ | Set hot page threshold to $\theta$ |
| GetNrHotPage | 0x300  | Read           | Readout # profiled hotpages        |
| GetHotPage   | 0x400  | Read           | Readout a hot page address         |
| GetNrSample  | 0x500  | Read           | Readout # sampled cycles           |
| GetRdCnt     | 0x600  | Read           | Readout # sampled read             |
| GetWrCnt     | 0x700  | Read           | Readout # sampled write            |
| SetHistEn    | 0x800  | Write 1        | Trigger the histogram calculation  |
| GetNrHistBin | 0x900  | Read           | Readout # histogram bins           |
| GetHist      | 0xA00  | Read           | Readout the histogram bins         |

histogram-based error-bound estimation mechanism.

As depicted in Figure 9, NeoProf Core incorporates a histogram unit with 64 bins. Triggered by specific NeoProf commands, the histogram unit reads the counters in the first row of sketch array and estimates the frequency distribution. The host CPU just needs to read out the histogram and estimate the  $p$ -percentile frequency using a straightforward algorithm. This approach greatly reduces the overheads compared to naively reading out and sorting the entire sketch rows.

Besides facilitating error-bound estimation, the histogram also approximates the page access frequency distribution in CXL memory. In Section V-A, we will demonstrate how the NeoMem migration policy conducts dynamic hotness threshold adjustment according to the histogram information.

**NeoProf Commands.** NeoProf is controlled by the host CPU through a set of commands, which are encoded by varying offsets in NeoProf’s MMIO region. Some core commands are listed in Table II. The Reset command clears all the counters and buffers within NeoProf. The SetThreshold command is used to adjust the hot page threshold  $\theta$ . Subsequently, the GetNrHotPage and GetHotPage commands are employed to retrieve the addresses of hot pages. Additionally, we design the GetNrSample, GetRdCnt and GetWrCnt commands to retrieve the total sampled cycles, as well as the breakdown of cycles attributed to read and write operations. Finally, the Hist-related commands trigger the histogram calculation and retrieve the histogram data.

## V. NEOMEM SOFTWARE DESIGN

### A. NeoMem Migration Policy

Setting the hotness threshold is a critical challenge in memory tiering. Traditional methods, constrained by limited insight into memory access patterns, often rely on static thresholds for classifying hot pages [17], [37], [51], [78], which are sub-optimal. With rich and timely memory access information, NeoMem dynamically adjusts the hotness threshold (described in Algorithm 1) based on the following statistics:

- **Access Frequency Distribution.** We utilize the page access frequency distribution to dynamically determine the hotness threshold  $\theta$ . This involves using NeoProf’s histogram of sketch counters as a proxy for actual access frequencies (line 4 in the algorithm). The threshold is determined by setting  $\theta$  to the  $p$ -percentile of this distribution. Specifically, we define  $Q_F$  as the histogram’s quantile function, where  $Q_F(x) = y$  implies that a fraction  $x$  of pages have fewer than  $y$  accesses. Thus,  $\theta$

### Algorithm 1: Dynamic Hotness Threshold Adjustment

```

1 Input: Migration Quota  $m_{quota}$ ; Percentile bounds
           $p_{min}, p_{max}$ ; Default percentile  $p_{init}$ ;
2  $\triangleright p \leftarrow p_{init}$ ;
3 while dynamic threshold adjustment is enabled do
4    $\mathcal{F} \leftarrow get\_neoprof\_hist()$ ;
5    $\mathcal{B} \leftarrow get\_bandwidth\_util()$ ;
6    $\mathcal{P} \leftarrow get\_ping\_pong\_count()$ ;
7    $\mathcal{E} \leftarrow get\_error\_bound(\mathcal{F})$ ;
8    $\mathcal{M} \leftarrow get\_migrate\_pages\_count()$ ;
9   if  $\mathcal{M} < m_{quota}$  then
10      $p \leftarrow p \cdot \frac{(1+\mathcal{B})^\alpha}{(1+\mathcal{P})^\beta}$ ;
11      $p \leftarrow bound(p_{min}, p_{max}, p)$ 
12   else
13      $p \leftarrow max(p_{min}, \frac{p}{2})$ ; /* Migration quota constraint*/
14   if  $Q_{\mathcal{F}}(1-p) < \mathcal{E}$  then
15      $p \leftarrow max(p_{min}, \frac{p}{2})$ ; /* Error-bound checking */
16    $\theta = Q_{\mathcal{F}}(1-p)$ , update_hotness_threshold( $\theta$ );
17    $\triangleright$  Wait for the next threshold update period;

```

is set to  $Q_{\mathcal{F}}(1-p)$  (outlined in line 16), aligning the threshold with the top- $p$  access frequency.

- **Bandwidth Utilization.** We aim for maximum utilization of fast memory in the system. Therefore, when we observe heavy use of the slower CXL memory’s bandwidth, it prompts the migration of more pages to the fast memory tier. We define bandwidth utilization ( $\mathcal{B}$ ) as the ratio of memory reads and writes to total sampled cycles, given by  $\mathcal{B} = \frac{read+write}{total~cycles}$  (line 5). *read* and *write* represent cycles when the device is transferring read and write data monitored by NeoProf during the last threshold update period, and *total cycles* stands for the sampled cycles in that period. The hotness threshold should be inversely proportional to  $\mathcal{B}$ , denoted as  $\theta \propto \frac{1}{\mathcal{B}}$ .

- **Ping-Pong Severity.** An improperly low hotness threshold may lead to a situation where infrequently accessed pages are prematurely promoted to fast memory, only to be swiftly demoted back to slower memory, which is referred to as Ping-Pong phenomenon [51]. To measure ping-pong severity, we introduce the *PG\_demoted* page flag in the Linux kernel, set when a page is demoted and cleared when it’s promoted. A page with the *PG\_demoted* flag set and then promoted again is counted as a ping-pong event. Ping-pong severity is the ratio of ping-pong events to promoted pages in the previous period, calculated as  $\mathcal{P} = \frac{\#ping~pong~events}{\#promoted~pages}$  (line 6). The hotness threshold should be proportional to  $\mathcal{P}$ , denoted as  $\theta \propto \mathcal{P}$ .

- **Approximation Error.** To ensure the precision of hot page classification necessitates considering the approximation error of the sketch-based hot page detector (line 7 of the algorithm). We assume that if the estimated error bound  $\mathcal{E}$  exceeds the threshold  $\theta$ , this indicates considerable inaccuracies in hot page detection. To mitigate this, we increase the threshold by halving the percentile  $p$  (as outlined in lines 14 and 15) to enhance the confidence in hot-page detection.

- **Migration Quota.** Lastly, to prevent excessive CPU resource and memory bandwidth consumption due to page migration, we establish a migration quota designated as  $m_{quota}$ . If the number of migrated pages during the previous period exceed

TABLE III  
EVALUATION SYSTEM CONFIGURATION

|            |                                                                                                                        |
|------------|------------------------------------------------------------------------------------------------------------------------|
| Host CPU   | Single socket Intel® Xeon 6430 CPU @ 2.10GHz<br>32 Cores, hyperthreading disabled.<br>60MB Shared LLC                  |
| DDR Memory | 32GB DDR5 4800MHz x 4                                                                                                  |
| CXL Memory | One Intel® Agilex™ I-Series FPGA Dev Kit @400 MHz<br>Hard CXL 1.1 IP on PCIe Gen5 x16<br>16GB 2-Channel DDR4-2666 DRAM |

TABLE IV  
HARDWARE PARAMETERS OF NEOPROF

| Addr Bits | Counter Bits | Sketch Width(W) | Sketch Lane(D) | # Memory Segment | Hot Buffer Entries |
|-----------|--------------|-----------------|----------------|------------------|--------------------|
| 32        | 16           | 512K            | 2              | 128              | 16K                |

this set quota ( $m_{quota}$ ), we also halve parameter  $p$  to make the threshold  $\theta$  higher, as outlined in line 13.

Taking these factors into consideration, we calibrate the parameter  $p$  to  $p \cdot \frac{(1+\beta)^\alpha}{(1+\mathcal{P})^\beta}$  (line 10), where  $\alpha$  and  $\beta$  are adjustable hyper-parameters. As described in the algorithm, in every threshold update period, we dynamically choose the top  $p$  fraction of pages as hot and make sure that the decision meets the constraints from error bound and migration quota.

### B. User-Space Interface

In order to facilitate the configuration of runtime parameters in NeoMem, we introduce a set of user-space interfaces, which are accessible through the `/sys/kernel/mm/neomem` directory. These interfaces are linked to various functions implemented in the kernel space, which empower users to retrieve essential information from NeoProf and adjust parameters, such as the hotness threshold, migration interval, etc. The migration policy is then implemented within the user space, utilizing these interfaces as its communication channel. Users also have the flexibility to implement their own custom scheduling policies via these interfaces.

## VI. EVALUATION

### A. Experimental Setup

**Prototyping Platform.** We evaluate NeoMem’s practicality and performance on a FPGA-based CXL memory system, detailed in Table III. The setup includes a single-socket Intel® Sapphire-Rapids™ CPU and a CXL-enabled Intel® Agilex™-7 I-Series FPGA acting as CXL memory (CXL 1.1, Type-3 device). Intel enables `cxl.mem` on this FPGA where the CXL-and memory-related IP cores are implemented on the chiplet [19]. The FPGA has dual-channel DDR4-2666 memory with 16GB capacity. The host CPU is equipped with 32GB  $\times$  4 DDR5-4800 memory. For different fast-slow memory ratios, we adjust host memory size by reserving a specific amount of physical memory within the Linux kernel [49]. The default fast-slow ratio is 1:2. We disable CPU’s SMT, fix the CPU clock frequency and clear the page cache before running workloads to ensure consistent performance.

**Benchmarks.** Our evaluation utilizes eight representative benchmarks that have been widely used in previous memory system studies: DeathstarBench [24], a representative data-center benchmark; Page-Rank (PR) [9], a classic graph pro-



Fig. 10. FPGA-based Prototyping System.

cessing workload; XSBench [70] and GUPS<sup>3</sup> [1], both are HPC workloads characterized by skewed hot memory regions; Silo [71], an in-memory database for which we employ the YCSB-C workload; Btree [4], an in-memory index lookup workload; and two scientific computing applications from SPEC-2017, namely 603.bwaves and 654.roms, selected for their substantial Resident Set Size (RSS). The RSS values for these benchmarks range from 10.3 GB to 19.7 GB. Workloads are executed with 32 threads to fully stress the CPU cores.

**Baselines.** We select five baselines for comparative analysis, covering the memory access profiling techniques introduced in Section II-C. Specifically, to compare NeoMem with hint-fault monitoring methods, TPP [51] and AutoNUMA [17] are chosen. TPP enhances hint-fault monitoring by introducing several new features. AutoNUMA, part of Linux kernel v6.3, blends part of TPP’s features and introduces configurable hotness threshold. In addition, to compare with PTE-scan and PMU sampling methods, we integrate these profiling techniques into NeoMem, replacing its native memory profiling functions. We call these two systems PTE-scan and PEBS for short. Lastly, we include First-touch NUMA as a baseline, a widely-used memory allocation policy that assigns pages to the fast memory tier until it’s full, without subsequent migration.

### B. Implementation

We implement NeoProf hardware in Verilog, connect it to Intel’s Type-3 CXL IP, and then synthesize the design using Quartus 22.3. Fig.10-c shows the block-diagram of the implemented prototype. The NeoMem’s software parts, along with other baseline methods, are all developed based on Linux kernel v6.3 for fair comparison.

**Hardware Parameters.** Table IV lists the default hardware parameters employed by NeoProf. We configure two sketch lanes ( $D = 2$ ), each equipped with 512K counters ( $W = 512K$ ), where each counter is 16 bits in size. The sketch counter array is divided into 128 pipeline stages. We allocate

<sup>3</sup>The original GUPS has random memory access, we follow HeMem’s practice [56] and make some memory access regions hotter than the others.



Fig. 11. End-to-end Performance Comparison.



Fig. 12. Performance with Different Fast-Slow Memory Ratios.

TABLE V  
DEFAULT SOFTWARE PARAMETERS

| Parameter               | Value    | Description                                 |
|-------------------------|----------|---------------------------------------------|
| <i>mquota</i>           | 256MB/s  | The maximum page migration rate.            |
| <i>p<sub>min</sub></i>  | 0.01%    | The lower percentile bound                  |
| <i>p<sub>max</sub></i>  | 1.56%    | The upper percentile bound                  |
| <i>p<sub>init</sub></i> | 0.1%     | The init value of <i>p</i>                  |
| $\alpha / \beta$        | 1/2      | Adjustable hyper-parameters                 |
| migration_interval      | 10ms     | The interval of page migration in NeoMem    |
| clear_interval          | 5s       | The interval of resetting NeoProf counters  |
| thr_update_interval     | 1s       | The interval of updating hot-page threshold |
| pebs_sampling_rate      | 200-5000 | The sampling rate of PEBS                   |
| pte_sampling_rate       | 1-3s     | PTE-sampling rate of TPP&AutoNUMA           |
| page_scanning_rate      | 5s       | Page table scanning rate of PTE-scan        |

16K hot page buffers to accommodate detected hot pages. Additionally, we utilize 32 bits to index the device-side page address (4KB page), allowing us to address up to 16TB of memory for each memory controller.

**Software Parameters.** The default software parameters of NeoMem and baseline methods are listed in Table V. For NeoMem, we carefully set the parameters in Algorithm 1. For baselines, the parameters are also tuned on each benchmark to guarantee a high performance.

**FPGA Resource Utilization.** Our NeoProf implementation mainly consumes 93.8K ALMs (10%) and 1.5K BRAMs (M20K, 12%), no DSPs. The FPGA post-synthesize layout is shown in Figure 10-(a). The light blue parts are consumed by NeoProf and the remaining parts are mainly consumed by Intel’s FPGA support logic (Type-3 device) for CXL hard IP.

### C. Main Results

**Performance Comparison.** Figure 11 shows the performance comparison of our proposed NeoMem system against baseline systems. All performance numbers are normalized against the PEBS system. As depicted by the blue bars representing NeoMem, our approach consistently demonstrates superior performance across all seven benchmarks, achieving geomean speedups ranging from 32% (over PEBS) to a remarkable 67% (over First-touch NUMA). On the representative data-center benchmark, DeathStarBench, NeoMem achieves 1.19× to 1.67× speedup over baseline methods. These results demonstrate NeoMem’s efficiency in tiered memory management.

In certain benchmarks, NeoMem exhibits astonishing performance improvements compared to the baseline systems. For instance, NeoMem outperforms First-touch NUMA by factors

of 3.5× and 4.7× in the XSBench and GUPS benchmarks, respectively. NeoMem also achieves 2.8× and 3.2× speedup over AutoNUMA and PTE-scan in XSBench, respectively. This remarkable performance gain is attributed to the skewed hot memory regions present in GUPS and XSBench, as discussed in [42]. NeoMem promptly and accurately identifies these hot regions based on NeoProf and efficiently migrates them to the fast memory, thus significantly enhancing system performance. More detailed analysis of slow-tier traffic reduction of different solutions are presented in Sec.VI-D.

**Performance with Different Memory Configs.** To illustrate NeoMem’s performance under various memory setups, we maintain a constant CXL memory size and investigate three fast-to-slow memory ratios: 1:2, 1:4, and 1:8. Our evaluation compares NeoMem with PEBS, the second-best memory-tiering system according to Figure 11. As depicted in Figure 12, NeoMem consistently outperforms PEBS. Notably, in Page-Rank and Btree, the performance gap between NeoMem and PEBS widens as fast memory shrinks, indicating NeoMem’s higher accuracy in hot page classification. Conversely, in GUPS and XSBench, the performance of both NeoMem and PEBS remains relatively stable. This is because the hot sets always fit within the fast memory.

### D. Analysis of NeoMem

**Memory Traffic and Page Migration Analysis.** To better understand per-application behaviors, we profile slow-tier (CXL memory) access and page migration using NeoProf’s state monitor and Linux kernel’s counters. As shown in Fig.13, NeoMem exhibits significantly lower slow-tier traffic across all benchmarks, which explains its superior performance. Note that the slow-traffic reduction is not strictly proportional to end-to-end performance in some cases. For example, on XSBench, PEBS has higher slow-tier access than TPP but has a better end-to-end performance in Fig.11. This is due to other system-level affecting factors, e.g., false page promotion also incurs slow-tier access, which varies among solutions.

From the figure, NeoMem’s promotion count (normalized to PEBS) is significantly lower than AutoNUMA’s, and is on par with PTE-scan. This implies NeoMem’s superior ability to identify hot pages accurately and promptly. TPP exhibits the fewest migration counts in most cases, as it promotes pages



Fig. 13. Slow-Tier (CXL Memory) Traffic and # of Promotions/Demotions Comparison.



Fig. 14. Profiling of NeoMem on Page-Rank Benchmark.

only after two consecutive hint-faults. First-touch NUMA performs the worst among all baseline solutions, primarily due to its absence of promotion. PEBS demonstrates fewer promotions than NeoMem in the majority of cases, this suggests that its sampling-based tracking has low coverage and is prone to missing a large number of hot pages.

**CPU Overhead of NeoMem.** As NeoMem offloads memory profiling to dedicated hardware, the host CPU only has to retrieve data from NeoProf, which has minimal overhead. To prove this, we evaluate the slowdown on the GUPS benchmark relative to a baseline system where NeoProf is disabled. After several trials, we observe a mere 0.021% slowdown.

**Effectiveness of the Migration Policy.** To demonstrate the effectiveness of our NeoMem policy introduced in Section V-A, we compare it to the naive fixed-threshold policy. We consider the Page-Rank workload processing a graph through sixteen iterations. In each iteration, the execution time is recorded. We compare NeoMem’s dynamic threshold policy with fixed thresholds ( $\theta = \{100, 200, 300, 400\}$ ).

As depicted in Figure 14-(a), the dynamic threshold policy employed by NeoMem (the dark blue line) consistently results in the shortest execution times across these iterations. Fixed thresholds, for example,  $\theta = 200$ , suffer from an obvious slowdown after the 9-th iteration. This outcome demonstrates the effectiveness and necessity of a dynamic scheduling policy.

Figure 14-(b) illustrates the evolving hotness threshold during Page-Rank testing. The threshold is dynamically adjusted from 1 to 1000, with initial low settings and rapid increases (50s to 100s) in response to runtime conditions.

In Figure 14-(c), we plot the runtime read/write bandwidth profiled by NeoProf. We can observe that during the initial graph processing phase (around 40s), high bandwidth utiliza-



Fig. 15. Sensitivity to System and NeoProf Parameters.

tion prompts NeoMem to set a low threshold (as seen in Figure 14-(b)), promoting more pages and effectively reducing CXL memory bandwidth utilization.

Figure 14-(d) visualizes the evolving access frequency histogram profiled by NeoProf. Every 5 seconds, the profiled histogram is plotted as the vertical strip. The darker regions represent that more pages have this page-access frequency. The distribution of dark regions appears to correspond closely with the fluctuations observed in Figure 14-(b). This suggests that hotness threshold is properly set according to the runtime page access frequency distribution.

**Sensitivity to System Parameters.** We investigate the impact of migration\_interval (the period at which NeoMem retrieves hot pages from NeoProf and does promotion) and migration quota ( $m\_quota$  in Sec. V-A) on system performance.

- **Migration Interval.** In Figure 15-(a), we vary the migration interval from 10ms to 5000ms and assess its effect on the Page-Rank benchmark. A shorter migration interval generally results in better performance, as it enables more timely detection and migration of hot pages. Achieving a short migration interval requires a memory profiling technique with high time resolution and low overhead, highlighting NeoProf’s advantages. In comparison, PTE-scan based methods can only support second-level hot page detection and migration [30], [56].

- **Migration Quota.** In Figure 15-(b), we vary the migration quota from 64MB/s to 8192MB/s and evaluate its impact on performance. We find that a 64MB/s migration quota results in a 10% lower performance compared to 128MB/s or 256MB/s. Increasing the migration quota further slightly hampers overall performance due to heightened migration aggressiveness.

**Sensitivity to NeoProf Parameters.** We evaluate NeoMem’s sensitivity to NeoProf’s hardware parameters, specifically sketch width ( $W$ ) and sketch lanes ( $D$ ). We find that us-



Fig. 16. Comparison Among Different Profiling Methods.

TABLE VI  
TRANSPARENT HUGE PAGE VS. BASE PAGE ON PAGE-RANK

| Memory-Tiering Technique | NeoMem THP | TPP THP | NeoMem Base | TPP Base |
|--------------------------|------------|---------|-------------|----------|
| Generate (s)             | 7.61       | 10.96   | 8.63        | 10.67    |
| Build (s)                | 23.90      | 36.90   | 25.58       | 35.67    |
| Avg. Trail (s)           | 2.80       | 3.59    | 2.95        | 3.32     |
| Total Time (s)           | 76.28      | 105.31  | 81.39       | 99.39    |
| Promoted Base Pages (GB) | 11.53      | 2.70    | 14.90       | 2.01     |
| Promoted Huge Pages (GB) | 7.02       | 0.74    | /           | /        |

ing a single lane ( $D=1$ ) results in a decrease in end-to-end performance. However, increasing  $D$  beyond 2 does not improve performance obviously. Consequently, we empirically choose to maintain  $D$  at 2 in our prototype. We then vary the parameter  $W$  from 32K to 512K and plot both the error-bound curve, calculated using the algorithm in Section IV, and the system performance using the Page-Rank benchmark. As shown in Figure 15-(c), increasing  $W$  dramatically reduces the error-bound, which is constantly zero when the sketch width reaches 512K. Concurrently, system performance improves with the increasing of sketch width, peaking when  $W= 256K$ , as shown in Figure 15-(d). Our prototype sets  $W$  at 512K to ensure a sufficiently low error bound and high performance.

**Convergence Analysis on GUPS.** To examine how the low overhead and high resolution/accuracy advantages of NeoProf contribute to improved performance, we perform a convergence analysis using the GUPS microbenchmark. In this experiment, we confine 90% of memory access to a fixed memory region, while the remaining 10% of memory access uniformly falls in the whole working set. For each method we warm up the system for 600s to reach a convergence. Then we suddenly change the location of the hot set to evaluate the convergence speed of different methods.

As shown in Figure 16, we plot the GUPS (giga updates per second, higher is better) of NeoProf and other baselines over time. The `pebs_sampling_rate` for PEBS based method is set to 397 in this experiment. NeoProf shows the highest GUPS in the converged state (0-50s), indicating that it accurately classifies the hot pages and cold pages, avoiding unnecessary page migration. The smooth curve of NeoProf also reveals that NeoProf has a low overhead. After the hot set change at about 50s, NeoProf shows the fastest converge speed, indicating that NeoProf quickly identifies hot pages and migrates them to local memory, thanks to the high time resolution and space resolution.

## VII. DISCUSSION & FUTURE WORK

**Huge Page Support.** In the experiments discussed earlier, NeoMem and all baselines manage and migrate pages at the



Fig. 17. End-to-end Comparison with Memtis.



Fig. 18. Layout and Parameters of NeoProf Under TSMC 22nm Process.

base page (4KB) level. However, NeoMem can also support huge pages since it utilizes Linux’s huge-page-compatible page migration functions. NeoProf still reports hot 4KB pages, and the host can migrate huge pages, provided the profiled hot 4KB pages are part of huge pages. To evaluate the performance, we enable the widely adopted Transparent Huge Page (THP) feature in the Linux kernel, enabling the automatic consolidation of base pages into larger, 2MB huge pages. As detailed in Table VI, NeoMem, when equipped with THP, demonstrates superior performance over the base-page-only configuration on Page-Rank. It efficiently migrates 7.02GB of huge pages into faster memory. In contrast, TPP experiences a performance decline with THP enabled, migrating only a minimal amount of huge pages. This is mainly due to its low time-resolution for hot page detection.

NeoMem is also orthogonal to earlier memory tiering methods optimized for huge pages, which detect hot pages and migrate them at base-page granularity to improve resolution and reduce overhead [5], [42].

**End-to-end Comparison with Memtis.** Memtis [42] is a recent CXL/NVM memory tiering solution. Memtis adopts PEBS to profile memory access, and incorporates dynamic hot set classification based on memory access distribution. We port their released code<sup>4</sup> to our hardware platform (with some bugs fixed) and perform end-to-end comparison with NeoMem. As shown in Fig.17, Memtis closely matches NeoMem’s performance on 603.bwaves but significantly underperforms on GUPS. NeoMem outperforms Memtis by a  $1.58\times$  geomean speedup. Analysis reveals that Memtis promotes only 1.1% of pages compared to NeoMem, likely due to its limited ability to adapt to rapidly changing memory access patterns through PEBS and histogram-based hot-page classification.

**Hardware Overhead Estimation.** To estimate the hardware cost of integrating NeoProf into CXL controllers, we also evaluate the area and power overheads of NeoProf using EDA tools. Figure 18 illustrates the layout and parameters of Neo-

<sup>4</sup><https://github.com/cosmoss-jigu/memtis>

Prof. This implementation utilizes the TSMC 22nm technology node, with sketch parameters set at  $W=256K$ ,  $D=2$ . The resulting design occupies an area of approximately  $5.3\text{ mm}^2$  and consumes  $152.2\text{ mW}$  of power, which is lightweight for integration into device-side controllers. The floor-planing results demonstrate that the SRAM macros occupy about 54% of chip area, which are used to implement the sketch array, hot-bit array and hot-page buffer. The remaining on-chip area is consumed by NeoProf’s compute and control logic.

**Virtualization Support.** In cloud environments based on virtual machines, NeoMem can be integrated into the host OS. The host OS identifies hot physical pages through the NeoMem daemon and executes hot page promotion. Following hot page migration, the Enhanced Page Table (EPT) of the guest virtual machines will undergo remapping [63]. Evaluation in virtualized environments is planned for our future work.

**Scalability of NeoMem.** In our current prototyping system, we are constrained by a single 16GB CXL memory device due to hardware limitations. However, unlike the PTE-scan baseline, which experiences a linear increase in profiling overhead with memory size due to scanning all pages, NeoProf maintains consistent performance regardless of memory size. This is because NeoProf directly tracks CXL.mem requests, and the maximum request rate is constrained by channel bandwidth, not memory size. Additionally, according to sketching theory, profiling accuracy depends on the volume of incoming requests, which is also bandwidth-related, not memory-size-related. Also, the profiling throughput should linearly scale with the addition of more CXL memory devices equipped with NeoProf. Given the negligible software overhead of host-NeoProf interaction in a single-device scenario (0.021% slowdown reported in previous section), adding multiple devices does not burden the hosts. We leave the evaluation in multi-device scenarios to our future work.

**Memory Interleaving.** In multi-device scenarios, a single physical page can be interleaved among multiple devices. Under such circumstances, the NeoProf in each device only profiles a specific fraction of a page. How will interleaving affect the overall memory-tiering performance is yet to be explored. The host OS may also need to gather fragmented page hotness information from all NeoProfs and conduct additional post-processing tasks like hot-page de-duplication.

## VIII. RELATED WORK

### A. Probabilistic Algorithms for Data Stream Analysis

Probabilistic algorithms have been extensively employed to solve various tasks such as identifying the presence of specific items in data streams, exemplified by the Bloom Filters [75], [76]. They are also used to identify unique elements, as demonstrated by the HyperLogLog algorithm [22], and to estimate item frequencies, as shown by the Sketch algorithms [18], [27], [44], [79], [80]. In this study, we treat hot page detection as a problem of identifying “heavy hitters” in memory access streams, a task for which the Count-Min Sketch algorithms are particularly well-suited [18].

### B. Software-based Tiered Memory System

Software-based tiered memory systems have seen extensive research in techniques related to page access profiling [5], [15], [21], [30], [41], [63], page classification [30], [39], [42], [50], [56], and efficient page migration [61], [78]. These methods, however, are hindered by limited memory access profiling capabilities, as demonstrated in our work. An alternative approach involves managing memory objects directly at the application [2], [3], [31], [46], [58], [72] or library [20], [38], [52], [74], [77] level, but these require modifications to users’ applications or libraries. Our NeoMem solution offers a practical resolution to these challenges.

### C. Architecture Support for Memory Tiering

Besides software-based approaches, previous works have also optimized heterogeneous memory systems from a pure architectural standpoint. MemPod [54] uses the Majority Element Algorithm (MEA) to identify hot pages but assumes management of both slow and fast memory by the same hardware, which differs from the current CXL memory system. Similar approaches like CAMEO [16], PoM [64], and SILCFM [62] treat fast memory as a hardware-managed cache for slow memory. A recent work, HoPP [43], suggests modifying CPU memory controllers to track memory accesses and provide information to the OS. However, these approaches necessitate costly CPU-side modifications. NeoMem is a “CXL-native” solution, limiting hardware modifications to the device side and avoiding expensive CPU-side upgrades.

## IX. CONCLUSION

This paper introduces NeoMem, a novel CXL-native memory-tiering technique. NeoMem embodies a hardware-software co-design philosophy, with the integration of a dedicated hardware profiler called NeoProf into the controllers of CXL memory. This enables the OS to access profiled information and execute efficient hot page migration based on a customized migration policy. Comprehensive evaluation on a real CXL memory platform demonstrates that NeoMem achieves a geometric speedup ranging from 32% to 67% across various existing memory tiering solutions.

## ACKNOWLEDGMENTS

We thank all the reviewers from ISCA 2024 and MICRO 2024 for their valuable comments. We thank Yijin Guan for his kind help. This work is supported by National Natural Science Foundation of China (NSFC) (Grant No. 62032001) and 111 Project (B18001). This work is also supported in part by the NSFC under Grant No. U21B2017. Dr. Jie Zhang is supported in part by the National Key Research and Development Program of China (Grant No. 2023YFB4502702) and the Natural Science Foundation of China (Grant No. 62332021).

## REFERENCES

- [1] “Gups,” <http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/>.
- [2] “Memkind,” <https://memkind.github.io/memkind/>, 2021.
- [3] “Persistent memory programming,” <http://pmem.io/>, 2017.
- [4] R. Achermann and A. Panwar, “Mitosis workload btree,” <http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/>, 2019.

- [5] N. Agarwal and T. F. Wenisch, "Thermostat: Application-transparent page management for two-tiered main memory," in *Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems*, 2017, pp. 631–644.
- [6] H. Al Maruf and M. Chowdhury, "Effectively prefetching remote memory with leap," in *2020 USENIX Annual Technical Conference (USENIX ATC 20)*, 2020, pp. 843–857.
- [7] M. Arif, A. Maurya, and M. M. Rafique, "Accelerating performance of gpu-based workloads using cxl," in *Proceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing*, 2023, pp. 27–31.
- [8] V. Banakar, K. Wu, Y. Patel, K. Keeton, A. C. Arpacı-Dusseau, and R. H. Arpacı-Dusseau, "Wiscort: External sorting for byte-addressable storage," *arXiv preprint arXiv:2307.06476*, 2023.
- [9] S. Beamer, K. Asanović, and D. Patterson, "The gap benchmark suite," *arXiv preprint arXiv:1508.03619*, 2015.
- [10] S. Bergman, P. Faldu, B. Grot, L. Vilanova, and M. Silberstein, "Reconsidering os memory optimizations in the presence of disaggregated memory," in *Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management*, 2022, pp. 1–14.
- [11] D. Boles, D. Waddington, and D. A. Roberts, "Cxl-enabled enhanced memory functions," *IEEE Micro*, vol. 43, no. 2, pp. 58–65, 2023.
- [12] I. Calciu, M. T. Imran, I. Puddu, S. Kashyap, H. A. Maruf, O. Mutlu, and A. Kolli, "Rethinking software runtimes for disaggregated memory," in *Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, 2021, pp. 79–92.
- [13] P. Chen, Y. Wu, T. Yang, J. Jiang, and Z. Liu, "Precise error estimation for sketch-based flow measurement," in *Proceedings of the 21st ACM Internet Measurement Conference*, 2021, pp. 113–121.
- [14] A. Cho, A. Saxena, M. Qureshi, and A. Daglis, "A case for cxl-centric server processors," *arXiv preprint arXiv:2305.05033*, 2023.
- [15] J. Choi, S. Blagodurov, and H.-W. Tseng, "Dancing in the dark: Profiling for tiered memory," in *2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)*. IEEE, 2021, pp. 13–22.
- [16] C. C. Chou, A. Jaleel, and M. K. Qureshi, "Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache," in *2014 47th Annual IEEE/ACM International Symposium on Microarchitecture*. IEEE, 2014, pp. 1–12.
- [17] J. Corbet, "Autonuma: the other approach to numa scheduling," *LWN.net*, 2012.
- [18] G. Cormode and S. Muthukrishnan, "An improved data stream summary: the count-min sketch and its applications," *Journal of Algorithms*, vol. 55, no. 1, pp. 58–75, 2005.
- [19] I. Corporation, "Intel® fpga compute express link (cxl) ip," <https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.html>, 2024.
- [20] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson, and K. Schwan, "Data tiering in heterogeneous memory systems," in *Proceedings of the Eleventh European Conference on Computer Systems*, 2016, pp. 1–16.
- [21] P. Duraisamy, W. Xu, S. Hare, R. Rajwar, D. Culler, Z. Xu, J. Fan, C. Kennelly, B. McCloskey, D. Mijailovic, B. Morris, C. Mukherjee, J. Ren, G. Thelen, P. Turner, C. Villavieja, P. Ranganathan, and A. Vahdat, "Towards an adaptable systems architecture for memory tiering at warehouse-scale," in *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3*, ser. ASPLOS 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 727–741. [Online]. Available: <https://doi.org/10.1145/3582016.3582031>
- [22] P. Flajolet et al., "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm," in *Discrete mathematics & theoretical computer science Proceedings*, 2007.
- [23] C. foundation, "Cxl 3.0 specification," <https://www.computeexpresslink.org/download-the-specification>, 2022.9.
- [24] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson et al., "An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems," in *Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems*, 2019, pp. 3–18.
- [25] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Badgertrap: A tool to instrument x86-64 tlb misses," *ACM SIGARCH Computer Architecture News*, vol. 42, no. 2, pp. 20–23, 2014.
- [26] D. Gouk, S. Lee, M. Kwon, and M. Jung, "Direct access,{High-Performance} memory disaggregation with {DirectCXL},," in *2022 USENIX Annual Technical Conference (USENIX ATC 22)*, 2022, pp. 287–294.
- [27] A. Goyal and H. Daumé, "Lossy conservative update (lcu) sketch: Succinct approximate count storage," in *Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 25, no. 1, 2011, pp. 878–883.
- [28] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, "Efficient memory disaggregation with infiniswap," in *14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)*, 2017, pp. 649–667.
- [29] M. Ha, J. Ryu, J. Choi, K. Ko, S. Kim, S. Hyun, D. Moon, B. Koh, H. Lee, M. Kim, H. Kim, and K. Park, "Dynamic capacity service for improving cxl pooled memory efficiency," *IEEE Micro*, vol. 43, no. 2, pp. 39–47, 2023.
- [30] T. Heo, Y. Wang, W. Cui, J. Huh, and L. Zhang, "Adaptive page migration policy with huge pages in tiered memory systems," *IEEE Transactions on Computers*, vol. 71, no. 1, pp. 53–68, 2022.
- [31] M. Hildebrand, J. Khan, S. Trika, J. Lowe-Power, and V. Akella, "Autotom: Automatic tensor movement in heterogeneous memory systems using integer linear programming," in *Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems*, 2020, pp. 875–890.
- [32] S. Hynix, "Sk hynix cxl memory," <https://news.skhynix.com/sk-hynix-develops-ddr5-dram-cxltm-memory-to-expand-the-cxl-memory-ecosystem>, 2022.9.
- [33] J. Jang, H. Choi, H. Bae, S. Lee, M. Kwon, and M. Jung, "{CXL-ANNS}:{Software-Hardware} collaborative memory disaggregation and computation for {Billion-Scale} approximate nearest neighbor search," in *2023 USENIX Annual Technical Conference (USENIX ATC 23)*, 2023, pp. 585–600.
- [34] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica, "Netcache: Balancing key-value stores with fast in-network caching," in *Proceedings of the 26th Symposium on Operating Systems Principles*, 2017, pp. 121–136.
- [35] T. Johnson and D. Shasha, "2q: A low overhead high performance buffer management replacement algorithm," in *Proceedings of the 20th International Conference on Very Large Data Bases*, ser. VLDB '94. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, p. 439–450.
- [36] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, "Heteroos: Os design for heterogeneous memory management in datacenter," *SIGARCH Comput. Archit. News*, vol. 45, no. 2, p. 521–534, jun 2017. [Online]. Available: <https://doi.org/10.1145/3140659.3080245>
- [37] J. Kim, W. Choe, and J. Ahn, "Exploring the design space of page management for {Multi-Tiered} memory systems," in *2021 USENIX Annual Technical Conference (USENIX ATC 21)*, 2021, pp. 715–728.
- [38] K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, and H. Song, "SMT: software-defined memory tiering for heterogeneous computing systems with CXL memory expander," *IEEE Micro*, vol. 43, no. 2, pp. 20–29, 2023. [Online]. Available: <https://doi.org/10.1109/MM.2023.3240774>
- [39] K. Koh, K. Kim, S. Jeon, and J. Huh, "Disaggregated cloud memory with elastic block management," *IEEE Transactions on Computers*, vol. 68, no. 1, pp. 39–52, 2019.
- [40] M. Kwon, S. Lee, and M. Jung, "Cache in hand: Expander-driven cxl prefetcher for next generation cxl-ssd," in *Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems*, 2023, pp. 24–30.
- [41] T. Lee and Y. I. Eom, "Optimizing the page hotness measurement with re-fault latency for tiered memory systems," in *2022 IEEE International Conference on Big Data and Smart Computing (BigComp)*, 2022, pp. 275–279.
- [42] T. Lee, S. K. Monga, C. Min, and Y. I. Eom, "Memtis: Efficient memory tiering with dynamic page classification and page size determination," in *Proceedings of the 29th Symposium on Operating Systems Principles*, 2023, pp. 17–34.
- [43] H. Li, K. Liu, T. Liang, Z. Li, T. Lu, H. Yuan, Y. Xia, Y. Bao, M. Chen, and Y. Shan, "Hopp: Hardware-software co-designed page prefetching for disaggregated memory," in *2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*. IEEE, 2023, pp. 1168–1181.
- [44] H. Li, Q. Chen, Y. Zhang, T. Yang, and B. Cui, "Stingy sketch: a sketch framework for accurate and fast frequency estimation," *Proceedings of the VLDB Endowment*, vol. 15, no. 7, pp. 1426–1438, 2022.

- [45] H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini, "Pond: Cxl-based memory pooling systems for cloud platforms," in *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023*, T. M. Aamodt, N. D. E. Jerger, and M. M. Swift, Eds., ACM, 2023, pp. 574–587. [Online]. Available: <https://doi.org/10.1145/3575693.3578835>
- [46] Z. Li and M. Wu, "Transparent and lightweight object placement for managed workloads atop hybrid memories," in *Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments*, ser. VEE 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 72–80. [Online]. Available: <https://doi.org/10.1145/3516807.3516822>
- [47] Linux, "Automatic numa balancing," <https://www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf>.
- [48] Linux, "Damon: Data access monitor," <https://docs.kernel.org/mm/damon/index.html>.
- [49] Linux, "Linux memmap command for reserving physical memory," <https://www.kernel.org/doc/html/v5.16/admin-guide/kernel-parameters.html>.
- [50] A. Maruf, A. Ghosh, J. Bhimani, D. Campello, A. Rudoff, and R. Rangaswami, "Multi-clock: Dynamic tiering for hybrid memory systems," in *2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, 2022, pp. 925–937.
- [51] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. O. Kanaujia, and P. Chauhan, "TPP: transparent page placement for cxl-enabled tiered-memory," in *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023*, T. M. Aamodt, N. D. E. Jerger, and M. M. Swift, Eds., ACM, 2023, pp. 742–755. [Online]. Available: <https://doi.org/10.1145/3582016.3582063>
- [52] D.-J. Oh, Y. Moon, D. K. Ham, T. J. Ham, Y. Park, J. W. Lee, J. H. Ahn, and E. Lee, "Maphea: A framework for lightweight memory hierarchy-aware profile-guided heap allocation," *ACM Trans. Embed. Comput. Syst.*, vol. 22, no. 1, dec 2022. [Online]. Available: <https://doi.org/10.1145/3527853>
- [53] onathan Corbet, "The future of memory tiering," <https://lwn.net/Articles/931421/>.
- [54] A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, "Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories," in *2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)*, 2017, pp. 433–444.
- [55] M. Ramakrishna, E. Fu, and E. Bahcekapili, "Efficient hardware hashing functions for high performance computers," *IEEE Transactions on Computers*, vol. 46, no. 12, pp. 1378–1381, 1997.
- [56] A. Raybuck, T. Stamler, W. Zhang, M. Erez, and S. Peter, "Hemem: Scalable tiered memory management for big data applications and real nvm," in *Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles*, 2021, pp. 392–407.
- [57] Redis, "Redis data base," <https://github.com/redis/redis>, 2023.10.
- [58] J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, "Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning," in *2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)*, 2021, pp. 598–611.
- [59] J. Ren, D. Xu, I. Peng, J. Ryu, K. Shin, D. Kim, and D. Li, "Hm-keeper: Scalable page management for multi-tiered large memory systems," *arXiv preprint arXiv:2302.09468*, 2023.
- [60] J. Ren, D. Xu, I. Peng, J. Ryu, K. Shin, D. Kim, and D. Li, "Rethinking memory profiling and migration for multi-tiered large memory systems," 2023.
- [61] J. H. Ryoo, L. K. John, and A. Basu, "A case for granularity aware page migration," in *Proceedings of the 2018 International Conference on Supercomputing*, 2018, pp. 352–362.
- [62] J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, "Silcfm: Subblocked interleaved cache-like flat memory organization," in *2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 2017, pp. 349–360.
- [63] S. Sha, C. Li, Y. Luo, X. Wang, and Z. Wang, "vtmm: Tiered memory management for virtual machines," in *Proceedings of the Eighteenth European Conference on Computer Systems*, 2023, pp. 283–297.
- [64] J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, "Transparent hardware management of stacked dram as part of memory," in *2014 47th Annual IEEE/ACM International Symposium on Microarchitecture*. IEEE, 2014, pp. 13–24.
- [65] J. Sim, S. Ahn, T. Ahn, S. Lee, M. Rhee, J. Kim, K. Shin, D. Moon, E. Kim, and K. Park, "Computational cxl-memory solution for accelerating memory-intensive applications," *IEEE Computer Architecture Letters*, vol. 22, no. 1, pp. 5–8, 2022.
- [66] Samsung, "Expanding the limits of memory bandwidth and density: Samsung's cxl dram memory expander," <https://semiconductor.samsung.com/newsroom/tech-blog/expanding-the-limits-of-memory-bandwidth-and-density-samsungs-cxl-dram-memory-expander/>, 2022.9.
- [67] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim, "Demystifying CXL memory with genuine cxl-ready systems and devices," *CoRR*, vol. abs/2303.15375, 2023. [Online]. Available: <https://doi.org/10.48550/arXiv.2303.15375>
- [68] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim, "Demystifying cxl memory with genuine cxl-ready systems and devices," *arXiv preprint arXiv:2303.15375*, 2023.
- [69] D. Tong and V. K. Prasanna, "Sketch acceleration on fpga and its applications in network anomaly detection," *IEEE Transactions on Parallel and Distributed Systems*, vol. 29, no. 4, pp. 929–942, 2017.
- [70] J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz, "XSbench - the development and verification of a performance abstraction for Monte Carlo reactor analysis," in *PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future*, Kyoto, 2014. [Online]. Available: <https://www.mcs.anl.gov/papers/P5064-0114.pdf>
- [71] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden, "Speedy transactions in multicore in-memory databases," in *Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles*, 2013, pp. 18–32.
- [72] C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos, O. Mutlu, F. Lv, X. Feng, and G. H. Xu, "Panthera: Holistic memory management for big data processing over hybrid memories," in *Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation*, ser. PLDI 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 347–362. [Online]. Available: <https://doi.org/10.1145/3314221.3314650>
- [73] V. M. Weaver *et al.*, "Advanced hardware profiling and sampling (pebs, ibs, etc.): creating a new papi sampling interface," *Technical Report UMAINE-VMWTR-PEBS-SAMPLING-2016-08*. University of Maine, Tech. Rep., 2016.
- [74] W. Wei, D. Jiang, S. A. McKee, J. Xiong, and M. Chen, "Exploiting program semantics to place data in hybrid memory," in *2015 International Conference on Parallel Architecture and Compilation (PACT)*, 2015, pp. 163–173.
- [75] Wikipedia, "Bloom filter," [https://en.wikipedia.org/wiki/Bloom\\_filter](https://en.wikipedia.org/wiki/Bloom_filter).
- [76] Wikipedia, "Counting bloom filter," [https://en.wikipedia.org/wiki/Counting\\_Bloom\\_filter](https://en.wikipedia.org/wiki/Counting_Bloom_filter).
- [77] K. Wu, Y. Huang, and D. Li, "Unimem: Runtime data management on non-volatile memory-based heterogeneous main memory," in *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, ser. SC '17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: <https://doi.org/10.1145/3126908.3126923>
- [78] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, "Nimble page management for tiered memory systems," in *Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems*, 2019, pp. 331–345.
- [79] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig, "Elastic sketch: Adaptive and fast network-wide measurements," in *Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication*, 2018, pp. 561–575.
- [80] T. Yang, Y. Zhou, H. Jin, S. Chen, and X. Li, "Pyramid sketch: A sketch framework for frequency estimation of data streams," *Proceedings of the VLDB Endowment*, vol. 10, no. 11, pp. 1442–1453, 2017.
- [81] X. Zhang, Y. Chang, T. Lu, K. Zhang, and M. Chen, "Rethinking design paradigm of graph processing system with a cxl-like memory semantic fabric," in *2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)*. IEEE, 2023, pp. 25–35.