# Storage Systems (StoSys) XM\_0092

# Lecture 11: CXL and io\_uring

Animesh Trivedi

https://stonet-research.github.io/

Autumn 2023, Period 1



# **Syllabus outline**

- 1. Welcome and introduction to NVM
- 2. Host interfacing and software implications
- **3.** Flash Translation Layer (FTL) and Garbage Collection (GC)
- 4. NVM Block Storage File systems
- 5. NVM Block Storage Key-Value Stores
- 6. Emerging Byte-addressable Storage
- 7. Networked NVM Storage
- 8. Trends: Specialization and Programmability
- 9. Distributed Storage / Systems I
- 10. Distributed Storage / Systems II
- 11. Emerging Topics

### **Today is the last course lecture**

We survived, it has been quite fun to teach this course

Hope you also had fun and learn a lot of advancements happening in the area of storage research

In coming days and weeks

- Next Tuesday: Milestone 5 interview sign up!
- Next Wednesday: Guest Lecture from Nikolas
- Afterwards: Prepare for the exam Good luck !
- In the End: We will ask for some feedback on the course
  - $\circ$  Me as a teacher
  - Broadly about the course *you can be frank!*
  - Want to be the TA next year?



# If you are interested in such research ...

Individual research projects (XM\_405088)

• 6 or 12 ECTS credits

Master projects / literature study

- Benchmarking the storage benchmarks
- io\_uring/CXL research (today's lecture)
- Integrating NVM(e)/NVMoF storage in ML runtime to train large models (Swapping Tensors)
- Building computation storage device prototype in QEMU
- Virtualizing ZNS/NVMe devices
- Scheduling I/O operations for workload-specific optimizations
- Your favorite idea ... I am broadly open to ideas from your side, pick a paper and lets discuss





### **Recap:** From HDDs to Persistent Memories (PMem)





http://pages.cs.wisc.edu/~remzi/OSTEP/file-disks.pdf https://www.partitionwizard.com/help/what-is-chs.html

## The (new) triangle of storage hierarchy



## Multiple Emerging Topics (non-exhaustive)

Domain-specific/specialized storage solutions

Storage virtualization, Disaggregation (end-to-end software-defined-\*)

Quality-of-service in Storage Ecosystems (scheduling, multi-tenancy)

Energy Considerations

**CPU-free Computing** (re-thinking the computing architecture)

<u>CPU-free Computing: A Vision with a Blueprint | Proceedings of the 19th Workshop on Hot Topics in Operating Systems</u>

#### Hardware changes: Computer Express Link (CXL)

• Brief motivation and capabilities (without getting into too much hw/PCIe details)

#### New software APIs: io\_uring (Linux, also being ported to other OSes)

• How is it different than other APIs and what options does it provide, performance implications

#### The CPU is the center of computing

- direct memory access
- center of coherency
- controller of the devices

and the final coordinator and arbiter

Figure 1.4

The CPU performance was fast!







#### CPU cache management is non-trivial and complex (even with same/similar homogeneous CPU architectures)









Cost-effective, Energy-efficient, and Scalable Storage Computing for Large-scale AI Applications. ACM Trans. Storage 16, 4, Article 21 (November 2020), 37 pages. https://doi.org/10.1145/3415580



Cost-effective, Energy-efficient, and **Scalable Storage Computing** for Large-scale AI Applications. ACM Trans. Storage 16, 4, Article 21 (November 2020), 37 pages. https://doi.org/10.1145/3415580

#### These accelerators can have :

- Compute elements (specialized FPGA, or general ARM)
- Memory elements
  - Storage chips
  - Multi-level caches
  - Outside connectivity

Who manages "coherency", "data flow", "configuration", "management" of memories/caches/devices here? Software, hardware? Performance? Cost of development of new APIs, protocols?

Cost-effect

Elba

ARM<sup>®</sup> rtex™-M ack-end)

sh memor

hannel #2

ash memory hannel #16)



- What happens to the remaining 1.5 GB DRAM?
- Do applications use all the DRAM what they ask for?

- Can not mix and match different DRAM technologies and generations
- More performance means more capacity (need to buy more DIMMs)
- 3. Limit to how much DRAM can be packed in a single machine



# Very close coupling of CPU-DRAM (1) DRAM technology; (2) Density, capacity; and (3) Performance





Figure 2: Memory stranding (§3.1). Stranding increases significantly as more CPU cores are scheduled. Error bars indicate the  $5^{th}$  and  $95^{th}$  percentiles (outliers in dots).

DRAM is a big power and cost factor in data center (up to ~40%) A big part can remain underutilized Azure with VMs : on average ~10% (but as high as ~30%)

Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, ASPLOS 2023, <u>https://doi.org/10.1145/3575693.3578835</u> TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. ASPLOS 2023, <u>https://doi.org/10.1145/3582016.3582063</u>





Figure 7: Application memory usage over last N mins.

Figure 11: Fraction of pages re-accessed at different intervals.

#### Not all pages allocation are used **uniformly:**

- (1) Only a small fraction of memory is <u>accessed</u> in 1-2 minutes window
- (2) For Web, almost 80% of the pages are <u>re-accessed</u> within a ten-minute interval but for warehouse it is 20%.

#### (do they all have to be in DRAM?)

Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, ASPLOS 2023, <u>https://doi.org/10.1145/3575693.3578835</u> TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. ASPLOS 2023, <u>https://doi.org/10.1145/3582016.3582063</u>

## **Summary Problem**

There has to be a better way to

- Manage non-CPU memories and caches (accelerators)
- Manage CPU-attached memories (allocation, disaggregate from the CPU)
- Expand beyond the CPU-attached memories

#### + Think of non-volatile memories ...

- Persistent memories
- Fast storage

Solution : **Compute Express Link (CXL)** (the last protocol we will ever need)

# **Computer Express Link (CXL)**

#### A cache coherent Interconnect between

- The CPU
- Accelerators
- Memory expansion cards

#### Asymmetric protocol

A set of standardized protocols defined on the top of PCIe 5.0 (PHY)

- Runs in the standard PCIe slots
- 32 GT/s, or 4 GB/lane  $\Rightarrow$  x32 card = **128 GB/sec**
- Latencies approaching the NUMA CPU (with v6.0)



| PCie<br>Specification | Data Rate per<br>Lane (GT/s) | Encoding  | x16 Unidirectional<br>Bandwidth (GB/s) | Specification<br>Ratification Year |  |
|-----------------------|------------------------------|-----------|----------------------------------------|------------------------------------|--|
| 1.x                   | 2.5                          | 8b/10b    | 4                                      | 2003                               |  |
| 2.x                   | 5                            | 8b/10b    | 8                                      | 2007                               |  |
| 3.x                   | 8                            | 128b/130b | 15.75                                  | 2010                               |  |
| 4.0                   | 16                           | 128b/130b | 31.5                                   | 2017                               |  |
| 5.0                   | 32                           | 128b/130b | 63                                     | 2019                               |  |
| 6.0                   | 64                           | PAM4/FLIT | 128                                    | 2022                               |  |

https://www.electronicdesign.com/technologies/embedded/article/21162617/cxl-coherency-memory-and-io-semantics-on-pcie-infrastructure https://www.xda-developers.com/pcie-5/ https://www.rambus.com/blogs/pcie-6/

# **<u>Three</u>** CXL Protocols

#### CXL.io

- Mandatory for all hosts, and CXL supported devices
- Discovery, enumerations, capabilities (DMA, interrupts, IOV), and host physical address configuration
- Same in spirit to what any basic PCIe device would support

#### CXL.mem

- Enables (only) CPU to access device/accelerator memory in a cacheable manner
- Useful in DRAM expansion
- Device is not initiating any communication

#### CXL.cache

- The same as CXL.mem, but now devices can also access the CPU memory/caches
- Additional commands/requests for maintaining coherence among <u>all</u> copies

## **<u>Three</u>** Classes of Devices



https://www.computeexpresslink.org/ files/ugd/0c1418 a8713008916044ae9604405d10a7773b.pdf https://www.computeexpresslink.org/ files/ugd/0c1418 998df4f459734f319e7a12cc2163b943.pdf

## **<u>Three</u>** Generations of CXL Protocols

| Features                                     | CXL 1.0 / 1.1 | CXL 2.0 | CXL 3.0      |
|----------------------------------------------|---------------|---------|--------------|
| Release date                                 | 2019          | 2020    | 1H 2022      |
| Max link rate                                | 32GTs         | 32GTs   | 64GTs        |
| Flit 68 byte (up to 32 GTs)                  | ✓             | ✓       | $\checkmark$ |
| Flit 256 byte (up to 64 GTs)                 |               |         | ✓            |
| Type 1, Type 2 and Type 3 Devices            | ✓             | ✓       | ✓            |
| Memory Pooling w/ MLDs                       |               | ✓       | ✓            |
| Global Persistent Flush                      |               | ✓       | <            |
| CXL IDE                                      |               | ✓       | ✓            |
| Switching (Single-level)                     |               | 1       | ✓            |
| Switching (Multi-level)                      |               |         | ✓            |
| Direct memory access for peer-to-peer        |               |         | ✓            |
| Enhanced coherency (256 byte flit)           |               |         | ✓            |
| Memory sharing (256 byte flit)               |               |         | ✓            |
| Multiple Type 1/Type 2 devices per root port |               |         | ✓            |
| Fabric capabilities (256 byte flit)          |               |         | ✓            |

- CXL 3.0: Enabling composable systems with expanded fabric capabilities, October 6, 2022, <u>https://www.computeexpresslink.org/\_files/ugd/0c1418\_998df4f459734f319e7a12cc2163b943.pdf</u>
- Good overview, <a href="https://community.cadence.com/cadence\_blogs\_8/b/breakfast-bytes/posts/hot-chips-cxl-tutorial">https://community.cadence.com/cadence\_blogs\_8/b/breakfast-bytes/posts/hot-chips-cxl-tutorial</a>



What can we do? Expansion of DRAM, CPU-Memory Decoupling (multiple generation of devices), Memory Pooling and sharing, Single Logical Device (SLD  $\rightarrow$  Exclusive to one CXL root) to Multiple Logical Device (MLD, connected to multiple CXL roots), Memory hot swapping ...

# **Design a Distributed Cluster Running CXL**



Multiple type of devices, Global Fabric Attached Memory (GFAM)

CXL 3.0 Fabric Architecture

- Interconnected Spine Switch System
- Leaf Switch NIC Enclosure
- Leaf Switch CPU Enclosure
- Leaf Switch Accelerator Enclosure
- Leaf Switch Memory Enclosure



CXL 3.0: Enabling composable systems with expanded fabric capabilities October 6, 2022 https://www.computeexpresslink.org/ files/ugd/0c1418 998df4f459734f319e7a12cc2163b943.pdf



## **CXL.mem Expansion Device Example**



- 1. PCIe enumeration and BAR mapping with, Host-Managed Device Memory (HDM) areas
- 2. Setup MMU and allocate the DRAM physical address from this area (software support)
- 3. Access happens, and the request is routed to the PCIe/CXL root

### **CXL.mem Expansion Device Example**



Multiple configurations (1) striping across multiple devices, ports, roots; (2) allocation units...

## **Transparent Page Placement (TPP)**



PCIe 6.0 latencies and bandwidth are approaching access to a remote NUMA CPU socket

**Challenge**: How to profile pages (at low-overheads) and put them in the right storage level in the CXL-enabled memory hierarchy

Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, ASPLOS 2023, <u>https://doi.org/10.1145/3575693.3578835</u> TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. ASPLOS 2023, <u>https://doi.org/10.1145/3582016.3582063</u>

### **POND (ASPLOS'23): How to Disaggregate VM Memory**



Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, ASPLOS 2023, <u>https://doi.org/10.1145/3575693.3578835</u>

## Where does <a>Storage</a> Come into the Play?

Any device can implement the CXL protocol

- Use SSD as large capacity RAM
- <u>Byte\*-addressable</u>
- Persistent

\*64B addressable



https://news.samsung.com/global/samsung-electronics-unveils-far-reaching-next-generation-memory-solutions-at-flash-memory-summit-2022

#### **Emerging work: Quantifying and Hiding Flash Latencies**

#### Hello Bytes, Bye Blocks: PCIe Storage Meets Compute Express Link for Memory Expansion (CXL-SSD)

Myoungsoo Jung Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology (KAIST) http://camelab.org

#### ABSTRACT

Compute express link (CXL) is the first open multi-protocol method to support cache coherent interconnect for different processors, accelerators, and memory device types. Even though CXL manages data coherency mainly between CPU memory spaces and memory on attached devices, we argue that it can also be useful to reform existing block storage as cost-efficient, large-scale working memory. Specifically, this paper examines three different sub-protocols of CXL from a memory expander viewpoint. It then suggests which device type can be the best option for PCIe storage to bridge its block semantics to memory-compatible, byte semantics. We then discuss how to integrate a storage-integrated memory expander into an existing system and speculate how much effect it does have on the system performance. Lastly, we visit various CXL network topologies and explore a new opportunity to efficiently manage the storage-integrated, CXL-based memory expansion.

#### 1 INTRODUCTION

Cache coherence interconnects are recently emerged to integrate different CPUs, accelerators, and memory components into a heterogeneous, single computing domain. Specifically, the interconnect technologies maintim data coherency between CPU memory and private memory attached to devices, defining a new type of globally shared memory and network space. While there have been several efforts to coherently connect different hardware components, such as Gen-Z [11] and CCIX [2]. Compute Express Link (CXL) is the first open interconnect protocol supporting various types of processors and device endpoints [3]. CXL has absorbed Gen-Z [41] and he secone one of the most pomising interconnect interfaces thanks to its high-speed coherence control and full computblit with the existing bus standard, PCLe, A broad spectrum

Primission to make digital or had copies of all equit of this work, Sey reconsult a distances on exist parameters with the special end of the special and the special end of the special end of the special end of the special or the fortugal. Copyrights for comparents of this work conservations are applied in the boost of Absorption gravity of the special end of the end of the special end of t of datacenter-scale hardware such as CPU, GPU, FPGA, and domain-specific ASIC is thus expected to take significant advantage of CXL [5–7]. CXL consortium announces that it can also disaggregate memory by pooling DRAM and byteaddressable persistent memory (PMEM).

While CXL can handle diverse computing resources and memory components: it sets block storage aside and leaves a question on whether the storage can reap the benefits of CXL and the storage can reap the benefits of CXL achieves any have is it why and what can the block storage benefit from CXL?. If there is an advantage, we should be able to answer the following questions: if) how can we connect the underlying block storage to the host's system memory bar?, it) what kind (CZL device type should be used for the block storage and memory expander?, and it) what does CXL need to improve for better utilization of the block storage?.

In this paper, we argue that CXL is helpful in leveraging PIC-based block storage to incarnate a large, scalable working memory by answering all the four questions mentioned above. We helieve CXL is a cost-ffective and practical interconnect technology that can bridge PCIs storage's blocksamatics to memory-compatible, bybe semantics. To this end, we should carefully integrate the block storage into its interconnect network by being aware of the diversity of device types and protocols that CXL, supports. This paper first discusses what a nechanism makes the PCI es torage impractical and unable to be used for a memory expander (32). Then, we sopher alth the CXL device types and their protocol interfaces to answer which configuration would be the best for the PCIe storage to expand the host's CPU memory (33).

Even though CXL can be the most promising interface for the block storage in getting closer to CPU, it is non-trivial to speculate how much effect a storage-integrated memory expander does have on system prefromance. As there is no CPU and fabris for CXL yet, it is also unclear for the storage designers and system architects to see how CXL-enabled storage can be implemented and interact with CPU. To answer his, we discuss what a PCIe storage device needs to change, how it can be connected to the host over CXL, and how users can access the device through loadStore instructions (§4). We then project the performance of the storage-integrated memory expander by prototyping CXL agents and controllers.

#### Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation CXL-SSDs

Miryeong Kwon<sup>\*†</sup>, Sangwon Lee<sup>\*†</sup>, Myoungsoo Jung<sup>\*†</sup> \*Computer Architecture and Memory Systems Laboratory, KAIST <sup>†</sup>Panmnesia, inc.

#### ABSTRACT

Integrating compute express link (CXL) with SSDs allows scalable access to large memory but has slower speeds than DRAMs. We present ExPAND, an expander-driven CXL prefetcher that offloads last-level cache (LLC) prefetching from host CPU to CXL-SSDs. ExPAND uses a heterogeneous prediction algorithm for prefetching and ensures data consistency with CXL-mer's back-invalidation. We examine prefetch timeliness for accurate latency estimation. ExPAND being aware of CXL multi-tirede witching, provides endto-end latency for each CXL-SSD and precise prefetch imliness estimations. Our method reduces CXL-SSD reliance and enables direct host cache access for most data. ExPAND enhances graph application performance by 3.5x, surpassing CXL-SSD pools with diverse prefetching strategies.

#### 1 INTRODUCTION

Compute Express Link (CXL) is receiving considerable attention as an emerging interface that separates memory resources from computing servers, allowing users to access large-capacity memory scalably. In terms of capacity, storage class memory (SCM) (ethnologies such as PRAM [1], Z-NAND [2], and XL-Flash [3] offer greater advantages over DRAMs. As a result, both industry and cacdemin strive to introduce byte-addressable solid-state drives (SSDs) using the CXL protocol and SCM's memory instruction semantics. For instance, one method integrates CXL into Optane SDBs for hierarchical memory expansion, while several proof-ofconcepts (PCCs) employ new flash like Z-NAND and XL-Flash to develop CXL-SBDs (L4-6).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without for pervised that copies are not made or distributed for porti or commercial advantage and that copies how this socies and the full cations on the fitney tange. Copyrighte for components of this work owned by others than the authority must be housed. Abstracting with credit is permissioned if across or personic or negativity to post or to reflect the transmission of across or person or to reflect the topost on servers or to reflect the transmissioned for across. *How there*, and the component *HotiSonrage* '23, July 9, 2023, Rosson, MA, USA

ACM. ACM ISBN 979-8-4007-0224-2/23/07...\$15.00 https://doi.org/10.1145/3599691.3603406

While CXL-SSDs target capacity needs for memory disaggregation, their backend media remain slower than DRAMs. Specifically, PRAMs are 7× slower than DRAMs [7], and the new flash technologies exhibit latencies 30× slower [2]. To address this, industrial PoCs employ SSD-side DRAM buffers as internal caches, resembling high-performance NVMe storage with larger internal DRAMs. Although these buffers effectively handle write latency issues, they struggle to mask the long read latency caused by SCM backend media. Unlike file system-managed block devices, CXL-SSDs should serve memory requests (load/store) without relying on the host-side storage stack. Concealing long read latency necessitates understanding execution behaviors of host applications and managing the corresponding CPU cache hierarchy, appropriately. Regrettably, these aspects are neglected by existing SSD technologies, as they have solely handled block requests thus far.

When CXL-SSDs are placed in the system memory space as host-managed device memory, existing CPL-uide cache prefetching mechanisms can still be beneficial. However, two main unaddressed challenges prevent current prefetchers within the cache hierarchy from fully utilizing the advantages of LLC with CXL-SSDs i) hardware logic size constraints in handling aw dise range of memory access patterns possibly encountered in the extensive CXL memory pooling space, and ii) harney variations experienced by different CXL-SSDs located in diverse positions within the CXL switch network.

In particular, rule-based cache prefetchers, such as spatial [8–10] and tempon prefetching algorithms (11–13), require tens of MB storage that is similar to the actual last-level cache (LLC) of a CPU [9]. As a result, modem CPUs employ a simpler stream cache prefetching algorithm [14], which unfortunately is unable to mask the increased latency introduced by CXL-SSDs. Another contributing factor is the interconnect network topology used in CXL-based memory disagregation. To boost memory equacity in a scalable way, CXL introduces a multi-level switch architecture where each level can potentially increase memory expander latency, depending on the target's position within the network. This is because the processing time taken by CXL, switches at different levels cannot be overlooked. Consequently, exsisting prefetchers are

#### **Overcoming the Memory Wall with CXL-Enabled SSDs**

| Shao-Peng Yang      | Minjae Kim | Sanghy     | un Nam     | Juhyur | ng Park  | Jin-yong Choi |
|---------------------|------------|------------|------------|--------|----------|---------------|
| Syracuse University | DGIST      | Soongsil   | University | DG     | SIST     | FADU Inc.     |
| Eyee Hyun Nam       | Eunji      | Lee        | Sungjin I  | ee     | Bryan    | S. Kim        |
| FADU Inc.           | Soongsil U | /niversity | DGIST      |        | Syracuse | University    |

#### Abstract

This paper investigates the feasibility of using inexpensive flash memory on new interconnect technologies such as CML (Compute Express Link) to overcome the memory wall. We explore the design space of a CXL-enabled flash device and show that techniques such as caching and prefetching can help mitigate the concerns regarding flash memory's performance and lifetime. We demonstrate using real-world application traces that these techniques enable the CXL device to have an estimated lifetime of at least 3.1 years and serve 68–91% of the memory requests under a microsecond. We analyze the limitations of existing techniques and suggest system-level changes to achieve a DRAM-teel performance using flash.

#### 1 Introduction

The growing imbalance between computing power and memory capacity requirement in computing systems has developed into a challenge known as the memory wall [23, 34, 52]. Figure 1, based on the data from Gholami et al. [34] and expanded with more recent data [11, 30, 43], illustrates the rapid growth in NLP (natural language processing) models (14.1× per year), which far outpaces that of memory capacity (1.3× per year). The memory wall forces modern dataintensive applications such as databases [8, 10, 14, 20], data analytics [1, 35], and machine learning (ML) [45, 48, 66] to either be aware of their memory usage [61] or implement user-level memory management [66] to avoid expensive page swaps [37, 53]. As a result, overcoming the memory wall in an application-transparent manner is an active research avenue; approaches such as creating an ML-centric system [45, 48, 61], building a memory disaggregation framework [36, 37, 52, 69]. and designing new memory architecture [23,42] are actively pursued.

We question whether it is possible to overcome the memory wall using flash memory — a memory technology that is typically used in storage due to its high density and capacity scaling [59]. While DRAM can only scale to gigabytes in capacity, a flash memory-based solid-state drive (SSD) is



Figure 1: The trend in memory requirements for NLP applications [11, 30, 34, 43]. The number of parameters increases by a factor of  $14.1 \times$  per year, while the memory capacity in GPUs only grows by a factor of  $1.3 \times$  every year.

in the tendpte scale [23], a sufficiently large capacity to address the memory wall-hallenge. The use of flash memory as main memory is enabled by the recent emergence of interconnect technologies such as CAL [3], Gene Z[7], CCR [2], and OpenCAPI [12], which allow PCIe (Peripheral Componen Interconnet Express) devices to be accessed directly by the CPU through load/store instructions. Furthermore, these technologies promise scellent scalability as more PCIe devices can be attached across switches [13] unlike DIMM (Dual Inline Memory Moule) used for PAAM.

However, there are three main challenges to using flash memory as CPU-accessible main memory. First, there is a granularity mismatch between memory requests and flash memory. This results in a signification on top of the existing need for indirection in flash [23, 33]: for example, a 64 Eache line flush to the CXL-enabled flash would result in 16KB flash memory page read, 64B update. and 16KB flash program to a different location (assuming a 16KB page-level mapping). Second, flash memory is still orders of manoixee slower than DRAM (tens of microseconds vs. tens of nanoseconds) [5,24]. As a consequence, while the perda data transfer are between the two technologies is similar [4,15], the long flash memory latency hinders sustained performance as data-intensive seplications can only endure

USENIX Association

# **Putting SSDs with CXL Memory Expander**

Which type of device to use? Type-1, Type-2, or Type-3 when using SSD as memory expander?

#### Туре-3:

• *(in CXL 1.0, 2.0)*: Only one Type-1 or Type-2 device allowed per CXL root, hence Type-3 are more scalable.



• Type-1/2 can be more complex, caches, all load/store requests require checking the cache states of PCIe storage computing complex

#### Hence, a Type-3 device type is the ideal CXL device for a "memory expander"

Hello bytes, bye blocks: PCIe storage meets compute express link for memory expansion (CXL-SSD). <u>https://doi.org/10.1145/3538643.3539745</u> Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation CXL-SSD. <u>https://doi.org/10.1145/3599691.3603406</u>

## CXL + Flash SSDs: Can Flash do it?



(a) LocalDRAM.

(b) CXL-SSD.

#### Can we use NAND flash SSDs as memory expander?

- What latencies one get with the granularity mismatch?
  - Cache line : 64B, flash pages : 8-16 KiB
  - DRAM: 100s of nanoseconds, vs. flash in 10-100 microseconds
- What is the access pattern for common workloads?
- Can we optimize latencies in any manner? Prefetching, buffering, caching?
- How about flash P/E limitations? Can it endure small 64B writes?

### **CXL-Enabled SSDs - Virtual vs. Physical Addresses**

⇒ Shows that the access pattern at the virtual address level do not correspond to the physical address level.

#### Why?

Just basic prefetching is not effective to hide latencies



Shao-Peng Yang and others. Overcoming the Memory Wall with CXL-Enabled SSDs, USENIX ATC 2023, <u>https://www.usenix.org/conference/atc23/presentation/yang-shao-peng</u>

# Impact of Caching

Inter-arrival time of 64B requests has a huge impact

- Queuing delays w/o cache





Lots of repeated accesses for the same page!

Multiple 64B requests go into the same flash page (Keep track of it)

Figure 6: Flash memory read count for physical memory frames. The solid bar represents the total number of reads, while the shaded bar, the number of repeated reads. A repeated read is a read request to an outstanding read request.

### **Workload-level Performance**



#### The New(er) Triangle of Storage-Memory Continuum



Instead of discrete steps, it is a continuous spectrum now: Continuum

## io\_uring : What is it and why you should care?



## The Long Debate: How to get Concurrency?

Threads versus Events (Asynchronous)



# **Linux I/O Options**

Standard POSIX I/O **blocking** read/write calls:

- https://man7.org/linux/man-pages/man2/read.2.html
- https://man7.org/linux/man-pages/man2/write.2.html



<u>https://man7.org/linux/man-pages/man2/fcntl.2.html</u> (o\_NONBLOCK)

#### Asynchronous I/O on Linux : libaio and POSIX AIO

- <u>https://github.com/littledan/linux-aio</u>
- Example of how to use libaio: <u>https://github.com/axboe/fio/blob/master/engines/libaio.c</u>



### **AIO Issues**

SIGNAL based delivery of completion

- Preemption and context switch
- Needs care for signal-safe function execution

```
Archive- Article, Thread

link:

On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:

> Another blocking operation used by applications that want aio

> functionality is that of opening files that are not resident in memory.

> Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

So I think this is ridiculously ugly.

AIO is a horrible ad-hoc design, with the main excuse being "other,

less gifted people, made that design, and we are implementing it for

compatibility because database people - who seldom have any shred of

taste - actually use it".

But AIO was always really really ugly.
```

Linux' AIO works truly "asynchronously" under very restricted conditions:

- works only with O\_DIRECT modes (alignment, and size restrictions)
- works only when the file's metadata is available (otherwise blocks until the metadata is fetched)
- can block based on device's queue capacity
- needs to memcpy of I/O metadata (~100 bytes)

Good introduction: <u>https://unixism.net/loti/async\_intro.html</u> and <u>https://kernel.dk/io\_uring.pdf</u>

### **Cost of these Interfaces**

### TABLE I: Categories of system-call techniques

|                |                          |                                                        | per sys request |                |                                                                      |
|----------------|--------------------------|--------------------------------------------------------|-----------------|----------------|----------------------------------------------------------------------|
| Kind           | Mechanism                | Examples                                               | traps           | csw            | cost[ns]                                                             |
| Sync<br>Sync   | Blocking<br>Non-Blocking | <pre>read(), write() SOCK_NONBLOCK &amp; epoll()</pre> | 1 $[1,3]$       | $2 \\ [2,6]$   | $955 \pm 1069 \\ 1656 \pm 1318$                                      |
| Async<br>Async | Callback<br>Queue-based  | POSIX AIO [13]<br>Linux AIO                            | $1 \\ ]0,2]$    | 2, 3<br>]1, 4] | $\begin{array}{c} 6224 \pm \! 12232 \\ 1922 \pm \! 1467 \end{array}$ |

Modern Concurrency Platforms Require Modern System-Call Techniques, Florian Schmaus, Florian Fischer, Timo Hönig, Wolfgang Schröder-Preikschat, 2021. <u>https://opus4.kobv.de/opus4-fau/frontdoor/index/index/docld/17655</u>

# Skip the OS Complexity: The SPDK Stack



- A user-space I/O framework for NVMe devices (only)
- Block-level abstraction (no file system, but there are research prototypes)
- Has user-space mapped drivers (<u>https://spdk.io/doc/userspace.html</u>)
- Designed for light-weight I/O, best performance (eschews many core OS features)

## **SPDK can have the Highest Performance**



2 CPU sockets, Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz 22x Kioxia® KCM61VUL3T20 3.2TBs (FW: 0105) (10 on CPU NUMA Node 0, 12 on CPU NUMA Node 1)

SPDK NVMe BDEV Performance Report Release 23.05, June 2023, https://ci.spdk.io/download/performance-reports/SPDK nvme bdev perf report 2305.pdf

### **Intricately Linked Issues**

What is the system call interface

What is the kernel threading model

Signal vs queuing

What is the cost of scheduling, context switching

Management of concurrency

Programming languages (error handling)



Zebin Ren and Animesh Trivedi. 2023. Performance Characterization of Modern Storage Stacks: POSIX I/O, libaio, SPDK, and io\_uring. In Proceedings of the 3rd CHEOPS'23 workshop. <u>https://doi.org/10.1145/3578353.3589545</u>

Theoretical Computer Science 410 (2009) 202-220 Contents lists available at ScienceDirect

### **Background Reading on this Topic**

Because the original of the following paper by Lauer and Needham is not widely available, we are reprinting it here. If the paper is referenced in published work, the citation should read; "Lauer, H.C., Needham, R.M., "On the Duality of Operating Systems Structures," in Proc. Second International Symposium on Operating Systems, IRIA, Oct. 1978, reprinted in Operating Systems Review, 13.2 April 1979, pp. 3-19.

#### On the Duality of Operating System Structures

Hugh C. Lauer Xerox Corporation Palo Alto, California

Roger M. Needham **Cambridge University** Cambridge, England

Abstract

Many operating system designs can be placed into one of two very rough categories, depending upon how they implement and use the notions of process and synchronization. One category, the "Message-oriented System," is characterized by a relatively small, static number of processes with an explicit message system for communicating among them. The other category, the "Procedure-oriented System," is characterized by a large, rapidly changing number of small processes and a process synchronization mechanism based on shared data.

In this paper, it is demonstrated that these two categories are duals of each other and that a system which is constructed according to one model has a direct counterpart in the other. The principal conclusion is that neither model is inherently preferable, and the main consideration for choosing between them is the nature of the machine architecture upon which the system is being built, not the application which the system will ultimately support.

This is an empirical paper, in the sense of empirical studies in the natural sciences. We have observed a number of samples from a class of objects and identified a classification of some of their properties. We have then generalized our classification and constructed abstract models to describe these properties. With the aid of these models, we were able to make some observations about the nature of the objects themselves, observations which are supported by other experimental evidence. Finally, we have drawn some conclusions about the class of objects which better aid our understanding of that class and the decisions which affect the design of members of that class.

The universe in this investigation is the class of operating systems, and the properties in which we are interested are the ways in which the concepts of process, synchronization, and interprocess communication occur within these systems and among their clients. There appear to be two general categories in this respect, which we designate the Message-oriented Systems and the Procedure-oriented Systems. Most systems which we have observed tend to be biased fairly strongly in favour of one or the other, rather than being neutral or indeterminate. Moreover,

\* This work was done while the author was on sabbatical leave at the Xerox Palo Alto Research Center during the summer of 1977. 3

#### Why Threads Are A Bad Idea (for most purposes)

#### John Ousterhout

Sun Microsystems Laboratories

john.ousterhout@eng.sun.com http://www.sunlabs.com/~ouster

#### Introduction

- **v** Threads:
- Grew up in OS world (processes).
- Evolved into user-level tool.
- Proposed as solution for a variety of problems.
- Every programmer should be a threads programmer?
- Problem: threads are very hard to program.
- v Alternative: events.
- v Claims:
- For most purposes proposed for threads, events are hetter
- Threads should be used only when true CPU concurrency is needed.
- Why Threads Are A Bad Idea

Sentember 28, 1995 slide 2

**Theoretical Computer Science** journal homepage: www.elsevier.com/locate/tcs

#### Scala Actors: Unifying thread-based and event-based programming\*

Philipp Haller\*, Martin Odersky

FPFI Switzerland

ARTICLE INFO

Concurrent programming Actors

There is an impedance mismatch between message-passing concurrency and virtual machines, such as the IVM, VMs usually map their threads to heavyweight OS processes Without a lightweight process abstraction users are often forced to write parts of concurrent applications in an event-driven style which obscures control flow, and increases the burden on the programme In this paper we show how thread-based and event-based programming can be

unified under a single actor abstraction. Using advanced abstraction mechanisms of the Scala programming language, we implement our approach on unmodified JVMs. Our programming model integrates well with the threading model of the underlying VM. © 2008 Elsevier B.V. All rights reserved

#### 1. Introduction

19

Concurrency issues have lately received enormous interest because of two converging trends: first, multi-core processors make concurrency an essential ingredient of efficient program execution. Second, distributed computing and web services are inherently concurrent. Message-based concurrency is attractive because it might provide a way to address the two challenges at the same time. It can be seen as a higher-level model for threads with the notential to generalize to distributed computation Many message passing systems used in practice are instantiations of the actor model [28.2] A popular implementation of this form of concurrency is the Erlang programming language [4]. Erlang supports massively concurrent systems such as telephone exchanges by using a very lightweight implementation of concurrent processes [3,36].

On mainstream platforms such as the IVM [34], an equally attractive implementation was, as yet, missing. Their standard concurrency constructs, shared-memory threads with locks, suffer from high memory consumption and context-switching overhead. Therefore, the interleaving of independent computations is often modeled in an event-driven style on these platforms. However, programming in an explicitly event-driven style is complicated and error-prone, because it involves an inversion of control [41,13].

In previous work [24], we developed event-based actors which let one program event-driven systems without inversion of control. Event-based actors support the same operations as thread-based actors, excent that the receive operation cannot return normally to the thread that invoked it. Instead the entire continuation of such an actor has to be a part of the receive operation. This makes it possible to model a suspended actor by a continuation closure, which is usually much cheaper than suspending a thread

In this paper we present a unification of thread-based and event-based actors. An actor can suspend with a full thread stack (receive) or it can suspend with just a continuation closure (react). The first form of suspension corresponds to thread-based, the second form to event-based programming. The new system combines the benefits of both models,

A preliminary version of the paper appears in the proceedings of COORDINATION 2007, LNCS 4467, June 2007 \* Corresponding address: EPEL Station 14: 1015 Lausanne Switzerland Tel: +41 21 693 6483; fax: +41 21 693 6650 E-mail address: philipp hallen@epfl.ch (P. Haller)

0304-3975/\$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi-10.1016/j.tcs 2009.09.019

#### Threads vs. Events 2

The debate between threads and events is a very old one. Lauer and Needham attempted to end the historically failed to meet these discussion in 1978 by showing that message-passing ny researchers to conclude that systems and process-based systems are duals, both in ing is the best (or even only) terms of program structure and performance characterrformance in highly concurrent istics [10]. Nonetheless, in recent years many authors re gives four primary arguments have declared the need for event-driven programming for highly concurrent systems [11, 12, 17].

· Inexpensive synchronization due to

· Lower overhead for managing state (no

· More flexible control flow (not just call/

· Better scheduling and locality,

application-level information; and

We have made extensive use of events

high-concurrency environments, including

SEDA [17], and Inktomi's Traffic Server.

with these systems, we realized that the prop

are not restricted to event systems; many ha

been implemented with threads, and the rest at

event-based programming is the wrong choic

concurrent systems. We believe that (1) threa

a more natural abstraction for high-concurre

and that (2) small improvements to compiler

runtime systems can eliminate the historical

Ultimately, our experience led us to co

multitasking

50

#### uss several enhancements that are use events. Additionally, threads are mor to compiler-based enhancements; we believ paradigm for highly concurrent application package with better compiler support. Section 2 compares events with threads the common arguments against threads. Nex explains why threads are particularly natural high-concurrency servers. Section 4 explore of compiler support for threads. In Section 5, our approach with a simple web server. Finally aximum capacity, which creates covers (some) related work, and Section 7 c high sensitivity to scheduling at he handled with care to avoid

Why Events Are A Bad Idea

(for high-concurrency servers)

Rob von Behren, Jeremy Condit and Eric Brewer

{irvb, icondit, brewer}@cs.berkelev.edu

http://capriccio.cs.berkeley.edu/

mputer Science Division, University of California at Berkeley

conditions and subtle corner ich makes debugging and code

HotOS IX: The 9th Workshop on Hot Topics in Operating Systems

compiler changes. applications such as Internet processing databases present a to application designers. First, of concurrent tasks requires the uctures. Second, these systems

ing has been highly touted in recent

rite highly concurrent applications.

of these systems, we now believe this

specifically, we believe that threads

neths of events, including support

werhead, and a simple concurrence

that threads allow a simpler and

ed strengths of events over threads

resses of threads are artifacts of

entations and not inherent to the

vidence, we present a user-level

to 100,000 threads and achieves

web server. We also refine the

and Needham, which implies that

thread systems and event systems

ice. Finally, we argue that compile

is a fruitful area for future research.

igh concurrency without help from

ABSTRACT

Events

## **Storage APIs: Recap**



#### Libaio:

- + Async I/O
- + Any files/FSes
- + Any device: HDD, NVMe
- Async only with direct I/O
- Performance
- Metadata management



#### SPDK:

- + Performance
- + Close application integration
- + No syscall or interrupts
- Only NVMe
- No kernel assistance
- Scalability and brittle



### io\_uring: A Structured Approach to Asynchronous I/O



Completion

Oueue

#### io\_uring: A Structured Approach to Asynchronous I/O



## **The three new Syscalls**

- 1. **io\_uring\_setup:** This call is for creating the ring structure (queue-depth, I/O completion and notification modes)
  - a. <u>Completion</u> polling by the kernel on the device (IORING\_SETUP\_**IOPOLL)**
  - b. Kernel polling for <u>submission (IORING\_SETUP\_SQPOLL</u>, zero system call)
- 2. **io\_uring\_enter:** This call enters the kernel and tells it to process I/O requests (any type and extensible, not just storage I/O)
  - a. Networking, ZNS, Programmable storage and more
  - b. Replacement for the ioctl() call: a private interface between a device driver and application
- 3. **io\_uring\_register**: This call is for registering specific fd, buffers, file ranges that are being used frequently to put them on an optimized fast path

## **Three Modes of Operations**



#### Systor'22

#### CHEOPS'23

#### Understanding Modern Storage APIs: A systematic study of libaio, SPDK, and io\_uring

Diego Didona, Jonas Pfefferle Nikolas Ioannou, Bernard Metzler IBM Research Europe Zurich, Switzerland {ddi.jpf.nio.bmt{@bm.zurich.com}

#### ABSTRACT

Recent high-performance storage devices have exposed software inefficiencies in existing toorge stacks, leading to a new breed of I/O stacks. The nevest storage APA of the Linux kernel is 16, our ing. We perform one of the first in-depth studies of 16, our ing, and compare its performance and disdowanings with hee stabilished 111 also and SPRA APA. Our key findings reveal that () opsling design significantly impacts performance (ii) with enough CPU cores 16, and (ii) percan deliver performance (iii) with enough CPU cores 16, and (ii) perequires careful consideration and necessitates a hybrid approach. Last, we provide design guidelines for developers of storage interestive applications.

#### ACM Reference Format:

Diego Didom, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler and Animesh Trivedi. 2022. Understanding Modern Sterage AFE: A systematic study of Libaio. 590K, and io. 2011; J. The 13th ACM International Systems and Storage Conference (SYSTOR '22), June 13-15, 2022, Halfu, Israel. ACM, New York, NY, USA, 8 pages. https://di.org/10.1145/3334065.3334045

#### 1 INTRODUCTION

https://doi.org/10.1145/3534056.3534945

Modern non-volatile memory (NVM) storage technologies, like Flash and Optane SSDs, can support down to single digit pascond latencies, and up to multi GB's bandwidth with millions of I/O operations per second (IOPS). CPU performance improvements have stalled over the past years due to various manufacturing and technical limitations [8].

Permission to make digilal or hand oppins of all er part of this work for permals archaroscus usi grandes Warkov fee provided that copies are not made or distributed for profit or commercial abundage and that oppies betain notice and the fact latents on the first page. Copyright for components of this verse word by others that ACM must be homored. Advantant, wark work words by others that ACM must be homored. Advantant, which even is permission to one you between error permission and/or a face. Request 9537076 29, 2009–183–2002. Hoped 92022 Associations for Computing Machinery 4021 SIN 9376-1930–1930. Animesh Trivedi VU Amsterdam Amsterdam, Netherlands a.trivedi@vu.nl

As a result, researchers have put considerable effort into identifying new CUP-efficient storage APIs, abstractions, designs, and optimizations [2, 3, 11, 13, 15, 19, 22, 25, 20, 30, 11]. One specific API [1, 0, our righ, als adram much attention from the community due to its versatile and high performance inferface [5, 15, 16, 13, 27, 34], our right was introduced in 2019 and has been merged in Linux v5.1. B brings toghter many well established lades from the high perforcinromous 10, shared memory-mapped queues, and polling (Section 2.19, n. 33, 12).

With the addition of io\_uring, Linux now has multiple ways of accessing a storage device. In this paper, we look at Linux Asynchronous I/O (libaio) [6, 24], the Storage Performance Development Kit (SPDK) from Intel® [13], and io uning [15, 17, 18]. These APIs have different parameters. deployment models, and characteristics, which make understanding their performance and limitations a challenging task. The use of the io\_uring API and its performance has been the focus of recent studies [7, 28, 33, 36]. However, to the best of our knowledge, there is no systematic study of these APIs that provides design guidelines for the developers of I/O intensive applications. There has also been an extensive body of work in studying system call overhead [29], implementing better interrupt management for I/O devices [30], leveraging polling for fast storage devices [38], using I/O speculation for µsecond-scale devices such as NVMe drives [35], and improving the performance of the Linux block layer in general [3, 39, 40]. These works are orthogonal to ours, since they explore designing new storage stacks, while we focus on the performance characteristics of state-of-the-art APIs that are readily available in Linux.

Our main contributions include (i) a systematic comparison of 11baio, io\_uring, and SPDK, that evaluates their latency, IOPS, and scalability behaviors; (ii) a first-of-tis-kind detailed evaluation of the different io\_uring configurations; and (iii) design guidelines for high-performance applications using moderm storge APIS. Our key findinger reveal that:

#### Performance Characterization of Modern Storage Stacks: POSIX I/O, libaio, SPDK, and io\_uring

Zebin Ren z.ren@vu.nl Vrije Universiteit Amsterdam Amsterdam, Netherlands

#### Abstract

Linux storage stack offers a variety of storage I/O stacks and APIs such as POSIX I/O. asynchronous I/O (libaio). high-performance asynchronous I/O (emerging io uring) or SPDK, the last of which completely bypasses the kernel. Despite their availability, there has not been a systematic study of their performance and overheads. In order to aid our understanding, in this work we systematically characterize performance, scalability and microarchitectural properties of popular Linux I/O APIs on high-performance storage hardware (Intel Optane SSDs). Our characterization reveals that: (1) at low I/O loads, all APIs perform competitively with each other, with polling helping the performance by 1.7×, but consuming 2.3× CPU instructions; (2) at highloads and scale, io\_uring is more than an order of magnitude slower than SPDK; (3) at high-loads and scale, the benchmarking tool (fio) itself becomes a bottleneck; (4) stateof-practice Linux block I/O schedulers (BFQ, mq-deadline, and Kyber) introduce significant (up to 50%) overheads, and their use of global locks hinder their scalability. All artifacts from this work are available at https://github.com/atlargeresearch/Performance-Characterization-Storage-Stacks.

CCS Concepts: • Software and its engineering → Secondary storage; Operating systems.

Keywords: Linux storage stack, io\_uring, SPDK, Efficiency, Measurements

#### ACM Reference Format:

Zebin Ren and Animesh Trivedi. 2023. Performance Characterization of Modern Storage Stacks: POSIX I/O, libaio, SPDK, and io, uring, In 3/4 Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS '23), May 8, 2023, Rome, Iady, ACM, New York, NY, USA, 11 pages. https://doi.org/10. 1146/33783353389545

#### · ·

This work is licensed under a Creative Commons Attribution International 4.9 License. CHEOR'S '23, May 8, 2023, Rome, Italy © 2023 Copyright held by the owner/author(s). ACM ISBN 979-8-4067-0081-1/23/05. https://doi.org/10.1105/23782312.038655 Animesh Trivedi a.trivedi@vu.nl Vrije Universiteit Amsterdam Amsterdam, Netherlands

#### 1 Introduction

Modern storage devices such as Intel Optane SSDs can deliver millions of IOPS (I/O operations per second) with singledigit microseconds (usecs) I/O access latencies [7, 17]. Meanwhile, the CPU performance has remained relatively stable as Moore's Law driven performance gains stall [29]. Consequently, the stalled CPU performance with highperformance storage hardware has exposed many previously hidden software overheads in the storage stack implementations, thus leading to a series of efforts to redesign and optimize the storage stack focusing on lock contentions, polling, copy elimination, new interfaces, scheduling, context switches, asynchronous I/O paths, interrupt and system call eliminations [3, 18, 20, 25, 30, 36, 37, 39, 40, 45, 56, 59, 66, 68]. Beyond these optimizations, there have been many efforts to improve the user-kernel and user-storage APIs and abstractions. Linux supports two popular and widely used APIs called (synchronous) POSIX file I/O calls [12, 13] and an asynchronous API called libaio [3]. Both of these APIs interact via system calls (syscalls) with the Linux kernel which can have high overheads [22, 38, 55]. More recently, Linux developers have introduced a new high-performance I/O API called io\_uring [8]. It takes many established ideas from the highperformance networking domain (shared-memory queues, asynchronous I/O, polling, shared I/O contexts) and applies them to storage in a unified manner [61, 62]. These advancements are now merged in the Linux storage stack (since v5.1 kernel version), and have shown to deliver high performance and CPU efficiency [22]. All of these APIs (POSIX, libaio, io\_uring) work within the kernel.

The Linux kernel with its generic code execution, functionalities, and features can also introduce significant overheads [31], thus leading to the design of kernel-bypassing userspace storage stack [23, 34, 00, 74]. The Storage Performance Development Kit (SPDK) is one of the most popular ad videly used user space IOI binteries, which can deliver up to 10 million IOPS using a single CPU core [2]. However, user space IOI bintraies lack mark kernel-supported features such as fine-grained isolation, access control, file systems, multi tenancy, and QoS support [46, 64].

In summary, over the past decade, the in-kernel and userspace I/O stacks have undergone a significant development phase. Despite sharing a common functional goal

Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio, SPDK, and io\_uring. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22). https://doi.org/10.1145/3534056.3534945

Zebin Ren and Animesh Trivedi. 2023. Performance Characterization of Modern Storage Stacks: POSIX I/O, libaio, SPDK, and io\_uring. In Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS '23). <u>https://doi.org/10.1145/3578353.3589545</u>

## **Benchmarking Setup**

#### Setup 1 [Systor'22]:

- 2x Intel<sup>®</sup> Xeon<sup>®</sup> E5-2630 (Sandy Bridge), 10 cores/socket ⇒ 20 CPU cores
- 20 Intel® DC P3600 400GB NVMe <u>Flash SSDs</u> ⇒ ~6 Million IOPS

#### Setup 2 [CHEOPS'23]:

- 2x Intel<sup>®</sup> Xeon<sup>®</sup> Silver 4210R (Cascade Lake), 10 cores/socket ⇒ 20 CPU cores
- 7× Intel Corporation 900P NVMe <u>Optane SSD</u>  $\Rightarrow$  4.2 Million IOPS

### **Number of System Calls**



Doing I/O with zero system calls!

## **Results: Efficiency (<u>single</u> CPU core)**



#### **Analysis** Systor'22 CHEOPS'23 25 ■ user ■ kernel 100 Instructions per I/O (K) 20 75 % CPU 15 50 25 10 0 5 16 64 128 4 1 0 Queue depth libaio iou+k iou iou+p SPDK

[Interesting] 8 milliseconds constant latency for all queue depths!

SPDK is still 5x more efficient

Poor scheduling, and CPU sharing - Careful!

## **Result: Efficiency with <u>TWO</u>CPU cores**



[aio < iou < iou with polling < iou with kernel poll < SPDK] Normal service order can be resumed (**but** at the cost of 2x CPU cores)!



**io\_uring kernel polling:** Performance collapses when the number of poller CPU threads increases beyond the cores

#### CPU efficiency is still bad: 10x more CPU cores needed

# io\_uring : Programming Ecosystem

- liburing : <u>https://github.com/axboe/liburing</u>
  - 3x syscall based programming can be tricky, hence, a high(er)-level library Ο

#### List of manual pages

| 1 5                                                                       |                                                                                            |                                                                                          |                                                                                       |                                                                                                        |                                                                                        |                                                   |
|---------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|---------------------------------------------------|
| [en] IO_URING_CHECK_VERSION(3)                                            | [en] io_uring_get_events(3)                                                                | [en] io_uring_prep_linkat(3)                                                             | <ul> <li>[en] io_uring_prep_readv(3)</li> </ul>                                       | [en] io_uring_prep_symlink(3)                                                                          | [en] io_uring_register(2)                                                              | [en] io_uring_sq_space_left(3)                    |
| [en] IO_URING_VERSION_MAJOR(3)                                            | [en] io_uring_get_probe(3)                                                                 | [en] io_uring_prep_madvise(3)                                                            | [en] io_uring_prep_readv2(3)                                                          | <ul> <li>[en] io_uring_prep_symlinkat(3)</li> </ul>                                                    | [en] io_uring_register_buf_ring(3)                                                     | <ul> <li>[en] io_uring_sqe_set_data(3)</li> </ul> |
| <ul> <li>[en] IO_URING_VERSION_MINOR(3)</li> </ul>                        | [en] io_uring_get_sqe(3)                                                                   | [en] io_uring_prep_mkdir(3)                                                              | [en] io_uring_prep_recv(3)                                                            | <ul> <li>[en] io_uring_prep_sync_file_range(3)</li> </ul>                                              | [en] io_uring_register_buffers(3)                                                      | [en] io_uring_sqe_set_data64                      |
| [en]io_uring_buf_ring_cq_advance(3)                                       | <ul> <li>[en] io_uring_major_version(3)</li> </ul>                                         | [en] io_uring_prep_mkdirat(3)                                                            | [en] io_uring_prep_recv_multishot(3)                                                  | [en] io_uring_prep_tee(3)                                                                              | [en] io_uring_register_buffers_sparse(3)                                               | [en] io_uring_sqe_set_flags(3)                    |
| <ul> <li>[en] io_uring(7)</li> </ul>                                      | [en] io_uring_minor_version(3)                                                             | [en] io_uring_prep_msg_ring(3)                                                           | [en] io_uring_prep_recvmsg(3)                                                         | [en] io_uring_prep_timeout(3)                                                                          | [en] io_uring_register_buffers_tags(3)                                                 | [en] io_uring_sqring_wait(3)                      |
| [en] io_uring_buf_ring_add(3)                                             | [en] io_uring_opcode_supported(3)                                                          | [en] io_uring_prep_msg_ring_cqe_flags(3)                                                 | [en] io_uring_prep_recvmsg_multishot(3)                                               | [en] io_uring_prep_timeout_remove(3)                                                                   | <ul> <li>[en]</li> </ul>                                                               | <ul> <li>[en] io_uring_submit(3)</li> </ul>       |
| [en] io_uring_buf_ring_advance(3)                                         | <ul> <li>[en] io_uring_peek_cqe(3)</li> </ul>                                              | [en] io_uring_prep_msg_ring_fd(3)                                                        | [en] io_uring_prep_remove_buffers(3)                                                  | [en] io_uring_prep_timeout_update(3)                                                                   | io_uring_register_buffers_update_tag(3)                                                | [en] io_uring_submit_and_get                      |
| [en] io_uring_buf_ring_cq_advance(3)                                      | <ul> <li>[en] io_uring_prep_accept(3)</li> </ul>                                           | [en] io_uring_prep_msg_ring_fd_alloc(3)                                                  | [en] io_uring_prep_rename(3)                                                          | [en] io_uring_prep_unlink(3)                                                                           | [en] io_uring_register_eventfd(3)                                                      | [en] io_uring_submit_and_wai                      |
| [en] io_uring_buf_ring_init(3)                                            | [en] io_uring_prep_accept_direct(3)                                                        | [en] io_uring_prep_multishot_accept(3)                                                   | [en] io_uring_prep_renameat(3)                                                        | [en] io_uring_prep_unlinkat(3)                                                                         | [en] io_uring_register_eventfd_async(3)                                                | [en] io_uring_submit_and_wai                      |
| [en] io_uring_buf_ring_mask(3)                                            | [en] io_uring_prep_cancel(3)                                                               | <ul> <li>[en]</li> </ul>                                                                 | [en] io_uring_prep_send(3)                                                            | [en] io_uring_prep_write(3)                                                                            | <ul> <li>[en] io_uring_register_file_alloc_range(3)</li> </ul>                         | [en] io_uring_unregister_buf_r                    |
| [en] io_uring_check_version(3)                                            | [en] io_uring_prep_cancel64(3)                                                             | io_uring_prep_multishot_accept_direct(3)                                                 | [en] io_uring_prep_send_set_addr(3)                                                   | [en] io_uring_prep_write_fixed(3)                                                                      | [en] io_uring_register_files(3)                                                        | [en] io_uring_unregister_buffe                    |
| [en] io_uring_close_ring_fd(3)                                            | [en] io_uring_prep_close(3)                                                                | [en] io_uring_prep_nop(3)                                                                | [en] io_uring_prep_send_zc(3)                                                         | [en] io_uring_prep_writev(3)                                                                           | [en] io_uring_register_files_sparse(3)                                                 | [en] io_uring_unregister_event                    |
| [en] io_uring_cq_advance(3)                                               | [en] io_uring_prep_close_direct(3)                                                         | [en] io_uring_prep_openat(3)                                                             | [en] io_uring_prep_send_zc_fixed(3)                                                   | [en] io_uring_prep_writev2(3)                                                                          | [en] io_uring_register_files_tags(3)                                                   | [en] io_uring_unregister_files(:                  |
| [en] io uring cq has overflow(3)                                          | [en] io_uring_prep_connect(3)                                                              | [en] io_uring_prep_openat2(3)                                                            | [en] io uring prep_sendmsg(3)                                                         | [en] io uring queue_exit(3)                                                                            | [en] io_uring_register_files_update(3)                                                 | [en] io_uring_unregister_iowq                     |
| [en] io_uring_cq_ready(3)                                                 | [en] io_uring_prep_fadvise(3)                                                              | [en] io_uring_prep_openat2_direct(3)                                                     | [en] io_uring_prep_sendmsg_zc(3)                                                      | [en] io_uring_queue_init(3)                                                                            | [en] io_uring_register_files_update_tag(3)                                             | [en] io_uring_unregister_ring_                    |
| [en] io_uring_cqe_get_data(3)                                             | <ul> <li>[en] io_uring_prep_fallocate(3)</li> </ul>                                        | [en] io_uring_prep_openat_direct(3)                                                      | [en] io_uring_prep_sendto(3)                                                          | <ul> <li>[en] io_uring_queue_init_params(3)</li> </ul>                                                 | [en] io_uring_register_iowq_aff(3)                                                     | <ul> <li>[en] io_uring_wait_cqe(3)</li> </ul>     |
| [en] io_uring_cqe_get_data64(3)                                           | [en] io_uring_prep_fgetxattr(3)                                                            | [en] io_uring_prep_poll_add(3)                                                           | [en] io_uring_prep_setxattr(3)                                                        | [en] io_uring_recvmsg_cmsg_firsthdr(3)                                                                 | <ul> <li>[en]</li> </ul>                                                               | [en] io_uring_wait_cqe_nr(3)                      |
| [en] io_uring_cqe_seen(3)                                                 | <ul> <li>[en] io_uring_prep_files_update(3)</li> </ul>                                     | [en] io_uring_prep_poll_multishot(3)                                                     | [en] io_uring_prep_shutdown(3)                                                        | <ul> <li>[en] io_uring_recvmsg_cmsg_nexthdr(3)</li> </ul>                                              | io_uring_register_iowq_max_workers(3)                                                  | [en] io uring wait cge timeou                     |
| [en] io_uring_enter(2)                                                    | [en] io_uring_prep_fsetxattr(3)                                                            | [en] io_uring_prep_poll_remove(3)                                                        | [en] io_uring_prep_socket(3)                                                          | [en] io_uring_recvmsg_name(3)                                                                          | [en] io_uring_register_ring_fd(3)                                                      | [en] io_uring_wait_cqes(3)                        |
| <ul> <li>[en] io uring enter2(2)</li> </ul>                               | [en] io uring prep fsync(3)                                                                | [en] io uring prep poll update(3)                                                        | [en] io uring prep socket direct(3)                                                   | [en] io uring recymsg out(3)                                                                           | [en] io uring register sync cancel(3)                                                  |                                                   |
| [en] io uring for each cqe(3)                                             | [en] io uring prep_getxattr(3)                                                             | <ul> <li>[en] io uring prep provide buffers(3)</li> </ul>                                | <ul> <li>[en] io uring prep socket direct alloc(3)</li> </ul>                         | [en] io_uring_recvmsg_payload(3)                                                                       | [en] io uring setup(2)                                                                 |                                                   |
| [en] io_uring_free_buf_ring(3)                                            | [en] io_uring_prep_link(3)                                                                 | [en] io_uring_prep_read(3)                                                               | [en] io_uring_prep_splice(3)                                                          | [en] io_uring_recvmsg_payload_length(3)                                                                | [en] io_uring_setup_buf_ring(3)                                                        |                                                   |
| [en] io uring free probe(3)                                               | [en] io uring prep link timeout(3)                                                         | [en] io uring prep read fixed(3)                                                         | [en] io uring prep statx(3)                                                           | [en] io uring recymsg validate(3)                                                                      | [en] io uring sq ready(3)                                                              |                                                   |
| <pre>[en] io_uring_free_buf_ring(3)<br/>[en] io_uring_free_probe(3)</pre> | <ul> <li>[en] io_uring_prep_link(3)</li> <li>[en] io_uring_prep_link_timeout(3)</li> </ul> | <ul> <li>[en] io_uring_prep_read(3)</li> <li>[en] io_uring_prep_read_fixed(3)</li> </ul> | <ul> <li>[en] io_uring_prep_splice(3)</li> <li>[en] io_uring_prep_statx(3)</li> </ul> | <ul> <li>[en] io_uring_recvmsg_payload_length(3)</li> <li>[en] io_uring_recvmsg_validate(3)</li> </ul> | <ul> <li>[en] io_uring_setup_buf_ring(3)</li> <li>[en] io_uring_sq_ready(3)</li> </ul> |                                                   |

- Active research in leveraging io\_uring in DBs, key-value store, etc.
- Applicability beyond storage as the "core" kernel-application interfacing API

uring sge set data(3) uring see set data64(3) uring sge set flags(3) ng sgring wait(3) uring\_submit(3) uring submit\_and\_get\_events(3 uring submit and wait(3) \_uring\_submit\_and\_wait\_timeout( uring\_unregister\_buf\_ring(3) uring\_unregister\_buffers(3) uring\_unregister\_eventfd(3) uring unregister files(3) uring unregister jowg aff(3) uring unregister ring (d(3) uring wait cge(3) uring wait cge nr(3) uring wait cge timeout(3) uring wait cges(3)

## What you should know from this lecture

What is CXL and what key problems does it solve

What is different types of CXL protocols, device types, and generational features

What does flash + CXL allow us to do

What is asynchronous and non-block I/O, and what different APIs support them

What is io\_uring? What are the different operation completion modes it support

What are the performance implications of these modes

The New(er) Triangle of Storage-Memory Continuum

## **To Conclude**

#### Storage Research is fundamentally changing and reshaping what kind of systems we can build tomorrow

- Performance
- Abstractions
- Efficiency
- Programmability
- Cost
- Scalability

This course came out of this report ;)

#### **Data Storage Research Vision 2025**

Report on NSF Visioning Workshop held May 30-June 1, 2018

George Amvrosiadis<sup>†</sup>, Ali R. Butt<sup>¶</sup>, Vasily Tarasov<sup>‡</sup>, Erez Zadok<sup>\*</sup>, Ming Zhao<sup>§</sup>

Irfan Ahmad, Remzi H. Arpaci-Dusseau, Feng Chen, Yiran Chen, Yong Chen, Yue Cheng, Vijay Chidambaram, Dilma Da Silva, Angela Demke-Brown, Peter Desnoyers, Jason Flinn, Xubin He, Song Jiang, Geoff Kuenning, Min Li, Carlos Maltzahn, Ethan L. Miller, Kathryn Mohror, Raju Rangaswami, Narasimha Reddy, David Rosenthal, Ali Saman Tosun, Nisha Talagala, Peter Varman, Sudharshan Vazhkudai Avani Waldani, Xiaodong Zhang, Yiying Zhang, and Mai Zheng.

> <sup>†</sup>Carnegie Mellon University, <sup>¶</sup>Virginia Tech, <sup>‡</sup>IBM Research, <sup>\*</sup>Stony Brook University, <sup>§</sup>Arizona State University

> > February 2019

#### **Executive Summary**

With the emergence of new computing paradigms (e.g., cloud and edge computing, big data, Internet of Things (IoT), deep learning, etc.) and new storage hardware (e.g., non-volatile memory (NVM), shingled-magnetic recording (SMR) disks, and kinetic drives, etc.), a number of open challenges and research issues need to be addressed to ensure sustained storage systems efficacy and performance. The wide variety of applications demand that the fundamental design of storage systems should be revisited to support application-specific and application-defined semantics. Existing standards and abstractions need to be reevaluated; new sustainable data representations need to be designed to support emerging applications. To take advantage of hardware advancements, new storage software designs are also necessary in order to maximize overall system efficiency and performance.

Therefore, there is a urgent need for a consolidated effort to identify and establish a vision for storage systems research and comprehensive techniques that provide practical solutions to the storage issues facing the information technology community. To address this need, the National Science Foundation's (NSF) "Visioning Workshop on Data Storage Research 2025" brought together a number of storage researchers from academia, industry, national laboratories, and federal agencies to develop a collective vision for future storage research, as well as to prioritize

#### The New(er) Triangle of Storage-Memory Continuum



Instead of discrete steps, it is a continuous spectrum now: Continuum

# Further Reading - CXL (1 or 2)

- CXL Consortium, <u>https://www.computeexpresslink.org/</u>
- CXL resources, <u>https://www.computeexpresslink.org/resource-library</u>
- Linux CXL driver code: <u>https://elixir.bootlin.com/linux/latest/source/drivers/cxl</u>
- Debendra Das Sharma, and others, An Introduction to the Compute Express Link (CXL) Interconnect, **2023**, <u>https://arxiv.org/abs/2306.11227</u>
- Hasan Al Maruf, and others. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. In Proceedings of the 28th ACM ASPLOS **2023**. <u>https://doi.org/10.1145/3582016.3582063</u>
- Myoungsoo Jung. **2022**. Hello bytes, bye blocks: PCIe storage meets compute express link for memory expansion (CXL-SSD). In Proceedings of the 14th ACM HotStorage '22, <u>https://doi.org/10.1145/3538643.3539745</u>
- Miryeong Kwon, Sangwon Lee, and Myoungsoo Jung. 2023. Cache in Hand: Expander-Driven CXL Prefetcher for Next Generation CXL-SSD. In Proceedings of the 15th ACM HotStorage '23, <u>https://doi.org/10.1145/3599691.3603406</u>
- Huaicheng Li, and others. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In Proceedings of the 28th ACM ASPLOS 2023, <u>https://doi.org/10.1145/3575693.3578835</u>
- Shao-Peng Yang and others. Overcoming the Memory Wall with CXL-Enabled SSDs, USENIX ATC **2023**, <u>https://www.usenix.org/conference/atc23/presentation/yang-shao-peng</u>
- Donghyun Gouk and others, Direct Access, High-Performance Memory Disaggregation with DirectCXL, USENIX ATC **2022**, <u>https://www.usenix.org/conference/atc22/presentation/gouk</u>

# Further Reading - CXL (2 of 2)

- CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search, USENIX ATC 2023, <u>https://www.usenix.org/conference/atc23/presentation/jang</u>
- Marcos K. Aguilera, and others. 2023. Memory disaggregation: why now and what are the challenges. SIGOPS Oper. Syst. Rev. 57, 1 (June **2023**), 38–46. <u>https://doi.org/10.1145/3606557.3606563</u>
- Hasan Al Maruf and Mosharaf Chowdhury. 2023. Memory Disaggregation: Advances and Open Challenges. SIGOPS Oper. Syst. Rev. 57, 1 (June **2023**), 29–37. <u>https://doi.org/10.1145/3606557.3606562</u>
- Jianguo Wang and Qizhen Zhang. **2023**. Disaggregated Database Systems. In Companion of the **2023** International Conference on Management of Data (SIGMOD '23). <u>https://doi.org/10.1145/3555041.3589403</u>
- Wenjing Jin, and others. DRAM Translation Layer: Software-Transparent DRAM Power Savings for Disaggregated Memory. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23). <u>https://doi.org/10.1145/3579371.3589051</u>
- What's the Difference Between CXL 1.1 and CXL 2.0?
   <u>https://www.electronicdesign.com/technologies/embedded/article/21249351/cxl-consortium-whats-the-difference-between-cxl-11-and-cxl-20</u>
- QEMU CXL setup, <u>https://www.qemu.org/docs/master/system/devices/cxl.html</u>
- How To Map a CXL Endpoint to a CPU Socket in Linux, <u>https://stevescargall.com/blog/2022/12/27/how-to-map-a-cxl-endpoint-to-a-cpu-socket-in-linux/</u>

# Further Reading - io\_uring (1 of 2)

- Efficient IO with io\_uring, <u>https://kernel.dk/io\_uring.pdf</u>
- What's new with io\_uring, <u>https://kernel.dk/axboe-kr2022.pdf</u>
- An Introduction to the io\_uring Asynchronous I/O Framework, <u>https://blogs.oracle.com/linux/post/an-introduction-to-the-io-uring-asynchronous-io-framework</u>
- Zebin Ren and Animesh Trivedi. 2023. Performance Characterization of Modern Storage Stacks: POSIX I/O, libaio, SPDK, and io\_uring. In Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS '23). Association for Computing Machinery, New York, NY, USA, 35–45. https://doi.org/10.1145/3578353.3589545
- Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio, SPDK, and io\_uring. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22). Association for Computing Machinery, New York, NY, USA, 120–127. <u>https://doi.org/10.1145/3534056.3534945</u>
- Simon A. F. Lund, Philippe Bonnet, Klaus B. A. Jensen, and Javier Gonzalez. 2022. I/O interface independence with xNVMe. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22). Association for Computing Machinery, New York, NY, USA, 108–119. <u>https://doi.org/10.1145/3534056.3534936</u>
- Sidharth Sundar, William Simpson, Jacob Higdon, Caeden Whitaker, Bryan Harris, and Nihat Altiparmak. 2023. Energy Implications of IO Interface Design Choices. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage '23). Association for Computing Machinery, New York, NY, USA, 58–64. <u>https://doi.org/10.1145/3599691.3603411</u>

# Further Reading - io\_uring (2 of 2)

- Ringing in a new asynchronous I/O API, <u>https://lwn.net/Articles/776703/</u>
- [PATCHSET v5] io\_uring IO interface, <u>https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/</u>
- Gabriel Haas and Viktor Leis. 2023. What Modern NVMe Storage Can Do, and How to Exploit it: High-Performance I/O for High-Performance Storage Engines. Proc. VLDB Endow. 16, 9 (May 2023), 2090–2102. https://doi.org/10.14778/3598581.3598584
- Hugh C. Lauer and Roger M. Needham. 1979. On the duality of operating system structures. SIGOPS Oper. Syst. Rev. 13, 2 (April 1979), 3–19. <u>https://doi.org/10.1145/850657.850658</u>
- John Ousterhout, Why Threads Are A Bad Idea (for most purposes), <u>https://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf</u>
- Rob von Behren, Jeremy Condit, and Eric Brewer. 2003. Why events are a bad idea (for high-concurrency servers). In Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9 (HOTOS'03). USENIX Association, USA, 4. <u>https://dl.acm.org/doi/10.5555/1251054.1251058</u>
- Philipp Haller, Martin Odersky, Scala Actors: Unifying thread-based and event-based programming, 2008, <u>https://doi.org/10.1016/j.tcs.2008.09.019</u>.
- A 5 part series on the asynchronous nature of I/O, OS, and concurrency: <u>https://blog.acolyer.org/2014/12/08/on-the-duality-of-operating-system-structures/</u>
- µTune: Auto-Tuned Threading for OLDI Microservices, <u>https://www.usenix.org/conference/osdi18/presentation/sriraman</u>
- Linux Asynchronous I/O, <u>https://oxnz.github.io/2016/10/13/linux-aio/</u>
- Linux-aio, <u>https://github.com/littledan/linux-aio</u>