Tuesday, May 5, 2026

Maxing Random Read IOPS on a Datacenter SSD with a Single Desktop E-Core


Sitting under my desk is a three year old system based on an Intel i9-13900KS CPU. This is a consumer CPU with 8 "performance cores" and 16 "efficient cores" (p-cores and e-cores). It's been a fine personal test system, but with the mix of core types, I've always pinned my benchmarking workloads to the p-cores, leaving the e-cores to whatever else the scheduler had in store for them. I knew that an e-core could not drive as many IOPS as a p-core, but I always wondered about their performance limits for IO work. There is no way around their lower frequency, but it’s not clear whether their other architectural tradeoffs strongly affect the task of moving data to and from an SSD. This blog documents my exploration process as I tried to saturate the random read performance of a Gen5 datacenter-class NVMe drive using a single e-core.

For this testing, I used a ScaleFlux CSD5310. This model advertises 2.8M 4kB random read IOPS. I would have liked to use a CSD5320, which can reach 3.2M IOPS, but I didn't have one handy [1]. I’m suffering from the SSD shortage too... The drive was originally connected via one of our engineering adapters to the second Gen5 slot on my ASUS motherboard.


Over the course of testing, I noticed an occasional correctable PCIe error in the kernel log (I use Arch, BTW). These were not frequent or severe enough to be concerned about, but I decided to move the drive to the first PCIe slot (closer to the PCIe root complex). This was a bit of a tight squeeze, but everything worked out with the help of some post-it notes to get the fan in place.


With clean kernel logs, it was time to move on. There are two system configurations that I started with from the beginning. First, I switched to the "performance" CPU governor. Second, I added nvme.poll_queues=24 to the kernel boot parameters (since I knew I would want to experiment with polled IO). This ensures that each core has a CPU local polling queue available. With that in place, the first order of business was simply to verify the maximum 4kB random read performance of the SSD with a very typical set of FIO parameters:

$ sudo fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --numjobs=8 --iodepth=64 --rw=randread --group_reporting --time_based --runtime=20 --cpus_allowed=0-7
...
read: IOPS=2837k, BW=10.8GiB/s (11.6GB/s)(216GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 71], 5.00th=[ 79], 10.00th=[ 84], 20.00th=[ 94],
| 30.00th=[ 106], 40.00th=[ 120], 50.00th=[ 139], 60.00th=[ 163],
| 70.00th=[ 194], 80.00th=[ 243], 90.00th=[ 330], 95.00th=[ 420],
| 99.00th=[ 644], 99.50th=[ 750], 99.90th=[ 988], 99.95th=[ 1090],
| 99.99th=[ 1336]
...
cpu : usr=11.88%, sys=16.51%, ctx=14359063, majf=0, minf=580


Here we reached 2.8M IOPS, exactly as expected. With 8 jobs, the workload is not CPU bound and the e-cores could replicate this result (at about double the CPU utilization).

$ sudo fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --numjobs=8 --iodepth=64 --rw=randread --group_reporting --time_based --runtime=20 --cpus_allowed=8-15
...
read: IOPS=2844k, BW=10.8GiB/s (11.6GB/s)(217GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 70], 5.00th=[ 76], 10.00th=[ 82], 20.00th=[ 91],
| 30.00th=[ 101], 40.00th=[ 118], 50.00th=[ 137], 60.00th=[ 161],
| 70.00th=[ 194], 80.00th=[ 243], 90.00th=[ 330], 95.00th=[ 424],
| 99.00th=[ 652], 99.50th=[ 750], 99.90th=[ 996], 99.95th=[ 1106],
| 99.99th=[ 1352]
...
cpu : usr=22.10%, sys=35.69%, ctx=9247836, majf=0, minf=574


Running the same workload on a single core exposed the difference in IOPS performance between p-cores and e-cores. In this case, I picked CPU 3 (p-core) and CPU 17 (e-core). The e-core was 2.4GHz lower than the p-core on this system (e-cores can apparently reach 4GHz in the best case, but I wasn’t so lucky on this system).

$ cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
5603265
$ cat /sys/devices/system/cpu/cpu17/cpufreq/scaling_cur_freq
3201000


With a single p-core, we could extract 888k IOPS.

$ sudo taskset -c 3 fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --iodepth=64 --rw=randread --group_reporting --time_based --runtime=20
...
read: IOPS=888k, BW=3468MiB/s (3636MB/s)(67.7GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 56], 5.00th=[ 60], 10.00th=[ 61], 20.00th=[ 64],
| 30.00th=[ 66], 40.00th=[ 68], 50.00th=[ 70], 60.00th=[ 72],
| 70.00th=[ 74], 80.00th=[ 77], 90.00th=[ 82], 95.00th=[ 94],
| 99.00th=[ 118], 99.50th=[ 126], 99.90th=[ 151], 99.95th=[ 161],
| 99.99th=[ 186]


The e-core achieved 471k IOPS with the same FIO parameters, but the latency was significantly right shifted indicating that we were heavily CPU bound.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --iodepth=64 --rw=randread --group_reporting --time_based --runtime=20
...
read: IOPS=471k, BW=1839MiB/s (1928MB/s)(35.9GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 119], 5.00th=[ 124], 10.00th=[ 126], 20.00th=[ 128],
| 30.00th=[ 130], 40.00th=[ 133], 50.00th=[ 133], 60.00th=[ 135],
| 70.00th=[ 137], 80.00th=[ 139], 90.00th=[ 143], 95.00th=[ 147],
| 99.00th=[ 169], 99.50th=[ 176], 99.90th=[ 190], 99.95th=[ 200],
| 99.99th=[ 229]


Backing the queue depth down to 32, the latency profile improved significantly without reducing IOPS.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --iodepth=64 --rw=randread --group_reporting --time_based --runtime=20
...
read: IOPS=472k, BW=1845MiB/s (1934MB/s)(36.0GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 52], 5.00th=[ 56], 10.00th=[ 58], 20.00th=[ 60],
| 30.00th=[ 62], 40.00th=[ 63], 50.00th=[ 65], 60.00th=[ 67],
| 70.00th=[ 69], 80.00th=[ 72], 90.00th=[ 75], 95.00th=[ 79],
| 99.00th=[ 103], 99.50th=[ 111], 99.90th=[ 127], 99.95th=[ 137],
| 99.99th=[ 163]


Setting the queue depth too low can leave IOPS on the table. Setting it too high can saturate CPU utilization. Once the CPU is saturated, increasing the queue depth adds host-side latency without increasing IOPS.

The following plot shows the latency and IOPS response to increasing queue depth on the e-core. Beyond a queue depth of 32, IOPS no longer improve and latency begins to degrade.


The next plot shows that the IOPS plateau is caused by CPU saturation.


If we’re able to make IO more CPU efficient, we’ll need to adjust the queue depth higher to operate at the edge of saturation, driving the most IOPS without adding excessive host-side latency.

The first major change I made was switching to polled IO (--hipri). This is more efficient than interrupt-driven completions in high IOPS workloads.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/nvme0n1 --direct=1 --ioengine=io_uring --bs=4k --iodepth=32 --rw=randread --group_reporting --time_based --runtime=20 --hipri
...
read: IOPS=546k, BW=2134MiB/s (2238MB/s)(41.7GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 47], 5.00th=[ 49], 10.00th=[ 50], 20.00th=[ 51],
| 30.00th=[ 53], 40.00th=[ 54], 50.00th=[ 56], 60.00th=[ 58],
| 70.00th=[ 60], 80.00th=[ 61], 90.00th=[ 64], 95.00th=[ 68],
| 99.00th=[ 95], 99.50th=[ 101], 99.90th=[ 118], 99.95th=[ 130],
| 99.99th=[ 153]


Staying at a queue depth of 32, we picked up an additional 74k IOPS above the baseline and slightly improved the latency profile.

In polling mode, the latency impact of increasing the queue depth shows up well before IOPS are completely saturated. This means leaving a significant amount of IOPS on the table to preserve the best possible latency distribution.


Next I switched to the relatively new NVMe Pass-through interface to reduce the work in the Linux block layer [2].

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=48 --rw=randread --group_reporting --time_based --runtime=20 --hipri
...
read: IOPS=766k, BW=2990MiB/s (3136MB/s)(58.4GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 51], 5.00th=[ 53], 10.00th=[ 55], 20.00th=[ 56],
| 30.00th=[ 58], 40.00th=[ 59], 50.00th=[ 60], 60.00th=[ 62],
| 70.00th=[ 63], 80.00th=[ 65], 90.00th=[ 69], 95.00th=[ 79],
| 99.00th=[ 102], 99.50th=[ 110], 99.90th=[ 133], 99.95th=[ 143],
| 99.99th=[ 165]


This change improved IOPS by a further 220k. The queue depth was bumped up from 32 to 48, which was at the knee of the latency curve.

At this point, userspace (FIO) was taking a significant fraction of the CPU utilization. There are four options we can use to cut down the CPU overhead (see the HOWTO in FIO for details [3]):
  • fixedbufs=1
  • norandommap=1
  • gtod_reduce=1
  • registerfiles=1

After applying these FIO options and moving the queue depth up to 64, we gain another 262k IOPS to reach just over 1M IOPS in total.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=192 --rw=randread --group_reporting --time_based --runtime=20 --hipri --norandommap=1 --fixedbufs=1 --registerfiles=1 --gtod_reduce=1
...
read: IOPS=1028k, BW=4017MiB/s (4212MB/s)(78.5GiB/20001msec)


Unfortunately, by using gtod_reduce we sacrifice latency measurements.

With all the low hanging fruit picked, I turned to perf to look for further optimizations. A perf trace showed that nearly 80% of CPU time was spent in td_io_commit() and its syscalls to io_uring.


By default, FIO will queue, commit, and reap each IO one by one, resulting in the very high syscall overhead. FIO provides options to batch both submissions and completions to alleviate this. With some experimentation, I arrived at batch sizes of 8 for completions and 32 for submissions. The queue depth was increased again to take advantage of the reduced CPU overhead. We gained an incredible 676k IOPS.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=128 --rw=randread --group_reporting --time_based --runtime=20 --hipri --norandommap=1 --fixedbufs=1 --registerfiles=1 --gtod_reduce=1 --iodepth_batch_complete=8 --iodepth_batch_submit=32
...
read: IOPS=1704k, BW=6658MiB/s (6981MB/s)(130GiB/20001msec)


When batching, the concern can be higher latency. Turning gtod_reduce back off showed an excellent latency profile (at the expense of giving up 37k IOPS).

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=320 --rw=randread --group_reporting --time_based --runtime=20 --hipri --norandommap=1 --fixedbufs=1 --registerfiles=1 --iodepth_batch_complete=32 --iodepth_batch_submit=64
...
read: IOPS=1667k, BW=6512MiB/s (6829MB/s)(127GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 57], 5.00th=[ 59], 10.00th=[ 61], 20.00th=[ 63],
| 30.00th=[ 65], 40.00th=[ 67], 50.00th=[ 69], 60.00th=[ 71],
| 70.00th=[ 74], 80.00th=[ 80], 90.00th=[ 97], 95.00th=[ 111],
| 99.00th=[ 143], 99.50th=[ 157], 99.90th=[ 192], 99.95th=[ 206],
| 99.99th=[ 247]


The tail latency above is driven by the much higher IOPS demand on the SSD (the probability of blocking operations within the SSD has increased).

Now it was time for a side quest. Keeping all of the optimizations (except leaving gtod_reduce off) and moving back to interrupt-driven completions dropped performance down to about 1.1M IOPS.

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=96 --rw=randread --group_reporting --time_based --runtime=20 --norandommap=1 --fixedbufs=1 --registerfiles=1 --gtod_reduce=1 --iodepth_batch_complete=8 --iodepth_batch_submit=32
...
read: IOPS=1088k, BW=4249MiB/s (4456MB/s)(83.0GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 60], 5.00th=[ 65], 10.00th=[ 69], 20.00th=[ 72],
| 30.00th=[ 75], 40.00th=[ 77], 50.00th=[ 79], 60.00th=[ 82],
| 70.00th=[ 85], 80.00th=[ 88], 90.00th=[ 96], 95.00th=[ 111],
| 99.00th=[ 135], 99.50th=[ 145], 99.90th=[ 172], 99.95th=[ 182],
| 99.99th=[ 210]


We can reduce the interrupt burden on the host by using interrupt coalescing (an NVMe feature). With interrupt coalescing set, the drive will only fire an interrupt based on either a maximum allowed delay value or a number of completion queue entries. Here we set the CSD5310 to only fire an interrupt after 32 completions (with a maximum completion delay of 300 microseconds). This got performance back up to nearly 1.6M IOPS, at the expense of higher tail latency compared to polled IO.

$ sudo nvme set-feature /dev/nvme0 -f 0x8 -V 0x320
set-feature:0x08 (Interrupt Coalescing), value:0x00000320, cdw12:00000000, save:0


$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=128 --rw=randread --group_reporting --time_based --runtime=20 --norandommap=1 --fixedbufs=1 --registerfiles=1 --gtod_reduce=0 --iodepth_batch_complete=8 --iodepth_batch_submit=32
...
read: IOPS=1576k, BW=6155MiB/s (6454MB/s)(120GiB/20001msec)
...
clat percentiles (usec):
| 1.00th=[ 56], 5.00th=[ 59], 10.00th=[ 61], 20.00th=[ 64],
| 30.00th=[ 66], 40.00th=[ 69], 50.00th=[ 71], 60.00th=[ 74],
| 70.00th=[ 77], 80.00th=[ 85], 90.00th=[ 103], 95.00th=[ 118],
| 99.00th=[ 157], 99.50th=[ 176], 99.90th=[ 251], 99.95th=[ 293],
| 99.99th=[ 355]


Up to this point, we have been carefully selecting the queue depth to preserve the best possible latency profile, but the goal was to achieve the maximum random read performance of the SSD. I had to throw caution to the wind and see how close we could get, no matter the latency impact. With polled mode re-enabled, the queue depth dialed up to 320, and the batch sizes increased, we topped out just shy of 2.4M IOPS

$ sudo taskset -c 17 fio --name=baseline --filename=/dev/ng0n1 --ioengine=io_uring_cmd --cmd_type=nvme --bs=4k --iodepth=380 --rw=randread --group_reporting --time_based --runtime=20 --norandommap=1 --fixedbufs=1 --registerfiles=1 --gtod_reduce=1 --iodepth_batch_complete=32 --iodepth_batch_submit=64 --hipri
...
read: IOPS=2367k, BW=9245MiB/s (9694MB/s)(181GiB/20001msec)


With my quiver empty and short of the 2.8M IOPS goal, the only thing left to try was the io_uring micro benchmark included in FIO. I ported most of the same parameters from FIO and boosted the queue depth.

$ sudo taskset -c 17 ./Projects/fio/t/io_uring -u 1 -O 0 -b 4096 -d 512 -B 1 -F 1 -r 10 -p 1 -c 32 -s 64 /dev/ng0n1
submitter=0, tid=380209, file=/dev/ng0n1, nfiles=1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=1, QD=512
Engine=io_uring, sq_ring=512, cq_ring=512
IOPS=2.81M, BW=10.96GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.98GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.97GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.98GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.97GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.98GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.97GiB/s, IOS/call=43/43
IOPS=2.81M, BW=10.98GiB/s, IOS/call=44/44
IOPS=2.81M, BW=10.97GiB/s, IOS/call=44/44
Exiting on timeout
Maximum IOPS=2.81M


Voila! We hit the magic number of 2.8M IOPS! We extracted the full random read performance of the CSD5310 from a single efficiency core!



Closing Thoughts

While consumer grade e-cores have little to do with datacenter workloads, the tuning needed to maximize their performance is relevant. As we look to Gen6 SSDs and next generation IOPS-dense devices, the amount of IOPS available in a single server will be astounding. The Linux kernel community continues to stay ahead of the curve with innovations like io_uring and NVMe pass-through. These preserve as much traditional IO infrastructure as possible without compromising on performance.

[1] https://scaleflux.com/products/csd-5000/
[2] https://www.usenix.org/conference/fast24/presentation/joshi
[3] https://github.com/axboe/fio/blob/master/HOWTO.rst


System Details

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i9-13900KS
CPU family: 6
Model: 183
Thread(s) per core: 1
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
Microcode version: 0x133
Frequency boost: enabled
CPU(s) scaling MHz: 132%
CPU max MHz: 3201.0000
CPU min MHz: 800.0000
BogoMIPS: 6374.40
...

$ sudo dmidecode | grep -i asus -A 2
...
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: Pro WS W680-ACE IPMI
Version: Rev 1.xx

$ fio --version
fio-3.41-11-gf2c1d

$ uname -r
7.0.3-arch1-1

$ cat /etc/os-release
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
...