Longhorn Performance Benchmarking

Longhorn is the cloud native distributed block storage in the Kubernetes ecosystem. It is an open source software defined storage that synchronously replicates the volumes across multiple Kubernetes nodes to achieve high availability and high degree of resiliency. Longhorn volume replicas are hosted on separate nodes to prevent single point of failure. In a nutshell, Longhorn volume is used as the fault-tolerant persistent volume for the stateful pods in Kubernetes.

The objective of running these performance benchmarking tests is to gauge the output in terms of IOPS, latency, bandwidth and CPU usage when the stateful pods use the Longhorn storage on a particular hardware specification in the Kubernetes cluster.

Hardware
Architecture
Storage Performance Tool
Some Storage Facts
Longhorn vs Local Attached Storage
Longhorn Replica Size 2 vs Size 3

Hardware

The performance benchmarking tests are carried out using the following hardware specification.

CPU	Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
Memory	DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns)
Disk	SSD P4610 1.6TB SFF

The datasheet for the storage disk SSD P4610 can be obtained here. The performance specifications stated in this datasheet can be used to benchmark against the result of the tests.

Architecture

The Longhorn performance tests are carried out in a Kubernetes cluster with 3 physical nodes connected to each other on the same subnet. Each test run inside the stateful pod that is attached to the storage class backed by the Longhorn persistent volume with replica size 2 and 3.
The local attached storage performance test is carried out directly on the operating system of the physical server.

Storage Performance Tool

fio is used as the storage performance tool using different block size (bs) and IOdepth as illustrated in the following table. Part of the tests are performed based on random 50% read and 50% write (files scattered around the disk). The rest is based on sequential 100% read and 100% write operations.

Block Size (bs)	IOdepth
4k	32
8k	32
64k	16
1024k	8

The full fio command with random 50% read and 50% write operation is shown as follows.

fio --name=fiotest --filename=test --size=10Gb --numjobs=8 --ioengine=libaio --group_reporting --runtime=60 --startdelay=60 --bs=8k --iodepth=32 --rw=randrw --direct=1 --rwmixread=50

The full fio command for write only is shown as follows.

fio --name=fiotest --filename=test --size=10Gb --direct=1 --numjobs=8 --ioengine=libaio --group_reporting --runtime=60 --startdelay=60 --bs=8k --iodepth=32 --rw=write

When testing the Longhorn performance, the abovementioned commands are executed inside the Kubernetes pod as illustrated in the following example.

# kubectl exec -ti fio-0 -n test -- /bin/bash
[root@fio-0 /]# fio --name=fiotest --filename=test --size=10Gb --rw=randrw --direct=1 --rwmixread=50 --numjobs=8 --ioengine=libaio --group_reporting --runtime=60 --startdelay=60 --bs=1M --iodepth=8
fiotest: (g=0): rw=randrw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
...
fio-3.7
Starting 8 processes
fiotest: Laying out IO file (1 file / 10240MiB)
^Cbs: 8 (f=8): [m(8)][86.7%][r=266MiB/s,w=256MiB/s][r=266,w=256 IOPS][eta 00m:16s]
fio: terminating on signal 2
Jobs: 8 (f=8): [m(8)][87.5%][r=238MiB/s,w=250MiB/s][r=238,w=250 IOPS][eta 00m:15s]
fiotest: (groupid=0, jobs=8): err= 0: pid=119: Wed May  4 01:59:49 2022
   read: IOPS=246, BW=246MiB/s (258MB/s)(10.6GiB/44238msec)
    slat (usec): min=10, max=110765, avg=3027.79, stdev=10297.34
    clat (msec): min=8, max=343, avg=92.82, stdev=35.80
     lat (msec): min=8, max=343, avg=95.85, stdev=37.22
    clat percentiles (msec):
     |  1.00th=[   26],  5.00th=[   44], 10.00th=[   51], 20.00th=[   66],
     | 30.00th=[   73], 40.00th=[   80], 50.00th=[   91], 60.00th=[  100],
     | 70.00th=[  106], 80.00th=[  121], 90.00th=[  138], 95.00th=[  157],
     | 99.00th=[  201], 99.50th=[  224], 99.90th=[  284], 99.95th=[  305],
     | 99.99th=[  326]
   bw (  KiB/s): min=10240, max=57344, per=12.51%, avg=31559.36, stdev=7650.58, samples=704
   iops        : min=   10, max=   56, avg=30.76, stdev= 7.46, samples=704
  write: IOPS=252, BW=252MiB/s (264MB/s)(10.9GiB/44238msec)
    slat (usec): min=60, max=196760, avg=14803.03, stdev=18096.68
    clat (msec): min=27, max=444, avg=145.26, stdev=53.50
     lat (msec): min=31, max=445, avg=160.07, stdev=55.94
    clat percentiles (msec):
     |  1.00th=[   61],  5.00th=[   75], 10.00th=[   86], 20.00th=[  101],
     | 30.00th=[  110], 40.00th=[  124], 50.00th=[  136], 60.00th=[  150],
     | 70.00th=[  167], 80.00th=[  186], 90.00th=[  220], 95.00th=[  249],
     | 99.00th=[  305], 99.50th=[  321], 99.90th=[  372], 99.95th=[  388],
     | 99.99th=[  414]
   bw (  KiB/s): min=18432, max=49152, per=12.48%, avg=32225.72, stdev=5107.40, samples=704
   iops        : min=   18, max=   48, avg=31.41, stdev= 5.00, samples=704
  lat (msec)   : 10=0.01%, 20=0.16%, 50=4.78%, 100=35.66%, 250=56.93%
  lat (msec)   : 500=2.46%
  cpu          : usr=0.42%, sys=0.69%, ctx=15497, majf=0, minf=81
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10900,11151,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=246MiB/s (258MB/s), 246MiB/s-246MiB/s (258MB/s-258MB/s), io=10.6GiB (11.4GB), run=44238-44238msec
  WRITE: bw=252MiB/s (264MB/s), 252MiB/s-252MiB/s (264MB/s-264MB/s), io=10.9GiB (11.7GB), run=44238-44238msec

Some Storage Facts

IOPS, bandwidth and latency are the common performance indicators from the storage perspective.
IOPS (Input/output Operations Per Second) represents the number of requests being sent to the storage disk per one second. IOPS could be measured in read or write operation - random or sequential.
Bandwidth or throughput is a metric measuring the amount of data that the application sends to the disk in a specified interval.
Latency measures the time taken to send and receive the data bits to and from the storage disk respectively.
The graph below shows that both bandwidth and latency increase whereas IOPS and CPU usage decrease when the block size is higher.
Generally, the block size for data warehouse is typically higher than OLTP (Online Transaction Processing) systems. This is because data warehouse tends to require high bandwidth whereas OLTP system requires high IOPS. High IOPS can be achieved at the expense of low bandwidth. In short, block size should be configured appropriately based on the characteristics of the workloads.

Sequential data with either 100% read or 100% write produces higher IOPS significantly in comparison to operation with random 50% read and 50% write as shown below.

Longhorn vs Local Attached Storage

Any software defined storage is expected to perform lower than the local attached storage. This is because a typical software defined storage like Longhorn replicates the data synchronously to achieve high availability. The question is, to what extent the performance penalty is at stake.
The following graph shows the performance difference between these two storage variants as the result of a series of tests using different block size, IOdepth and read/write operation.

Note that the above result is generated using SSD as the storage disk. Spinning disk could produce much lower performance in comparison to the SSD.

Longhorn Replica Size 2 vs Size 3

The following graph illustrates the performance output as a result of using Longhorn volume with replica size 2 and replica size 3.

The result seemingly implies that there is no significant performance difference between replica size 2 and size 3 in a small Kubernetes cluster. Replica size 3 ensures higher redundancy but at the expense of higher cost by involving more node to store the data. Replica size 2, on the other hand, reduces hardware cost but at the expense of lower redundancy.
Read operation yields higher IOPS compared to write operation.
Block size with 64k and 1024k reduce IOPS significantly.