Btrfs vs ZFS Performance

I have been using Btrfs for my server for a few years, but recently ran into performance issues with databases and snapshots. On the internet, ZFS seemed a pretty popular alternative, so I decided to compare the two in various aspects.

Both ZFS and Btrfs are copy-on-write (CoW) filesystems. These types of filesystems do not overwrite data in-place. Instead, they write file data to a new block, then update file metadata to point to it. As a result, their performance characteristics differ quite a bit from a traditional non-CoW filesystem like XFS or ext4.

For this article, I am using test scripts I have written which use fio, a highly customizable I/O testing tool supporting a variety of testing patterns.

The test setup for this article is a spare hard drive I have:

❯ sudo parted /dev/sda print
Model: ATA TOSHIBA DT01ACA1 (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/4096B

All tests are done without caches/buffers (direct=1 in fio).

ZFS supports some tunable parameters, some of which are:

recordsize: This is the minimum size which must be read from either disk/cache when data in a file is to be modified. Writes which are smaller than this will require that the whole record be read and rewritten. Only applicable for datasets.
volblocksize: The analogous property to recordsize, only applicable for volumes.
ashift: The block alignment size that ZFS will use per vdev, or equivalently, the smallest possible IO on a vdev.
primarycache: The Adaptive Replacement Cache, an improvement over LRU caches.

Test Results

Raw disk vs CoW filesystems

First, let's compare the performance of the raw disk versus these filesystems on top.

Btrfs mount options: datacow,noatime,defaults
ZFS dataset parameters: atime=off, ashift=0, primarycache=metadata, compression=off

Sequential:

Test (in MB/s)	Raw	Btrfs	ZFS 128k	ZFS 4k
SEQ1M_Q8T1 Read	164	185	126	55
SEQ1M_Q8T1 Write	188	194	162	63
SEQ1M_Q1T1 Read	190	194	153	69
SEQ1M_Q1T1 Write	188	184	143	64

Btrfs is the fastest here, while ZFS is the slowest. This might be due to read-ahead or other optimizations that Btrfs is doing that I'm not aware of.

Random:

Test (in IOPS)	Raw	Btrfs	ZFS 128k	ZFS 4k
RND4K_Q32T1 Read	220	226	118	119
RND4K_Q32T1 Write	396	16977	116	8787
RND4K_Q1T1 Read	140	147	109	121
RND4K_Q1T1 Write	410	6139	112	8262

Btrfs wins again, with ZFS at 4k recordsize second.

On CoW filesystems, random writes with high queue depths can exceed that of the raw disk, since the writes can happen sequentially on disk. The filesystem maps them to the correct offsets in the file.

Btrfs `datacow` vs `nodatacow`

We can directly evaluate the performance characteristics of Btrfs when CoW is turned off (btrfs-admin), and writes are done in-place.

Test (MB/s)	`datacow`	`nodatacow`	Raw
SEQ1M_Q8T1 Read	185	195	164
SEQ1M_Q8T1 Write	194	187	188
SEQ1M_Q1T1 Read	194	195	190
SEQ1M_Q1T1 Write	184	195	188

No measurable difference in sequential performance.

Test (IOPS)	`datacow`	`nodatacow`	Raw
RND4K_Q32T1 Read	226	217	220
RND4K_Q32T1 Write	16977	402	396
RND4K_Q1T1 Read	147	155	140
RND4K_Q1T1 Write	6139	350	410

nodatacow random write speeds drop back to that of the raw disk.

As COW is turned off, random writes happen in-place on disk, and so suffer the same penalties as the raw disk.

nodatacow and checksums

When nodatacow is turned on, checksums (and compression) are disabled. Data corruption cannot be detected. In a RAID1 setup, it will not be possible to determine which disk has the correct block and correction is thus not possible.

For this reason, despite the performance benefit, nodatacow should not be used.

In-place writes

A common database workload involves reading/writing repeatedly to the same file at random locations. This causes fragmentation of the file in CoW filesystems, since the writes are not done in-place. Subsequent sequential reading of the file is going to be slower, possibly close to that of random read performance.

We can simulate this by first writing a large file, doing random writes for a while, and then reading the same file sequentially.

ZFS dataset parameters: atime=off, ashift=0, primarycache=metadata, compression=off
datacow and nodatacow refer to Btrfs with the respective mount option

Test	`datacow`	ZFS 128k	ZFS 4k	`nodatacow`
SEQ1M_Q8T1 Write, MB/s	188	182	65	187
SEQ1M_Q8T1 Read, IOPS (Pre)	187	183	82	192
RND4K_Q32T1 Write, IOPS	20024	92	4933	271
SEQ1M_Q8T1 Read, MB/s (Post)	1	73	3	188

Here, we see that the performance of Btrfs with CoW is abyssmal.

At a recordsize of 128k, ZFS does much better, coming in at around half the perfomance if one were to write in-place (simulated via nodatacow). However at 4k recordsize, the performance drops significantly.

In fragmented files/workloads causing fragmentation, a large recordsize can help with sequential reads, at the price of slower random writes (IO amplification due to read-modify-write cycles).

Note: Btrfs does not offer any way to set the default extent size (the equivalent of recordsize). At present the minimum size is 4kB, with no limit on the maximum.

Snapshot Performance

Taking repeated snapshots of a filesystem can affect performance, since defragmentation of any sort can't really happen while snapshotted blocks are still being referenced.

In this test, we take repeated snapshots (~3/s) while the random writes are being done, which should theoretically hamper optimization attempts to sequentially align data.

ZFS dataset parameters: atime=off, ashift=0, primarycache=metadata, compression=off

Test	128k	128k, snaps	4k	4k, snaps
SEQ1M_Q8T1 Write, MB/s	182	180	73	73
SEQ1M_Q8T1 Read, MB/s (Post)	73	90	7	10

Btrfs `autodefrag`

Btrfs` autodefrag feature reads "When enabled, small random writes into files (in a range of tens of kilobytes, currently it’s 64KiB) are detected and queued up for the defragmentation process."

Let's see how turning this on changes things:

Btrfs mount options: datacow,noatime,defaults

Test	noautodefrag	autodefrag
SEQ1M_Q8T1 Write	188 MB/s	186 MB/s
SEQ1M_Q8T1 Read (Pre)	187 MB/s	187 MB/s
RND4K_Q32T1 Write	20024 IOPS	19708 IOPS
SEQ1M_Q8T1 Read (Post)	1 MB/s	1 MB/s

No measurable difference.

ZFS datasets vs volumes

ZFS also supports volumes (zvols), which are datasets that represent a block device. These can be passed as a raw disk to a VM, while still retaining checksums and the ability to snapshot the volume.

ZFS dataset parameters: atime=off, ashift=0, primarycache=metadata, compression=off
ZFS zvol parameters: ashift=12, compression=off
Despite turning off caches, results for the SEQ1M_Q8T1 read for zvols were unrealistically high and so they are omitted.

Test (MB/s)	Dataset, 16k	zvol, 16k
SEQ1M_Q8T1 Write	158	113
SEQ1M_Q1T1 Read	156	264
SEQ1M_Q1T1 Write	151	43

Test (IOPS)	Dataset, 16k	zvol, 16k
RND4K_Q32T1 Read	127	458
RND4K_Q32T1 Write	121	176
RND4K_Q1T1 Read	112	210
RND4K_Q1T1 Write	124	63

RAID1/RAIDZ

We would also like to see if there are any performance gains when using a RAID1 setup, because in theory, reads can be fulfilled by both devices simultaneously.

To do this, we can run benchmarks on both Btrfs and ZFS, with and without RAID1 setup.

In progress

Of note when using RAIDZ, to improve random IOPS, use less disks per vdev (or better, a mirror).

Conclusion

CoW filesystems allow for efficient snapshotting, and come with a variety of useful features such as checksumming, compression and encryption (only for ZFS). The main drawbacks are due to the CoW mechanism, most clearly observed with heavy random write workloads which cause fragmentation. To some extent, this can be mitigated by increasing the extent/record size, sacrificing random write IOPS for sequential read speeds. At the time of this writing, only ZFS allows for that.