Btrfs vs ZFS Performance
I have been using Btrfs for my server for a few years, but recently ran into performance issues with databases and snapshots. On the internet, ZFS seemed a pretty popular alternative, so I decided to compare the two in various aspects.
Both ZFS and Btrfs are copy-on-write (CoW) filesystems. These types of filesystems do not overwrite data in-place. Instead, they write file data to a new block, then update file metadata to point to it. As a result, their performance characteristics differ quite a bit from a traditional non-CoW filesystem like XFS or ext4.
For this article, I am using test scripts I have written which use fio, a highly customizable I/O testing tool supporting a variety of testing patterns.
The test setup for this article is a spare hard drive I have:
❯ sudo parted /dev/sda print
Model: ATA TOSHIBA DT01ACA1 (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/4096B
All tests are done without caches/buffers (direct=1
in fio).
ZFS supports some tunable parameters, some of which are:
- recordsize: This is the minimum size which must be read from either disk/cache when data in a file is to be modified. Writes which are smaller than this will require that the whole record be read and rewritten. Only applicable for datasets.
- volblocksize: The analogous property to recordsize, only applicable for volumes.
- ashift: The block alignment size that ZFS will use per vdev, or equivalently, the smallest possible IO on a vdev.
- primarycache: The Adaptive Replacement Cache, an improvement over LRU caches.
Test Results
Raw disk vs CoW filesystems
First, let's compare the performance of the raw disk versus these filesystems on top.
- Btrfs mount options:
datacow,noatime,defaults
- ZFS dataset parameters:
atime=off, ashift=0, primarycache=metadata, compression=off
Sequential:
Test (in MB/s) | Raw | Btrfs | ZFS 128k | ZFS 4k |
---|---|---|---|---|
SEQ1M_Q8T1 Read | 164 | 185 | 126 | 55 |
SEQ1M_Q8T1 Write | 188 | 194 | 162 | 63 |
SEQ1M_Q1T1 Read | 190 | 194 | 153 | 69 |
SEQ1M_Q1T1 Write | 188 | 184 | 143 | 64 |
Btrfs is the fastest here, while ZFS is the slowest. This might be due to read-ahead or other optimizations that Btrfs is doing that I'm not aware of.
Random:
Test (in IOPS) | Raw | Btrfs | ZFS 128k | ZFS 4k |
---|---|---|---|---|
RND4K_Q32T1 Read | 220 | 226 | 118 | 119 |
RND4K_Q32T1 Write | 396 | 16977 | 116 | 8787 |
RND4K_Q1T1 Read | 140 | 147 | 109 | 121 |
RND4K_Q1T1 Write | 410 | 6139 | 112 | 8262 |
Btrfs wins again, with ZFS at 4k recordsize second.
On CoW filesystems, random writes with high queue depths can exceed that of the raw disk, since the writes can happen sequentially on disk. The filesystem maps them to the correct offsets in the file.
Btrfs datacow
vs nodatacow
We can directly evaluate the performance characteristics of Btrfs when CoW is turned off (btrfs-admin), and writes are done in-place.
Test (MB/s) | datacow |
nodatacow |
Raw |
---|---|---|---|
SEQ1M_Q8T1 Read | 185 | 195 | 164 |
SEQ1M_Q8T1 Write | 194 | 187 | 188 |
SEQ1M_Q1T1 Read | 194 | 195 | 190 |
SEQ1M_Q1T1 Write | 184 | 195 | 188 |
No measurable difference in sequential performance.
Test (IOPS) | datacow |
nodatacow |
Raw |
---|---|---|---|
RND4K_Q32T1 Read | 226 | 217 | 220 |
RND4K_Q32T1 Write | 16977 | 402 | 396 |
RND4K_Q1T1 Read | 147 | 155 | 140 |
RND4K_Q1T1 Write | 6139 | 350 | 410 |
nodatacow
random write speeds drop back to that of the raw disk.
As COW is turned off, random writes happen in-place on disk, and so suffer the same penalties as the raw disk.
nodatacow and checksums
When nodatacow
is turned on, checksums (and compression) are disabled. Data corruption cannot be detected. In a RAID1 setup, it will not be possible to determine which disk has the correct block and correction is thus not possible.
For this reason, despite the performance benefit, nodatacow
should not be used.
In-place writes
A common database workload involves reading/writing repeatedly to the same file at random locations. This causes fragmentation of the file in CoW filesystems, since the writes are not done in-place. Subsequent sequential reading of the file is going to be slower, possibly close to that of random read performance.
We can simulate this by first writing a large file, doing random writes for a while, and then reading the same file sequentially.
- ZFS dataset parameters:
atime=off, ashift=0, primarycache=metadata, compression=off
datacow
andnodatacow
refer to Btrfs with the respective mount option
Test | datacow |
ZFS 128k | ZFS 4k | nodatacow |
---|---|---|---|---|
SEQ1M_Q8T1 Write, MB/s | 188 | 182 | 65 | 187 |
SEQ1M_Q8T1 Read, IOPS (Pre) | 187 | 183 | 82 | 192 |
RND4K_Q32T1 Write, IOPS | 20024 | 92 | 4933 | 271 |
SEQ1M_Q8T1 Read, MB/s (Post) | 1 | 73 | 3 | 188 |
Here, we see that the performance of Btrfs with CoW is abyssmal.
At a recordsize of 128k, ZFS does much better, coming in at around half the perfomance if one were to write in-place (simulated via nodatacow
). However at 4k recordsize, the performance drops significantly.
In fragmented files/workloads causing fragmentation, a large recordsize can help with sequential reads, at the price of slower random writes (IO amplification due to read-modify-write cycles).
Note: Btrfs does not offer any way to set the default extent size (the equivalent of recordsize). At present the minimum size is 4kB, with no limit on the maximum.
Snapshot Performance
Taking repeated snapshots of a filesystem can affect performance, since defragmentation of any sort can't really happen while snapshotted blocks are still being referenced.
In this test, we take repeated snapshots (~3/s) while the random writes are being done, which should theoretically hamper optimization attempts to sequentially align data.
- ZFS dataset parameters:
atime=off, ashift=0, primarycache=metadata, compression=off
Test | 128k | 128k, snaps | 4k | 4k, snaps |
---|---|---|---|---|
SEQ1M_Q8T1 Write, MB/s | 182 | 180 | 73 | 73 |
SEQ1M_Q8T1 Read, MB/s (Post) | 73 | 90 | 7 | 10 |
Btrfs autodefrag
Btrfs` autodefrag feature reads "When enabled, small random writes into files (in a range of tens of kilobytes, currently it’s 64KiB) are detected and queued up for the defragmentation process."
Let's see how turning this on changes things:
- Btrfs mount options:
datacow,noatime,defaults
Test | noautodefrag | autodefrag |
---|---|---|
SEQ1M_Q8T1 Write | 188 MB/s | 186 MB/s |
SEQ1M_Q8T1 Read (Pre) | 187 MB/s | 187 MB/s |
RND4K_Q32T1 Write | 20024 IOPS | 19708 IOPS |
SEQ1M_Q8T1 Read (Post) | 1 MB/s | 1 MB/s |
No measurable difference.
ZFS datasets vs volumes
ZFS also supports volumes (zvols), which are datasets that represent a block device. These can be passed as a raw disk to a VM, while still retaining checksums and the ability to snapshot the volume.
- ZFS dataset parameters:
atime=off, ashift=0, primarycache=metadata, compression=off
- ZFS zvol parameters:
ashift=12, compression=off
- Despite turning off caches, results for the SEQ1M_Q8T1 read for zvols were unrealistically high and so they are omitted.
Test (MB/s) | Dataset, 16k | zvol, 16k |
---|---|---|
SEQ1M_Q8T1 Write | 158 | 113 |
SEQ1M_Q1T1 Read | 156 | 264 |
SEQ1M_Q1T1 Write | 151 | 43 |
Test (IOPS) | Dataset, 16k | zvol, 16k |
---|---|---|
RND4K_Q32T1 Read | 127 | 458 |
RND4K_Q32T1 Write | 121 | 176 |
RND4K_Q1T1 Read | 112 | 210 |
RND4K_Q1T1 Write | 124 | 63 |
RAID1
We would also like to see if there are any performance gains when using a RAID1 setup, because in theory, reads can be fulfilled by both devices simultaneously.
To do this, we can run benchmarks on both Btrfs and ZFS, with and without RAID1 setup.
In progress
Of note when using RAIDZ, to improve random IOPS, use less disks per vdev (or better, a mirror).
Conclusion
CoW filesystems allow for efficient snapshotting, and come with a variety of useful features such as checksumming, compression and encryption (only for ZFS). The main drawbacks are due to the CoW mechanism, most clearly observed with heavy random write workloads which cause fragmentation. To some extent, this can be mitigated by increasing the extent/record size, sacrificing random write IOPS for sequential read speeds. At the time of this writing, only ZFS allows for that.