Migrating from Btrfs to ZFS
In my article on self hosting, I initially wrote about how I was using Btrfs to store my data. However, I ran into issues with app database read/writes becoming very slow due to fragmentation. Additionally, I wanted better space efficiency than offered by RAID1 as I was expanding my hard drive collection.
I ran some performance tests to compare Btrfs and ZFS, and the results showed that ZFS with a recordsize of 128k was pretty resilient to fragmentation (simulated by random in-place writes into a file).
I finally decided on ZFS with RAIDZ2, for the following reasons:
- RAIDZ's space efficiency vs RAID1 in Btrfs
- Ability to tune the recordsizeto reduce the impact of fragmentation
- RAIDZ2 lets me have up to 2 drives failing, avoiding the dangers of a second disk failing during the intensive resilvering happening during a RAIDZ1 rebuild
Initial Data Migration
I wrote some scripts to move my filesystem on Btrfs to the ZFS
Setup and Initial Data Migration
I started with the following:
- 12TB: Btrfs RAID1
- 12TB: Btrfs RAID1
- 8TB: Btrfs RAID1
- 14TB: Off site archive
- 14TB: Spare
- 10TB: Spare
Usable space: 16TB in RAID1 (max 1 drive failure)
Initially, I attempted to migrate my Btrfs snapshots to ZFS (via rsync with scripts). However, I realized that this was extremely space inefficient: reflink-ed blocks in Btrfs would take up additional space, and turning on deduplication in ZFS is memory intensive, and cannot be fully disabled until all deduplicated data is rewritten.
Deduplication
I deliberated about whether to turn on deduplication. However, the space savings were miniscule:
❯ sudo zdb -S storage
Simulated DDT histogram:
bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
    1    33.2M   4.14T   4.07T   4.07T    33.2M   4.14T   4.07T   4.07T
    2     376K   46.5G   44.6G   44.6G     760K   94.0G   90.1G   90.1G
    4    1.85K    190M    125M    126M    8.31K    840M    541M    543M
    8      162   10.2M   2.73M   2.87M    1.66K    114M   27.7M   29.0M
    16       36   3.04M   1.29M   1.30M      853   74.6M   34.5M   34.9M
    32       24   2.76M   1.35M   1.35M      855   95.7M   45.0M     45M
    64        6     82K   77.5K     88K      480   6.25M   5.80M   6.58M
128        3     41K   8.50K     12K      595   9.70M   1.76M   2.32M
256        3      4K      4K     12K      892   1.18M   1.18M   3.48M
Total    33.6M   4.19T   4.12T   4.12T    34.0M   4.24T   4.16T   4.16T
dedup = 1.01, compress = 1.02, copies = 1.00, dedup * compress / copies = 1.03
So, I decided to rsync my Btrfs array to a ZFS dataset I created on the 10TB drive (you can see the rsync options I used here).
ZFS Dataset Options
I created a pool with a child dataset, with the following properties:
- compression=zstd-3
- atime=off
Warning
My data did not have any redundancy during the transfer to/from the 10TB drive. In fact, I encountered one checksum error on ZFS while reading back from the drive (fortunately, I had a backup of that file)
After that was done, I wanted to create a RAIDZ2 pool. However, I couldn't use one of the 14TB drives as that still had my old Btrfs archives on it. So, I created a RAIDZ2 pool using the sparse file trick:
- First, I created a 14TB sparse file with truncate -S 14T sparse.tmp
- Then, I created a RAIDZ2 zpool with the 2x12TB drives and 1 14TB drive, plus the sparse file
- I took the sparse file offline
I then started the transfer:
Finally, I added the second 14TB drive (after offloading my backups) and ran zpool replace to add it back to the array.
My final setup:
- 10TB off site archive
- 12TB raidz2
- 12TB raidz2
- 14TB raidz2
- 14TB raidz2
Usable space: 24TB in RAIDZ2 (max 2 drive failures)
ZFS Limitations
ZFS is not without limitations. In particular:
- RAIDZ vdevs cannot be shrunk
- You cannot add a device which is smaller than the smallest disk in a RAIDZ vdev
- No rebalance after expanding RAIDZ vdev (data-to-parity ratio for old blocks is unchanged)
- Cannot convert a mirror to RAIDZ in-place
- Top-level vdevs can only be removed if the primary pool storage does not contain a top-level raidz vdev, all top-level vdevs have the same sector size, and the keys for all encrypted datasets are loaded.
Raidz expansion notes:
- Old blocks remain with their old data-to-parity ratio. In the worst case scenario, expanding a full 4x1TB RAIDZ2 to 8x1TB RAIDZ2 would result in the final array being 50% utilized. However, this can be mitigated somewhat by re-copying over the data, such as with zfs send -Ror by using sparse files and offlining them.
Monitoring:
ZFS has nice utilities to monitor drive and pool status.
Of note, zpool iostat is really useful for identifying pre-failing drives and diagnosing poor performing arrays.
For example, with the following command, we can see the distribution of reads and writes in a pool, the slowest drives and more:
❯ zpool iostat -vly <pool> 5 1
                capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool          alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
storage-temp  1.27T  7.83T      0    161      0   148M      -   95ms      -   12ms      -    1us      -   82ms      -      -      -
  sde         1.27T  7.83T      0    161      0   148M      -   95ms      -   12ms      -    1us      -   82ms      -      -      -
------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
More details here.