ZFS Best Practices and Caveats
Dating from 2005, the Zeta fileystem is a complex and reliable volume manager and filesystem. Throughout the years, it received various improvements and additions, so much so that ZFS landed in the general attention.
In fact, it became so popular that the OpenZFS community created a well-developed port (zfsonlinux) for various linux distributions. Canonical considered it an advantage and offered the posibility to even install the Ubuntu OS onto a zfs filesystem. Scroll down to find out more info on ZFS and how to best tweak it.
ZFS is a complex and reliable filesystem so there are a lot of best practices and caveats when using it. I’m going to cover some important aspects and share some useful information from my experience.
When creating the pool, try not to combine disks of different speeds and sizes in the same VDEV, however, do try to mix the manufacturing date of the disks in order to prevent a data loss disaster.
ZFS is a highly reliable filesystem which uses checksumming to verify data and metadata integrity with on-the-fly repairs. It uses fletcher4 as the default algorithm for non-deduped data and sha256 for deduped data. Later implementations were made available sha512, skein and edon-R. At that date the official page circulated some numbers like sha512 being 50% more performant than sha256, skein 80% and edon-R with more than 350% an increase in performance.
The need for better system performance is understandable and you might think that the raid configuration and a backup are enough, so you'll disable checksumming. Don't. Try monitoring your system performance while using different types of checksumming algorithms and choose the proper one.
Logged access time is a property that can be as well turned off for improved pool performance.
zfs set atime=off <pool_name>
zfs set atime=on <pool_name> zfs set relatime=on <pool_name>
This is simillar to ext4 atime (access time is only updated if the modified time or changed time changes). It can be used as a compromise between zfs atime=off and having it set to on.
Set the compression algorithm to lz4, default is lzjb.
If the ZFS filesystem is the backend for a storage system, then the performance of the filesystem can be increased by tuning its use of physical memory. ZFS is quite a memory hog, so allocate it more memory. To do so on-the-fly, increase the /sys/module/zfs/parameters/zfs_arc_max parameter value. To make it persistent after a reboot, add the zfs_arc_max value to /etc/modprobe.d/zfs.conf file.
Keep the used capacity under 80% for best performance.
When creating a pool, use disks with the same blocksize. A correlation between zfs "blocksize" and the disk blocksize is the ashift parameter (which cannot be modified after the pool creation). Either the value is set to "0" for auto blocksize recognition, or manually set to either ashift=12 for 4k blocksize disks or ashift=9 for 512b blocksize disks. Mixing disks with different blocksizes in the same pool can lead to caveats like performance leaks or inefficient space utilization.
You can interact more effectively with ZFS using the zdb tool. To check the vdev ashift used for the zpool, check the ZFS MOS configuration:
zdb -C <pool_name>
zdb -e <pool_name>
if the pool is not imported or it lacks the zpool cache file.
zdb -C <pool_name>
version: 5000 name: '<pool_name>' state: 0 txg: 7222258 pool_guid: 1158898053409524037 errata: 0 hostname: 'host1' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 1158898053409524037 create_txg: 4 children: type: 'raidz' id: 0 guid: 5533485938024525468 nparity: 1 metaslab_array: 34 metaslab_shift: 33 ashift: 9 asize: 1500309356544 is_log: 0 create_txg: 4 children: type: 'disk' id: 0 guid: 15862811554655025429 path: '/dev/disk/by-id/ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP2925667' whole_disk: 1 create_txg: 4 children: type: 'disk' id: 1 guid: 17748791980113647982 path: '/dev/disk/by-id/ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP2924231' whole_disk: 1 create_txg: 4 children: type: 'disk' id: 2 guid: 3803198689712549478 path: '/dev/disk/by-id/ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP2924347' whole_disk: 1 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data
Inspect dataset in-depth details (you can add more d’s for more verbosity):
zdb -dd <dataset_name>
Display information about every spacemap record:
zdb -MMM <pool_name>
For a much more detailed pool history than the output of zpool history -i command use:
zdb -h <pool_name>
Zdb can be used to take an in-depth look at some other cool stuff too, like the current uberblock, see dedup statistics, inspect arc and zil stats, and so on.
Deduplication: Yay or Nay?
If you're not sure that you should use deduplication or not, you can do a dry-run test of how duplicated your data really is.
zdb -S <pool_name> displays the predicted effect of deduplication, if enabled on the pool. Run the command on a low-IO period of time, because it reads the whole zpool. It simulates a DDT histogram as below. Based on the results, you can decide if you can really save some space by deduping your data.
zdb -S <pool_name>
Simulated DDT histogram:
bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 4.23M 33.8G 20.7G 21.4G 4.23M 33.8G 20.7G 21.4G 2 4.37M 35.0G 23.4G 24.1G 9.33M 74.6G 49.5G 51.2G 4 872K 6.82G 4.46G 4.61G 3.91M 31.3G 20.3G 21.0G 8 181K 1.42G 919M 951M 1.71M 13.7G 8.64G 8.94G 16 36.6K 293M 186M 193M 763K 5.96G 3.77G 3.91G 32 4.33K 34.6M 21.3M 22.1M 177K 1.38G 880M 913M 64 1016 7.94M 4.51M 4.71M 81.2K 650M 369M 386M 128 360 2.81M 2.57M 2.59M 68.5K 548M 505M 508M 256 21 168K 128K 131K 6.35K 50.8M 38.8M 39.6M 512 4 32K 4K 5.33K 3.07K 24.5M 3.08M 4.10M 1K 1 8K 512 682 1.63K 13.0M 834K 1.09M 2K 2 16K 1K 1.33K 5.74K 45.9M 2.87M 3.82M Total 9.67M 77.3G 49.6G 51.3G 20.3M 162G 105G 108G dedup = 2.11, compress = 1.55, copies = 1.03, dedup * compress / copies = 3.16
If the dedup ratio is greater than 2 (2.11 in the example above), then you can see some improvements to your free pool space after enabling it. But keep in mind that each DDT table occupies 320K. So in order to calculate how much memory the deduplication will need, take the total number of blocks (9.67M) and multiply it by 320K and it will be somewhere around 3GB.
Although it sounds cool, deduplication is rarely worth it. It usually creates a lot of memory problems and it sometimes stall IOs.
Try and care for the recordsize, as well, when creating a dataset. For best performance, it should match the application blocksize.
When using slow disks in a pool, there's a workaround to that. Use a read-intensive solid state drive as an ARC device and a mirror of two write-intensive SSDs as an Intent-Log device.
If you want to be notified by the zed daemon regarding the state of the pool, add the below lines in /etc/zfs/zed.d/zed.rc file. You can choose to be notified even when the pool is healthy, by setting notify verbosity to 1.
ZED_EMAIL_ADDR="<email_address>" ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@" ZED_NOTIFY_INTERVAL_SECS=120 ZED_NOTIFY_VERBOSE=1
Have a disk from an old zpool and wanna repurpose it? Just issue a zpool labelclear <device> to remove the old label.
When in need of a snappy recovery from errors or disk failures, you'll want to increase scrub or resilver speed. For that, you can modify these settings:
zfs_resilver_min_time_ms is default set at 3000 and zfs_resilver_delay is 2. You can test out the resilver process with increased values for zfs_resilver_min_time_ms, let's say 5000 or 6000 and zfs_resilver_delay set to 1.
zfs_scrub_delay is default set at 4, so you can set it lower and increase as well the scan min time parameter (1000 default).
You can check this github link for details about each parameter.
When setting these values, you're in fact prioritizing the resilvering/scrub IOs over normal IOs so don't play around too much with these. After the resilvering is over, set the parameters back to default values.
No matter the robustness of ZFS or the precautions you take, you should backup your ZFS pool. ZFS backups can be easily kept up to date due to its snapshots which can also be copied incrementally. But when deciding to first migrate all the data from one place to another, you'll see that you'll need all the speed. A good trick to boost up the replication performance is to get the data through mbuffer.
It can be used as below. Chunk size and the buffer size can vary and you should play around to see what suits you best.
zfs send testpool/testdataset@snapshot | mbuffer -s 128k -m 1G | ssh user@backup_host 'mbuffer -s 128k -m 1G | zfs receive testpool_backup/testdataset_backup'
However, if you didn’t back up and got the pool corrupted somehow, there are several ways you can try to access the data on your datasets.
First you can try and import the pool in read-only mode. If that doesn’t work, you can try and find an older but still recent txg and try to import using that checkpoint. You will not have the latest file modifications but you will be able to recover some of your data.
In order to see older txg issue, use:
zdb -e <pool_name> -ul
and choose a txg from the most recent uberblocks.
The pool can then be imported with the -T parameter followed by the txg number.
Want to learn more about ZFS? Curious about other settings? We’re here to help. Contact us anytime for powerful dedicated bare metal servers and top-notch service.
About the Author
Catalin Maita is a Storage Engineer at Bigstep. He is a tech enthusiast with a focus on open-source storage technologies.