Unicorn

Security, Network, Linux and Programming

View my profile on GitHub

ZFS snapshots gone wild

Date: 25 Jun 2025


For some years I’ve been using Sanoid as my snapshot management/replication tool.

My snapshot policy was a “default/recomended” and was used for some years. But recently my snapshots started to clog up, and have a huge accumulation of historic blocks so it was very much in need of some love.

For that reason I’ve been fine tuning my system for the past week or so and this is my journey through snapshot tuning by curing diseases and not symptoms.

Snapshots

I changed my sanoid snapshot policy for my two production zpools to this:

[template_production]
frequently = 0
hourly = 36
daily = 14
monthly = 0
yearly = 0

As I have become wiser and realised a long running snapshot policy isn’t worth it. I replicate to a cold storage anyway, which is my archiving with long running snapshots.

Pfsense usage

Not many days after some of my datasets already started taking up large amount of data in this case with pfsense2 it was with 7.13G REFER

NAME USED AVAIL REFER MOUNTPOINT
zpool1/libvirt/pfsense2@autosnap_2025-06-10_15:12:32_daily 2.24G - 6.87G -
zpool1/libvirt/pfsense2@autosnap_2025-06-11_00:00:10_daily 2.00G - 6.97G -
zpool1/libvirt/pfsense2@autosnap_2025-06-12_00:00:12_daily 2.90G - 7.14G -
zpool1/libvirt/pfsense2@autosnap_2025-06-13_00:00:06_daily 2.62G - 6.87G -
zpool1/libvirt/pfsense2@autosnap_2025-06-14_00:00:08_daily 2.85G - 6.93G -
zpool1/libvirt/pfsense2@autosnap_2025-06-15_00:00:09_daily 2.68G - 6.99G -
zpool1/libvirt/pfsense2@autosnap_2025-06-16_00:00:11_daily 2.97G - 7.06G -
zpool1/libvirt/pfsense2@autosnap_2025-06-17_00:00:05_daily 2.60G - 7.09G -
zpool1/libvirt/pfsense2@autosnap_2025-06-18_00:00:12_daily 2.95G - 7.18G -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_00:00:05_daily 2.78G - 7.17G -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_21:15:04_hourly 203M - 7.13G -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_22:00:01_hourly 49.5M - 7.13G -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_23:00:04_hourly 10.7M - 7.15G -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_23:15:03_hourly 6.37M - 7.16G -
zpool1/libvirt/pfsense2@autosnap_2025-06-20_00:00:02_daily 0B - 7.18G -

In my effort to try and cure the disease I went through these troubleshooting steps.

Local logging

Historic blocks reserved by ZFS is often caused by small continous writes like logs. So the first thing I tried was going into the pfense WebUI Status > System Logs > Settings > Disable writing log files to the local disk.

This action had a small warning attached: WARNING: This will also disable Login Protection!, but I’m okay with that temporarily.

I performed this action the 21-06-2025, and this is how the change affected the snapshots up to the 25-06-25 not including hourly snapshots.

NAME                        USED  AVAIL  REFER  MOUNTPOINT
zpool1/libvirt/pfsense2    53.0G   116G  7.45G  /mnt/zpool1/libvirt/pfsense2
NAME                                                                USED  AVAIL  REFER  MOUNTPOINT
zpool1/libvirt/pfsense2@autosnap_2025-06-12_00:00:12_daily         3.56G      -  7.14G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-13_00:00:06_daily         2.62G      -  6.87G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-14_00:00:08_daily         2.85G      -  6.93G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-15_00:00:09_daily         2.68G      -  6.99G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-16_00:00:11_daily         2.97G      -  7.06G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-17_00:00:05_daily         2.60G      -  7.09G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-18_00:00:12_daily         2.95G      -  7.18G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-19_00:00:05_daily         2.85G      -  7.17G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-20_00:00:02_daily         2.95G      -  7.18G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-21_00:00:09_daily         2.58G      -  7.20G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-22_00:00:08_daily         2.25G      -  7.40G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-23_00:00:12_daily         2.14G      -  7.50G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-23_23:00:04_hourly        41.5M      -  7.50G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-23_23:15:01_hourly        3.79M      -  7.50G  -
zpool1/libvirt/pfsense2@autosnap_2025-06-24_00:00:01_daily            0B      -  7.50G  -

The big change can be observed from the day before the change:

zpool1/libvirt/pfsense2@autosnap_2025-06-20_00:00:02_daily 2.95G - 7.18G -

and up to this newest daily snapshot:

zpool1/libvirt/pfsense2@autosnap_2025-06-23_00:00:12_daily 2.14G - 7.50G -

The snapshot size has gone down from 2.95G to 2.14G, so around 800M smaller snapshots.

Now when the accumulation of historic blocks has been resolved, I needed the USED size to go down aswell.

First I deleted the oldest snapshots up until the day where I disabled the local logging (21-06-2025) and this is the result:

NAME                        USED  AVAIL  REFER  MOUNTPOINT
zpool1/libvirt/pfsense2    19.7G   150G  7.45G  /mnt/zpool1/libvirt/pfsense2

A difference of over 33G freed up.

But the REFER size hasn’t gone down, actually it’s still increasing.

The real disease causing the large snapshots hasn’t been resolved yet. But I now know where I should look.

ZFS Autotrim

After turning down the amount of writes performed by the VM and cleaning up old heavy snapshots, the REFER hasn’t moved a tiny bit.

This is an indication of the host ZFS filesystem don’t know if any of the old reserved blocks has been freed within the VM itself. For the host ZFS to know if certain blocks are freed, the VM should have access to TRIM functionality. This is granted by the discard='unmap' in the VM XML configuration for QEMU/KVM.

But since PfSense also uses ZFS we need to enable Autotrim for the zpool within the VM, before it actually starts telling the host what blocks are free.

[2.7.2-RELEASE][admin@pfsense2]/root: zpool get autotrim pfsense2
NAME      PROPERTY  VALUE     SOURCE
pfsense2  autotrim  off        local

[2.7.2-RELEASE][admin@pfsense2]/root: zpool set autotrim=on pfsense2

[2.7.2-RELEASE][admin@pfsense2]/root: zpool get autotrim pfsense2
NAME      PROPERTY  VALUE     SOURCE
pfsense2  autotrim  on        local

I issued a manual zpool trim pfsense2 command for running a trim on the disk. The command doesn’t give any output, but running zpool status pfsense2 showed the device was trimming.

[2.7.2-RELEASE][admin@pfsense2]/root: zpool status pfsense2
NAME        STATE     READ WRITE CKSUM
pfsense2    ONLINE       0     0     0
  vtbd0p3   ONLINE       0     0     0 (trimming)

Result

After disk trimming within the VM was done, the following output had confirmed the real reason for my snapshots taking up this much space over time indeed was caused by the lack of TRIM on the ZFS datastore.

NAME                        USED  AVAIL  REFER  MOUNTPOINT
zpool1/libvirt/pfsense2    19.7G   149G   981M  /mnt/zpool1/libvirt/pfsense2

The REFER has gone from ~7.5G down to ~1G, that’s a remarkable change, and I’m very satisfied.

Next up is letting a couple of days pass before evaluating, and then re-enable local logging if everything is still fine.

Tags: ZFS, Homelab