For Science: Does ZFS deduplication work on intros of TV shows?

Today from the “What am I doing with my life?"-department: I finally set out to find a definite answer to something I’ve always wondered about ever since hearing about the deduplication feature of ZFS – does it work on the intros of TV shows? TL;DR: Nope!

I never even expected it to work. Plus, you’re always advised against using deduplication anyway. The infamous “1GB RAM per 1TB storage in the pool” rule which is often incorrectly applied to ZFS in general stems from it. So even if I had found out that it worked, I probably couldn’t have benefitted from that. But still, not knowing for sure always bugged me.

As I’m currently building a new NAS and will switch from ext4 for my home storage once it will be fully operating it was time to simply do some tests and be done with the matter one and for all. Establishing the test setup: One season of Dexter in 1080p from the iTunes Store weighs roughly 26GB, exactly 28171821903 bytes in the case of my test data. The episodes of said season run for 54:27s on average while the brilliant intro of the hit-turned-shit show lasts for a whopping 01:45s – i.e. 3,213957759% of each episode. That means we could hope for saving around 800MB per season in an ideal scenario.

First I created four different ZFS pools:

  • zfs_blank – neither compression nor deduplication turned on
  • zfs_dedup – deduplication turned on
  • zfs_compr – compression turned on
  • zfs_both – both compression and deduplication turned on
truncate -s 32G /var/lib/zfs_img/zfs_blank.img
truncate -s 32G /var/lib/zfs_img/zfs_dedup.img
truncate -s 32G /var/lib/zfs_img/zfs_compr.img
truncate -s 32G /var/lib/zfs_img/zfs_both.img

zpool create zfs_blank /var/lib/zfs_img/zfs_blank.img

zpool create zfs_dedup /var/lib/zfs_img/zfs_dedup.img
zfs set dedup=on zfs_dedup

zpool create zfs_compr /var/lib/zfs_img/zfs_compr.img
zfs set compression=on zfs_compr

zpool create zfs_both /var/lib/zfs_img/zfs_both.img
zfs set compression=on zfs_both
zfs set dedup=on zfs_both

After creating the pools, there is the exact same amount of free space on each of them:

df /zfs_*

Filesystem     1K-blocks  Used Available Use% Mounted on
zfs_blank       32771968     0  32771968   0% /zfs_blank
zfs_dedup       32771968     0  32771968   0% /zfs_dedup
zfs_compr       32771968     0  32771968   0% /zfs_compr
zfs_both        32771968     0  32771968   0% /zfs_both

After copying the files into each pool, let’s see what we got:

df /zfs_*

Filesystem     1K-blocks     Used Available Use% Mounted on
zfs_blank       32771712 27531008   5240704  85% /zfs_blank
zfs_dedup       32709632 27532672   5176960  85% /zfs_dedup
zfs_compr       32771712 27527424   5244288  84% /zfs_compr
zfs_both        32708480 27529088   5179392  85% /zfs_both

Well, this is odd. Suddenly there is a different number of (total, not just free) 1K-blocks in each of the filesystems. I have no idea why that is happening, please let me know if you can explain it. (I did stumble upon these df/ZFS troubles while researching, but either this was fixed meanwhile or never an issue with the ZoL implementation, as the script there gave me the same numbers as df/du.) To make certain this doesn’t influence the results for the purpose of the test, I also tried it with a set of highly compressible and a set of highly dedupable files. In doing so I encountered the same 1K-blocks issue but still got exactly the results I would expect.

So let’s compare how much space is used in each scenario:

du /zfs_*

27531052	/zfs_blank/
27532697	/zfs_dedup/
27527385	/zfs_compr/
27529012	/zfs_both/

With deduplication turned on, the files actually use up more space than when it is turned off. Even though these are H.264-encoded videos, turning compression on saves a little space. Adding deduplication to the compression is increasing the required space just as it was the case without using compression. Between the most (dedup on) and least (compression on) amount of space the files could use there is a difference of 5312 1K-blocks, roughly 5MB. The gains from compression compared to using no compression are 3667 1K-blocks, roughly 3.5MB. You would have to store more than 750 such seasons before the savings would add up to just a single episode’s file size.

Just like I always expected, deduplication does not work on TV show intros albeit them being “just the same”. Due to the nature of modern video encoding, the underlying data is rarely the same: In an episode with lots of explosions a high amount of bitrate will be dedicated to those scenes and less of it will be left for the intro, and therefore the resulting data will differ from another episode. I’m guessing the gains from compression come from compressible metadata of the container format (and possibly subtitles) but that’s just a wild guess. As others have written before: compression never hurts you, dedup almost certainly does.

Even on a show with an incredibly long intro like Dexter you’ll gain nothing from ZFS’ deduplication feature. On the bright side: Usually you won’t be “wasting” more than 3% of a file on the intro – an episode of The Simpsons (average length 22:49s) only uses 1,826150475% for it. You can calculate that percentage for Lost on your own, I guess.

Now you know.