Comparing the contents of gzipped tarballs

Sunday, 18 Dec 2022

Some time ago I had a pile of tarballs created periodically from a cron job which were eating up the space on that machine. I needed a quick way to identify which of them had changes relative to the respective previous one.

I was hoping I could just compare the tarballs themselves rather than doing anything more complicated that involved actually extracting their contents (and then, I don’t know, doing some kind of fingerprinting on top). Unfortunately some naïve attempts using cmp failed and seemed to indicate that I was going to have to do something complicated like that.

Now there is quite a bit of discussion online about how to make Tar generate reproducible (so in a sense, canonical) tarballs. Of course that isn’t much use in hindsight (such as in my case), when a pile of tarballs is already sitting around on disk.

But in my case, the source tree was a part of the filesystem where absolutely nothing had happened. And I don’t just mean logical non-changes like creating and then deleting a temporary file. I mean that nothing was writing to that part of the filesystem in any capacity. Therefore Tar should encounter files and directories in the exact same order each time it was iterating that tree. Why then should it be generating non-identical tarballs?

Exasperated, I dug into the question of archive reproducibility for quite a while.

Spoiler: that was a waste of time. I finally discovered that Tar is not the culprit at all – Gzip is! Namely, the Gzip file header includes a timestamp. My instincts were right: Tar should have been creating the exact same archive over and over – and it was. I just hadn’t thought to suspect Gzip at all.

Luckily, the timestamp is found at a fixed location in a gzipped file: it is the 32-bit value at offset 4. And handily, cmp has a switch to tell it to seek past the start of the file(s) it is comparing. To make a long story short:

cmp -i8 file1.tar.gz file2.tar.gz

(This entry is brought to you by the hope to not have to figure this all out a third time in my life. Some time after the events described above, it came back to me that I had already figured this out years before but lost all memory of it by the next occasion to use the knowledge.)