Comparing the contents of gzipped tarballs

Sunday, 18 Dec 2022

Some time ago I had a pile of tarballs which were created periodically by a cron job on a machine, regardless of whether anything had changed between runs, and they were eating up all the storage space. To free up space I wanted to get rid of the redundant ones so I needed a quick way to identify which of them had (no) changes relative to the respective preceding tarball.

I was hoping I could just compare the tarballs themselves rather than doing anything more complicated that involved actually extracting their contents (and then, I don’t know, doing some kind of fingerprinting on top). Unfortunately some naïve attempts using cmp failed and seemed to indicate that I was going to have to take the kind of more complicated approach I was hoping to avoid.

Now, there is quite a bit of discussion online about how to make Tar generate reproducible (so in a sense, canonical) tarballs. Of course that isn’t much use in hindsight (such as in my case), when a pile of tarballs is already sitting around on disk.

But in my case, the part of the filesystem that all of the tarballs were made from was an area where, in case of the tarballs that were redundant, absolutely nothing would have happened. By that I don’t just mean logical non-changes like creating and then deleting a temporary file. I mean that nothing was writing to that part of the filesystem in any capacity. Therefore Tar should encounter files and directories in the exact same order each time it was iterating that directory tree. Why then should it ever generate non-identical tarballs? Exasperated, I dug into the question of archive reproducibility for quite a while.

Spoiler: that was a waste of time.

I finally discovered that Tar is not the culprit at all…

Gzip is! Namely, the Gzip file header includes a timestamp.

So my instincts were right: Tar should have been creating the exact same archive over and over, and in fact it was. I just hadn’t thought to suspect Gzip at all.

Luckily, the timestamp is found at a fixed location in a gzipped file: it is the 32-bit value at offset 4. And handily, cmp has a switch to tell it to seek past the start of the file(s) it is comparing.

So to make a long story short:

cmp -i8 file1.tar.gz file2.tar.gz

(This entry is brought to you by the hope to not have to figure this all out a third time in my life. Some time after the events described above, it came back to me that I had already figured this out years before but lost all memory of it by the next occasion to use the knowledge.)