Good Mail Sorting

Monday, 21 Dec 2015

When I started writing this, 13 months ago or so, I had been frenetically hacking my mail rig for a couple days. I sat and wrote, because it had been a revelation.

Setting

For years I used a setup on my home server where cron would kick off fetchmail (since replaced by mpop), which would in turn invoke procmail to deliver each received mail, putting it somewhere in my set of folders. Then I read that mail in mutt over a SSH connection.

I love mutt.

Act Ⅰ

My first impetus to do something was frustration over my VoIP telephony voicebox notifications. Those come as mail with the voicemail attached as a sound file. Because I can’t well play them in a mutt running on another machine, I have to get the attachment out and onto my laptop.

Previously I tried to automate this by having procmail send these mails to a script that would extract the attachment and discard the mail itself. I never quite got this working right, so I never actually used it.

This time I succeeded – by the choice of having the script scan my inbox for new mail after delivery. (It looks for voicebox mail, plucks out the sound file, then deletes the mail.) Since my inbox is a Maildir, this is easy code: a readdir to list the mails, an open to scan them, an unlink to delete the matching ones. So now I never deal with voicebox mails directly any more, I just run this script against the network share when I’ve missed calls, and up pop some files on my laptop, which I can play. Convenient.

(Not as convenient as clicking a play button in a GUI or web mail client, granted. But using one of those would be far more inconvenient on other counts. Tradeoffs. (Have I mentioned I love mutt?))

Act Ⅱ

So I then realised that I could use the exact same approach to write another script I’ve unsuccessfully attempted once before – namely, to mark some mail as read before I’ve even seen it.

I used to have a program that I wrote for this – again, invoked from my procmail configuration, delivering mail in procmail’s stead and marked the way I wanted – but I stopped using it when I noticed that it would drop mail, rarely, under circumstances I never managed to pinpoint.

Now I have this functionality again, and now I can completely trust the code never to lose mail again, because all it does this time around is readdir and rename.

I forgot how nice this setup is when it works. I have mutt set up to show only threads with new mail when I open a mailing list folder, so if the noisy notification traffic I don’t care about is already marked as read, it is essentially invisible to me. (I don’t want to killfile the chaff. I like correct threading so much, and there are sometimes replies to these mails, long chains of them even. I don’t want those floating about with no context.)

Anagnorisis — When mail processing is hard

Here’s the problem: the period during which a mail exists only in main memory, before it is delivered and on disk, is quite high-stakes. Code that handles mail during this period must be absolutely solid, reliable at all times without exception without fail, in order to avoid dropping mail on the floor. You can afford no errors. Any typo that aborts program execution will cost you mail. Any bug that manifests someday in the future will cost you mail. Editing the code and saving it in half-finished form before you’re done will cost you mail if cron kicks off a fetch at that moment. In short, a mistake of any kind tends to default to costing you mail. And so every possible error condition must be covered by failsafes and rescue measures. This is a very hard environment, grim and unforgiving.

However. Once the mail is on disk, and once it’s in a Maildir in particular, things get incomparably relaxed: the likelihood of losing the mail to a small mistake drops precipitously. You have to screw up very hard to manage to lose data. At that point, scripts processing mail just deal with files to move around. It’s not by any means impossible to write data loss bugs then too – but mistakes default to leaving your mail where it was. That’s a stress-free situation.

(Indeed with the Maildir-based scan-and-move approach, the mark-as-read program was basically trivial.)

Act Ⅲ

Now I had to figure out when to run this program. It had to have a way to specify rules for what messages to mark as read, and so it took rules as command line arguments. Then, because different folders need different rules, I wrote a shell script that invoked the program first on one folder with some rules, then another with others, etc.

Now when would I invoke that shell script, though? By hand, like the voicemail extractor? One thing I realised while writing the script was, I could also put the mpop invocation inside it, and then kick off that script from cron instead of invoking just mpop itself. Because of course the best time to run the script is right after new mail has (potentially) arrived.

It took a day until it dawned on me what I had just done.

If I have cron kick off a script that fetches mail and then only afterwards moves some of it around…

… then why do I need procmail at all?

Synthesis — Mail filtering can be easier and better

See, over the years, I have forever wanted to get rid of procmail, because its configuration syntax is so awful. But I have been putting it off ever since I read about Mail::Audit back in… yes, 2001. (Later along came Email::Filter, and later still some aid for dealing with the harsh environment in the form of Email::Pipemailer::DieHandler.)

But I knew the data loss default cost of mistakes, so I stuck with procmail grudgingly instead of trying to replace it with some custom code, because I trusted procmail with my mail’s arrival on disk – if not fully then at least to an extent that I wouldn’t trust my own attempts. I’m only a dilettante at mail-handling code (not being familiar with the ground it has to have covered), and in any case procmail is battle-tested in a way my own attempts will never ever be.

What I didn’t realise is that I don’t need to write code to do what procmail does. I can just have mpop deliver all of my mail to a transit folder, then invoke a script that scans this transit folder and kicks any mail it finds in there over to its destination.

This has some very attractive properties. Unconditional delivery to disk makes it very robust against losing mail, dwarfing even procmail. Code that enumerates files and then moves them around is much harder to screw up catastrophically than code that is responsible for data that only exists in main memory. And thus I get to write the code that does the rule-checking and file-moving in a real programming language, rather than crappy procmail syntax.

Reflection

For over a decade, I was so fixated on doing the mail processing during delivery that I was blind to the much simpler approach that was staring me right in the face. After all, every tool that I considered using is designed this way: Mail::Audit; Email::Filter; Courier’s maildrop; common Sieve implementations; etc. So I never even made the connection that another approach was possible. It seemed like the obvious natural and sole way of doing it.

Until I finally did realise it. And kicked myself for days. To think: all the automation I could have set up years ago (cf. voicemail extractor); all the mail I could’ve not lost (not very much, all told, but all of it avoidable). Just the friction of trying to use the misshapen language of an ill-fitting tool, for years on end.

Don’t pick the wrong problem to fight with. Life is better that way.