Prehistoric filesystem iteration

Sunday, 26 Dec 2004

I was just watching mutt chomp through a large Maildir for the umpteenth time, but noticed something unexpected. This Maildir happens to contain a lot of new mail – over 100 messages –, and as you may or may not know, new vs read messages are stored in separate directories in the Maildir format (one file per mail). It took mutt much less time to read 100 messages from the “new” directory than it took it to read the same number of files from the “cur” directory.

And it suddenly occurred to me: isn’t this a Shlemiel the painter algorithm? Isn’t the directory scanned by the filesystem code all over for every single call to open(2)? Isn’t this pointless, considering that mutt is going to open every single file in the directory anyway?

This is obviously not mutt’s fault: browsing manpages, I can’t see a way to iterate through a directory and open each file in turn that doesn’t involve a lookup by path for opening files. mutt’s job would be far easier if there was a version of readdir(2) that returned inode numbers along with the filenames, and versions of open(2) and stat(2) that work directly on inodes rather than requiring a path – which have to be resolved to an inode anyway.

For the specific case of mutt reading Maildirs, this would in fact be faster even without the use of directory hashes than the naïve method is with them. In fact, judging by how quickly mutt sped through the new messages directory, it would actually make Maildirs with huge numbers of messages painless to use.

To be sure, B-trees have many things going for them besides the capacity to cope with humongous directories, but it makes me wonder if we don’t need more powerful filesystem iteration abilities in the file access APIs at least as much as ever more clever and tricky filesystem implementations.