Why I counted characters in UTF-8

Saturday, 2 Aug 2008

A while ago, Kragen Sitaker and I spent a day batting back and forth a thesis of mine: I had mused out loud that the common suggestion that processing UTF-8 is so much slower than the good old fixed-width single-byte encodings of yore does not seem truthy. He wrote up our conclusions, which write-up was subsequently posted to reddit. (Also, George Pollard and Colin Percival took the optimisation to an obsessive level – very cool.)

The comments on reddit overwhelmingly focused on the fact that using strlen in the first place is stupid. Of course, I agree with that! Yet I still chose it for that exercise because I reasoned that pretty much every interesting operation on strings does one of two things:

  1. Put strings together in some way, in which case you only deal with the string as a whole (copying bytes or shuffling pointers without ever examining any one character), or
  2. Iterate over a single string (such as when doing a pattern match), in which case you deal with each character in sequence.

In neither of these cases do you need to arbitrarily index into a string – if you are looking at a particular character, you will look at it after having looked at the previous one also. Therefore, even if the character representation makes indexing slow, in practice, no one will care. That was the crux of my thesis: arbitrarily indexing into a string is irrelevant and iterating over a string in a variable-width encoding does not appear to be terribly costly. If so, it follows that variable-width character encodings should not impose a significant performance penalty.

From there, I chose counting characters as the example case for iteration – not because C strlen is a useful operation (in practice, you’d want length-prefixed strings at minimum, not null-terminated ones; or better yet, ropes); rather, simply because it is the most trivial example of iterating over a sequence of characters. To test my thesis it suffices just fine, even as I wouldn’t use it in practice. In fact it is probably the best choice for that particular question, as any real-world algorithm that iterates over a string will perform a lot more per-character processing which is likely to dwarf the cost of iteration alone.