Repairing broken documents that mix UTF-8 and ISO-8859-1

Thursday, 6 Apr 2006 [Thursday, 4 Feb 2010]

A perpetual (if thankfully not too frequent) problem on the web are documents claiming to be encoded in either UTF-8 or ISO-8859-1, but containing characters encoded according to the respective other charset. Such documents will display incorrectly, regardless of which way you look at them. Worse, if the document in question is XML (such as, say, a newsfeed) and claims to be encoded in UTF-8, upset ensues that leads the XML parser to halt and catch fire as soon as it encounters the first invalid byte.

How does it know? It does because UTF-8 has a very specific way of encoding non-ASCII characters. Encoding non-ASCII characters according to ISO-8859-1 violates this scheme, so their presence is detectable with a very high degree of confidence.

Of course, this can just as soon be used to good advantage. If you start with the working assumption that the primary encoding of a confusedly encoded document is UTF-8, and merely decode and re-encode the byte stream, you can salvage misencoded data by catching any character decoding errors and decoding the offending invalid bytes as ISO-8859-1.

Here’s a Perl script, cleverly called repair-utf8, which implements this approach:

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( decode FB_QUIET );

binmode STDIN, ':bytes';
binmode STDOUT, ':encoding(UTF-8)';

my $out;

while ( <> ) {
  $out = '';
  while ( length ) {
    # consume input string up to the first UTF-8 decode error
    $out .= decode( "utf-8", $_, FB_QUIET );
    # consume one character; all octets are valid Latin-1
    $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
  }
  print $out;
}

The only non-obvious bit to be aware of here is that when using the FB_QUIET fallback mode, Encode will remove any successfully processed data from the input buffer. The entire script revolves around this behaviour. After the first decode, $_ will be empty if it was successfully decoded. If not, the successfully decoded part at the start of $_ will be returned, and $_ will be truncated from the front up to the offending byte. The second decode is then free to process that. The inner loop will keep running as long as any undecoded input is left, decoding it, if need be, one byte at a time as ISO-8859-1.

See also Sam Ruby’s just posted clean_utf8_for_xml.c.