Many ways to skin a char

Sunday, 10 Apr 2005

So my Ratings and Reviews for CPAN feed is broken. It validates alright, but along the way, content that is already UTF-8 gets interpreted as Latin1 and subsequently re-transcoded to broken UTF-8.

This happens because my scraper script downloads files using wget, then passes the result to HTML Tidy. During this step, any charset information that was present in the HTTP header is lost, prompting Tidy to treat the document as Latin1 in the absence of a meta tag or some suchlike.

There are a lot of ways to foul up this encoding business… with all this care I’ve still managed to step right into a trap. Sam Ruby is absolutely right.

This blows particularly badly because after reviewing my options, I’m not even sure how to address the problem.

Tidy can only cope with very few encodings and it uses non-standard abbreviations for them, so I cannot just dig the encoding out of the HTTP headers and feed it to Tidy. (Seeing that this is the most obvious solution, I tried it immediately.)
LibXSLT’s builtin download capabilities don’t help either. Although it has access to the request headers and can accept HTML tag soup as input, it doesn’t do what you’d sensibly expect: the charset parameter in the header is ignored and the output is just as broken as with my current Tidy-based toolchain. Bah.

It looks like there are only two choices, both of which seriously suck:

Translate the charset parameter from the headers into a form that Tidy accepts (ugh)
Try to intuit the document’s encoding on my own following the proper procedures, and transcode it to UTF-8 before passing it off to Tidy (yuck!).

Sigh.