Many ways to skin a char

Sunday, 10 Apr 2005

So my Ratings and Reviews for CPAN feed is broken. It validates alright, but along the way, content that is already UTF-8 gets interpreted as Latin1 and subsequently re-transcoded to broken UTF-8.

This happens because my scraper script downloads files using wget, then passes the result to HTML Tidy. During this step, any charset information that was present in the HTTP header is lost, prompting Tidy to treat the document as Latin1 in the absence of a meta tag or some suchlike.

There are a lot of ways to foul up this encoding business… with all this care I’ve still managed to step right into a trap. Sam Ruby is absolutely right.

This blows particularly badly because after reviewing my options, I’m not even sure how to address the problem.

It looks like there are only two choices, both of which seriously suck:

Sigh.