Decoding URI-escaped characters… with sed

Thursday, 7 Dec 2006

Funny how life moves in circles. Remember the last time I came across a similar problem? Back then, there was a human-readable Unicode codepoint that had to be decoded to the corresponding character, in XSLT. Someone on a forum I frequent just posted a very similar question:

How do you convert a human-readable hexadecimal value from a URI-escaped string into the corresponding byte?

Only this time, the tool has to be sed.

Now, XSLT required an appalling hack to solve the problem; well, and there was the EXSLT requirement because there are no bitwise operators; but at least the combination provides enough primitives that it’s doable. But sed? Uh, it’s actually Turing-complete. An even more appaling hack would be able to solve the problem. However, the sed proto-language also lacks bitwise operators and there is nothing to make up for their absence. So it’s just not possible. But then, there are only 256 byte values (whereas there are over a million codepoints in Unicode).

So what was my answer?


I opted for the other way out:

sed "$( printf 's/%%%02X/\\x%02X/ig;' $( ( seq 38 255 ; seq 0 37 ) | sed p ) )"

Did I mention I’m a hack value addict?

If you don’t want to puzzle it out for yourself:

  1. seq provides a sequence of numbers from 38 to 255 and from 0 to 37;
  2. … in which sed p doubles each number into a pair of identical numbers;
  3. … which are converted via printf to 256 sed statements of the form s/%0A/\x0A/ig;i.e. roughly 3.5KB of simple-minded sed code in total;
  4. … and that is then fed back to sed for execution.
