Decoding URI-escaped characters… with `sed`

Thursday, 7 Dec 2006

Funny how life moves in circles. Remember the last time I came across a similar problem? Back then, there was a human-readable Unicode codepoint that had to be decoded to the corresponding character, in XSLT. Someone on a forum I frequent just posted a very similar question:

How do you convert a human-readable hexadecimal value from a URI-escaped string into the corresponding byte?

Only this time, the tool has to be sed.

Now, XSLT required an appalling hack to solve the problem; well, and there was the EXSLT requirement because there are no bitwise operators; but at least the combination provides enough primitives that it’s doable. But sed? Uh, it’s actually Turing-complete. An even more appaling hack would be able to solve the problem. However, the sed proto-language also lacks bitwise operators and there is nothing to make up for their absence. So it’s just not possible. But then, there are only 256 byte values (whereas there are over a million codepoints in Unicode).

So what was my answer?

Mu.

I opted for the other way out:

sed "$( printf 's/%%%02X/\\x%02X/ig;' $( ( seq 38 255 ; seq 0 37 ) | sed p ) )"

Did I mention I’m a hack value addict?

If you don’t want to puzzle it out for yourself:

seq provides a sequence of numbers from 38 to 255 and from 0 to 37;
… in which sed p doubles each number into a pair of identical numbers;
… which are converted via printf to 256 sed statements of the form s/%0A/\x0A/ig; – i.e. roughly 3.5KB of simple-minded sed code in total;
… and that is then fed back to sed for execution.

Notes:

The odd choice of broken up number sequence is necessary because the byte value 37, which is 0x25 in hexadecimal and the character “%” in ASCII, must be decoded last. If it isn’t, then sequences like “%25%66%66”, which becomes “%ff” after decoding, will wrongly be double-decoded.
Doubling up each number for printf is necessary because there are two placeholders in its format string that have to have the same value.

Decoding URI-escaped characters… with sed

Decoding URI-escaped characters… with `sed`