Decoding URI-escaped characters… with sed
Funny how life moves in circles. Remember the last time I came across a similar problem? Back then, there was a human-readable Unicode codepoint that had to be decoded to the corresponding character, in XSLT. Someone on a forum I frequent just posted a very similar question:
How do you convert a human-readable hexadecimal value from a URI-escaped string into the corresponding byte?
Only this time, the tool has to be sed
.
Now, XSLT required an appalling hack to solve the problem; well, and there was the EXSLT requirement because there are no bitwise operators; but at least the combination provides enough primitives that it’s doable. But sed
? Uh, it’s actually Turing-complete. An even more appaling hack would be able to solve the problem. However, the sed
proto-language also lacks bitwise operators and there is nothing to make up for their absence. So it’s just not possible. But then, there are only 256 byte values (whereas there are over a million codepoints in Unicode).
So what was my answer?
Mu.
I opted for the other way out:
sed "$( printf 's/%%%02X/\\x%02X/ig;' $( ( seq 38 255 ; seq 0 37 ) | sed p ) )"
Did I mention I’m a hack value addict?
If you don’t want to puzzle it out for yourself:
seq
provides a sequence of numbers from 38 to 255 and from 0 to 37;- … in which
sed p
doubles each number into a pair of identical numbers; - … which are converted via
printf
to 256sed
statements of the forms/%0A/\x0A/ig;
– i.e. roughly 3.5KB of simple-mindedsed
code in total; - … and that is then fed back to
sed
for execution.
Notes:
- The odd choice of broken up number sequence is necessary because the byte value 37, which is 0x25 in hexadecimal and the character “%” in ASCII, must be decoded last. If it isn’t, then sequences like “
%25%66%66
”, which becomes “%ff
” after decoding, will wrongly be double-decoded. - Doubling up each number for
printf
is necessary because there are two placeholders in its format string that have to have the same value.