Decoding URI-escaped characters… with
Funny how life moves in circles. Remember the last time I came across a similar problem? Back then, there was a human-readable Unicode codepoint that had to be decoded to the corresponding character, in XSLT. Someone on a forum I frequent just posted a very similar question:
How do you convert a human-readable hexadecimal value from a URI-escaped string into the corresponding byte?
Only this time, the tool has to be
Now, XSLT required an appalling hack to solve the problem; well, and there was the EXSLT requirement because there are no bitwise operators; but at least the combination provides enough primitives that it’s doable. But
sed? Uh, it’s actually Turing-complete. An even more appaling hack would be able to solve the problem. However, the
sed proto-language also lacks bitwise operators and there is nothing to make up for their absence. So it’s just not possible. But then, there are only 256 byte values (whereas there are over a million codepoints in Unicode).
So what was my answer?
I opted for the other way out:
sed "$( printf 's/%%%02X/\\x%02X/ig;' $( ( seq 38 255 ; seq 0 37 ) | sed p ) )"
Did I mention I’m a hack value addict?
If you don’t want to puzzle it out for yourself:
seqprovides a sequence of numbers from 38 to 255 and from 0 to 37;
- … in which
sed pdoubles each number into a pair of identical numbers;
- … which are converted via
sedstatements of the form
s/%0A/\x0A/ig;– i.e. roughly 3.5KB of simple-minded
sedcode in total;
- … and that is then fed back to
- The odd choice of broken up number sequence is necessary because the byte value 37, which is 0x25 in hexadecimal and the character “%” in ASCII, must be decoded last. If it isn’t, then sequences like “
%25%66%66”, which becomes “
%ff” after decoding, will wrongly be double-decoded.
- Doubling up each number for
printfis necessary because there are two placeholders in its format string that have to have the same value.