Decoding Unicode codepoints into the corresponding character in XSLT

Saturday, Feb 4, 2006, 06:37 (updated Monday, Feb 6, 2006, 20:59)

The following is either fun or madness, depending on your point of view. I just wasted entirely too much time hacking XSLT for fun and profit.

The question was:

The following XML fragment is given:

<symbol unicode="2013"/>

Find a way to emit the actual U+2013 character. Simply emitting &#x2013; with the aid of disable-output-escaping is insufficient due to additional constraints.

First things first: without EXSLT, you don’t. Plain XPath 1.0 is too limited. Thus, the following elaborations assume a transform with the following document element:

<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:fn="http://exslt.org/functions"
  xmlns:str="http://exslt.org/strings"
  xmlns:my="urn:foo"
  extension-element-prefixes="fn my str"
>

My first hunch was to be devious and take the simple way out:

<xsl:template match="symbol">
  <xsl:value-of select="document( str:encode-uri( concat(
    'data:application/xml,&lt;chr>&amp;#x',
    @unicode,
    ';&lt;/chr>'
  ), false() ) )"/>
</xsl:template>

In some XSLT processors this may in fact work. In LibXSLT, unfortunately, it fails – which sent me on a very long goose chase…

So how do you decode a codepoint to a character? Having come back from that goose chase, the simple answer is that you just use str:decode-uri on the URI-encoded UTF-8 byte sequence representation of the Unicode codepoint in question. How you do that? Oh, nothing could be simpler:

  1. Since XPath does not have bit operators, bit wrangling must be done using integer math – but as XPath has no notion of hexadecimals either, the first thing required is some conversion functions for that:

    <fn:function name="my:hex2num">
      <xsl:param name="hexstr" />
      <xsl:variable
        name="head"
        select="substring( $hexstr, 1, string-length( $hexstr ) - 1 )"
      />
      <xsl:variable
        name="nybble"
        select="substring( $hexstr, string-length( $hexstr ) )"
      />
      <xsl:choose>
        <xsl:when test="string-length( $hexstr ) = 0">
          <fn:result select="0" />
        </xsl:when>
        <xsl:when test="string( number( $nybble ) ) = 'NaN'">
          <fn:result select="
            my:hex2num( $head ) * 16
            + number( concat( 1, translate( $nybble, 'ABCDEF', '012345' ) ) )
            "/>
        </xsl:when>
        <xsl:otherwise>
          <fn:result select="my:hex2num( $head ) * 16 + number( $nybble )" />
        </xsl:otherwise>
      </xsl:choose>
    </fn:function>

    This function takes a hexadecimal numeral and converts it to a number. If the string is empty, it returns zero. Otherwise, it converts the last character of the string, and adds it to the recursively converted front of the string multiplied by 16. It converts the character by checking whether it is a number; if not, it assumes a character from A-F and maps it to 0-5, then prepends a 1, thereby getting 10-15.

    <fn:function name="my:num2hex">
      <xsl:param name="num" />
      <xsl:variable name="nybble" select="$num mod 16" />
      <xsl:variable name="head" select="floor( $num div 16 )" />
      <xsl:variable name="rest">
        <xsl:if test="not( $head = 0 )">
          <xsl:value-of select="my:num2hex( $head )"/>
        </xsl:if>
      </xsl:variable>
      <xsl:choose>
        <xsl:when test="$nybble > 9">
          <fn:result select="concat(
            $rest,
            translate( substring( $nybble, 2 ), '012345', 'ABCDEF' )
          )"/>
        </xsl:when>
        <xsl:otherwise>
          <fn:result select="concat( $rest, $nybble )" />
        </xsl:otherwise>
      </xsl:choose>
    </fn:function>

    This function converts a number to its hexadecimal string representation in rougly the reverse way. It converts the number modulo 16 by checking if it’s greater than 9; if so, it hacks off the 1 at the front and converts 0-5 into A-F. In either case, it recursively converts the integer part of the number divided by 16, and prepends that to the converted digit.

  2. Alright, now we can convert the codepoint to a number. We can also convert a number representing a byte sequence to a hexadecimal string. What’s missing is the step in-between – the actual UTF encoding. In case you don’t already know how that works, Wikipedia explains it. So let’s just write the code to do that:

    <fn:function name="my:char-to-utf8bytes">
      <xsl:param name="codepoint" />
      <xsl:choose>
        <xsl:when test="$codepoint > 65536">
          <fn:result select="
              ( ( floor( $codepoint div 262144 ) mod  8 + 240 ) * 16777216 )
            + ( ( floor( $codepoint div   4096 ) mod 64 + 128 ) *    65536 )
            + ( ( floor( $codepoint div     64 ) mod 64 + 128 ) *      256 )
            + ( ( floor( $codepoint div      1 ) mod 64 + 128 ) *        1 )
          " />
        </xsl:when>
        <xsl:when test="$codepoint > 2048">
          <fn:result select="
              ( ( floor( $codepoint div   4096 ) mod 16 + 224 ) *    65536 )
            + ( ( floor( $codepoint div     64 ) mod 64 + 128 ) *      256 )
            + ( ( floor( $codepoint div      1 ) mod 64 + 128 ) *        1 )
          " />
        </xsl:when>
        <xsl:when test="$codepoint > 128">
          <fn:result select="
              ( ( floor( $codepoint div     64 ) mod 32 + 192 ) *      256 )
            + ( ( floor( $codepoint div      1 ) mod 64 + 128 ) *        1 )
          " />
        </xsl:when>
        <xsl:otherwise>
          <fn:result select="$codepoint" />
        </xsl:otherwise>
      </xsl:choose>
    </fn:function>

    Easy-peasy! Divisions, modulos, additions and multiplications take the place of right shifts, binary ANDs, binary ORs, and left shifts, respectively. But the fact that doing bit-twiddling with regular math is somewhat unaccustomed aside, this looks very much it would in any other language.

  3. But we’re only most of the way there. For 2013 we now get a string that looks like E28093 – whereas what we need is a URL-encoded string that looks like %E3%80%93. Let’s go again!

    <fn:function name="my:percentify">
      <xsl:param name="str" />
      <xsl:choose>
        <xsl:when test="string-length( $str ) > 2">
          <fn:result select="concat(
            '%',
            substring( $str, 1, 2 ),
            my:percentify( substring( $str, 3 ) )
          )" />
        </xsl:when>
        <xsl:otherwise>
          <fn:result select="concat( '%', $str )" />
        </xsl:otherwise>
      </xsl:choose>
    </fn:function>

    Phew; that really wasn’t bad – just way verbose. It prepends a percent sign to the first two characters of a passed-in string and, if the string is longer than that, processes the rest recursively and appends it.

  4. Finally we can string the beads together; this is completely straightforward:

    <fn:function name="my:decode-codepoint">
      <xsl:param name="codepoint" />
      <fn:result
        select="str:decode-uri( my:percentify(
          my:num2hex( my:char-to-utf8bytes(
            my:hex2num( $codepoint )
          ) )
        ) )"
      />
    </fn:function>
  5. After all that, we finally arrived where we wanted to be:

    <xsl:template match="symbol">
      <xsl:value-of select="my:decode-codepoint( @unicode )"/>
    </xsl:template>

    Wouldn’t you agree that it is simple now?

Sigh. I am a hopeless hack value addict.