Wrapping up XSLT word wrapping

Thursday, 15 Jul 2010 [Thursday, 29 Jul 2010]

At the turn of the year, Dave Brotherstone wrote me to let me know that my implementation of word wrapping in pure XSLT is buggy. He included a fixed version which I did not feel comfortable with because it was much longer than the original and had some obvious remnants of experimentation. However, his approach was conceptually correct, as I discovered after independently examining and fixing my code and thus gaining the understanding I needed to assess his version. I wound up with an equivalent to his code, though with one significant control flow refactor. This is what I got:

<!-- Copyright 2010 Aristotle Pagaltzis; under the MIT licence -->
<!-- http://www.opensource.org/licenses/mit-license.php -->
<xsl:template name="wrap-string" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:param name="str" />
  <xsl:param name="wrap-col" />
  <xsl:param name="break-mark" />
  <xsl:param name="pos" select="0" />
  <xsl:choose>

    <xsl:when test="contains( $str, ' ' )">
      <xsl:variable name="first-word" select="substring-before( $str, ' ' )" />
      <xsl:variable name="pos-now" select="$pos + 1 + string-length( $first-word )" />
      <xsl:choose>

        <xsl:when test="$pos > 0 and $pos-now >= $wrap-col">
          <xsl:copy-of select="$break-mark" />
          <xsl:call-template name="wrap-string">
            <xsl:with-param name="str" select="$str" />
            <xsl:with-param name="wrap-col" select="$wrap-col" />
            <xsl:with-param name="break-mark" select="$break-mark" />
            <xsl:with-param name="pos" select="0" />
          </xsl:call-template>
        </xsl:when>

        <xsl:otherwise>
          <xsl:value-of select="$first-word" />
          <xsl:text> </xsl:text>
          <xsl:call-template name="wrap-string">
            <xsl:with-param name="str" select="substring-after( $str, ' ' )" />
            <xsl:with-param name="wrap-col" select="$wrap-col" />
            <xsl:with-param name="break-mark" select="$break-mark" />
            <xsl:with-param name="pos" select="$pos-now" />
          </xsl:call-template>
        </xsl:otherwise>

      </xsl:choose>
    </xsl:when>

    <xsl:otherwise>
      <xsl:if test="$pos + string-length( $str ) >= $wrap-col">
        <xsl:copy-of select="$break-mark" />
      </xsl:if>
      <xsl:value-of select="$str" />
    </xsl:otherwise>

  </xsl:choose>
</xsl:template>

The original code had two problems, both involving its use of the pos parameter, which keeps track of how many characters from the string have already been consumed, in order to decide when to wrap. The mistakes:

  1. When the wrapper outputs a space between words, the space is not accounted for in pos; an off-by-one error. This is the bug I spotted.

  2. The value of pos is not reset when breaking a line, so it is relative to the start of the overall string, not relative to the start of the current line; a logic bug. This is what Dave spotted. I didn’t even realise this was the case – it is so obvious that pos must be line-relative that when I started reading my code, I simply assumed it was.

In the old code, the wrapper checks whether a wrapping point has been reached at the start of an iteration, and if so, outputs break-mark, otherwise a space. Then it always outputs first-word (the non-whitespace at the start of the remainder of the string) and finally it recurses, passing the remainder of the string, giving $pos + length( $first-word ) as the new pos.

To fix the first bug, the value passed as pos would have to depend on whether a space or break-mark was output. Due to the purely functional nature of XSLT, I found no good way to express this without code duplication. (In an imperative language, I would just increment pos in the conditional block that outputs a space. In XSLT I could see no other way than to test the same condition once in order to know whether to output a space or break-mark, and another time in order to figure out whether to add 1 or 0 to pos.) It seemed that fixing the error would be significantly detrimental to the clarity of the code.

I got stuck on this for quite a while.

Eventually, I found a sleight of hand that also allows some conceptual simplification, so thankfully, neither brevity nor clarity suffered too much for the result. The trick is to check whether the line is about to overflow, and if so, output break-mark but not first-word, passing down the string to the next recursion unchanged, effectively spinning the loop in place for one iteration. This folds the start-of-line case into the code path for the start-of-string case, and allows unconditionally passing 0 as the value for pos for both of them, which avoids duplication of conditionals – and just so happens to change the semantics of pos such that Dave’s bug is inadvertently also fixed. The form of the conditionals required by this semantic change is also nicely simple compared to those I had before.

In the end, the new code turned out slightly longer than the original and also harder to understand, though with the decided advantage of, y’know, working correctly. Fortunately it is nowhere near as bad as I feared it might become when I started on it, and in terms of conditionals, actually got much clearer.

Update: fixed another bug. Previously the condition was just “$pos-now >= $wrap-col”, so words wider than $wrap-col would send the code into infinite recursion: the loop would spin in place at the start of the line without ever advancing through the source string. Now the condition is “$pos > 0 and $pos-now >= $wrap-col”, which means the word which starts a line will always be emitted no matter its length.

I need a test suite…