Concise XPath

Saturday, 14 Jan 2012

I get the impression that not many people know XPath, or know it very well, which is a shame. For one, it’s a beautifully concise notation (as you’ll see shortly). For another, it may be the difference between whether you hate XML or not. (I won’t claim it’ll make you like XML, though it may. It did for me.)

XPath is really very simple: you just string together conditions. Evaluation begins with a set of nodes so far. Then a new set of nodes is selected based on the given ones, and the condition is checked on this new set.

If a condition is appended with /, that means to then select the matching nodes for the next step. If the appended condition was enclosed in [], that means to continue on with the original set, but to discard those nodes for which there were no matching new nodes.

So to illustrate this / vs [] – /foo/bar means this:

Start with the root node.
Then /foo: for each node (which is just the root node, so far), fetch its child nodes (of which the root note always has exactly one), check which ones are foo elements, and take those as the new set.
Then /bar: for each node, fetch its child nodes, check which ones are bar elements, and take those as the new set.

These conditions appended with / are known as steps.

And then /foo[bar] means this:

Start with the root node.
Then /foo: for each node, fetch its child nodes, check which ones are foo elements, and take those as the new set.
Then [bar]: for each node, fetch its child nodes, check if any are bar elements, and if you come up empty then discard that node.

Such a condition inside [] is known as a predicate. Each predicate can itself be just as complex as any expression: it can itself contain steps and predicates of its own.

Finally, there are axes, written as prefixes separated with a ::. Axes specify which set of nodes to select before checking the condition – it doesn’t have to be the child nodes of the current set, that’s just the default axis (which you don’t need to write) called child::. So you can write e.g. /foo/following-sibling::bar:

Start with the root node.
Then /foo: for each node, fetch its child nodes, check which ones are foo elements, and take those as the new set.
Then /following-sibling::bar: for each node, fetch all its siblings, check which are bar elements, and then take those as the new set.

(Thus /foo/bar and /foo[bar] really mean /child::foo/child::bar and /child::foo[child::bar] respectively. Therefore each condition also includes a selection rule, often implicitly.)

Compare expressions and explanations and you see what I said about concision and beauty.

Now, with those principles given to you, just string together conditions. There are a few syntactic shortcuts other than not needing to write child::, e.g. you can write attribute::foo as @foo, and /descendant-or-self::foo can be written //foo, but there is no magic to those: they are just sugar. For the details – lists of possible axes, syntactic shortcuts, etc. – just refer to the standard. Lousy though it may be as an introduction, it makes a good reference.

That’s XPath.

Some practical notes:

With the various axes such as following-sibling::, you always get a whole set (e.g. all following siblings in this example). If you want a specific one from that set based on position – usually the first –, you have to discard the ones you aren’t interested in by using a predicate that checks the position – in that case [1], which is another shortcut notation, standing for [position() = 1]. The position() function evaluates to the index of a node within its subset, which is based on the node it was selected for.

So a common construction is following-sibling::*[1], which amounts to “the element whose start-tag is right after this one’s end-tag.” A somewhat likely case is to further combine this with a [self::foo] predicate to say “but only as long as that is a foo element.”

Observe that the order of predicates matters.

If you write *[self::foo][1], you get all elements, then narrow it down to the foo elements, then to the first of them – so it amounts to “select the first foo element anywhere” which is identical to the much simpler expression foo[1]. This is very different from *[1][self::foo], which first narrows down “everything” to “the first thing” and only then checks “but only if it’s a foo.”