XPath vs the default namespace: easy things should be easy
In trying to tweak my feed scrapers’ process, I was trying to write a short script in Perl to introspect a stylesheet in order to extract a URL from it. The XML I was trying to pry the value out of looks like this, as far as the parts we’re interested in are concerned:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://purl.org/atom/ns#"
>
<xsl:template match="/">
<feed version="0.3">
<link href="http://example.org/"/>
</feed>
</xsl:template>
</xsl:stylesheet>
The Perl code, of course, is fairly straightforward:
use XML::LibXML;
my $doc = XML::LibXML
->new
->parse_file( $ARGV[ 0 ] );
print $doc->findvalue( q{
/xsl:stylesheet
/xsl:template[ @match = '/' ]
/feed
/link[ not( @rel ) or @rel = 'alternate' ]
/@href
} );
And doesn’t work.
The story is that you can’t match on the default namespace in XPath. Element names without a prefix always match the null namespace, not the default namespace if it happens to be associated with a URI. Says Daniel Veillard on the GNOME XML mailing list:
You cannot define a default namespace for XPath, period, don’t try you can’t, the XPath spec does not allow it. This can’t work and trying to add it to libxml2 would simply make it non-conformant to the spec. In a nutshell forget about using default namespace within XPath expressions, this will never work, you can’t!
It turns out you need to make the XPath evaluator aware of a mapping from the URI in question to a prefix so that you can use this explicit prefix in your XPath expression when matching elements which, in the document, have no namespace prefix. But this is not easy, as Daniel explains:
XPath was *NOT* designed to be used in isolation, as a result there is no way to provide namespace bindings within the XPath syntax. There are APIs to provide this bindings in libxml2 XPath module.
Unfortunately, XML::LibXML does not bind these APIs. You end up needing to pull in an extra module in Perl, XML::LibXML::XPathContext.
So then the code looks like this:
use XML::LibXML;
use XML::LibXML::XPathContext;
my $doc = XML::LibXML
->new
->parse_file( $ARGV[ 0 ] );
my $xc = XML::LibXML::XPathContext
->new( $doc->documentElement() );
$xc->registerNs( atom => 'http://purl.org/atom/ns#' );
print $xc->findvalue( q{
/xsl:stylesheet
/xsl:template[ @match = '/' ]
/atom:feed
/atom:link[ not( @rel ) or @rel = 'alternate' ]
/@href
} );
Feh!
I ended up doing this with XSLT instead.