The O’Reilly Network suck at newsfeeds
Thursday, 23 Mar 2006 [Friday, 31 Mar 2006]
Update: Tim O’Reilly himself and Justin Watt, O’Reilly Network webmaster, just got in touch to let me know the issue is being looked at. This is great news!
Update again: it’s fixed – the feeds use type="html"
now, but plans remain in effect to enforce that the content be well-formed XHTML so type="xhtml"
can be eventually used. Thanks, everyone.
I finally gave up fighting.
A while ago, the O’Reilly Network started migrating their backend to a new system which would support a better newsfeed infrastructure, letting articles from webloggers appear in the feeds for the respective specialised subsites such as Perl.com, ONLamp.com and XML.com. Part of this move was a motion to standardise on Atom 1.0 feeds (with cruft-free URIs, no less), offering other formats as a compatibility option.
Unfortunately, they chose to use type="xhtml"
. I say unfortunately even though xhtml
is the best choice of type (as it allows the content payload to be a first-class citizen of the feed document), because the O’Reilly CMS is an XML-unaware legacy system. This means that every time one of their webloggers would forget to close a tag, the feed would become non-well-formed, leading to variously comical failures depending on the number of unclosed tags and the aggregator in question. This is the kind of situation for which type="html"
was conceived: it allows tag soup to be transported in Atom as a double-encoded string, opaque to the XML parser that reads the feed document.
I emailed their support address on three or four occasions of breakage when the new feeds were being introduced, explaining each respective problem and how to solve it. Lately, I left comments on a number of weblog entries, letting the author in question know that their unclosed tags were breaking the feed. On each occasion, I asked that the team consider using type="html"
if they could not ensure that the feed contain only well-formed XHTML; or alternatively, and preferrably, consider implementing sanitisation of article HTML prior to inclusion in the feed.
Neither ever happened.
I don’t really understand why, because using type="html"
would have been an acceptable choice and trivial to implement – after all, they’re already doing the same work for other formats. Sanitisation takes much more implementation effort, but the team seem intent on using type="xhtml"
, so the necessary determination might be there for that too.
At this point, I have given up hope either way. Last night, I went to the site and looked up the URIs to alternative formats. It turns out you just have to append ?format=rss1
to the address to receive the feeds as RSS 1.0; no need to care about XML parser errors and posts which swallow the content of following posts any more.
Except for, now I see this in my aggregator:
“We strongly support Microsoft’
In the source for that entry I see this:
“We strongly support Microsoft’
The same “UTF-8 as ISO-8859-1 as UTF-8” double-encoding problem is apparent in the RSS 2.0 feed as well, so there’s no horse left to change to.
Sigh. You know what I want for Christmas? Markup Barbie. You pull a string and she says “XML is tough.”