The O’Reilly Network suck at newsfeeds

Thursday, 23 Mar 2006 [Friday, 31 Mar 2006]

Update: Tim O’Reilly himself and Justin Watt, O’Reilly Network webmaster, just got in touch to let me know the issue is being looked at. This is great news!

Update again: it’s fixed – the feeds use type="html" now, but plans remain in effect to enforce that the content be well-formed XHTML so type="xhtml" can be eventually used. Thanks, everyone.

I finally gave up fighting.

A while ago, the O’Reilly Network started migrating their backend to a new system which would support a better newsfeed infrastructure, letting articles from webloggers appear in the feeds for the respective specialised subsites such as Perl.com, ONLamp.com and XML.com. Part of this move was a motion to standardise on Atom 1.0 feeds (with cruft-free URIs, no less), offering other formats as a compatibility option.

Unfortunately, they chose to use type="xhtml". I say unfortunately even though xhtml is the best choice of type (as it allows the content payload to be a first-class citizen of the feed document), because the O’Reilly CMS is an XML-unaware legacy system. This means that every time one of their webloggers would forget to close a tag, the feed would become non-well-formed, leading to variously comical failures depending on the number of unclosed tags and the aggregator in question. This is the kind of situation for which type="html" was conceived: it allows tag soup to be transported in Atom as a double-encoded string, opaque to the XML parser that reads the feed document.

I emailed their support address on three or four occasions of breakage when the new feeds were being introduced, explaining each respective problem and how to solve it. Lately, I left comments on a number of weblog entries, letting the author in question know that their unclosed tags were breaking the feed. On each occasion, I asked that the team consider using type="html" if they could not ensure that the feed contain only well-formed XHTML; or alternatively, and preferrably, consider implementing sanitisation of article HTML prior to inclusion in the feed.

Neither ever happened.

I don’t really understand why, because using type="html" would have been an acceptable choice and trivial to implement – after all, they’re already doing the same work for other formats. Sanitisation takes much more implementation effort, but the team seem intent on using type="xhtml", so the necessary determination might be there for that too.

At this point, I have given up hope either way. Last night, I went to the site and looked up the URIs to alternative formats. It turns out you just have to append ?format=rss1 to the address to receive the feeds as RSS 1.0; no need to care about XML parser errors and posts which swallow the content of following posts any more.

Except for, now I see this in my aggregator:

â€œWe strongly support Microsoftâ€™

In the source for that entry I see this:

â€œWe strongly support Microsoftâ€™

The same “UTF-8 as ISO-8859-1 as UTF-8” double-encoding problem is apparent in the RSS 2.0 feed as well, so there’s no horse left to change to.

Sigh. You know what I want for Christmas? Markup Barbie. You pull a string and she says “XML is tough.”