All XML, all the time
There has been a lot of noise over the HTML vs XHTML issue. Big names are saying that the latter offers nothing over the former, that HTML is no less semantic than XHTML (which is correct), and so on. There is loud spec thumping over doctypes and mime types and that you should only serve XHTML as
application/xhtml+xml (indeed you should). Lachlan Hunt says HTML is dead. Anne van Kesteren, whom I generally have the utmost respect for, loudly advises people to stick to HTML 4.01 Strict.
A chief argument of opponents of XHTML like him (besides the fact that IE does not support it at all) is that draconian error handling in XML will cause your page to break and will make your users suffer for you if you make even the slightest mistake. Well, I don’t care. This cry and whine is just another legacy of HTML we've gotten used to: our toolchains tend to be of the “glue strings together” variety. (We get fancy and call these things “templates.”)
Which is utterly, utterly broken.
The tools should be XML all the way from bottom to top. The right way to build something like a CMS is to never stick user input right into the output. Input should either be of the form of something like wiki markup which can always, always be translated to valid XHTML, or if it’s tagsoup, should be corrected before storing it.
There should never be any markup-related part in your publishing toolchain just gluing strings together. Ever.
Everyone saying something else is just noise in the background.
Regardless of what you do, it’s not necessary to teach the users to write valid markup. If you let them enter markup, because they’re used to it, fix it for them and offer a forced preview of the result. It’s not a hard problem. In fact, this solution is so painfully obvious that I have a hard time listening to much of the pitter-patter surrounding a discussion we’re only having because we’ve gotten used to utterly inadequate tools.