Home Baked XML Content Management - Part 1

by Balazs Fejes

Abstract

This article will describe a fairly straightforward way to create and publish articles in XML - and you get to download free software!

Why do I need this?

I've been using CityDesk Free Edition to manage the contents of this web site. One of the reasons I selected it instead of using a custom written tool is that I really wanted to focus on the content of the site first. If I start to write my own tool, I will certainly spend more time tinkering with the source code of the tool instead of using it. There's a catch with CityDesk though: the Free Edition only allows you to manage and publish 50 items. It sounds enough for a small site, but once you start to add some images, style sheets, or in my case, included HTML files within each article, 4-5 articles will be enough to reach this limit. Even if the tool would be perfect, the price for the full edition is way too much, 300 dollars. I've looked at several similar Desktop Content Management tools, but none of them were close to what I need, or what CityDesk can provide. I've found several annoyances in CityDesk, so I decided that I give in to the itching programmer in me, but with a very strong focus: I should try to reuse existing tools, and I should only work on tasks which can not be carried out in an efficient way with pre-existing tools.

Basic Requirements

The three articles I've done with CityDesk gave me enough information to realize what my requirements are.

CM Requirements

  • The look and feel, navigation, header, footer, and similar elements must be separated from the article contents, so that I can quickly regenerate the whole site if I want to change the design, navigation, or structure. I am a terrible HTML designer; it will take a while (and a lot of help) to get the site into a shape I will be happy with. The only way to do this, is to separate the formatting from the articles.

  • I should be able to work off-line, on multiple computers, so I have to be able to put the site contents under revision control. One thing I've realized with CityDesk is that it's NOT good that it uses a single binary file (an Access database) for all the site contents, because I have no chance to work on multiple PCs, even on different articles. If I want to merge my changes, I have to do it manually, even if I worked on completely different artifacts.

  • The Content Editor must have a spellchecker included, which can ignore the markup - it'd be great to have a grammar checker as well...

The Content Model

During the writing of my first couple of articles, I've realized that HTML markup is simply not quite right for writing such a document. I've spent a lot of time and effort on getting the layout right for figures, illustrations, tables, within the article document itself. I wanted to define a section title, maybe the table of contents, code sample sections, figures, illustrations, or a list of items, but HTML is really there to define generic web page elements. The markup is very similar, but just different enough to cause inconvenience. I decided to write my new articles in a markup which is closer to my intentions for structured articles. Instead of rolling my own schema, I've looked into the available specifications.

I knew about DocBook, which is a specification and a set of DTD files. It can be used to produce (primarily technical) documentation, but I really did not want to work with such a heavy markup. I wanted to concentrate on getting the article done, instead of getting my head around the 400 or so elements defined in the latest DocBook standard. Luckily for me, other people realized that most people will only need a subset. The Oasis organization maintains the Simplified DocBook standard. Its aim is mainly to provide a simplified alternative, with a similar sized vocabulary as HTML which is proven to be a "manageable" size specification. Its root element is an <article> instead of the complex book-based structure of the full DocBook specification.

One of the additional benefits of using an existing specification is that I already have the XSLT scripts to generate HTML out of my articles; I just need to customize the output. I'd have to do it from scratch for a custom schema.

There are available pretty mature XSLT scripts to transform DocBook (and since it's a subset only, Simplified DocBook) documents into HTML, XHTML, or PDF documents. PDF output also sounds very exciting; it'd be definitely useful to provide downloadable, printable versions for my articles. The transformation scripts are highly parameter driven, so there's a good chance I will be able to achieve my intended output with them.

Here's a Simplified DocBook document:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V4.1.2.5//EN"
   "http://www.oasis-open.org/docbook/xml/simple/4.1.2.5/sdocbook.dtd">
   <article><title>A Short Example</title>
       <section><title>Section #1</title>
         <para>A short example of a Simplified DocBook file.</para>
       </section>   
   </article> 
    

Site Model

One DocBook based document, and a set of transformation scripts would be certainly enough to publish one web page. However I wanted to produce a full web site from my articles. So my next task was to define a way to describe the whole web site, including the articles, the page templates with the navigational elements, and standard headers and footers. I've looked into the site map definitions of some existing CM tools, and I've found a DTD in the DocBook.sf.net Website project, which was enough for most of my needs. I will just need to add a few attributes of my own.

The Publishing Workflow

Here's an activity diagram that describes the basic workflows I have to carry out to publish the web site:

Figure 1. Publishing workflow activity diagram

Publishing workflow activity diagram

The diagram above describes the following workflow:

  • First I need a content editor to write the article. I also need a mechanism to maintain a site map, which would describe the content pages, index pages, and additional artifacts like images and binary files - essentially the structure of the web site.

  • From these two artifact types (the articles themselves and the site map), I have to be able to generate HTML documents, containing the proper headers, links, etc.

  • The next task is to create a local version of the web site, with the same directory structure as the uploaded site will have. Since I don't want to upload all the artifacts every time I change something in one small file, I want to be able to quickly create a checksum for each file, and then I can compare that with the list of files and checksums on the already published web site.

  • After the successful upload of the changed files, I need to update the checksum list on the uploaded site as well.

From WYSIWYG to Emacs...

See? This is the problem! I've been playing around with specifications, schemas, UML modeling, without actually working on the content. The whole point to get content management sorted out is to enable me to write articles efficiently! So let's open up an editor, and write an article.

Theoretically I could open up any validating XML editor, and get started on editing the Simplified DocBook XML document, but I am a visual type - it helps, if I can see an approximation of how the text will look in the browser, without the XML tags cluttering up my screen. On the other hand, DocBook is not trivial to configure, I need to be able to work with the tags, so a completely WYSIWYG editor would be of no use. I will need some help on what elements are available, based on the DTD. The editor should be free, as I think I'm spending enough already on this blogging habit :-).

I've found out that Altova, the makers of XMLSPY has just such a tool! It's called Authentic 2004, and it's free (as in free beer). It is essentially a WYSIWYG XML editor, which is based on the XMLSPY engine. Every document must be based on a .sps file, which is basically a style sheet that describes the way each element should be rendered in the editor. Unfortunately, this .sps file can be only done in the Stylesheet Designer tool, which comes with XMLSPY.

I've downloaded a trial copy, and I've designed a custom SPS file based on the already included DocBook sample. I've created a form-based header, where the meta information about the article can be filled out quickly. I was able to define drop-down lists for the language selection for example. One thing that was frustrating a bit was that the date picker UI element is only working for XSD based schemas, and not on DTDs, so I was not able to use it in my form. I've also slightly tweaked the existing formatting rules to remove the unnecessary elements, since I'm just working with a subset of the DocBook elements.

Figure 2. Stylesheet Designer

Stylesheet Designer screenshot

Once I've completed the SPS file, I was able to select my new style sheet within the Authentic editor. The editor is not a replacement for XMLSPY: it will not allow you to edit the XML file in Text mode for example. It only displays the Browser view, and an "Authentic" view which is basically the WYSIWYG editable layout. In this view, you can hide/unhide the tag markers. It took me a while to get comfortable using the palette and the other editor helpers, but this seems to be a great way to edit structured content.

Figure 3. Authentic view

Authentic screenshot

For a while I've found the lack of the raw text view very annoying, but using the "Show Large Markup" mode now seems to be almost as good. During the actual content writing, I'm constantly switching between no-markup and markup-display mode. When I just edit paragraph text, I don't need to see the document structure and the tags, but when I want to add a new section, or a figure for example, I need to be able to click on specific tags to put the new sections in the proper place.

A note on Text View - I've realized that if I somehow I create invalid markup, and try to switch to Browser mode, Authentic will display the good old XMLSPY validation error, with the Text View open, to resolve the validation problem. This could be used as a trick to get into Text mode...

Before fully committing myself to Authentic, I wanted to verify my concern that using a "normal" editor would be a less productive process. The most efficient solution seemed to be Emacs in NXML mode. I was able to find the Relax NG schema for Simplified DocBook, which is needed to get the best out of NXML mode. Even though Emacs is definitely not a visual editor, with the NXML extension it has some advanced capabilities for my tasks, like instant as-you-type schema-based validation, tag completion, and syntax highlighting.

Figure 4. Emacs NXML mode

Emacs NXML mode screenshot

After the initial culture shock (I'm a Windows/GVIM guy), I've found the editing very quick and comfortable, however I was keep on switching back to Authentic to take a look at how my article looks with the proper formatting, and without the XML tags. I guess I could configure Emacs to execute my XSL transformations for the HTML output, so that I could get a rendered preview on-demand. I think I will investigate this further.

Anyway, after spending considerable time on playing with some additional editors, my preferred method right now is to use Authentic to write the articles.

Rendering HTML

Now that I have the unformatted, but well-structured DocBook XML article, I will need to transform it to HTML documents. I've been using the pre-built XSLT scripts from the docbook.sf.net project, but I needed to change them a bit to make the rendered pages look like the rest of my web pages. There are configurable parameters which gave me some control over the transformation process, but I definitely had to modify some small things within the scripts, which is unfortunate, because I can't just update my copy if a new version comes out. There are some how-to documentation regarding the customization, but the instructions seemed to be a bit complex for me. I think this is an area what I have to revisit at some point - maybe when there's a significant new release of the XSLT package...

This Was A Self-documenting Article

There are many things I still need to figure out, like how can I implement the <iframe> based code snippets I used in the previous articles. In my next article, I will focus on the scripts which will build the site - I still need to migrate the existing articles to DocBook. I am having fun using this toolset! I certainly enjoy controlling the publishing process, rather than just relying on a closed-source tool. CityDesk is a superb product, I can recommend it for a quick publishing solution, but for the hands-on people, a DocBook-driven approach may be a good alternative.