Gilbane Report logoContent Management Technologies, Trends & Advice

Gilbane San Francisco and Boston banner
Gilbane Reports

The Gilbane Report: Volume 10, Number 8

The Role of XML in Content Management

October 2002

Download a PDF version of this article

Read the news for this issue.

The Role of XML in Content Management

The management of semi-structured or unstructured data has always depended on markup languages. Before the Web it was SGML or proprietary markup languages, now it is XML. This dependence was mutual in practice, unstructured information management (mainly for publishing) was the only use of markup. However, the wild success of XML is due to its acceptance as a way to encode and share all kinds of structured and unstructured data, including code. Ironically, advocating XML for content or document management was actually disparaged by many early XML evangelists because they were afraid XML would be seen as being limited to publishing-oriented applications. This was in spite of the fact that, while most XML development was targeting application integration, most deployment was for content applications.

 

Today, you wouldn't implement a content management solution without thinking very carefully about what role XML should play. Should it be used for content, for metadata, for application integration, for information integration? Where in the create/manage/deliver cycle should it be used? Where do Web Services fit in? What about WebDAV? Contributor Lauren Wood returns this month with a look at how businesses are actually using XML in content management implementations, and how they view XMLs role in the future. Laurens report will provide you with an outline to help you organize your thoughts about the role XML should play in your content management implementation.

 

 

Executive Summary

 

XML is an extremely flexible technology that can fulfill several roles in any software application. Content management is no exception to this. This survey discusses some of the roles that XML can play in a content management system (CMS) and whether there is much industry support or customer demand for such support.

 

On speaking with several people representing companies and customers in this area, I found that there is increasing demand and good support for XML for content; less support or demand for XML metadata support; increasing support but little current demand for Web Services, and some support but more demand for WebDAV.

 

Introduction

 

One of the interesting things about XML is that the principles behind it are so simple that it can be used in many different ways. If we ignore all the other specifications and concentrate for a moment on the simple XML 1.0 specification 1., we see that what XML does is give us a way of labeling information. This labeled information is relatively easy to process, and is readable by humans (depending on the choice of the labels). Most of the 30-page specification is taken up with defining the syntax to make these two important facets possible (along with allowing for graphics, internationalization, and robust error-handling). Since XML can be used for so many different things its not surprising that it is used in many different roles in the content management world as well.

 

This article will talk about content management or CMS (content management systems) and include all the variations that are appropriate, such as information management, knowledge management, or document management. Yes, these are all different. In terms of where XML can be used, however, they are similar enough to justify lumping them all together under one label.

 

XML use in a CMS can be divided into two main categories:

 

  • XML for content (including metadata)
  • XML for plumbing (including Web Services)

 

Several people in the CMS business were interviewed for this article and I asked them about the current use of XML in both of these categories. The results were interesting and show that XML is starting to push past the hype into the mainstream. Opinions varied wildly as to how widely XML will be used in the near future, and for what; the synthesis presented in this article is my own and should not be attributed to any of the people I spoke to.

 

XML for Content

 

The origins of XML are well known: it is a streamlined version of SGML (Standard Generalized Markup Language). SGML was particularly well suited to being used for technical documentation and publishing. The concepts that led to SGML being used for hard documentation problems are still present in XML. For example, the airline industry developed methods for coping with the fact that every airplane is individual and needs a maintenance manual that includes all the work that has been carried out on that particular plane. Such methods require a sophisticated view of the documents incorporating relevant metadata (which airplane it is) and content (what needs to be done), as well as information as to workflow (due date or time the job needs to be done by, and which team does the maintenance) and integration to other systems (who gets billed for the work). XML is the only common content format that readily allows for such sophistication.

 

Obviously, such sophistication is not needed for every document or for every company. But even smaller companies with less extreme needs still want to be able to repurpose content for print, web, or other formats and many are turning to XML for this. This content needs to be managed and so the demand is rising from customers for a CMS that can handle XML well enough for their needs.

 

Usually customer requirements are a mixture of three basic needs: reusing content, repurposing content, and keeping their content independent of the applications used to create and manage it.

 

  • Content is reused when one it appears in more than one context. A common example is a copyright statement that may appear in hundreds of separate documents. If the statement is updated, the change will immediately appear in each of the documents that contain the copyright.
  • Repurposing content means delivering that content in more than one format or medium. The most common repurposing need is to deliver information in both HTML and print (often PDF).
  • Application independence has a number of different implications. Most often, it means that an organization will not be locked into a particular vendor. In addition, different departments within the same enterprise can adopt an XML model even though they may have differing systems in place.

 

In the current economic climate, companies are being much more careful about where they put their money, and much more cognizant of the need for a technology strategy plan. This means they will probably implement systems that better suit their needs. There appears to be an upturn with companies doing feasibility studies and pilot projects, ready for implementing in the next 6-12 months. And many of these projects will be using XML for content. Not many of these projects are in the large enterprise content management space; were seeing more departmental projects, or projects for particular types of documents. HP exemplifies the type of company implementing the latter HP has many different product groups that all produce documents for technical support or product catalogs. It makes sense to use one strategy and one type of system for all of those documents, no matter which department produces them. True enterprise-wide content management is still some time off, though there are some Fortune 200 companies looking at centralizing their information flows to allow for enterprise-wide access.

 

XML for content is often thought of principally as a technology used in publishing. The traditional publishing industry that started with SGML is moving to XML because of the cheaper tools, and often incurring some expense in moving their content to obey the stricter XML syntax rules. In general, however, they understand what XML is good for and have for some years. What is interesting now is that many other industries are also moving to XML without having a background in SGML. For example, Web content management systems that use XML content and then transform on the server to HTML or PDF are increasingly popular with smaller companies from a multitude of industries seeking an easier way to maintain their web sites.

 

One of the biggest areas of growth for XML is e-learning. Demand for e-learning is growing fast, and from multiple directions. Students at colleges and universities are increasingly expecting material to be available online to supplement their lectures. Adults are upgrading their qualifications in online courses, or expect online support for those courses they take in evening school. And companies are running training for their employees and their customers online to avoid travel costs and disruptions.

 

Cisco uses XML for an e-learning system that they use for employees and for customers. There are two major reasons for using XML.

 

  1. The engineer who knows how the new switch or router works only has to write it all down once. The content can then be used to create derivative works, such as for marketing materials, without having to go back to the engineer. Prior to using XML, the engineer was a bottleneck, because everything had to be authored by that person (which also meant s/he couldnt do anything else!)
  2. The content can be tailored to the needs of the person receiving the training. Adults in a 4-day course with 10 years of experience have different needs to college students who have 4 months to learn the same material, but have no experience in the area.

 

The companies using XML together with a CMS range across the spectrum, from Fortune 200 to small. Companies are in publishing, in finance, in manufacturing. Consumer products companies such as Kohler and Proctor and Gamble are implementing XML systems as part of their business processes, realizing that the documentation related to what they are selling to consumers now has to be delivered in a variety of formats, be accurate, and be timely. As companies increasingly sell into markets outside of their home territory they are also finding they need to produce different documents to go with those products. There may be different products in different countries, different names, or different marketing approaches not to mention different languages! The recent announcement of significant XML support in Microsoft Office 11, along with the XML support already available in Corel WordPerfect and Sun StarOffice, shows that XML for content is reaching the mainstream.

 

In fact, the number of customers using XML for content now has reached the stage where any CMS vendor that hasnt already implemented XML support probably cant presumably because of some problem in their underlying technology. For the customer shopping for a CMS, that means figuring out where XML will be used in the business process, and what other data formats will be stored as well. The days are past when a silo mentality of storing XML in an XML content store, and other documents in some other store, made sense. This is no longer necessary. All the larger CMS vendors support multiple data formats, including XML, so you can store your Word files, XML files, and multi-media in the same facility. Whether you should store everything in one repository, or whether you should have multiple repositories, depends on a number of factors that no longer need to have anything to do with the format the content is stored in. Over the last year or so weve seen a lot of movement in this space, between the more traditional CMS vendors adding XML support (sometimes you need to get an optional module to get all the features, such as chunking), the relational database vendors adding some degree of check-in and check-out, and the Web CMS vendors broadening their format support to include XML and office document formats.

 

With all this competition, the prices are coming down and the vendors and consultants are keen to get business. This makes it all the more important for companies thinking of installing a new CMS, or updating an existing one, to know what sort of content they wish to store. When they know that, then they can figure out how much XML support they need, and they can shop that around to the vendors and consultants. Knowing what you need is necessary for example, one big variation between products is in how efficient they are at finding and checking out a chunk (portion) of an XML document when its stored in the CMS. Some products are much slower than others at finding or checking out the chunk when there are many very small chunks; this should only worry those who need such high granularity for their XML.

 

Authoring Content

So whats the biggest problem with XML content? Authoring it The authoring tools are becoming more capable and people are starting to figure out that the ease of processing XML content can outweigh the pain of creating it, but there is still some way to go. Since XML is so flexible, any XML authoring tool needs to be configured to match the schema and should also be configured to match the authors needs and knowledge. This, in a sense, is the last mile issue for the XML content industry. Frequently, the last issue considered in a well thought out XML system is the content creation process. However, a lot of good work and otherwise admirable effort can be undermined if the ease of use of the system isnt carefully considered. Small changes to the data model and authoring tool user interface or configuration can often produce dramatic improvements in productivity and quality.

 

Metadata

Metadata is the connecting tissue for all CMSs. It tells the CMS what the content is, who created it, who may read it, who may change it, where it fits in the workflow, and what sorts of operations may be performed on it. Metadata can do more, however. If the Semantic Web ever becomes reality (even if it never quite reaches the grandiose dreams some people have) it will be because sufficient metadata has been added to each bit of relevant content.

 

Metadata can be stored as XML, as indexes in a relational database, or in some CMS-specific storage format. For some purposes, the format it is stored in is irrelevant. Metadata that is more volatile than the underlying content, such as stage of a workflow process, or date the item moved from one stage to the other, is often stored outside of the XML. An XML format becomes useful in other scenarios, such as integrating different systems, or if the metadata is complicated enough to warrant storing it in a rich hierarchical format. In particular, a rich taxonomy provides a way to navigate through content following different navigation paths.

 

Since integration of different CMSs, passing around content complete with the metadata, and the requirements for rich, hierarchically structured metadata are just starting to become important for many people, the various metadata standards (in which I include topic maps and RDF) have not yet experienced the updraft that XML for content has. Metadata is the second layer of a complete content management system and requires at least as much thought as the design of a document schema for authoring in XML does. At this stage in the technology cycle, there isnt yet the experience in metadata system design that there is in document modeling; the best practices (which depend on the industry) are still being worked on. Metadata is hard: Mark Hale estimates that to fully classify a single document requires 60-90 minutes of human thought. Automatic metadata generation can help, but it will be some time before its satisfactory.

 

Thus metadata is another area where the customer requirements document must be fully fledged out. Is the metadata required simply for workflow and basic search? Or will the content be passed around between divisions, or even between companies? If the latter, an XML format may be the right answer. If so, is there an applicable metadata standard or ontology that could be used?

 

XML for Plumbing

 

Web Services

Web Services has a hype factor that rivals that of XML a couple of years ago. The number of articles proclaiming the virtues of XML, and the number of products proudly claiming XML prowess have decreased, simply because XML is now mainstream and all CMSs are expected to support it. The number of articles about the virtues and problems of Web Services has increased to fill that void.

 

Web Services (2) is an example of XML plumbing. The configuration files that determine how a piece of content is passed from one system to another are written in XML (actually a subset of XML). Web Services at the moment seem to be more hype than reality, but the economics of technology are such that theres a good chance that Web Services will become a basic part of systems infrastructure in the next two years or so. It will be used for passing around information between systems and thus for integration. Web Services are a relatively easy addition to most CMSs so there is push to implement from the vendors as well as the analysts who are writing all those articles mentioned in the paragraph above. The standards development isnt quite ready for primetime yet; some of the important pieces such as security are still being worked on, but the basic shape is taking place. There are still some technical hurdles as well, such as the fact that SOAP only supports a subset of XML; various ways to solve this problem are also being worked on.

 

Do customers really want Web Services? Some do, depending on their corporate tolerance for risk or the technical vision of the person in the CTO office. Im hearing far more about companies looking at adding Web Services support to their technical strategy over the next two years or so than wanting to add it immediately, though some who enjoy being on the bleeding edge are implementing it already. A large part of this planning is because companies need to integrate various systems. At the moment many (mostly the larger companies) are using J2EE for integration while many others (smaller to mid-size companies) are looking at migrating to .Net from COM. Web Services will be an important part of both of these platforms and so it makes sense to make sure components of an overall strategy, such as the CMS, also support the appropriate methods for integration. Web Services should enable integration between the J2EE and the .Net worlds; it remains to be seen just how robust and with what performance this integration can be carried out in the real world.

 

WebDAV (Web-based Distributed Authoring and Versioning)

Another piece of the puzzle that uses XML as plumbing, WebDAV is a relatively unknown specification that enables lightweight content management. It functions as a set of extensions to the web protocol HTTP (unlike Web Services, which can also function via other protocols such as email). These extensions are defined using XML. WebDAV (often called DAV for short) allows for basic CM functionality such as locking and metadata assignment; versioning is still being developed. It is not sufficient for a full-blown, all-the-bells-and-whistles CMS, but adequate for a lot of smaller uses where all the features of a large, expensive CMS are not needed. Once versioning has been added to WebDAV so that the basic check-in and check-out is supported, it will do much of what small groups of people need. There appears to be some customer demand for WebDAV in various tools such as XML authoring tool vendors, so that they can implement their own basic CMS. The larger CMS vendors are also implementing WebDAV (though the implementation isnt always supported) to enable a basic level of automatic integration with other tools without having to write special custom integrations for every tool on the market that a customer might want to use with the CMS. For many vendors, of course, there isnt the same level of urgency to implement WebDAV as they already have integrations with their favored third-party tools using their own methods. Customer demand seems to be having the desired effect, however.

 

More Information

 

Many of the topics discussed in this article will be discussed in much more depth at the forthcoming XML 2002 Conference and Exposition, to be held in Baltimore, Maryland in the week of December 8-13. Many of the people I spoke to in researching this article will be speaking at the conference on content management, metadata, and Web Services. There are also Town Hall meetings on these topics that give a forum for in-depth questions and discussions. The exhibit space includes many CMS vendors who will be showing their XML support. More information is at http://www.xmlconference.org .

 

(Note that full Gilbane Report subscribers Save $300 off the cost of a Conference Gold Pass . Login to the Gilbane subscribers section at www.gilbane.com to get the discount priority code to use on the registration form. Discounts cannot be combined. ed. )

 

Acknowledgements

 

I would like to thank everyone who spent time talking to me about the role XML plays in the content management world. I very much appreciate the input and the insights that they gave me. I spoke with Brian Buehling, Dakota Systems; Chris Wolff, Thomson; Jay di Silvestri (3), Corel; Jay Todtenbier, Cisco; Jon Parsons and Rich Pasewark, XyEnterprise; Lubor Ptacek, Documentum; Mark Hale, Interwoven; Mike Champion, Software AG; Ron Daniel, Taxonomy Strategies; Sebastian Holst, Artesia; Todd Price, Stellent.

Lauren Wood
Lauren@textuality.com

 

(1)Found at http://www.w3.org/TR/REC-xml

(2) We are talking about Web Services based on the W3C standards (SOAP etc. ). Sometimes the term is used in a much broader way.

(3) Thanks also to Jay for proofreading.

Subscribe to NewsShark
Content technology industry news without the hype

Email Address:*
First Name:*
Last name*
* = Required Field

RSS/XML Newsfeeds
Industry News
Analyst Blog
Enterprise Search Blog
Content Globalization Blog
XML Technologies & Strategies
Press Releases & Announcements


The Gilbane Report is published by Gilbane Group, Inc. © 1993 - 2005 The Gilbane Report. All Rights Reserved.
Contact | Privacy Polic