The Gilbane Report: Volume 10, Number 8The Role of XML in Content Management
October 2002
Download a PDF version of this article Read the news for this issue.
The Role of XML in Content Management
The
management of semi-structured or unstructured data has always depended on markup
languages. Before the Web it was SGML or proprietary markup languages, now it
is XML. This dependence was mutual in practice, unstructured information management
(mainly for publishing) was the only use of markup. However,
the wild success of XML is due to its acceptance as a way to encode and share
all kinds of structured and unstructured data, including
code. Ironically, advocating XML for content or document management was actually
disparaged by many early XML evangelists because they were afraid XML would
be seen as being limited to publishing-oriented applications. This was in spite
of the fact that, while most XML development was targeting application integration,
most deployment was for content applications.
Today,
you wouldn't implement a content management solution without thinking very carefully
about what role XML should play. Should it be used for content, for metadata,
for application integration, for information integration? Where in the create/manage/deliver
cycle should it be used? Where do Web Services fit in? What about WebDAV? Contributor
Lauren Wood returns this month with a look at how businesses are actually using
XML in content management implementations, and how they view XMLs role in the
future. Laurens report will provide you with an outline to help you organize
your thoughts about the role XML should play in your content management implementation.
Executive
Summary
XML
is an extremely flexible technology that can fulfill several roles in any software
application. Content management is no exception to this. This survey discusses
some of the roles that XML can play in a content management system (CMS) and
whether there is much industry support or customer demand for such support.
On
speaking with several people representing companies and customers in this area,
I found that there is increasing demand and good support for XML for content;
less support or demand for XML metadata support; increasing support but little
current demand for Web Services, and some support but more demand for WebDAV.
Introduction
One
of the interesting things about XML is that the principles behind it are so
simple that it can be used in many different ways. If we ignore all the other
specifications and concentrate for a moment on the simple XML 1.0 specification
1., we see that what XML does is give us
a way of labeling information. This labeled information is relatively easy to
process, and is readable by humans (depending on the choice of the labels).
Most of the 30-page specification is taken up with defining the syntax to make
these two important facets possible (along with allowing for graphics, internationalization,
and robust error-handling). Since XML can be used for so many different things
its not surprising that it is used in many different roles in the content management
world as well.
This
article will talk about content management or CMS (content management systems)
and include all the variations that are appropriate, such as information management,
knowledge management, or document management. Yes, these are all different.
In terms of where XML can be used, however, they are similar enough to justify
lumping them all together under one label.
XML
use in a CMS can be divided into two main categories:
- XML
for content (including metadata)
- XML
for plumbing (including Web Services)
Several
people in the CMS business were interviewed for this article and I asked them
about the current use of XML in both of these categories. The results were interesting
and show that XML is starting to push past the hype into the mainstream. Opinions
varied wildly as to how widely XML will be used in the near future, and for
what; the synthesis presented in this article is my own and should not be attributed
to any of the people I spoke to.
XML
for Content
The
origins of XML are well known: it is a streamlined version of SGML (Standard
Generalized Markup Language). SGML was particularly well suited to being used
for technical documentation and publishing. The concepts that led to SGML being
used for hard documentation problems are still present in XML. For example,
the airline industry developed methods for coping with the fact that every airplane
is individual and needs a maintenance manual that includes all the work that
has been carried out on that particular plane. Such methods require a sophisticated
view of the documents incorporating relevant metadata (which airplane it is)
and content (what needs to be done), as well as information as to workflow (due
date or time the job needs to be done by, and which team does the maintenance)
and integration to other systems (who gets billed for the work). XML is the
only common content format that readily allows for such sophistication.
Obviously,
such sophistication is not needed for every document or for every company. But
even smaller companies with less extreme needs still want to be able to repurpose
content for print, web, or other formats and many are turning to XML for this.
This content needs to be managed and so the demand is rising from customers
for a CMS that can handle XML well enough for their needs.
Usually
customer requirements are a mixture of three basic needs: reusing content, repurposing
content, and keeping their content independent of the applications used to create
and manage it.
- Content
is reused when one it appears in more than one context.
A common example is a copyright statement that may appear in hundreds of separate
documents. If the statement is updated, the change will immediately appear
in each of the documents that contain the copyright.
- Repurposing
content means delivering that content in more than one format or
medium. The most common repurposing need is to deliver information in both
HTML and print (often PDF).
- Application
independence has a number of different implications. Most often,
it means that an organization will not be locked into a particular vendor.
In addition, different departments within the same enterprise can adopt an
XML model even though they may have differing systems in place.
In
the current economic climate, companies are being much more careful about where
they put their money, and much more cognizant of the need for a technology strategy
plan. This means they will probably implement systems that better suit their
needs. There appears to be an upturn with companies doing feasibility studies
and pilot projects, ready for implementing in the next 6-12 months. And many
of these projects will be using XML for content. Not many of these projects
are in the large enterprise content management space; were seeing more departmental
projects, or projects for particular types of documents. HP exemplifies the
type of company implementing the latter HP has many different product groups
that all produce documents for technical support or product catalogs. It makes
sense to use one strategy and one type of system for all of those documents,
no matter which department produces them. True enterprise-wide content management
is still some time off, though there are some Fortune 200 companies looking
at centralizing their information flows to allow for enterprise-wide access.
XML
for content is often thought of principally as a technology used in publishing.
The traditional publishing industry that started with SGML is moving to XML
because of the cheaper tools, and often incurring some expense in moving their
content to obey the stricter XML syntax rules. In general, however, they understand
what XML is good for and have for some years. What is interesting now is that
many other industries are also moving to XML without having a background in
SGML. For example, Web content management systems that use XML content and then
transform on the server to HTML or PDF are increasingly popular with smaller
companies from a multitude of industries seeking an easier way to maintain their
web sites.
One
of the biggest areas of growth for XML is e-learning. Demand for e-learning
is growing fast, and from multiple directions. Students at colleges and universities
are increasingly expecting material to be available online to supplement their
lectures. Adults are upgrading their qualifications in online courses, or expect
online support for those courses they take in evening school. And companies
are running training for their employees and their customers online to avoid
travel costs and disruptions.
Cisco
uses XML for an e-learning system that they use for employees and for customers.
There are two major reasons for using XML.
- The
engineer who knows how the new switch or router works only has to write it
all down once. The content can then be used to create derivative works, such
as for marketing materials, without having to go back to the engineer. Prior
to using XML, the engineer was a bottleneck, because everything had to be
authored by that person (which also meant s/he couldnt do anything else!)
- The
content can be tailored to the needs of the person receiving the training.
Adults in a 4-day course with 10 years of experience have different needs
to college students who have 4 months to learn the same material, but have
no experience in the area.
The
companies using XML together with a CMS range across the spectrum, from Fortune
200 to small. Companies are in publishing, in finance, in manufacturing. Consumer
products companies such as Kohler and Proctor and Gamble are implementing XML
systems as part of their business processes, realizing that the documentation
related to what they are selling to consumers now has to be delivered in a variety
of formats, be accurate, and be timely. As companies increasingly sell into
markets outside of their home territory they are also finding they need to produce
different documents to go with those products. There may be different products
in different countries, different names, or different marketing approaches
not to mention different languages! The recent announcement of significant XML
support in Microsoft Office 11, along with the XML support already available
in Corel WordPerfect and Sun StarOffice, shows that XML for content is reaching
the mainstream.
In
fact, the number of customers using XML for content now has reached the stage
where any CMS vendor that hasnt already implemented XML support probably cant
presumably because of some problem in their underlying technology. For the
customer shopping for a CMS, that means figuring out where XML will be used
in the business process, and what other data formats will be stored as well.
The days are past when a silo mentality of storing XML in an XML content store,
and other documents in some other store, made sense. This is no longer necessary.
All the larger CMS vendors support multiple data formats, including XML, so
you can store your Word files, XML files, and multi-media in the same facility.
Whether you should store everything in one repository, or whether you should
have multiple repositories, depends on a number of factors that no longer need
to have anything to do with the format the content is stored in. Over the last
year or so weve seen a lot of movement in this space, between the more traditional
CMS vendors adding XML support (sometimes you need to get an optional module
to get all the features, such as chunking), the relational database vendors
adding some degree of check-in and check-out, and the Web CMS vendors broadening
their format support to include XML and office document formats.
With
all this competition, the prices are coming down and the vendors and consultants
are keen to get business. This makes it all the more important for companies
thinking of installing a new CMS, or updating an existing one, to know what
sort of content they wish to store. When they know that, then they can figure
out how much XML support they need, and they can shop that around to the vendors
and consultants. Knowing what you need is necessary for example, one big variation
between products is in how efficient they are at finding and checking out a
chunk (portion) of an XML document when its stored in the CMS. Some products
are much slower than others at finding or checking out the chunk when there
are many very small chunks; this should only worry those who need such high
granularity for their XML.
Authoring
Content
So
whats the biggest problem with XML content? Authoring it The authoring tools
are becoming more capable and people are starting to figure out that the ease
of processing XML content can outweigh the pain of creating it, but there is
still some way to go. Since XML is so flexible, any XML authoring tool needs
to be configured to match the schema and should also be configured to match
the authors needs and knowledge. This, in a sense, is the last mile issue
for the XML content industry. Frequently, the last issue considered in a well
thought out XML system is the content creation process. However, a lot of good
work and otherwise admirable effort can be undermined if the ease of use of
the system isnt carefully considered. Small changes to the data model and authoring
tool user interface or configuration can often produce dramatic improvements
in productivity and quality.
Metadata
Metadata
is the connecting tissue for all CMSs. It tells the CMS what the content is,
who created it, who may read it, who may change it, where it fits in the workflow,
and what sorts of operations may be performed on it. Metadata can do more, however.
If the Semantic Web ever becomes reality (even if it never quite reaches the
grandiose dreams some people have) it will be because sufficient metadata has
been added to each bit of relevant content.
Metadata
can be stored as XML, as indexes in a relational database, or in some CMS-specific
storage format. For some purposes, the format it is stored in is irrelevant.
Metadata that is more volatile than the underlying content, such as stage of
a workflow process, or date the item moved from one stage to the other, is often
stored outside of the XML. An XML format becomes useful in other scenarios,
such as integrating different systems, or if the metadata is complicated enough
to warrant storing it in a rich hierarchical format. In particular, a rich taxonomy
provides a way to navigate through content following different navigation paths.
Since
integration of different CMSs, passing around content complete with the metadata,
and the requirements for rich, hierarchically structured metadata are just starting
to become important for many people, the various metadata standards (in which
I include topic maps and RDF) have not yet experienced the updraft that XML
for content has. Metadata is the second layer of a complete content management
system and requires at least as much thought as the design of a document schema
for authoring in XML does. At this stage in the technology cycle, there isnt
yet the experience in metadata system design that there is in document modeling;
the best practices (which depend on the industry) are still being worked on.
Metadata is hard: Mark Hale estimates that to fully classify a single document
requires 60-90 minutes of human thought. Automatic metadata generation can help,
but it will be some time before its satisfactory.
Thus
metadata is another area where the customer requirements document must be fully
fledged out. Is the metadata required simply for workflow and basic search?
Or will the content be passed around between divisions, or even between companies?
If the latter, an XML format may be the right answer. If so, is there an applicable
metadata standard or ontology that could be used?
XML
for Plumbing
Web
Services
Web
Services has a hype factor that rivals that of XML a couple of years ago. The
number of articles proclaiming the virtues of XML, and the number of products
proudly claiming XML prowess have decreased, simply because XML is now mainstream
and all CMSs are expected to support it. The number of articles
about the virtues and problems of Web Services has increased to fill that void.
Web
Services (2) is an example of XML plumbing.
The configuration files that determine how a piece of content is passed from
one system to another are written in XML (actually a subset of XML). Web Services
at the moment seem to be more hype than reality, but the economics of technology
are such that theres a good chance that Web Services will become a basic part
of systems infrastructure in the next two years or so. It will be used for passing
around information between systems and thus for integration. Web Services are
a relatively easy addition to most CMSs so there is push to implement from the
vendors as well as the analysts who are writing all those articles mentioned
in the paragraph above. The standards development isnt quite ready for primetime
yet; some of the important pieces such as security are still being worked on,
but the basic shape is taking place. There are still some technical hurdles
as well, such as the fact that SOAP only supports a subset of XML; various ways
to solve this problem are also being worked on.
Do
customers really want Web Services? Some do, depending on their corporate tolerance
for risk or the technical vision of the person in the CTO office. Im hearing
far more about companies looking at adding Web Services support to their technical
strategy over the next two years or so than wanting to add it immediately, though
some who enjoy being on the bleeding edge are implementing it already. A large
part of this planning is because companies need to integrate various systems.
At the moment many (mostly the larger companies) are using J2EE for integration
while many others (smaller to mid-size companies) are looking at migrating to
.Net from COM. Web Services will be an important part of both of these platforms
and so it makes sense to make sure components of an overall strategy, such as
the CMS, also support the appropriate methods for integration. Web Services
should enable integration between the J2EE and the .Net worlds; it remains to
be seen just how robust and with what performance this integration can be carried
out in the real world.
WebDAV
(Web-based Distributed Authoring and Versioning)
Another
piece of the puzzle that uses XML as plumbing, WebDAV is a relatively unknown
specification that enables lightweight content management. It functions as a
set of extensions to the web protocol HTTP (unlike Web Services, which can also
function via other protocols such as email). These extensions are defined using
XML. WebDAV (often called DAV for short) allows for basic CM functionality such
as locking and metadata assignment; versioning is still being developed. It
is not sufficient for a full-blown, all-the-bells-and-whistles CMS, but adequate
for a lot of smaller uses where all the features of a large, expensive CMS are
not needed. Once versioning has been added to WebDAV so that the basic check-in
and check-out is supported, it will do much of what small groups of people need.
There appears to be some customer demand for WebDAV in various tools such as
XML authoring tool vendors, so that they can implement their own basic CMS.
The larger CMS vendors are also implementing WebDAV (though the implementation
isnt always supported) to enable a basic level of automatic integration with
other tools without having to write special custom integrations for every tool
on the market that a customer might want to use with the CMS. For many vendors,
of course, there isnt the same level of urgency to implement WebDAV as they
already have integrations with their favored third-party tools using their own
methods. Customer demand seems to be having the desired effect, however.
More
Information
Many
of the topics discussed in this article will be discussed in much more depth
at the forthcoming XML 2002 Conference and Exposition, to be held in Baltimore,
Maryland in the week of December 8-13. Many of the people I spoke to in researching
this article will be speaking at the conference on content management, metadata,
and Web Services. There are also Town Hall meetings on these topics that give
a forum for in-depth questions and discussions. The exhibit space includes many
CMS vendors who will be showing their XML support. More information is at http://www.xmlconference.org
.
(Note
that full Gilbane Report subscribers Save $300 off the cost
of a Conference Gold Pass . Login to the Gilbane subscribers
section at www.gilbane.com to get the discount priority
code to use on the registration form. Discounts cannot be combined. ed.
)
Acknowledgements
I
would like to thank everyone who spent time talking to me about the role XML
plays in the content management world. I very much appreciate the input and
the insights that they gave me. I spoke with Brian Buehling, Dakota Systems;
Chris Wolff, Thomson; Jay di Silvestri (3),
Corel; Jay Todtenbier, Cisco; Jon Parsons and Rich Pasewark, XyEnterprise; Lubor
Ptacek, Documentum; Mark Hale, Interwoven; Mike Champion, Software AG; Ron Daniel,
Taxonomy Strategies; Sebastian Holst, Artesia; Todd Price, Stellent.
Lauren
Wood
Lauren@textuality.com
(1)Found
at http://www.w3.org/TR/REC-xml
(2)
We are talking about Web Services based on the W3C standards (SOAP
etc. ). Sometimes the term is used in a much broader way.
(3)
Thanks also to Jay for proofreading.
|