We've published a new paper on addressing large-scale integration, storage, and access of complex information. As Dale mentions in his entry over on our main blog, the paper frames the discussion in terms of challenges to Open Government initiatives. We note, though, that the exploration of obstacles to effective, efficient processing of high volumes of data and content is relevant across many industries.

We're cross-posting here on the XML blog because the paper deals wtih XML content and the XML family of standards, including XQuery and XPath.

The Gilbane Beacon is available as a free download from Gilbane and from Mark Logic, sponsor of the paper.

The growth in web-centric communication has created a major focus on content management, web content management , component content management, and so on. This interest is driven primarily by increasing demand for rich, interactive, accessible information products delivered via the Web. The focus is not misplaced but may be missing part of the point. To be specific, in our focus on the "management" part of CM, we may be missing the first word in the phrase.... "Content."

It's true that the application of increasing amounts of computer and brain power to the processes associated with preparing and delivering the kind of information demanded by today's users can improve those products. But it does so within limits set by and at costs generated by the content "raw material" it gets from the content providers. In many cases, the content available to web product development processes is so structurally crude that it requries major clean-up and enhancement in order to adequately participate in the classification and delivery process. As the focus on elegant Web delivery increases, barring real changes in the condition of this raw content, the cost of enhancement is likely to grow proportionally, straining the involved organizations' ability to support it.

The answer may be in an increased focus on the processes and tools used to create the original content. We know that the original creator of most content knows the most about how it should be logically structured and most about the best way to classify it for search and retrieval. Trouble is, in most cases, we provide no means of capturing what the creator knows about his or her intellectual product. Moreover, because many creators have never been able to fully populate the metadata needed to classify and deliver their content, in past eras, professional catalogers were employed to complete this final step. In today's world, however, we have virtually eliminated the cataloger, assuming instead that the prodigious computer power available to us could develop the needed classification and structure from the content itself. That approach can and does work, but it will require better raw material if it is to achieve the level of effectiveness needed to keep the Web from becoming a virtual haystack in which finding the needle is more good luck than good measure. Native XML editors instead of today's visually oriented word processors, spreadsheets, graphics and other media forms with content-specific XML under them, increased use of native XML databases and a host of rich content-centric resources are part of this content evolution.

Most important, however, may be promulgation of the realization across society that creating content includes more than just making it look good on the screen, and that the creator shares in that responsibility. This won't be an easy or quick process, requiring more likely generations than years, but if we don't begin soon, we may end up with a Web 3 or 4 or 5.0 trying to deliver content that isn't even yet 1.0.

As the world of search becomes more and more sophisticated (and that process has been underway for decades,) we may be approaching the limits of software's ability to improve its ability to find what a searcher wants. If that is true, and I suspect that it is, we will finally be forced to follow the trail of crumbs up the content life cycle... to its source. Indeed, most of the challenges inherent in today's search strategy and products appears to grow from the fact that while we continually increase our demands for intelligence on the back end, we have done little if anything to address the chaos that exists on the front end. You name it, different word processing formats, spreadsheets, HTML tagged text, database delimited files, and so on are all dumped into what we think of as a coherent, easily searchable body of intellectual property. It isn't and isn't likely to become so any time soon unless we address the source. Having spent some time in the library automation world, I can remember the sometimes bitter controversies over having just two major foundations for cataloging source material (Dewey and LC; add a third if you include the NICEM A/V scheme.) Had we known back then that the process of finding intellectual property would devolve into the chaos we now confront, with every search engine and database product essentialy rolling its own approach to rational search, we would have considered ourselves blessed. In the end, it seems, we must begin to see the source material, its physcial formats, its logical organization and its inclusion of rational cataloging and taxonomy elements as the conceptual raw material for its own location. As long as the word processing world teaches that anyone creating anything can make it look like it should in a dozen different ways, ignoring any semblance of finding-aid inclusion, we probably won't have a truly workable ability to find what we want without reworking the content or wading through a haystack of misses to find our desired hits. Unfortunately, the solutions of yesteryear, including after-creation cataloging by a professional cataloger, probably won't work now either, for cost if no other reason. We will be forced to approach the creators of valuable content, asking them for a minimum of preparation for searching their product, and providing the necessary software tools to make that possible. We can't act too soon because, despite the growth of software elegance and raw computer power, this situation will likely get worse as the sheer volume of valuable content grows. Regards, Barry Read more: Enterprise Search Practice Blog:  http://gilbane.com/search_blog/

What is Smart Content?

user-pic
Vote 2 Votes  

At Gilbane we talk of "Smart Content," "Structured Content," and "Unstructured Content." We will be discussing these ideas in a seminar entitled "Managing Smart Content" at the Gilbane Conference next week in Boston. Below I share some ideas about these types of content and what they enable and require in terms of processes and systems.

When you add meaning to content you make it "smart" enough for computers to do some interesting things. Organizing, searching, processing, and discovery are greatly improved, which also increases the value of the data. Structured content allows some, but fewer, processes to be automated or simplified, and unstructured content enables very little to be streamlined and requires the most ongoing human intervention.

Most content is not very smart. In fact, most content is unstructured and usually more difficult to process automatically. Think flat text files, HTML without all the end tags, etc. Unstructured content is more difficult for computers to interpret and understand than structured content due to incompleteness and ambiguity inherent in the content. Unstructured content usually requires humans to decipher the structure and the meaning, or even to apply formatting for display rendering.

The next level up toward smart content is structured content. This includes wellformed XML documents, content compliant to a schema, or even RDMS databases. Some of the intelligence is included in the content, such as boundaries of element (or field) being clearly demarcated, and element names that mean something to users and systems that consume the information. Automatic processing of structured content includes reorganizing, breaking into components, rendering for print or display, and other processes streamlined by the structured content data models in use.

Finally, smart content is structured content that also includes the semantic meaning of the information. The semantics can be in a variety of forms such as RDFa attributes applied to structured elements, or even semantically names elements. However it is done, the meaning is available to both humans and computers to process.

SmartContentValue.jpgSmart content enables highly reusable content components and powerful automated dynamic document assembly. Searching can be enhanced with the inclusion of metadata and buried semantics in the content providing more clues as to what the data is about, where it came from, and how it is related to other content.Smart content enables very robust, valuable content ecosystems.

Deciding which level of rigor is needed for a specific set of content requires understanding the business drivers intended to be met. The more structure and intelligence you add to content, the more complicated and expensive the system development and content creation and management processes may become. More intelligence requires more investment, but may be justified through benefits achieved.

I think it is useful if the XML and CMS communities use consistent terms when talking about the rigor of their data models and the benefits they hope to achieve with them. Hopefully, these three terms, smart content, structured content, and unstructured content ring true and can be used productively to differentiate content and application types.

In a world that seems increasingly about technology itself, it has become tempting to assume that the questions and challenges of new and better information products is about the technology.  While it is true that technology is the key enabler of the new information world we are building, it is also true that the decision making and judgment involved in how that technology is to be organized and deployed is of equal--and not decreasing--importance.  Indeed, as the products move toward increasing sophistication and flexibility--smart content you might say--the importance of the human and organizational parts of the information life cycle become even more important. 

It is a truism that you cannot deliver information products you can't create and manage, and with the circle of participants in that creation and management ever widening, we must be sensitive to the limits of the creators.  Moreover, while just "getting it up on the web" used to be at least sufficient to justify deployment of information products, today's information consumer has a much more extensive and demanding list of features required before he will accept web-based information.  The publisher who forgets  or ignores that list is for trouble.

In a half-day session preceding the Gilbane conference next week, the Gilbance consulting team will tackle some of the real world challenges inherent in this rapidly changing information world, providing both sign posts for issues likely to come up and "in the trenches" suggestions for how to deal with them.  The goal of the session, scheduled for the afternoon of December 1, is that the attendees leave with a better handle on how to proceed in the quest for better information products and the role "smart content" should play. 

The presenters, in addition to their expertise in the technology and tools of information, bring a unique resource to their efforts: years of design, implementation and evaluation of real organizations facing real challenges.

As part of next week's Gilbane Boston Conference, the XML practice will be delivering a pre-conference workshop, "Managing Smart Content: How to Deploy XML Technologies across Your Organization." The instructors will be Geoff Bock, Dale Waldt, Bill Trippe, Barry Schaeffer and Neal Hannon--a group of experts that represents decades of technical and management experience on XML initiatives.

A tip of the virtual hat to Senior Analyst Geoff Bock for organizing this.

Once Upon a Time...

user-pic
Vote 1 Vote  

... there was SVG. People were excited about it. Adobe and others supported it. Pundits saw a whole new graphical web that would leverage SVG heavily. Heck, I even wrote a book about it. 

Then things got quiet for a long time...

However, there are some signs that SVG might be experiencing a bit of a renaissance, if the quality of presentations at a recent conference is a strong indication. It's notable that Google hosted the conference and even more notable that Google is trying to bigfoot Microsoft into supporting SVG in IE, a move that would substantially boost SVG as an option for Web developers.

So a question for those out there interested in SVG. Where are some big projects out there? Are there organizations creating large bases of illustrations and other graphical content with SVG? I would love to talk to you and learn about your projects. You can email me or comment below.

UPDATE: Brad Neuberg of Google, who is quoted in the InfoWorld article linked above, sent along a link to a project at Google, SVG Web, a JavaScript library that supports SVG on many browsers, including Internet Explorer, Firefox, and Safari. According to the tool's website, using the library plus native SVG support, you can instantly target ~95% of the existing installed web base.

UPDATE: Ruud Steltenpool, the organizer for SVG Open 2009, sent a link to an incredibly useful compendium of links to SVG projects, tools, and other resources though he warns it is a little outdated.

Over at TeleRead, David Rothman has a really fine writeup discussing our digital publishing report. He summarizes some of our key points about asset management and flexibility, but also raises some interesting related issues about DRM and the risks of "publishers as mixmasters."

My thanks to David for his thoughtful response.

I have a new post over at EMC's Community site, "Preserving Electronic Public Records: Lessons from the Washington State Digital Archives." This is part of our ongoing series for EMC on the use of ECM and XML in the public sector.

We have been very pleased with the interest in our new report, Digital Platforms and Technologies for Publishers: Implementations Beyond "eBook." We have had hundreds of downloads already, the vast majority of which are senior people in the publishing industry. This tells us that the timing for the research is good and that interest is strong, and we are thinking about what to do next with this topic.

One idea we have thought about is helping publishers think through their eBook strategy. If our research (and other recent research) is correct, many larger publishers are jumping in with both feet, but some larger publishers, many medium-sized, and perhaps most smaller publishers are staying on the sidelines or testing the waters with pilots and low-cost and low-impact tests with third parties. Perhaps these efforts are part of developing a strategy? Perhaps some of you think the market is too nascent?

An eBook strategy would necessarily be multi-faceted, and would include input from sales, marketing, editorial, production, fulfillment, and others with a stake in the process. It would need to be informed by good market data, and with good understanding of what technology and channel partners can truly offer publishers. It would also need to be pragmatic, balancing the capabilities of your organization with a realistic assessment of the market opportunities you have.

We'd like to gauge interest in this kind of offering through the following simple poll. Just one question, and no requirement to log in or register. If you would like to talk in more detail about this idea, please email me with any questions.

Bill's latest Tweet

NewsShark

Sign-up for our weekly NewsShark newsletter.
Content technology industry news without the hype:

* Email

* First Name

* Last Name

* = Required Field