Gilbane Report logo

September 2004

We have made tremendous progress in managing corporate content in the past few years. The good news is that information technology is dramatically better at managing the vast and growing amount of unstructured information that is the lifeblood and fuel for knowledge workers. Organizations now have growing collections of repositories for managing content, documents, digital assets, electronic records and forms, all in addition to the structured databases they already had in place. The bad news is that there are so many of these databases and repositories. Most business applications need to access or integrate information from multiple sources, sometimes even from outside firewalls, and this turned out to be an awful lot more difficult than most people expected, even when there is a well-thought-out information architecture.

This explains the excitement that surrounded EAI (Enterprise Application Integration) in the mid-to-late nineties, and today’s growing interest in integration technologies that go beyond what pure EAI was capable of by using XML transformation technology in sophisticated ways.

In this issue we look at where EII came from, how it has evolved, and provide some guidance on how you should think about it in the context of your overall information strategy.

Frank Gilbane


Download a complete version of this issue that includes industry news and additional information (PDF)


Our last extended discussion of EII (Vol 11, Num 1) was in early 2003. A lot has changed since then in terms of the market landscape, and the problem of information integration is even bigger than it was then.

No matter what business you are in, many, if not most, of your important business applications need to include information that resides in multiple databases or content repositories. There are lots of good reasons for this:

  • Distributed information is a by-product of decentralized organizations that need to be able to scale.
  • The relative value of information is dependent on its accuracy, quality, and timeliness. High-quality/value information can only be maintained when its control (creation and maintenance) is in the hands of local domain experts – those who understand what the information is, and what it means.
  • Knowledge worker productivity depends on their being able to get the information they need, when they need it. Even more importantly, being able to aggregate information allows for previously unknown connections to become apparent. This kind of emergent knowledge is as critical for business as it is for science.

Our favorite example of the requirement for information integration (as always!) is an electronic product catalog, which has to aggregate from, and remain actively integrated with, a wide variety of applications and information resources from ERP, CRM, ECM, DAM and other systems. The systems not only have to be connected, but the information in them has to be integrated, aligned, and understood. To make informed product decisions and provide acceptable customer service, a view that stretches from the customer inquiry all the way back through their history to inventory is required.

Other examples where information integration is critical are business intelligence (BI) and executive dashboard applications. Long gone are the days when most important decisions can be made based only on single or multiple sources of relational data.

The need for EII is easy to understand. What is more difficult is to determine what individual business requirements are for both the collection of information that needs to be integrated, and best technical approach to achieving it.

Terms

Another popular acronym for EII is ECI, for enterprise content integration. ECI is a perfectly reasonable alternative, and we consider them interchangeable. Some vendors have chosen to use ECI simply to emphasize their solutions are more focused on integrating unstructured information, and to differentiate themselves from EII vendors who remain more focused on integrating structured information. We stick with EII because we have been using it for years and think information is still more neutral, and therefore more useful, than content.

When we talk about different information types we use the following terms: structured data – typically relational records, unstructured data – unmarked-up text and graphics, and semi-structured data – marked-up, or tagged, data. Semi-structured data is a fuzzy category and will only become more so since most useful business information is a combination of structured and unstructured data, so it is not helpful to try and be too rigid about its definition.

Where EII came from

One way to understand the evolution of EII is to consider it in the context of:

  • the growth of integration technologies in the pure data world,
  • the growth of unstructured and semi-structured repositories, and
  • the convergence of the data and content worlds.

At some point in time 2-4 decades ago, companies installed their first database application, (probably a customer list or a part list.) Obviously integration was not a problem. Perhaps it wasn’t even a problem after a few databases had been installed in different areas of the company, but at some point it became clear that better business decisions could be made quicker if there was a way to analyze this data in consolidated form. After many expensive custom solutions had been implemented (successful or not), some products and standards emerged to help with this integration. The development and acceptance of SQL gave an enormous boost to companies’ ability to build applications that depended on data from multiple databases at a reasonable cost. This is probably not news to most readers.

An important characteristic of the integration challenge then was that the data, being mostly relational records, was structured data, and relatively easy to understand and manipulate, so the focus was on simply getting the data from one application to another. SQL handled the data specifics, and APIs were where the challenge was. Over the past few years EAI technologies have made it increasingly easier to integrate structured data applications. Today the challenge is not the APIs (although there is still a lot needed there), but sharing and making use of the, sometimes very complex, structure and metadata that provide the context and behavior guidance that accompany the information.

Our world view was formed largely by the state of IT circa 1980, when the very few content repositories that could be found resided outside of mainstream IT and were buried in “in plant” or “corporate” publishing and printing departments. There was seldom any database discipline applied, and integration was ignored by all except for the poor souls who had to figure out how to, e.g., incorporate the latest product description or price in the electronic files before they were sent to the printer. It wasn’t until the growing demand for electronic documents reached critical mass in the early 90s and then exploded with the emergence of the Web that the integration of unstructured content from various repositories and with databases gained any serious recognition from IT and CIO staffs. Since then, repositories of unstructured and semi-structured data have sprouted up in huge numbers in all areas of organizations.

What is not surprising, but is still underestimated, is that a very similar evolutionary path has taken place with unstructured data. This kind of data was mostly ignored in the past just because it was unstructured – after all, it was simply unmanageable by its very nature. Of course, this was a bit short-sighted; it was significantly more difficult to manage, but where most business information was. Companies are now realizing how critical it is to be able to make use of the information in all their unstructured and semi-structured repositories. They are also realizing that they need to integrate all three types of information. This is what has fueled the interest in EII technologies.

So what is EII?

Since we are biased towards simplifying technology issues, we often describe the difference between EAI and EII as two halves of the same problem: EAI addressing the somewhat straightforward application connection problem, and EII addressing the more complex integration of the information structures that the applications process. In reality, it is more complicated than that, but the bifurcation is helpful in clarifying differing approaches. EII solutions today should address both application and information integration. In fact, the EAI Consortium has been renamed the Integration Consortium in recognition of this, and you should expect the majority of EAI vendors to come up with an information integration story.

In addition to EAI, there are other software categories and analyst concepts, old and new, that are relevant to integrating information, such as Business Activity Monitoring (BAM), the “Real Time Enterprise”, Extract, Transform, and Load (ETL), Enterprise Portals, etc. These relate to some of the business problems organizations have in specific or grand ways, but we think it is more useful for IT to look at the more fundamental information management issues that underlie all of these. Only then can an information infrastructure be built that can meet the unique needs of the wide variety of existing and future enterprise requirements.

In general, EII needs to

  • support all information types, structured, unstructured and semi-structured
  • provide for context, i.e., where does the information fit in the schema/taxonomy of the receiving repository/application, and what are the relevant behavioral constraints.

EII Components

Aside from specific applications where information integration requirements are limited to a very small number of compatible information sources, aggregating information from various sources into a central repository is not a viable strategy. Thinking about the potential combination of maintenance and scalability and information ownership issues should convince you that a difference strategy is required.

Most EII solutions have three things in common:

  • they rely heavily on XML technology in some form,
  • they involve the use of a metadata repository, and
  • they include some specific connectors for various repositories.

XML

The success of XML is directly related to the information integration problem. It was only a few years ago that the majority of IT shops were looking at XML primarily as a tool for sharing data and messages for EAI applications. It was also recognized as being equally suitable for structured and unstructured data, and, as a way to bridge islands of legacy repositories with the future growth of more sophisticated repositories either based on XML or easily encoded in XML.

The immediate popularity of XSLT spurred lots of EAI development, and some use of XSLT is probably a requirement in all EII solutions today. Why? Any exchange of XML information is going to involve a combination of mapping of information objects, and in most cases these will involve structural transformations to account for different contexts, i.e., uses of the information. There is simply no reason to use anything but XSL for this.

As more and more information becomes available in XML, XQuery is emerging as another key XML technology for EII. What is XQuery? The simple answer is that it is a language for accessing information in XML structures that incorporates the features of SQL. Obviously, finding information is a prerequisite to integrating it. (Some of the vendors with XQuery functionality include IBM, BEA, Microsoft, Software AG, Ipedo, Oracle, Nimble/Actuate, and Mark Logic).

Metadata repositories

How EII solutions implement their metadata repository is a key question to ask when investigating products. You need to understand how the repository gets built, how it gets updated, how its structure gets traversed, and how it aggregates and interacts with the source repositories. You should assume there are more ways to implement a metadata repository than you can think of. A metadata repository could be anything from a simple list in a file or database, to a complex schema modeling all potential information (MetaMatrix), to a sophisticated collection of RDF triplets (Metatomix). Understanding how the metadata repository works is the only way to make sure an EII solution will meet your scalability, time-to-refresh information, time-to-add another source, and maintenance resource, requirements.

Connectors

Most EII solutions will have a collection of tools for accessing some set of repositories that, more or less, come “out-of-the-box”. No matter how complete or sophisticated a particular collection may be, because integration is almost by definition a matter of customization, you should expect to either modify or create new connectors.

The types of connectors will also vary by product depending on the types of markets and applications they have been developed for. For example, some EII solutions focus on more on uni-directional aggregation and distribution applications, while others focus on bi-directional information interaction. Also, vendors have domain expertise in different types of content applications, such as rich media, transactional documents, records, technical data, etc., and will have a correspondingly different set of off-the-shelf connectors.

The Market

Today there are a wide range of players providing various pieces of the integration puzzle. There are companies who market themselves as EII or ECI suppliers (Context Media, MetaMatrix). There are ECM vendors who offer their own EII software (Vignette, Documentum). There are EAI vendors who are paying a lot more attention to EII (See Beyond, Web Methods). There are XML database vendors who market their platforms as EII solutions (Software AG, Ipedo). There are vendors that market EII as part of a broader integration solution (Metatomix, Snapbridge). And there are infrastructure players that have specific EII products (BEA, IBM).

IBM clearly takes this area very seriously and their activity is instructive. In addition to their own developed solutions they have a partnership with Context Media and recently acquired Venetica. IBM recognized EII was not a simple problem, but a broad multi-faceted set of complex requirements that included multiple degrees of integration, multiple types of content, multiple types of transformations and significant metadata management.

An important question is whether EII is a market of its own, or whether EII technology will simply be part of larger applications or infrastructures. Long term there will certainly be EII capability built into both – the more difficult question is what that capability will look like.

The Future

The future of EII will be determined to a large degree by the continued convergence of unstructured and structured data. With Microsoft, Oracle and IBM all morphing their database platforms into repositories for all types of data, it is only natural to expect that basic integration capabilities will increasingly be included. As this occurs the main challenges to integration will be less technical and more informational. That is, information integration will still be very difficult, but the center of effort will shift towards information architecture, including structure and metadata management.

Web services1. are also especially relevant to the future of EII because they provide a way to meet fundamental needs associated with both the application and information integration requirements. Web services are part of the solution to each half of the problem because they:

  • simplify the application integration to largely a one-sided problem
  • allow for the scaling necessary for the large volumes of information
  • get around some of the information complexity by allowing for the information to reside in environments where it is understood – where it can be manage by the humans and applications that understand it.

Information integration will continue to be a challenge. Like many content technologies, consensus on technical and architectural approaches to EII won’t come close to being reached until a majority of IT and developers understand more about unstructured and semi-structured information, and such understanding won’t happen until there has been a critical mass of industry experience building integrated applications. Twenty-something developers are faced with this integration challenge, and while they may not fully grasp the complexity of managing unstructured and semi-structured content, neither are they handicapped by data and API-centric views of the previous generations.

Not all integration issues are equal – the number of applications that need to be integrated is far smaller than the number of documents, or content elements, that need to be integrated. There are a lot of similar requirements across industries and among companies who need to connect information piping between, e.g., an ERP system and a catalog internally, and even when they need to connect with partner systems. But the diversity of the content and related contextual processing once the information has gotten from one system to the other is infinite by comparison.

This means that there will continue to be a need for a wide variety of EII technologies and approaches for the foreseeable future. Even though many content management functions are becoming commoditized, the value added by those who understand the content will ensure the market for multiple content technologies will remain healthy. The same is true for EII. You can argue that EII technology needs to be everywhere, and in particular, included in platforms, but the need for expertise on how to effectively use it guarantees that integration specialists and product companies with domain expertise will be around for awhile.

Frank Gilbane, frank@gilbane.com

1. We won’t digress into a discussion on the different definitions or political views on web services. We believe our point is general enough to be consistent with different definitions and true no matter what you think of particular standards.