Gilbane Report logoContent Management Technologies, Trends & Advice

Gilbane San Francisco and Boston banner
Gilbane Reports

The Gilbane Report: Volume 11, Number 7

Content Management Strategies: Integrating Search

September 2003

Download a PDF version of this article

Read the news for this issue.

Content Management Strategies: Integrating Search

Given how much most of us depend on search technology it is surprising that so many enterprises lack a coherent strategy for ensuring search technology is meeting the needs of their constituents. In fact, most organizations would be hard-pressed to even list all of the search technologies they have deployed. This is largely due to the incorporation of search functionality in the multiple enterprise applications, especially content management and portal solutions, companies have in place. But the multiplicity of technologies and varied range of capabilities is precisely why businesses need to pay attention this is one of those cases where more' does not equate to better', but only to more confusion', and probably more cost.'

Search technology is usually considered a critical component of a content management or portal initiative. But there are many subtle issues concerning how search functionality should be integrated within a content management system or portal, as well as how it should be integrated across applications. There are technical, user-interface, and licensing issues that need to be analyzed. This month Kathleen Reidy returns with a look at different approaches to integrating search and content management products, and provides some valuable advice on what to watch for. If you are in the process of starting, or revamping a content management or portal project her article will save you frustration, if not money.

Content Management Strategies: Integrating Search

Managing content is largely about making it easier for users to access and find the information they need. Yet too often search is an afterthought in a content management project. It may be one of a long list of features provided by a CMS vendor or the CMS may not provide it as a feature at all, leaving the customer to determine how best to provide users the ability to search content. Search, along with an effective and tightly coupled browsing mechanism, are the primary paths that site users take to find content. This is why understanding the implications of integrating search functionality into a content management initiative is critical. A useful and accurate search engine should be a desired outcome of a successful content management project.

There are as many variations in how search can be applied to information as there are search and content management products on the market. Some systems build search functionality tightly into core features, making it seamless to users and often inaccessible to other systems. Others focus only on creating and delivering content and do not address the search issue at all. This often leaves organizations with a wide variety of search technologies at play within one environment with multiple search interfaces confronting the user.

As organizations grapple with implementing content management systems specifically, it is critical that the way in which this content will be made available for searching is considered. It is also important to think about how the search functionality applied to a CMS will be integrated or notwith existing search engines already at work within the organization. Many organizations today are standardizing on content management products to help centralize the management, publishing, categorization, and security functions that they provide. Similarly, organizations are looking at search infrastructures that will be able to meet the needs of a wide number of users who access all types of content. This may mean a centralized search and indexing service that can integrate with the content management system as well as other data and information stores within the organization.

This article looks at search in the broadest sense. Most search products today rely on a full-text index of content. All kinds of value-added services and technologies can be layered on top of this index to cluster like documents, automatically expand queries, intelligently pinpoint specific answers the list goes on. The many flavors of search technology as nicely sampled by Sebastian Holst in the Gilbane Report article Searching for Value in Search Technology (Vol. 10, No. 7) . In this article we'll look more generally at how search engines regardless of the sophisticated retrieval features they may or may not offer can be integrated with managed content as well as other disparate content that resides within the organization.

The Ideal Solution

There are many considerations to keep in mind when looking at the ways in which a search engine can integrate with a CMS. The requirements in each organization are different as the content sources vary and users have different needs. For this reason, saying there is an ideal solution' is hyperbole, but this article is intended to provide a framework for thinking about the often complex issues that arise when synchronizing a search engine with a CMS, and other information sources in the enterprise.

An ideal coupling of a search engine and a CMS will result in:

  • A full-featured search engine that can access content structured and unstructured originating from many sources and systems.
  • A search engine that is able to search the full-text of documents as well as any metadata associated with those documents regardless of where that data is stored (i.e., on the document or not).
  • Search results that are secure at the document-level without the need to duplicate access control information in multiple stores.
  • Content available for searching as soon as it is published.
  • Unpublished content also available via search to those CMS users who are privileged to access it.

Figure 1 identifies some of the technologies and integration points that come into play in addressing the above issues.

Figure 1.

There are a number of issues identified in the above graphic. The intent is to show the possible technologies that might be used, not to say that all of these are required in each solution. For example, to search an external news feed, it may make sense to federate results if that feed maintains its own search engine, import its data to the search index (which may cause size concerns), or to crawls some version of the feed that is available in-house or made available by the news feed provider. One of these is likely to be the chosen solution, not all.

Looking specifically at the integration points with a CMS, let's identify some of the key components articulated above:

  • A central search engine is able to provide indexing and search services for content that is coming from one CMS, multiple content management systems, content that is not currently managed by a CMS, external content, structured data and email, or other collaboration stores. This service provides all the search features that users expect (Boolean, fuzzy search, parametric search etc. ). It may also include auto-classify or clustering features.
  • This index is created and maintained using a number of technologies based on the type of content or data, its location, its security, and the frequency with which it changes.

    •  Most search engines today are still built upon a crawler-based architecture. This means the search engine is able to crawl a wide variety of document types to build a full-text index of these documents. Crawlers can typically be scheduled and can run incrementally so that only new content is added to the index. Crawling pages that have been published by a CMS and thus pushed to a web server is probably the most common integration point, though in cases where the search engine is tightly linked with the repository and all content is stored there, crawling of published pages may not be necessary.

    •  Crawlers must be supplemented by import features. This allows data to be loaded directly into the index. In cases where the search engine has no direct access to a repository (and so no access to content that is not published), import features allow the system to send its data to the index and synchronize it with any data that the search engine already has about a particular piece of content. This can be the most effective way to ensure metadata that and content that are stored separately are included together in the search engine's index.

    •  Both crawlers and importing features may need to be triggered by agents. Agents can alert the search engine when there is new content available. So, for example, the CMS may notify the search engine that a piece of content has been published. This will tell the search engine to crawl that content and it may also initiate the importing of the metadata about that document to the search engine.

    •  As many systems include their own search engines, federated search capabilities can often be the most straightforward way of providing users with a unified search across multiple sources. Federated search allows one search engine to query and retrieve results from another, filter and de-duplicate results, and deliver one coherent set of results to the searcher. It's important that the federation features be cross-product and cross-vendor.

    •  Many content management systems today promote the idea of a virtual repository' where the CMS is managing just the metadata about a piece of content and that piece of content remains in its original location. This concept is amenable to multi-repository search requirements and can enhance the integration of the two technologies. However, it is important to ensure that a) the search engine can access the source document and b) the metadata stored by the CMS is synced with the full-text information the search engine has gathered.

Security

Security has long been, and continues to be, a sticky situation with search and is such a big issue it deserves more lengthy coverage. In the past, organizations often chose to only index publicly available documents so as to avoid the security issue altogether. This is not a viable solution for many organizations as they move towards centralizing content management and search services. Ensuring that search results are secure is an increasingly important concern and one that has several possible and partial solutions.

To be secure, search results need to be filtered so that the results page only shows links for documents to which that user has access. This means the index must understand who the user is and what she can see. Showing all results and leaving authentication to the source repository so that a user is challenged after he clicks on a search result link is not sufficient. Users may see private information in the search results page, even though they are not able to access the actual document when they click through the result link.

There are a number of ways that this secure results filtering can be accomplished. The search engine can store access control information associated with each record in its index. This provides a solution but can be a laborious process to set up and requires that information be stored in more than one location. Better are systems that can work from a centralized authentication scheme (like a Unix, MS, or LDAP login) to identify which sources a user can access if you can't access Lotus Notes, the system will not even look at those results. This can be a fast way to solve one portion of the problem but doesn't address the more granular security issues within a particular system. To accomplish this, the search engine must be able to filter the search request through the authorization mechanism of the source system. This may slow the search results or may require further duplication of security data stored in the search engine.

Many organizations today are starting to move towards centralized policy or identity management' solutions that layer authorization, policy enforcement, and single sign-on on top of standard LDAP directories. As these identity management solutions are integrated with search engines, they may offer the most efficient way to provide secure search results without a lot of duplication, provided they are able to do so while maintaining adequate search engine speed. This is still an emerging concept however and not well advanced in most organizations. Figure 1 shows both a centralized authentication and authorization layer along with specific integrations that may need to be done to ensure the document-level security of content coming from specific systems.

Approaches to Integrating Search & CMS Products

Figure 1 clearly represents a complex environment and this complexity is why effectively integrating search technology with a CMS is not always a straightforward task. There are different approaches that an organization can take. Many CMS vendors today include search technology as a feature of their products and this represents the first possible approach. The other is to work with two stand-alone products for CMS and search engine technology.

CMS vendors clearly recognize the importance of effective search in making content management successful. This is certainly something for which customers consistently clamor. The scope and scale of this functionality can vary quite a bit product by product, depending on the origin of the search technology. Some CMS vendors have built it, others have bought it, and others OEM some version of a search engine from a search vendor such as Verity or Autonomy. For example, Documentum OEMs Verity, Vignette OEMs Autonomy, FatWire OEMs AltaVista, Autonomy, and Verity, Stellent OEMs Convera, and Interwoven resells iPhrase.

There are a number of potential benefits in taking this approach.

  • The search engine may be tightly integrated and able to leverage CMS metadata. Tight integration could enable searching of published and unpublished content.
  • The search engine may natively respect the access privileges managed by the CMS.
  • No additional license / integration costs required.

These are only identified as potential benefits as the actual search features provided and the level of integration with the CMS can vary substantially. Similarly, depending on the specifics of the product and the integration, this solution may also have the following drawbacks.

  • CMS (and other) vendors often OEM a search engine from a search engine vendor, as seen in Table 1. OEM versions of products can be limited both in terms of the features they provide and in terms of the level of integration that is available. An OEM product is not generally the full product that would be provided by the search vendor if purchased independently.
  • The CMS product license (particularly if it is an OEM) may only include the ability to search content managed by that CMS and perhaps may only be intended for system users accessing the repository, not site users searching published content.
  • A CMS vendor's search may not be able to crawl other systems or repositories.
  • The included search engine may not offer the most sophisticated or cutting edge search features that are available from independent search vendors (the Gilbane Report article referenced above is a good source for more information on some of these advanced features).

Despite these potential pitfalls, using the search features provided by a CMS vendor, whether an OEM of another product or native product features is a common approach among customers today. This approach can solve the search problem for a particular site or set of sites that are running a CMS and are not looking to provide a unified search across multiple systems or sources. For organizations standardizing on a single CMS, this approach may also make more sense, provided the search engine is able to access other data or information types (like email) if required.

The other primary approach is to work with separate products from search and CMS vendors. Some well-known enterprise search vendors include Autonomy, Convera, Google, and Verity.

This is just a sampling of some of the more mainstream search engine vendors. There are many vendors offering new and different twists to help solve the search problem. See the Gilbane Report article In Search of Search Solutions (Vol. 10, No.3) for a more comprehensive list of these vendors.

It should be noted that even when a CMS vendor OEMs a particular search engine, most maintain relationships with the other leading search vendors as well. Customers can generally choose not to use an OEMed product and to go with another search product without too much difficulty.

Taking this best-of-breed approach offers a number of benefits.

  • These search engines are content and system agnostic.
  • They provide a centralized index that can be comprised from many content and data sources.
  • Independent products are typically full-featured and sophisticated.
  • For organizations with search engines already in place, it's likely that one of these is already the enterprise search provider.

Yet to achieve the ideal solution' articulated above, a significant amount of integration work between the two systems would be likely. Without extensive integration work, the search engine may:

  • not easily leverage CMS metadata.
  • only crawl published pages from the CMS.
  • not be able to search content as soon as it is published.
  • require a lot of duplication of access privileges to secure search results.

Addressing these issues will require the use of a number of the technologies depicted in Figure 1: federation, agents, gateways, and import features. The specifics depend on the requirements, the available integration between the two products today, and the capabilities of those products.

This approach has the best chance of coming close to the ideal solution if the requisite integration work is well thought out and complete. With this in hand, this approach can solve large enterprise-scale search needs to provide a central indexing service, along with tight CMS integration.

Where Do Portals Fit In?

The line between portals products and content management systems continues to blur, as we explored in a previous Gilbane Report article, Portals & Content Management Systems: Have Two Markets Become One? (Vol. 11, No. 4) . Search has been a service portal products have provided since the early days. As with CMS vendors, this search functionality can have different origins and different capabilities. Table 1 looks at some portal vendors and the search capabilities they provide.

Vendor Name

Search Included?

Origin

BEA

Yes

OEM - Autonomy

IBM

Yes

Lotus

Oracle

Yes

Oracle

Plumtree

Yes

RipFire acquisition

Sun

Yes

Netscape

Table 1: Portal Vendors and Search Features

Adding portal technology to the CMS and search engine mix has the potential to both muddy the waters and to offer a solution. Integrating portal technology that also includes a search engine presents several potential benefits.

  • CMS and portal vendors have done a lot of pre-integration work that is available to customers. Sometimes this includes search integration.
  • Portals are increasingly addressing the need for centralized access control (or identity management) which may be leveraged by the portal, the search engine, and hopefully the CMS
  • Portal search features will almost always be multi-repository.
  • Search engine will be included in portal license.

However, portal products in many cases have similar relationships to search vendors as the CMS vendors do, so potential cons to this approach are also similar.

  • Portals also often use OEM versions of other vendors' search products. These can have limited licenses, scope, and integration.
  • A portal's search engine integration with a CMS may be no different than if the two products were purchased stand-alone; the search engine may not be in sync with the publishing process and may only be able to crawl published pages.

Conclusions & Recommendations

Knowing that a product comes with search or that we already have a search engine is never enough. Be sure that you understand the technology underlying the search engine, whether it is bundled in a CMS product or purchased stand-alone. Questions to consider are:

  • Is it crawler-based only?
  • Does it have import features?
  • Can these be triggered by agents that understand when new content is available?
  • Can it crawl structured repositories and email systems or just web pages and documents?

It is also important to understand the specifics of the CMS integration, if it is already available. Think about the following:

  • Will metadata stored in the CMS be indexed by the search engine along with the document's full text?
  • Does the search do any auto-categorize and if so, how will this merge with existing, manually applied metadata?
  • Will the search engine leverage CMS access controls or does this information have to be duplicated to provide secure search?
  • Will authors and publishers use the same search to find in-process documents in the repository?

In cases where the CMS vendor does provide search, make sure you understand the search engine's license structure.

  • What features does an OEM search engine provide? How do these compare to full product available from the search vendor?
  • Is it only licensed to search content managed by this CMS?
  • What's involved in extending the license? Is the CMS vendor authorized to resell additional licenses or does it require working directly with the search vendor?

Perhaps most important in beginning an initiative in this vein is to identify the search experts in-house and at the vendors you're working with. Search is fairly specialized and the folks who understand intricately how it works or how it will integrate with the CMS may not be the same folks who typically sell or implement the CMS. Ask the really tough questions that don't make sense to you and most likely they don't make sense to the guy on the other side of the table either.

Kathleen Reidy, kathleenoreidy@yahoo.com

Subscribe to NewsShark
Content technology industry news without the hype

Email Address:*
First Name:*
Last name*
* = Required Field

RSS/XML Newsfeeds
Industry News
Event Announcements
Analyst Blog
Enterprise Search Blog
Publishing Technology Blog
Globalization Blog
Collaboration Blog
Web Content Management Blog


The Gilbane Report is published by Bluebill Advisors, Inc. © 1993 - 2005 The Gilbane Report. All Rights Reserved.
Contact | Editorial Policy | Privacy Policy | Site Map