The Gilbane Report: Volume 11, Number 7Content Management Strategies: Integrating Search
September 2003
Download a PDF version of this article Read the news for this issue.
Content Management Strategies: Integrating Search
Given how much most of us depend on search technology it is surprising that
so many enterprises lack a coherent strategy for ensuring search technology
is meeting the needs of their constituents. In fact, most organizations would
be hard-pressed to even list all of the search technologies they have deployed.
This is largely due to the incorporation of search functionality in the multiple
enterprise applications, especially content management and portal solutions,
companies have in place. But the multiplicity of technologies and varied range
of capabilities is precisely why businesses need to pay attention this is
one of those cases where more' does not equate to better', but only to more
confusion', and probably more cost.'
Search technology is usually considered a critical component of a content
management or portal initiative. But there are many subtle issues concerning
how search functionality should be integrated within a content management system
or portal, as well as how it should be integrated across applications. There
are technical, user-interface, and licensing issues that need to be analyzed.
This month Kathleen Reidy returns with a look at different approaches to integrating
search and content management products, and provides some valuable advice on
what to watch for. If you are in the process of starting, or revamping a content
management or portal project her article will save you frustration, if not
money.
Content Management Strategies: Integrating Search
Managing content is largely about making it easier for users to access and
find the information they need. Yet too often search is an afterthought in
a content management project. It may be one of a long list of features provided
by a CMS vendor or the CMS may not provide it as a feature at all, leaving
the customer to determine how best to provide users the ability to search content.
Search, along with an effective and tightly coupled browsing mechanism, are
the primary paths that site users take to find content. This is why understanding
the implications of integrating search functionality into a content management
initiative is critical. A useful and accurate search engine should be a desired
outcome of a successful content management project.
There are as many variations in how search can be applied to information as
there are search and content management products on the market. Some systems
build search functionality tightly into core features, making it seamless to
users and often inaccessible to other systems. Others focus only on creating
and delivering content and do not address the search issue at all. This often
leaves organizations with a wide variety of search technologies at play within
one environment with multiple search interfaces confronting the user.
As organizations grapple with implementing content management systems specifically,
it is critical that the way in which this content will be made available for
searching is considered. It is also important to think about how the search
functionality applied to a CMS will be integrated or notwith existing search
engines already at work within the organization. Many organizations today are
standardizing on content management products to help centralize the management,
publishing, categorization, and security functions that they provide. Similarly,
organizations are looking at search infrastructures that will be able to meet
the needs of a wide number of users who access all types of content. This may
mean a centralized search and indexing service that can integrate with the
content management system as well as other data and information stores within
the organization.
This article looks at search in the broadest sense. Most search products today
rely on a full-text index of content. All kinds of value-added services and
technologies can be layered on top of this index to cluster like documents,
automatically expand queries, intelligently pinpoint specific answers the
list goes on. The many flavors of search technology as nicely sampled by Sebastian
Holst in the Gilbane Report article Searching for Value in Search Technology
(Vol. 10, No. 7) . In this article we'll look more generally at how search
engines regardless of the sophisticated retrieval features they may or may
not offer can be integrated with managed content as well as other disparate
content that resides within the organization.
The Ideal Solution
There are many considerations to keep in mind when looking at the ways in
which a search engine can integrate with a CMS. The requirements in each organization
are different as the content sources vary and users have different needs. For
this reason, saying there is an ideal solution' is hyperbole, but this article
is intended to provide a framework for thinking about the often complex issues
that arise when synchronizing a search engine with a CMS, and other information
sources in the enterprise.
An ideal coupling of a search engine and a CMS will result in:
- A full-featured search engine that can access content structured and
unstructured originating from many sources and systems.
- A search engine that is able to search the full-text of documents as well
as any metadata associated with those documents regardless of where that
data is stored (i.e., on the document or not).
- Search results that are secure at the document-level without the need to
duplicate access control information in multiple stores.
- Content available for searching as soon as it is published.
- Unpublished content also available via search to those CMS users who are
privileged to access it.
Figure 1 identifies some of the technologies and integration points that come
into play in addressing the above issues.

Figure 1.
There are a number of issues identified in the above graphic. The intent is
to show the possible technologies that might be used, not to say that all of
these are required in each solution. For example, to search an external news
feed, it may make sense to federate results if that feed maintains its own
search engine, import its data to the search index (which may cause size concerns),
or to crawls some version of the feed that is available in-house or made available
by the news feed provider. One of these is likely to be the chosen solution,
not all.
Looking specifically at the integration points with a CMS, let's identify
some of the key components articulated above:
- A central search engine is able to provide indexing and search services
for content that is coming from one CMS, multiple content management systems,
content that is not currently managed by a CMS, external content, structured
data and email, or other collaboration stores. This service provides all
the search features that users expect (Boolean, fuzzy search, parametric
search etc. ). It may also include auto-classify or clustering features.
- This index is created and maintained using a number of technologies based
on the type of content or data, its location, its security, and the frequency
with which it changes.
Most search engines today are still built
upon a crawler-based architecture. This means the search engine is able
to crawl a wide variety of document types to build a full-text index
of these documents. Crawlers can typically be scheduled and can run incrementally
so that only new content is added to the index. Crawling pages that have
been published by a CMS and thus pushed to a web server is probably the
most common integration point, though in cases where the search engine
is tightly linked with the repository and all content is stored there,
crawling of published pages may not be necessary.
Crawlers must be supplemented by import
features. This allows data to be loaded directly into the index. In cases
where the search engine has no direct access to a repository (and so
no access to content that is not published), import features allow the
system to send its data to the index and synchronize it with any data
that the search engine already has about a particular piece of content.
This can be the most effective way to ensure metadata that and content
that are stored separately are included together in the search engine's
index.
Both crawlers and importing features may
need to be triggered by agents. Agents can alert the search engine when
there is new content available. So, for example, the CMS may notify the
search engine that a piece of content has been published. This will tell
the search engine to crawl that content and it may also initiate the
importing of the metadata about that document to the search engine.
As many systems include their own search
engines, federated search capabilities can often be the most straightforward
way of providing users with a unified search across multiple sources.
Federated search allows one search engine to query and retrieve results
from another, filter and de-duplicate results, and deliver one coherent
set of results to the searcher. It's important that the federation features
be cross-product and cross-vendor.
Many content management systems today promote the idea of a virtual
repository' where the CMS is managing just the metadata about a piece of
content and that piece of content remains in its original location. This
concept is amenable to multi-repository search requirements and can enhance
the integration of the two technologies. However, it is important to ensure
that a) the search engine can access the source document and b) the metadata
stored by the CMS is synced with the full-text information the search engine
has gathered.
Security
Security has long been, and continues to be, a sticky situation with search
and is such a big issue it deserves more lengthy coverage. In the past, organizations
often chose to only index publicly available documents so as to avoid the security
issue altogether. This is not a viable solution for many organizations as they
move towards centralizing content management and search services. Ensuring
that search results are secure is an increasingly important concern and one
that has several possible and partial solutions.
To be secure, search results need to be filtered so that the results page
only shows links for documents to which that user has access. This means the
index must understand who the user is and what she can see. Showing all results
and leaving authentication to the source repository so that a user is challenged after he
clicks on a search result link is not sufficient. Users may see private information
in the search results page, even though they are not able to access the actual
document when they click through the result link.
There are a number of ways that this secure results filtering can be accomplished.
The search engine can store access control information associated with each
record in its index. This provides a solution but can be a laborious process
to set up and requires that information be stored in more than one location.
Better are systems that can work from a centralized authentication scheme (like
a Unix, MS, or LDAP login) to identify which sources a user can access if
you can't access Lotus Notes, the system will not even look at those results.
This can be a fast way to solve one portion of the problem but doesn't address
the more granular security issues within a particular system. To accomplish
this, the search engine must be able to filter the search request through the
authorization mechanism of the source system. This may slow the search results
or may require further duplication of security data stored in the search engine.
Many organizations today are starting to move towards centralized policy or identity
management' solutions that layer authorization, policy enforcement, and single
sign-on on top of standard LDAP directories. As these identity management solutions
are integrated with search engines, they may offer the most efficient way to
provide secure search results without a lot of duplication, provided they are
able to do so while maintaining adequate search engine speed. This is still
an emerging concept however and not well advanced in most organizations. Figure
1 shows both a centralized authentication and authorization layer along with
specific integrations that may need to be done to ensure the document-level
security of content coming from specific systems.
Approaches to Integrating Search & CMS Products
Figure 1 clearly represents a complex environment and this complexity is why
effectively integrating search technology with a CMS is not always a straightforward
task. There are different approaches that an organization can take. Many CMS
vendors today include search technology as a feature of their products and
this represents the first possible approach. The other is to work with two
stand-alone products for CMS and search engine technology.
CMS vendors clearly recognize the importance of effective search in making
content management successful. This is certainly something for which customers
consistently clamor. The scope and scale of this functionality can vary quite
a bit product by product, depending on the origin of the search technology.
Some CMS vendors have built it, others have bought it, and others OEM some
version of a search engine from a search vendor such as Verity or Autonomy.
For example, Documentum OEMs Verity, Vignette OEMs Autonomy, FatWire OEMs AltaVista,
Autonomy, and Verity, Stellent OEMs Convera, and Interwoven resells iPhrase.
There are a number of potential benefits in taking this approach.
- The search engine may be tightly integrated
and able to leverage CMS metadata. Tight integration could enable searching
of published and unpublished content.
- The search engine may natively respect
the access privileges managed by the CMS.
- No additional license / integration
costs required.
These are only identified as potential benefits as the actual search features
provided and the level of integration with the CMS can vary substantially.
Similarly, depending on the specifics of the product and the integration, this
solution may also have the following drawbacks.
- CMS (and other) vendors often OEM a search engine from a search engine
vendor, as seen in Table 1. OEM versions of products can be limited both
in terms of the features they provide and in terms of the level of integration
that is available. An OEM product is not generally the full product that
would be provided by the search vendor if purchased independently.
- The CMS product license (particularly if it is an OEM) may only include
the ability to search content managed by that CMS and perhaps may only
be intended for system users accessing the repository, not site users searching
published content.
- A CMS vendor's search may not be able to crawl other systems or repositories.
- The included search engine may not offer the most sophisticated or cutting
edge search features that are available from independent search vendors (the
Gilbane Report article referenced above is a good source for more information
on some of these advanced features).
Despite these potential pitfalls, using the search features provided by a
CMS vendor, whether an OEM of another product or native product features is
a common approach among customers today. This approach can solve the search
problem for a particular site or set of sites that are running a CMS and are
not looking to provide a unified search across multiple systems or sources.
For organizations standardizing on a single CMS, this approach may also make
more sense, provided the search engine is able to access other data or information
types (like email) if required.
The other primary approach is to work with separate products from search and
CMS vendors. Some well-known enterprise search vendors include Autonomy, Convera,
Google, and Verity.
This is just a sampling of some of the more mainstream search engine vendors.
There are many vendors offering new and different twists to help solve the
search problem. See the Gilbane Report article In Search of Search Solutions
(Vol. 10, No.3) for a more comprehensive list of these vendors.
It should be noted that even when a CMS vendor OEMs a particular search engine,
most maintain relationships with the other leading search vendors as well.
Customers can generally choose not to use an OEMed product and to go with another
search product without too much difficulty.
Taking this best-of-breed approach offers a number of benefits.
- These search engines are content and system agnostic.
- They provide a centralized index that can be comprised from many content
and data sources.
- Independent products are typically full-featured and sophisticated.
- For organizations with search engines already in place, it's likely that
one of these is already the enterprise search provider.
Yet to achieve the ideal solution' articulated above, a significant amount
of integration work between the two systems would be likely. Without extensive
integration work, the search engine may:
- not easily leverage CMS metadata.
- only crawl published pages from the CMS.
- not be able to search content as soon as it is published.
- require a lot of duplication of access privileges to secure search results.
Addressing these issues will require the use of a number of the technologies
depicted in Figure 1: federation, agents, gateways, and import features. The
specifics depend on the requirements, the available integration between the
two products today, and the capabilities of those products.
This approach has the best chance of coming close to the ideal solution if
the requisite integration work is well thought out and complete. With this
in hand, this approach can solve large enterprise-scale search needs to provide
a central indexing service, along with tight CMS integration.
Where Do Portals Fit In?
The line between portals products and content management systems continues
to blur, as we explored in a previous Gilbane Report article, Portals & Content
Management Systems: Have Two Markets Become One? (Vol. 11, No. 4) . Search
has been a service portal products have provided since the early days. As with
CMS vendors, this search functionality can have different origins and different
capabilities. Table 1 looks at some portal vendors and the search capabilities
they provide.
Vendor Name |
Search Included? |
Origin |
BEA |
Yes |
OEM - Autonomy |
IBM |
Yes |
Lotus |
Oracle |
Yes |
Oracle |
Plumtree |
Yes |
RipFire acquisition |
Sun |
Yes |
Netscape |
Table 1: Portal Vendors and Search Features
Adding portal technology to the CMS and search engine mix has the potential
to both muddy the waters and to offer a solution. Integrating portal technology
that also includes a search engine presents several potential benefits.
- CMS and portal vendors have done a lot of pre-integration work that is
available to customers. Sometimes this includes search integration.
- Portals are increasingly addressing the need for centralized access control
(or identity management) which may be leveraged by the portal, the search
engine, and hopefully the CMS
- Portal search features will almost always be multi-repository.
- Search engine will be included in portal license.
However, portal products in many cases have similar relationships to search
vendors as the CMS vendors do, so potential cons to this approach are also
similar.
- Portals also often use OEM versions of other vendors' search products.
These can have limited licenses, scope, and integration.
- A portal's
search engine integration with a CMS may be no different than if the two
products were purchased stand-alone; the search engine may not be in sync
with the publishing process and may only be able to crawl published pages.
Conclusions & Recommendations
Knowing that a product comes with search or that we already have a search
engine is never enough. Be sure that you understand the technology underlying
the search engine, whether it is bundled in a CMS product or purchased stand-alone.
Questions to consider are:
- Is it crawler-based only?
- Does it have import features?
- Can these be triggered by agents that understand when new content is available?
- Can it crawl structured repositories and email systems or just web pages
and documents?
It is also important to understand the specifics of the CMS integration, if
it is already available. Think about the following:
- Will metadata stored in the CMS be indexed by the search engine along with
the document's full text?
- Does the search do any auto-categorize and if so, how will this merge with
existing, manually applied metadata?
- Will the search engine leverage CMS access controls or does this information
have to be duplicated to provide secure search?
- Will authors and publishers use the same search to find in-process documents
in the repository?
In cases where the CMS vendor does provide search, make sure you understand
the search engine's license structure.
- What features does an OEM search engine provide? How do these compare to
full product available from the search vendor?
- Is it only licensed to search content managed by this CMS?
- What's involved in extending the license? Is the CMS vendor authorized
to resell additional licenses or does it require working directly with the
search vendor?
Perhaps most important in beginning an initiative in this vein is to identify
the search experts in-house and at the vendors you're working with. Search
is fairly specialized and the folks who understand intricately how it works
or how it will integrate with the CMS may not be the same folks who typically
sell or implement the CMS. Ask the really tough questions that don't make sense
to you and most likely they don't make sense to the guy on the other side of
the table
either.
Kathleen Reidy, kathleenoreidy@yahoo.com
|