Gilbane Report logoContent Management Technologies, Trends & Advice

Gilbane San Francisco and Boston banner
Gilbane Reports

The Gilbane Report: Volume 10, Number 3

In Search of Search Solutions

April 2002

Download a PDF version of this article

Read the news for this issue.

In Search of Search Solutions

Search, and search-related, technology is enjoying renewed interest these days. There are dozens of products, analysts are busy selling reports, and businesses are realizing they could benefit from even incrementally better search capabilities. It is time to take a look at what's available.

We are constantly being told, as if we needed reminding, that we are so overwhelmed with information we often can't find the specific knowledge nuggets we need. The problem is not just the volume of information, but also the variety of information types, and the lack of information organization. Much of the current wave of development is focused either on the variety problem (structured, unstructured, and rich media) or on the organization problem (categorizing and taxonomy tools). Indeed, vendors often differentiate themselves based on which of these they specialize in. Businesses however, need to look at all three aspects of the problem. Organized information is both easier to find and more useful when (re-)organized for specific uses once found. While there are many situations where a Google-like search is just what you need, many business applications require at least the ability to store what has been found for further use without having to recreate imaginative search queries. Organization (of which categorization is one aspect) and search should be considered together when building IT strategies.

This month, Sebastian provides you with a way to get started by laying out a high-level taxonomy of the market, and some guidance on what to think about as you consider investing in new search technologies.

In Search of Search Solutions

If you can't find it - it isn't there.

Prediction: You are often frustrated and ultimately fail when looking for information. Sometimes this happens on the Internet and sometimes you are working within a specific application.

Do I have ESP? Is this an example of some sort of personalization where each Gilbane Report is customized just for you? Sadly, it is neither of these. The simple truth is that if you use a computer, search technology has let you down on more than one occasion. There are numerous reasons for this, and not all of the blame can be placed upon search technology. In fact, intelligent search technology is deceptively complex. This month, we will look at the market dynamics that are pushing search technology to new heights, categorize the various flavors of search technology that have emerged, and review a cross section of the software vendors that are hoping to solve your search problems once and for all.

Market Drivers

Our use of technology in the workplace and at home continues to evolve rapidly. Each new application, device, and media type brings with it new technology and usability requirements that the search technology of only a few years ago simply cannot support. The result is that there has been a revival for search technology vendors. The major factors driving this growth in search technology include the following.

Explosive increase in the volume of content

  • Raw content: It has been estimated that the total volume of information on the planet is doubling every three years. Much of this content is in digital form.
  • Published content: The volume of content becoming available publicly (on the Internet) is multiplying at an even greater rate.
  • Connected content: As applications become integrated across broader sections of business and society, each individual application has access to greater amounts of content that had previously been contained within one or more "stove pipe" applications.

Increased variety of content

  • New Media Types: Rich media formats including video, audio and images for traditional distribution and emerging digital channels have completely changed the rules of the search game.
  • Metadata: The increased use of metadata (data about content) to capture rights, usage, ownership, etc. in both general and industry-specific ways has radically altered how search criteria needs to be applied. Often metadata is stored and managed independently from the content it describes. This alters not only the search algorithms, but also how results need to be returned and managed.
  • Structured Content: The rapid adoption of the XML family to capture domain semantics (meaning for particular uses), presentation rules as well as metadata has resulted in further requirements for search algorithms, result management and presentation.

Greater variety and numbers of users

  • Uses and roles: The proliferation of the Internet into virtually every facet of our daily lives from daycare to automobile shopping has brought with it an equal number of new use cases and categories of users. The kinds of searches, user expectations on accuracy, and completeness and assumptions about privacy and reuse often change quite dramatically across these new user profiles.
  • Skill levels: As web-based applications find themselves in increasingly specialized uses, the skill levels of users are also becoming increasingly diverse. Novice, elderly, toddler and special needs users all bring unique requirements, as do domain experts such as medical doctors, lawyers, chemists and engineers, all of whom have unique search use cases and varying degrees of willingness and ability to be trained in specialized search techniques.
  • Languages: Global communication and information sharing as well as increased access to the Internet play havoc with information management in general and search techniques and their underlying assumptions in particular. Sorting, indexing and organizing content expressed in multiple languages requires special technical and operational considerations that a single language environment simply does not need to consider.

What Does Searching Include?

Searching is the process of matching a user's request with a set of results that meet that specific request. Beyond that, the definition gets very complex very quickly. Database, document management, video archives, digital libraries, web sites and every other kind of information store act as incubators for evolving search technologies. As the Internet infrastructure connects all of these various data stores, there will be increasing pressure to not only improve search technologies but also to integrate them. The following overview is intended to provide a holistic overview of basic search functionality that can be used to evaluate the diverse array of search products in the market today. The objective here is to optimize for accuracy while clearly skimping on precision.

Ingestion

In order to provide the response times that users have come to expect, search engines must build indexes and collect statistics on the content in before it can process its first query. The following are functions that are often, although not always, present in search engines.

Content extraction functions include:

  • Filters that look inside content to extract information that are ordinarily hid-den within a proprietary format such as PowerPoint, PDF, etc.
  • Loggers analyze, deconstruct and extract information from time-based con-tent such as video and audio.
  • Parsers analyze structure and extract information and raw content from structured content such as XML.
  • Transformers and encoders generate proxy and alternate versions of content for simplified searching and previewing of content.

Figure 1. Ingestion and Indexing

Content organization and categorization functions include:

  • Key words: These are terms that users would associate with content, e.g., news, celebrities, software, etc.
  • Structure: Semantic (meaning) and formatting structure can be captured, e.g. owner versus page numbers.
  • Associations: This is a very broad category that includes relationships (versions), topic maps (states to cities), linguistic (monkey is an animal), etc.

These functions can be provided by users/administrators but are increasingly being offered as intelligent inference services within the search engine.

The shift from manual to automatic ingestion is at the heart of much of the innovation in today's search technology. The cost in person-power and time required to manually tag content with key words and associations often preclude the use of search technology from all but the most critical applications. As automatic ingestion becomes increasingly accurate and sophisticated, the market for advanced search technology can be expected to explode.

Query Processing


Figure 2. Query Processing

Once content has been analyzed and an index has been built, searches can now be submitted, processed and results returned. This section describes functions that are often although not always present in search engines.

Users must articulate their request. SQL queries, natural language interpreters, navigation a la Yahoo!, form-based, inference, and query by example (including cut and paste of images, audio and video) are all available to users today.

At this point, the search engine takes over and starts crunching statistics and navigating various links. Expressions are resolved, patterns are matched, categories and associations are traversed and the resulting answer set is often sorted by content value (date, last name) or by relevance (proximity or nearness of match).

The resulting matches are ordered, formatted and presented to the user for inspection and to provide direct access to specific content referenced in the result set and Voila!

In an ideal world, the user is then presented with a result set that includes:

  • Meaningful references, e.g., the summary or previews of each match accurately represents the actual match. Examples include thumbnails of a PDF page, a proxy of a video clip, or a gist of a piece of text.
  • No false positives, e.g., each result returned satisfies the user's search criteria.
  • No false negatives, e.g., there is no content that has been excluded from the search set that the user would have been interested to review.
  • Accurate sorting and relevancy ranking, e.g., of the twenty thousand potential matches to a query, the first 20 results formatted on the first page truly are the most relevant.

Of course, in the real world users are often imprecise in the expression of their search criteria, search engines are limited in the variety of searches they can perform, and content is often poorly categorized or completely unavailable to the search application. This is the problem that enterprising search technology vendors are trying to solve today.

Some commercial solutions focus on the automation of categorization, metadata creation, and keyword generation. One challenge is how to generate this information in a precise enough way to map into the distinct and often contradictory models across industries and use cases. Academic institutions, intelligence agencies and corporations see the world through very different lenses and therefore expect content to be organized accordingly.

Another challenge is that different media types require very different technology to peer into and analyze content. The algorithms required to extract the semantic meaning from an audio track are quite distinct from those used to analyze a novel.

This is a seemingly impossible hurdle to clear in a completely general way for all use cases, but vendors are able to deliver impressive results by reducing the variety of content and contexts to be supported at any given point in time. Increasingly, cost savings and productivity enhancements are clearly validating this approach of increased capability over specific classes or categories of content.

Issues to be wary of include:

  • Automatically generated metadata and categories are not likely to match industry standards that your organization and your trading partners may be considering, e.g., PRISM, ICE, etc.
  • While content may have many uses and be of interest to many different user communities, the metadata and categories are not likely to have the same transferability.
  • If the ingestion components of a search engine cannot process specific media types, significant amounts of content may be inaccessible.

One approach to compensate for divergence and distributed indexes, categories and locations is the mixed search.


Figure 3. Mixed Search

A mixed search engine is one that accepts a search request and then dispatches localized versions of that search to multiple search engines. The results are then normalized, aggregated and returned as a single result set. Dependence on multiple search strategies that are often supplied by third party search technology can lead to some unexpected results, but it is the best and only way to search across all data regardless of location, format, or use. Mixed search solutions can be homegrown, be part of a content or digital asset management system, or be sold by search vendors.

Classifying Search Technologies

There is no individual search technology that can cover the full spectrum of content types and search algorithms nor is there (yet) a single company that has integrated the search techniques to provide a one-stop-shop. It is therefore important to consider the strengths and weaknesses of each approach. The following framework is intended to provide some degree of order when evaluating the myriad of search products and technologies that are currently available.


Figure 4: Classifying Search Technologies

Figure 4 illustrates how four families of search categories have covered the entire spectrum of searches. As we review vendor examples of each category, there are three important points to keep in mind.

  • Individual products and the companies that offer them are constantly expanding their vision and scope. As such, this diagram is not intended to imply permanent limitations, rather to emphasize centers of excellence and historical success.
  • The companies mentioned are not intended to provide an exhaustive list of search companies and their products. These represent a subset taken from a larger group that the author was quickly able to identify. This is a very crowded field.
  • The author has not personally evaluated each product offering and cannot therefore warrant any product's quality or suitability for a particular purpose.

DBMS

Search technology in DBMSs has developed from query optimization engines that crunched out specific search strategies for relational queries to hybrid search engines that include category and cross media search. The most advanced of which now include references to content outside of the DBMS in question. This is part of a larger trend of the large DBMS provider's attempt to flip the content world upside down by making the DBMS the file manager for the enterprise (rather than have the DBMS sit in the file system. As long as this paradigm is not uncomfortable and there is no issue with running all content through a single DBMS product, this technology can be quite comprehensive.

Traditional Web

Web-oriented search technology was delivered primarily through public portals (e.g., AltaVista, Excite, Lycos, HotBot and, recently, Google). These products were relatively unsophisticated in terms of the search algorithms: they build indexes of all significant words and use those to look up documents. The quality of results depends on the range, sophistication and frequency of updating page links (the "crawling" process), and the differences in the algorithms used to rank the relevance of the results returned.

Premium Search

Premium search solutions are those that are based upon advanced search technology and are focused primarily on large, strategic collections of content and whose customers are willing to pay top dollar to provide the most robust search capabilities possible.

In 2002, the state of the art typically includes some combination of statistical, semantic, syntactic, and contextual methods to understand key concepts for the organization, enhancement, and utilization of relevant information. In other words, solutions use complex algorithms that use every possible computational method available to replace the need to manually define taxonomies, keywords, and metadata, and to properly assign content to these categories. Most of these algorithms will not be understood by business or IT managers, and it is difficult to determine how well they work. However, there are some very impressive demonstrations available and there is a lot of serious development going on. (We may delve further into the different algorithm types in an upcoming article.)

Rich Media

This typically includes specific fluency in logging and indexing time-based media (video and audio) to extract and synchronize clips, closed captioning, speech to text and other data extraction utilities.

Rich Media Extended

Extended rich media search includes advanced pattern matching across individual images as well as across time-based sequences. Facial image recognition and action recognition, e.g., the scoring of a goal in sports, are both examples of this expanded media search capability.

Mixed Search

This is a pluggable architecture that includes the dispatch, aggregation and normalization of results across multiple, heterogeneous search engines.


Traditional Web

DBMS

Premium Search

Rich Media

Rich Media Extended

Mixed Search

Albert

 

 

 

 

 

Altavista

 

 

 

 

 

Answerfriend

 

 

 

 

 

Applied
Semantics

 

 

 

 

 

Automony

 

 

 

 

 

ClearForest

 

 

 

 

 

Convera

 

 

 

DreMedia

 

 

 

eVision

 

 

 

 

Fast Search & Transfer

 

 

 

 

Fast-Talk

 

 

 

 

Google

 

 

 

 

 

H5

 

 

 

 

 

IBM

 

 

 

 

Inktomi

 

 

 

 

 

Insightful

 

 

 

 

 

InXight

 

 

 

 

 

Iphrase

 

 

 

 

 

LingoMotors

 

 

 

 

 

LTU
Technologies

 

 

 

 

 

Mohomine

 

 

 

 

 

Mondosoft

 

 

 

 

Oracle

 

 

 

 

Primus

 

 

 

 

 

Quiver

 

 

 

 

 

Sageware

 

 

 

 

 

Semio

 

 

 

 

 

Smartlogik

 

 

 

 

 

Stratify

 

 

 

 

 

Unifind

 

 

 

 

Verity

 

 

 

 

Virage

 

 

 

 

 

Wherewithal

 

 

 

 

 

 

Table 1. Mapping selected search vendors to the search techniques graph in Figure 4.

Observations

  • The first obvious conclusion is that this is a crowded market.
  • The majority of activity is currently focused on automating the ingestion, indexing and categorization of content. Both the Premium Search and the Rich Media Extended technologies invest heavily in automating the definition and population of information models. The reasonable premise behind this is that if it is too expensive and time consuming to organize and mark-up content, the vast majority of that content will never become searchable. While this is in fact true, it is also the case that auto-generated metadata and taxonomy models cannot be relied upon to facilitate the interchange of valuable content between organizations or to optimally preserve digital content for very long periods of time across multiple uses. Automating search ingestion and taxonomy generation is certainly a market-widening approach, but it can never fully displace careful and deliberate information modeling and content archiving.
  • Web search is rapidly becoming a low cost commodity.
  • Much of the cutting edge work is being done in the rich and extended rich media space. While it is not typical form in these articles, here is some homework for the interested reader. Visit www.dremedia.com. One of the features of this search technology is that one can edit the xml-tagged textual transcripts of video. The index is used to automatically edit the original video to correspond with the cut and pasted text - this is a poor-man's desktop broadcasting requiring no new editing skills beyond what one needs to edit email today. If the software works as advertised, this could get very interesting.
  • The Premium Search is the most popular category in this high-level view of the market. In fact, there are a number of differences between some of these vendors and some would prefer to be thought of as offering categorization products rather than search. We'll sub-divide this category in a future issue.

Conclusions & Recommendations

There is a "principle of least surprise" that is based upon the premise that predictable software holds more value than software whose performance might occasionally be spectacular but cannot be relied upon to provide consistent and expected results. For all but the most dogged researcher that has plenty of time on their hands and a great deal of research expertise of their own, the principle of least surprise should probably sit at or near the top of the priority list when selecting and deploying search technologies.

Know the strengths and weaknesses of the search technology you use. If blind spots and particular strengths are well understood, then users can appropriately compensate and can assess the likelihood that they may have "false positives" in their result set or may have left "false negatives" behind. Don't count solely on your ability to understand the relative merits of the sophisticated algorithms and linguistic, mathematical, and statistical theories they are based on. You need to test with your own content, including content from repositories you do not control, but need access to.

Assess user expectations and know how much search is enough. Causal and basic search requirements require only inexpensive and simple search tools. Advanced search requirements require a more careful analysis of user expectations, user expertise, the state of the content being searched (existence of metadata, etc.), and the suitability of available search technology.

Begin good housekeeping practices immediately to be best prepared to take advantage of emerging technologies. Wherever possible, capture as much descriptive metadata that may serve as useful search criteria in the future, e.g., creation dates, authors, subject matter, rights and permissions etc. Much of this information simply cannot be inferred and must therefore be captured somewhere. Develop and utilize categorization and archival "best practices." There are emerging best practices for digital preservation and that library science is assimilating the XML family, making living digital archives a reality. This should greatly simplify search requirements for large organizations and leave every category of information-intensive organization best prepared to take advantage of the many interesting and powerful search technologies that continue to emerge and mature.

Sebastian Holst
sebastian@gilbane.com

 

Subscribe to NewsShark
Content technology industry news without the hype

Email Address:*
First Name:*
Last name*
* = Required Field

RSS/XML Newsfeeds
Industry News
Event Announcements
Analyst Blog
Enterprise Search Blog
Publishing Technology Blog
Globalization Blog
Collaboration Blog
Web Content Management Blog


The Gilbane Report is published by Bluebill Advisors, Inc. © 1993 - 2005 The Gilbane Report. All Rights Reserved.
Contact | Editorial Policy | Privacy Policy | Site Map