The Gilbane Report: Volume 10, Number 3In Search of Search Solutions
April 2002
Download a PDF version of this article Read the news for this issue.
In Search of Search Solutions
Search, and search-related,
technology is enjoying renewed interest these days. There are dozens of products,
analysts are busy selling reports, and businesses are realizing they could benefit
from even incrementally better search capabilities. It is time to take a look
at what's available.
We are constantly being
told, as if we needed reminding, that we are so overwhelmed with information
we often can't find the specific knowledge nuggets we need. The problem is not
just the volume of information, but also the variety of information
types, and the lack of information organization. Much of the current
wave of development is focused either on the variety problem (structured, unstructured,
and rich media) or on the organization problem (categorizing and taxonomy tools).
Indeed, vendors often differentiate themselves based on which of these they
specialize in. Businesses however, need to look at all three aspects of the
problem. Organized information is both easier to find and more useful when (re-)organized
for specific uses once found. While there are many situations where a Google-like
search is just what you need, many business applications require at least the
ability to store what has been found for further use without having to recreate
imaginative search queries. Organization (of which categorization is one aspect)
and search should be considered together when building IT strategies.
This month, Sebastian provides
you with a way to get started by laying out a high-level taxonomy of the market,
and some guidance on what to think about as you consider investing in new search
technologies.
In Search of Search Solutions
If you can't find it
- it isn't there.
Prediction: You are
often frustrated and ultimately fail when looking for information. Sometimes
this happens on the Internet and sometimes you are working within a specific
application.
Do I have ESP? Is this an
example of some sort of personalization where each Gilbane Report is customized
just for you? Sadly, it is neither of these. The simple truth is that if you
use a computer, search technology has let you down on more than one occasion.
There are numerous reasons for this, and not all of the blame can be placed
upon search technology. In fact, intelligent search technology is deceptively
complex. This month, we will look at the market dynamics that are pushing search
technology to new heights, categorize the various flavors of search technology
that have emerged, and review a cross section of the software vendors that are
hoping to solve your search problems once and for all.
Market Drivers
Our use of technology in
the workplace and at home continues to evolve rapidly. Each new application,
device, and media type brings with it new technology and usability requirements
that the search technology of only a few years ago simply cannot support. The
result is that there has been a revival for search technology vendors. The major
factors driving this growth in search technology include the following.
Explosive
increase in the volume of content
- Raw content: It
has been estimated that the total volume of information on the planet is doubling
every three years. Much of this content is in digital form.
- Published content:
The volume of content becoming available publicly (on the Internet) is
multiplying at an even greater rate.
- Connected content:
As applications become integrated across broader sections of business and
society, each individual application has access to greater amounts of content
that had previously been contained within one or more "stove pipe"
applications.
Increased variety
of content
- New Media Types:
Rich media formats including video, audio and images for traditional distribution
and emerging digital channels have completely changed the rules of the search
game.
- Metadata: The
increased use of metadata (data about content) to capture rights, usage, ownership,
etc. in both general and industry-specific ways has radically altered how
search criteria needs to be applied. Often metadata is stored and managed
independently from the content it describes. This alters not only the search
algorithms, but also how results need to be returned and managed.
- Structured Content:
The rapid adoption of the XML family to capture domain semantics (meaning
for particular uses), presentation rules as well as metadata has resulted
in further requirements for search algorithms, result management and presentation.
Greater variety and numbers
of users
- Uses and roles:
The proliferation of the Internet into virtually every facet of our daily
lives from daycare to automobile shopping has brought with it an equal number
of new use cases and categories of users. The kinds of searches, user expectations
on accuracy, and completeness and assumptions about privacy and reuse often
change quite dramatically across these new user profiles.
- Skill levels:
As web-based applications find themselves in increasingly specialized uses,
the skill levels of users are also becoming increasingly diverse. Novice,
elderly, toddler and special needs users all bring unique requirements, as
do domain experts such as medical doctors, lawyers, chemists and engineers,
all of whom have unique search use cases and varying degrees of willingness
and ability to be trained in specialized search techniques.
- Languages: Global
communication and information sharing as well as increased access to the Internet
play havoc with information management in general and search techniques and
their underlying assumptions in particular. Sorting, indexing and organizing
content expressed in multiple languages requires special technical and operational
considerations that a single language environment simply does not need to
consider.
What Does Searching Include?
Searching is the process
of matching a user's request with a set of results that meet that specific request.
Beyond that, the definition gets very complex very quickly. Database, document
management, video archives, digital libraries, web sites and every other kind
of information store act as incubators for evolving search technologies. As
the Internet infrastructure connects all of these various data stores, there
will be increasing pressure to not only improve search technologies but also
to integrate them. The following overview is intended to provide a holistic
overview of basic search functionality that can be used to evaluate the diverse
array of search products in the market today. The objective here is to optimize
for accuracy while clearly skimping on precision.
Ingestion
In order to provide the
response times that users have come to expect, search engines must build indexes
and collect statistics on the content in before it can process its first query.
The following are functions that are often, although not always, present in
search engines.
Content extraction functions
include:
- Filters that look inside
content to extract information that are ordinarily hid-den within a proprietary
format such as PowerPoint, PDF, etc.
- Loggers analyze, deconstruct
and extract information from time-based con-tent such as video and audio.
- Parsers analyze structure
and extract information and raw content from structured content such as XML.
- Transformers and encoders
generate proxy and alternate versions of content for simplified searching
and previewing of content.

Figure 1. Ingestion and
Indexing
Content organization and
categorization functions include:
- Key words: These are
terms that users would associate with content, e.g., news, celebrities,
software, etc.
- Structure: Semantic (meaning)
and formatting structure can be captured, e.g. owner versus page numbers.
- Associations: This is
a very broad category that includes relationships (versions), topic maps (states
to cities), linguistic (monkey is an animal), etc.
These functions can be provided
by users/administrators but are increasingly being offered as intelligent inference
services within the search engine.
The shift from manual to
automatic ingestion is at the heart of much of the innovation in today's search
technology. The cost in person-power and time required to manually tag content
with key words and associations often preclude the use of search technology
from all but the most critical applications. As automatic ingestion becomes
increasingly accurate and sophisticated, the market for advanced search technology
can be expected to explode.
Query Processing

Figure 2. Query Processing
Once content has been analyzed
and an index has been built, searches can now be submitted, processed and results
returned. This section describes functions that are often although not always
present in search engines.
Users must articulate their
request. SQL queries, natural language interpreters, navigation a la
Yahoo!, form-based, inference, and query by example (including cut and paste
of images, audio and video) are all available to users today.
At this point, the search
engine takes over and starts crunching statistics and navigating various links.
Expressions are resolved, patterns are matched, categories and associations
are traversed and the resulting answer set is often sorted by content value
(date, last name) or by relevance (proximity or nearness of match).
The resulting matches are
ordered, formatted and presented to the user for inspection and to provide direct
access to specific content referenced in the result set and Voila!
In an ideal world, the user
is then presented with a result set that includes:
- Meaningful references,
e.g., the summary or previews of each match accurately represents the
actual match. Examples include thumbnails of a PDF page, a proxy of a video
clip, or a gist of a piece of text.
- No false positives, e.g.,
each result returned satisfies the user's search criteria.
- No false negatives,
e.g., there is no content that has been excluded from the search set that
the user would have been interested to review.
- Accurate sorting and
relevancy ranking, e.g., of the twenty thousand potential matches to
a query, the first 20 results formatted on the first page truly are the most
relevant.
Of course, in the real world
users are often imprecise in the expression of their search criteria, search
engines are limited in the variety of searches they can perform, and content
is often poorly categorized or completely unavailable to the search application.
This is the problem that enterprising search technology vendors are trying to
solve today.
Some commercial solutions
focus on the automation of categorization, metadata creation, and keyword
generation. One challenge is how to generate this information in a precise enough
way to map into the distinct and often contradictory models across industries
and use cases. Academic institutions, intelligence agencies and corporations
see the world through very different lenses and therefore expect content to
be organized accordingly.
Another challenge is that
different media types require very different technology to peer into and analyze
content. The algorithms required to extract the semantic meaning from an audio
track are quite distinct from those used to analyze a novel.
This is a seemingly impossible
hurdle to clear in a completely general way for all use cases, but vendors are
able to deliver impressive results by reducing the variety of content and contexts
to be supported at any given point in time. Increasingly, cost savings and productivity
enhancements are clearly validating this approach of increased capability over
specific classes or categories of content.
Issues to be wary of include:
- Automatically generated
metadata and categories are not likely to match industry standards that your
organization and your trading partners may be considering, e.g., PRISM,
ICE, etc.
- While content may have
many uses and be of interest to many different user communities, the metadata
and categories are not likely to have the same transferability.
- If the ingestion components
of a search engine cannot process specific media types, significant amounts
of content may be inaccessible.
One approach to compensate
for divergence and distributed indexes, categories and locations is the mixed
search.

Figure 3. Mixed Search
A mixed search engine is
one that accepts a search request and then dispatches localized versions of
that search to multiple search engines. The results are then normalized, aggregated
and returned as a single result set. Dependence on multiple search strategies
that are often supplied by third party search technology can lead to some unexpected
results, but it is the best and only way to search across all data regardless
of location, format, or use. Mixed search solutions can be homegrown, be part
of a content or digital asset management system, or be sold by search vendors.
Classifying Search Technologies
There is no individual search
technology that can cover the full spectrum of content types and search algorithms
nor is there (yet) a single company that has integrated the search techniques
to provide a one-stop-shop. It is therefore important to consider the strengths
and weaknesses of each approach. The following framework is intended to provide
some degree of order when evaluating the myriad of search products and technologies
that are currently available.

Figure 4: Classifying Search Technologies
Figure 4 illustrates how
four families of search categories have covered the entire spectrum of searches.
As we review vendor examples of each category, there are three important points
to keep in mind.
- Individual products
and the companies that offer them are constantly expanding their vision and
scope. As such, this diagram is not intended to imply permanent limitations,
rather to emphasize centers of excellence and historical success.
- The companies mentioned
are not intended to provide an exhaustive list of search companies and their
products. These represent a subset taken from a larger group that the author
was quickly able to identify. This is a very crowded field.
- The author has not personally
evaluated each product offering and cannot therefore warrant any product's
quality or suitability for a particular purpose.
DBMS
Search technology in DBMSs
has developed from query optimization engines that crunched out specific search
strategies for relational queries to hybrid search engines that include category
and cross media search. The most advanced of which now include references to
content outside of the DBMS in question. This is part of a larger trend of the
large DBMS provider's attempt to flip the content world upside down by making
the DBMS the file manager for the enterprise (rather than have the DBMS sit
in the file system. As long as this paradigm is not uncomfortable and there
is no issue with running all content through a single DBMS product, this technology
can be quite comprehensive.
Traditional Web
Web-oriented search technology
was delivered primarily through public portals (e.g., AltaVista, Excite, Lycos,
HotBot and, recently, Google). These products were relatively unsophisticated
in terms of the search algorithms: they build indexes of all significant words
and use those to look up documents. The quality of results depends on the range,
sophistication and frequency of updating page links (the "crawling"
process), and the differences in the algorithms used to rank the relevance of
the results returned.
Premium Search
Premium search solutions
are those that are based upon advanced search technology and are focused primarily
on large, strategic collections of content and whose customers are willing to
pay top dollar to provide the most robust search capabilities possible.
In 2002, the state of the
art typically includes some combination of statistical, semantic, syntactic,
and contextual methods to understand key concepts for the organization, enhancement,
and utilization of relevant information. In other words, solutions use complex
algorithms that use every possible computational method available to replace
the need to manually define taxonomies, keywords, and metadata, and to properly
assign content to these categories. Most of these algorithms will not be understood
by business or IT managers, and it is difficult to determine how well they work.
However, there are some very impressive demonstrations available and there is
a lot of serious development going on. (We may delve further into the different
algorithm types in an upcoming article.)
Rich Media
This typically includes
specific fluency in logging and indexing time-based media (video and audio)
to extract and synchronize clips, closed captioning, speech to text and other
data extraction utilities.
Rich Media Extended
Extended rich media search
includes advanced pattern matching across individual images as well as across
time-based sequences. Facial image recognition and action recognition, e.g.,
the scoring of a goal in sports, are both examples of this expanded media search
capability.
Mixed Search
This is a pluggable architecture
that includes the dispatch, aggregation and normalization of results across
multiple, heterogeneous search engines.
|
Traditional
Web
|
DBMS
|
Premium
Search
|
Rich
Media
|
Rich
Media Extended
|
Mixed
Search
|
|
Albert
|
|
|
|
|
|
|
| Altavista |
|
|
|
|
|
|
| Answerfriend
|
|
|
|
|
|
|
Applied
Semantics |
|
|
|
|
|
|
| Automony
|
|
|
|
|
|
|
| ClearForest
|
|
|
|
|
|
|
| Convera |
|
|
|
|
|
|
| DreMedia
|
|
|
|
|
|
|
| eVision
|
|
|
|
|
|
|
| Fast
Search & Transfer |
|
|
|
|
|
|
| Fast-Talk
|
|
|
|
|
|
|
| Google
|
|
|
|
|
|
|
| H5
|
|
|
|
|
|
|
| IBM
|
|
|
|
|
|
|
| Inktomi
|
|
|
|
|
|
|
| Insightful
|
|
|
|
|
|
|
| InXight
|
|
|
|
|
|
|
| Iphrase |
|
|
|
|
|
|
| LingoMotors
|
|
|
|
|
|
|
LTU
Technologies |
|
|
|
|
|
|
| Mohomine
|
|
|
|
|
|
|
| Mondosoft
|
|
|
|
|
|
|
| Oracle |
|
|
|
|
|
|
| Primus
|
|
|
|
|
|
|
| Quiver
|
|
|
|
|
|
|
| Sageware
|
|
|
|
|
|
|
| Semio
|
|
|
|
|
|
|
| Smartlogik
|
|
|
|
|
|
|
| Stratify
|
|
|
|
|
|
|
| Unifind
|
|
|
|
|
|
|
| Verity
|
|
|
|
|
|
|
| Virage |
|
|
|
|
|
|
| Wherewithal
|
|
|
|
|
|
|
Table 1. Mapping selected
search vendors to the search techniques graph in Figure 4.
Observations
- The first obvious conclusion
is that this is a crowded market.
- The majority of activity
is currently focused on automating the ingestion, indexing and categorization
of content. Both the Premium Search and the Rich Media Extended technologies
invest heavily in automating the definition and population of information
models. The reasonable premise behind this is that if it is too expensive
and time consuming to organize and mark-up content, the vast majority of that
content will never become searchable. While this is in fact true, it is also
the case that auto-generated metadata and taxonomy models cannot be relied
upon to facilitate the interchange of valuable content between organizations
or to optimally preserve digital content for very long periods of time across
multiple uses. Automating search ingestion and taxonomy generation is certainly
a market-widening approach, but it can never fully displace careful and deliberate
information modeling and content archiving.
- Web search is rapidly
becoming a low cost commodity.
- Much of the cutting edge
work is being done in the rich and extended rich media space. While it is
not typical form in these articles, here is some homework for the interested
reader. Visit www.dremedia.com. One
of the features of this search technology is that one can edit the xml-tagged
textual transcripts of video. The index is used to automatically edit the
original video to correspond with the cut and pasted text - this is a poor-man's
desktop broadcasting requiring no new editing skills beyond what one needs
to edit email today. If the software works as advertised, this could get very
interesting.
- The Premium Search is
the most popular category in this high-level view of the market. In fact,
there are a number of differences between some of these vendors and some would
prefer to be thought of as offering categorization products rather than search.
We'll sub-divide this category in a future issue.
Conclusions & Recommendations
There is a "principle
of least surprise" that is based upon the premise that predictable software
holds more value than software whose performance might occasionally be spectacular
but cannot be relied upon to provide consistent and expected results. For all
but the most dogged researcher that has plenty of time on their hands and a
great deal of research expertise of their own, the principle of least surprise
should probably sit at or near the top of the priority list when selecting and
deploying search technologies.
Know the strengths and weaknesses
of the search technology you use. If blind spots and particular strengths are
well understood, then users can appropriately compensate and can assess the
likelihood that they may have "false positives" in their result set
or may have left "false negatives" behind. Don't count solely on your
ability to understand the relative merits of the sophisticated algorithms and
linguistic, mathematical, and statistical theories they are based on. You need
to test with your own content, including content from repositories you do not
control, but need access to.
Assess user expectations
and know how much search is enough. Causal and basic search requirements require
only inexpensive and simple search tools. Advanced search requirements require
a more careful analysis of user expectations, user expertise, the state of the
content being searched (existence of metadata, etc.), and the suitability
of available search technology.
Begin good housekeeping
practices immediately to be best prepared to take advantage of emerging technologies.
Wherever possible, capture as much descriptive metadata that may serve as useful
search criteria in the future, e.g., creation dates, authors, subject
matter, rights and permissions etc. Much of this information simply cannot be
inferred and must therefore be captured somewhere. Develop and utilize categorization
and archival "best practices." There are emerging best practices for
digital preservation and that library science is assimilating the XML family,
making living digital archives a reality. This should greatly simplify search
requirements for large organizations and leave every category of information-intensive
organization best prepared to take advantage of the many interesting and powerful
search technologies that continue to emerge and mature.
Sebastian
Holst
sebastian@gilbane.com
|