The Gilbane Report: Volume 10, Number 7Searching for Value in Search Technology
September 2002
Download a PDF version of this article Read the news for this issue.
Search Technology Value Propositions
In our April issue, Sebastian
provided a high-level taxonomy of search technology solutions, and some guidance
on how to think about the search problem from an enterprise point of view. The
largest, and most interesting, category of search solutions were labeled "premium
search" solutions that "
are based upon advanced search technology
and are focused primarily on large, strategic collections of content and whose
customers are willing to pay top dollar to provide the most robust search capabilities
possible." These are sophisticated solutions with long lists of features,
and are based on difficult-to-understand algorithms. The mostly-common feature
lists don't help much in differentiating between vendors. Unfortunately, even
making the effort to fully comprehend the algorithms is not enough to inspire
confidence that they will work as expected with your content.
It is worth the effort
to learn something about how these advanced search technologies work, but you
need to supplement this knowledge with an understanding of the subtle differences
in the strategic value propositions between different vendors. This month Sebastian
takes a look at three categories of value propositions, identifies some vendors
focused on each, and provides additional advice to help you take the next step
in determining which type of premium search solution is right for you.
Searching for Value in
Search Technology
This article takes a closer
look at the category of "premium search" that was introduced in an
earlier Gilbane Report article (Vol. 10, Num. 3), In
Search of Search Solutions. For the purposes of this article, let's
say that the average transaction size for a premium search vendor is greater
than $100K USD. We explore how these vendors build their value proposition today
and how they see their offerings evolve to preserve that value. Constraints
of time and space dictate that specific vendors cannot be reviewed. Rather,
(sub) categories of premium search will be introduced with a small subset of
suppliers used as illustration.
Search: It's an Ugly Business
but Everyone has to do it.
Everyone has to do "it"
It is often said that the
greater the number of terms a community has for a given object, the more important
that object is to that culture. Inuit have numerous names for ice, Pacific Islanders
do the same with coconuts and most every culture have a multitude of terms for
money and for sex. Search, navigate, retrieve, report and query are just a few
of the tags we use on a daily basis. Search is clearly important. In fact, I
challenge anyone to name an operation that is more universal than search.
I would also like to introduce
a corollary. The more meanings a community has for a single term, the poorer
that community understands that term. In other words, if a term can have many
meanings, then it essentially means nothing. Every organization and individual
requires and/or provides search services. We search on the web, across our enterprise,
through our desk, and for matching socks in the morning. Search is almost unique
in its universality. This virtually assures that buying, selling or using a
search tool will always mean many things at different times. Said another way,
search will likely continue to devolve into a meaningless term for a variety
of technology applications.
Premium search vendors agree.
I could not find a single vendor that was comfortable being categorized as a
search vendor and all were quick to characterize search as a commodity.
It's ugly
Search technology cannot
meet end-user expectations.
Combine every advanced search
algorithm smoothly across all media and it will not approach the sophistication
that we employ when searching for (and finding) matching socks in the morning.
Categorization (dresser drawers), predicate (where matching), visual search
(color and pattern), profiling (plans for the day), clustering (formal/casual),
media-specific retrieval (remove from drawer and put on feet) - are seamlessly
prioritized, integrated, and flawlessly executed and all before a first cup
of coffee!
Cognitively, search is one
of the seemingly effortless activities that are in fact marvels of the mind.
Everyone searches yet very few people take the times to appreciate how complex
and sophisticated our most basic searches are. The result is that when technologists
tout their powerful search capabilities, user expectations are heightened and
then somewhat dashed as even the most advanced search technologies rapidly reveal
obvious limitations and inappropriate behaviors.
It's a business
If it hurts when you do
that - don't do that. This vaudevillian cliché has not been lost on the
multitude of software vendors whose considerable value propositions rely heavily
upon one or more aspects of search technology. (Search) vendors have wisely
opted to position (or reposition) their offerings in a variety of ways, but
none of them have chosen to make leadership in search technology the cornerstone
of their business or sales strategy. Vendors have taken a number of strategies
that will be reviewed here.
Understanding a vendor's
self-perception is the best leading indicator of where that vendor will likely
be investing their development resources, the kinds of partnerships that will
be prioritized and ultimately, its ongoing viability.
Premium Search Value Drivers
The basic processes of ingesting,
processing and staging content to be searched were covered in the previous
article. Also covered were the various flavors of search technology. It
is recommended that the reader refresh themselves as the following descriptions
are intended to be incremental.
Figure 1. represents the
essential components that one can now expect to find embedded in most DBMS,
Internet and content management solutions. Consider this a baseline. It is the
functionality beyond this baseline that vendors must use to justify their premium
value propositions and fees. The following key provides context.
- The ability to search
content that is both stored in a DBMS and across a file system.
- The "Y" axis
represents the integration of generated or supplemental information that the
search engine uses to speed retrieval or to better evaluate and prioritize
potential results. Both metadata and key words can be entered manually, but
typically there are always some metadata fields that are automatically generated
such as format information and virtually all keywords are generated through
a syntactic parser. Indexes are always 100% generated and exclusively machine-readable.
- The "X" axis
illustrates the move to extend searching beyond literal pattern matching to
capture meaning or the semantics in fulfilling a search request.
- The two boxes just below
the user represent a moderate degree of query pre-processing and result formatting
and prioritization that are present in virtually every search function.

Figure 1: Commodity Search in 2002
Figure 2 illustrates where
and how vendors have pushed beyond these basic functions to provide significant
improvements in quality, performance and breadth of functionality. After a walk
through of these enhancements, a more detailed set of explanations and examples
will be provided under the umbrella of the actual value propositions themselves,
e.g., emphasizing the value to the consumer rather than the technology
for its own sake. Here is an updated key for Figure 2.
- The ability to ingest,
traverse, retrieve, preview and return rich media including audio, video and
high-end publishing materials is emerging as an optional set of functionality
for a number of premium vendors.
- The algorithms to automatically
generate greater varieties of metadata, categories and key concepts is a now
a standard premium feature.
- The extraction of concepts
automatically from content has pushed the degree of abstraction to even greater
lengths permitting world views and subject matter expertise to influence the
categorization of content.
- Just as vendors have
pushed the ability to abstract semantics, or meaning, from content - they
have pushed supplemental information to new levels by providing there own
original content as well. This is primarily in the form of deep knowledge
of syntax across multiple languages ensuring that all forms and declensions
of a word can be found (concrete), and proprietary ontologies(1.)
that add significant value to automatic categorization, metadata generation
and search pre-processing.
- A great deal more attention
has been placed upon the interpretation and internal representation of a user's
query. There are approaches that apply all of the conceptual and computational
categories in 1-4 to a query before it is ever submitted to a search engine.
The rationale is a good one - the ability to simplify the experience and improve
the results hits right to the heart of most users' frustrations with typical
search tools. This also permits queries to be mapped to different search engines
over time or simultaneously (see "mixed searches" in the previous
article).
- Processing results, particularly
large volumes of results and rich media-based results require specialized
preview, drill-down, navigation, and visualization techniques. This family
of functionality is also rapidly taking on a product lifecycle of its own
quite apart from (although obviously co-dependent upon) the underlying search
engine.

Figure 2: Premium Search
Attributes
Search Concepts
Given the clear emphasis
on the automation and enhanced abstraction of search technology, a very short
selection of these concepts follows. While lengthy tomes have been written on
each of these search concepts, here is the minimum required to establish enough
of a context to follow the rest of the article.
- Categorization:
The process of organizing information into related groups (buckets). Yahoo!
is an example of a popular, manually maintained category-based search engine.
It is not uncommon for large manufacturing companies to organize their content
into well-defined categories that number in the many tens of thousands (of
categories - not entries). Some classification products will
attempt to classify data automatically, while others assist human catalogers.
- Clustering: Clustering
is a technique for organizing documents/words into subsets of similar documents/words
based on common elements between the documents/ words. This is similar to
categorization, although clustering lacks the hierarchical depth of categorization
and there are a multitude of clustering algorithms where categorization typically
is based upon semantic meaning/relationships.
- Clustering by Example:
Another approach to classifying unstructured data is to develop a subset of
documents that pre-establish categories defined by a set of reference content.
These "training sets" can be automatic or supervised. The software
analyzes new documents in comparison to the training set and searches for
similar concepts and ideas. This approach is also referred to as "machine
learning."
- Linguistic Clustering:
This technology observes and measures co-occurrences of words. For example,
"Java" used in connection with Starbucks probably relates to a document
about coffee instead of a programming language. Relative placement of words
is important. Words in the first lines of a document are likely more important
than information contained in the copyright section. Statistical analysis
and clustering also look for word frequency, placement, and grouping, as well
as the distance between words in a document.
- Semantic Clustering:
Semantic analysis depends on a particular language and dialect. Documents
are clustered or grouped depending on meaning of words using thesauri, custom
dictionaries (e.g., a dictionary of abbreviations), parts-of-speech
analyzers, rule based and probabilistic grammar, recognition of idioms, verb
chain recognition, and noun phrase identifiers (e.g., "business
unit manager"). Linguistic software also analyzes the structure of the
sentences identifying the subject, verbs and objects, like you did when you
first studied grammar in grade school. Then sentence structure analysis is
applied to extract the meaning. Stemming or reducing a word to its root also
helps linguistic or semantic clustering.
- Ontology: Used
in Information Retrieval and Artificial Intelligence, an ontology defines
concepts, providing a way to move towards consistency in vocabulary. It provides
a working model of the entities and interactions of a particular topic, such
as dentistry or anthropology.
Unfortunately, but not surprisingly,
no one vendor has been able to push their offerings in all of these directions
simultaneously. Rather than focus on a list of features, it is more meaningful
to organize the various pockets of extended functionality under broader value
propositions. This should provide a rationale as to how vendors can still
offer material value without a complete suite of the latest search capabilities
and, more importantly, provide a lens through which to view your own requirements.
Premium Value Propositions
Search technology can be
counted on to work as long as the following conditions are met:
The content
- exists
- has been thoroughly ingested
such that all metadata, categorization, indexes, etc. have been built
- the taxonomy, metadata,
etc. accurately reflects the usage of the content
The users
- have been properly trained
in the search tool/application they are using
- are fluent in whatever
search concepts are required in order to both express their requests and interpret
their results.
- use the search tool,
e.g., are not frustrated or "turned-off" by the user interface
This is a short list but
a long haul. In fact, there are only two essential ingredients to achieving
this near-perfect state: unlimited time and exceptional expertise. Of course,
no organization has either of these and this at the heart of all search value
propositions - to permit organizations to approximate timely, regular, accurate
and useful search results without investing the time or training required. We
look at three types of value propositions:
- Simple is hard
- Time-to-value
- Solution sell.
Simple is hard Add-value
to search by simplifying or hiding complexity
The following table lists
the areas of complexity that most often frustrate or defeat users and the techniques
that are used to overcome these obstacles.
| Area of complexity |
Simplifying Technique |
Comments |
| Expressing a query |
Natural Language Processing |
Techniques that learn
from specific users are most likely to provide best results over time. |
| |
Profiling |
Can work well assuming
privacy issues are not violated or conversely, that individual behaviors
are permitted to be tracked. |
| |
Fact Extraction |
This applies the same
automated categorization to the query as was applied to the content. It
compares apples to apples, but does not guarantee that the user will comprehend
the full scope of the query they have just submitted. |
| |
Navigation |
This provides more
control to the user and eliminates the most general query interpretations,
which are often the most difficult for a computer to do well. |
| |
By example |
Like fact extraction,
it applies apples to apples, e.g., give me more results like this
one - the only area of surprise is that users and software may not agree
on what the original sample's essential elements are. |
| Multi-lingual searches |
Linguistic and semantic
knowledgebase |
The syntactic piece
has been fairly well worked out by the top tier search engines and there
are techniques for dealing with the structural peculiarities of German,
Japanese, English etc. However, applying a query specified in one language
to content authored in another is still a black art that can amuse and horrify
as often as it satisfies. |
| Defining a taxonomy |
Pre-fabricated taxonomies
or automatically generated taxonomies |
Defining and maintaining
taxonomies is hard. Extensible, pre-existing taxonomies are
ideal if they are a fit for your needs. Automatically generated
taxonomies are also simple, but come with their own set of caveats covered
in the next section (Time-to-value). |
| Browsing results |
Graphical visualization |
There are a variety
of visualization techniques from charting results against concepts and "topic
maps" to the generation of webs where each node is a related piece
of information. |
| |
Thumbnail and Gists |
This is the abstraction
of content to essential concepts (for text) and low resolution formats for
production print, audio and video. Users can browse results without having
to pay the price of downloading what are typically large objects. |
Time-to-value Automation
of ingestion, indexing, categorization, etc. through
algorithms and content
The debate over automation
versus manual processes has no potential winner. From suits to software and
from cars to content - the advantages and drawbacks are actually fairly generic.
If one can afford (in time and money) a hand tailored suit or a handmade car
(Rolls Royce is still made by hand I believe), there is typically little downside.
If you need a fleet of cars or have to clothe an entire organization quickly
and cheaply - mass production is your only option. As you can see in the following
list, there is little difference when it comes to content and information modeling.
Automatic
Advantages
- Logically consistent
over time
- Scales to accommodate
virtually any volume of content
- Can be centrally managed
Limitations
- Not synchronized with
current business practices and behaviors of users
- Often difficult to
extend or train
Manual Processes
Advantages
- Highly precise
- Can be carefully mapped
to user expectations and orientations
Limitations
- Inconsistent over time
- information modelers change as do their perspectives
- Does not scale well
requiring significant staffing
- High degree of skill
required to develop and populate information structures
- Difficult to manage
centrally or to audit behaviors
The theoretical ideal is
an automatic process that can be rapidly taught and can be manually extended
over time to account for any unique site requirements. However, these kinds
of extensions are difficult to maintain whenever the automated systems need
to be rerun. In other words, automated components can rarely, if ever, assimilate
manual changes to generated content/index/taxonomies.
Solution Sell
Providing turnkey solutions that benefit from, but are not exclusively,
search-based
If organizations are not
buying search, sell them something else. Virtually every premium search vendor
that I spoke with claimed that they were offering "solutions" not
technology. However, only a very few actually offered fully formed solutions
that solved business problems.
In my view, only the vendors
that successfully make the transition from technology function vendor to solution
provider will survive (independently) over the next two years. Examples of "solutions"
include:
- Contextual Advertising:
semantically analyze web-based queries to match them up with appropriate
advertisements. This is essentially an advertising/personalization solution
that leverages search technology behind the scenes.
- Competitive Business
Intelligence: Analyzing, filtering and presenting competitive information
as it comes over news wires, becomes available through public disclosure or
by revisiting existing archives and repositories can be dramatically enhanced
through semantic analysis, prioritization and navigation.
- Domain Name Suggestions:
Ever try to select a domain name on the Internet? All of the good ones
seem to be taken! There is an application that will suggest semantically related
terms that are available when your first selection is already spoken for.
This service combines the semantic and syntactic analysis with straight textual
queries against the domain database.
The important element in
each of these examples is that users and the technology consumers (not the same
in these examples) don't know or care how search plays a role in providing these
services.
Some Vendor Examples
Rather than provide a large
matrix, which would do little to differentiate vendors from one another, this
section uses a few select vendors to illustrate how the functionality outlined
above can be effectively combined and positioned to provide premium value to
customers. Thinking of vendors in this way can help you crystallize your own
thinking about appropriate solution providers. This is not a complete list (see
In Search of Search Solutions for a more complete list), nor is there any implied
endorsement of the vendors included here. Further, this article reflects only
what is available today and does not account for any future development.
Having inserted the requisite
caveats, each of these vendors appears to have found a credible way to distinguish
their offerings by solving large and expensive problems for the markets they
serve.
To Simplify and Accelerate
Time to Value
These select vendors do
not offer turnkey solutions, although each takes great pride in the simplification
and time to value that they offer. These vendors offer significant value-add
to the traditional or baseline search experience. Each aligns proprietary, and
often patented, technology behind the two essential roadblocks to successful
information retrieval; inherent complexity and required investments in time
and resources.
Albert www.albert.com
Value Proposition:
A relatively small investment that dramatically improves the value of your
existing intranet and enterprise investments in infrastructure, content and
processes.
What they offer
- Query simplification:
Natural language processing, automatic generation and maintenance of taxonomies,
syntactic and semantic clustering, multilingual support and learning capabilities.
- " Document/concept-centric
indexing: A modular (optional) index that generates both surface (e.g.,
statistical analysis of direct relationships between words and concepts)
and deep (ontological mapping of conceptually related topics) indexes.
Why this works:
Albert has chosen to focus on one of the greatest areas of frustration for
all users and the IT management that is chartered to serve them - if users
do not "use", then 100% of all dollars and time invested are, in
fact, wasted. Albert is focused on driving adoption rates(2)
for existing intranet and enterprise content investments. Albert's motto could
be - "everything you know - remember" since users express their
queries as they would normally, existing infrastructure and even existing
indexes can be plugged in to Albert's query processor and the complex and
time consuming process of developing taxonomies is both automated and hidden.
ClearForest
www.clearforest.com
Value proposition:
ClearForest extracts and transfers knowledge to users from existing and real-time
content resulting in more intelligent and more rapid decision making, increased
productivity and a reduction in expenses.
What they offer
- Graphical visualization:
a broad set of visualization and navigation mechanisms that lend themselves
extremely well to monitoring competitive activity and correlating seemingly
distinct events such as product releases and stock prices.
- Concept-centric indexing:
automatic semantic tagging, real-time content ingestion and learning logic.
Why this works: ClearForest
has in many ways eliminated the query entirely and replaced it with interactive,
graphical results. Users simply drill-down on interesting facets of a visually
intuitive representation/slice of their content to see a more detailed or
correlated view of their content. They have democratized the old, manual and
expensive Executive Information System to the point where entire communities
of users can explore, discover and take action on important events and relationships
that have been historically obscured or hidden.
Solution Providers
These select vendors have
defined their business and revenue models in such a way that the majority of
their customers (and revenue) are coming from customers that don't really care
about search/information retrieval/whatever - they value specific business applications
where to the user, search may not even be an obvious component (but, in point
of fact, it is key).
Autonomy www.autonomy.com
Value proposition:
Autonomy provides infrastructure that processes and organizes all of an enterprise's
information into personally relevant experiences. This horizontal platform
drives Customer Relationship Management (CRM), Business Intelligence, Human
Resource and other business functions that are the backbone of an enterprise.
What they offer:
Autonomy has the most comprehensive collection of algorithms, applications,
interfaces and partnerships making it the 800lb Gorilla in the Enterprise
Information Ecosystem.
Why this works: Search
is complex, information modeling is complex, application integration is complex,
evaluating a strategic supplier's viability is complex, and negotiating licensing
deals and managing multiple releases of multiple product components is complex.
Autonomy proposes to solve all of these issues in a one-stop fashion
(3).
Applied Semantics
www.appliedsemantics.com
Value proposition:
Applied Semantics has embedded their core technology into specific solutions
for publishing, domain name registration and online advertising. These solutions,
and particularly the latter two, completely hide the categorization and searching
to provide simple solutions to business problems. The domain name application
uses their ontology to find semantically related domain name suggestions that
are available when the initial request is denied. The advertising application
matches appropriate/related advertising to an Internet user's location and
navigation.
What they offer:
A proprietary ontology, automated categorization, automatic metadata creation
and the indexing to retrieve content via these relationships.
Why this works:
Rather than leave the application of this sophisticated ontology and auto-modeling
technology to the consumer's imagination, Applied Semantics is (yes) applying
their technology to solve real problems and then going to market with those
solutions.
DreMedia www.dremedia.com
Value proposition:
DreMedia focuses on simplifying many of the production processes for broadcasters.
The premise is that by speeding and democratizing access to high quality video
content, their customers will significantly reduce time-to-air and reduce
the cost of their operations.
What they offer:
DreMedia combines advanced video, audio and unstructured textual ingestion,
categorization and retrieval with applications specifically targeting broadcasters.
Interestingly, this is incremental to Autonomy's offerings as the unstructured
text component is Autonomy.
Why it works: This
is one of - if not the only - commercial offering that marries speech to text
with advanced semantic and syntactic analysis. With the connection made between
time-based content, the content and the concepts therein, DreMedia is able
to build applications that permit users to cut and paste text to build new
video and to put video and audio categorization, search and retrieval on par
with structured text.
What is a Consumer to Do?
As with all IT investments,
let your business and end-user requirements drive the process. Do not be distracted
by technological feats of magic, regardless of how impressive they may be.
- Define the lens through
which you will prioritize the features and functionality you will require.
Since no vendor has it all - optimizations (trade-offs) must be made. Assess
your customers' needs, their training levels, their willingness (and ability)
for incorporate new behaviors and technology and do nothing that violates
these essential requirements.
- Estimate the potential
for your content. Depending on the current scope, scale and state of your
content, some applications may not be practical or may require significant
upgrades to your content (via metadata, categorization, accuracy, format,
etc.). There may be hidden costs in dollars, time and quality if this
step is not properly concluded.
- Never Assume.
In today's economy, you should have no trouble trying before you buy. Test
the algorithms, learning modules, visual interfaces, response times, etc.
Heuristics are a funny thing - they do not always operate on your content
in ways that you or your vendor may have expected.
- Drive adoption.
Capture and socialize technology, best practices associated with that technology
and the rationale for what's in it for the user.
Search is Dead - Long Live
Search
Search will vanish from
premium vendor lexicon, but the need for Time-to-value, simplification, and
complete solutions that happen to leverage search will not only persist - it
is should continue to grow dramatically. Vendors that can stay one step ahead
of the large infrastructure and application providers should remain viable and
there is never a shortage of demand for a turn-key solution to a pressing business
problem. However, the transition from horizontal technology vendor to solution
provider is a very difficult one because it calls for completely different skill
sets and investment profiles. The corporate re-alignment is often more challenging
than the development of the application.
Search, and related technology,
is getting better and, as we continue to integrate unstructured data into enterprise
computing applications, it is becoming more critical - it will have a long,
if often hidden, life.
Notes:
(1.)
An ontology is a "world
view" as expressed through carefully selected terms, definitions and complex
relationships. An ontology should be a complete system of concepts and
is typically modeled using formal and precise modeling methodologies.
(2)
Adoption rate is defined
as the use of technology in the context of the best practice that justified
the initial investment and the abandonment of old, obsolete practices.
(3)
Read my July 2002 column
on www.gilbane.com/columns.html
for this author's own views on the pluses and minuses of end-to-end solutions.
Sebastian
Holst
sebastian@gilbane.com
|