Gilbane Report logoContent Management Technologies, Trends & Advice

Gilbane San Francisco and Boston banner
Gilbane Reports

The Gilbane Report: Volume 10, Number 7

Searching for Value in Search Technology

September 2002

Download a PDF version of this article

Read the news for this issue.

Search Technology Value Propositions

In our April issue, Sebastian provided a high-level taxonomy of search technology solutions, and some guidance on how to think about the search problem from an enterprise point of view. The largest, and most interesting, category of search solutions were labeled "premium search" solutions that "…are based upon advanced search technology and are focused primarily on large, strategic collections of content and whose customers are willing to pay top dollar to provide the most robust search capabilities possible." These are sophisticated solutions with long lists of features, and are based on difficult-to-understand algorithms. The mostly-common feature lists don't help much in differentiating between vendors. Unfortunately, even making the effort to fully comprehend the algorithms is not enough to inspire confidence that they will work as expected with your content.

It is worth the effort to learn something about how these advanced search technologies work, but you need to supplement this knowledge with an understanding of the subtle differences in the strategic value propositions between different vendors. This month Sebastian takes a look at three categories of value propositions, identifies some vendors focused on each, and provides additional advice to help you take the next step in determining which type of premium search solution is right for you.

Searching for Value in Search Technology

This article takes a closer look at the category of "premium search" that was introduced in an earlier Gilbane Report article (Vol. 10, Num. 3), In Search of Search Solutions. For the purposes of this article, let's say that the average transaction size for a premium search vendor is greater than $100K USD. We explore how these vendors build their value proposition today and how they see their offerings evolve to preserve that value. Constraints of time and space dictate that specific vendors cannot be reviewed. Rather, (sub) categories of premium search will be introduced with a small subset of suppliers used as illustration.

Search: It's an Ugly Business but Everyone has to do it.

Everyone has to do "it"

It is often said that the greater the number of terms a community has for a given object, the more important that object is to that culture. Inuit have numerous names for ice, Pacific Islanders do the same with coconuts and most every culture have a multitude of terms for money and for sex. Search, navigate, retrieve, report and query are just a few of the tags we use on a daily basis. Search is clearly important. In fact, I challenge anyone to name an operation that is more universal than search.

I would also like to introduce a corollary. The more meanings a community has for a single term, the poorer that community understands that term. In other words, if a term can have many meanings, then it essentially means nothing. Every organization and individual requires and/or provides search services. We search on the web, across our enterprise, through our desk, and for matching socks in the morning. Search is almost unique in its universality. This virtually assures that buying, selling or using a search tool will always mean many things at different times. Said another way, search will likely continue to devolve into a meaningless term for a variety of technology applications.

Premium search vendors agree. I could not find a single vendor that was comfortable being categorized as a search vendor and all were quick to characterize search as a commodity.

It's ugly

Search technology cannot meet end-user expectations.

Combine every advanced search algorithm smoothly across all media and it will not approach the sophistication that we employ when searching for (and finding) matching socks in the morning. Categorization (dresser drawers), predicate (where matching), visual search (color and pattern), profiling (plans for the day), clustering (formal/casual), media-specific retrieval (remove from drawer and put on feet) - are seamlessly prioritized, integrated, and flawlessly executed and all before a first cup of coffee!

Cognitively, search is one of the seemingly effortless activities that are in fact marvels of the mind. Everyone searches yet very few people take the times to appreciate how complex and sophisticated our most basic searches are. The result is that when technologists tout their powerful search capabilities, user expectations are heightened and then somewhat dashed as even the most advanced search technologies rapidly reveal obvious limitations and inappropriate behaviors.

It's a business

If it hurts when you do that - don't do that. This vaudevillian cliché has not been lost on the multitude of software vendors whose considerable value propositions rely heavily upon one or more aspects of search technology. (Search) vendors have wisely opted to position (or reposition) their offerings in a variety of ways, but none of them have chosen to make leadership in search technology the cornerstone of their business or sales strategy. Vendors have taken a number of strategies that will be reviewed here.

Understanding a vendor's self-perception is the best leading indicator of where that vendor will likely be investing their development resources, the kinds of partnerships that will be prioritized and ultimately, its ongoing viability.

Premium Search Value Drivers

The basic processes of ingesting, processing and staging content to be searched were covered in the previous article. Also covered were the various flavors of search technology. It is recommended that the reader refresh themselves as the following descriptions are intended to be incremental.

Figure 1. represents the essential components that one can now expect to find embedded in most DBMS, Internet and content management solutions. Consider this a baseline. It is the functionality beyond this baseline that vendors must use to justify their premium value propositions and fees. The following key provides context.

  1. The ability to search content that is both stored in a DBMS and across a file system.
  2. The "Y" axis represents the integration of generated or supplemental information that the search engine uses to speed retrieval or to better evaluate and prioritize potential results. Both metadata and key words can be entered manually, but typically there are always some metadata fields that are automatically generated such as format information and virtually all keywords are generated through a syntactic parser. Indexes are always 100% generated and exclusively machine-readable.
  3. The "X" axis illustrates the move to extend searching beyond literal pattern matching to capture meaning or the semantics in fulfilling a search request.
  4. The two boxes just below the user represent a moderate degree of query pre-processing and result formatting and prioritization that are present in virtually every search function.


Figure 1: Commodity Search in 2002

Figure 2 illustrates where and how vendors have pushed beyond these basic functions to provide significant improvements in quality, performance and breadth of functionality. After a walk through of these enhancements, a more detailed set of explanations and examples will be provided under the umbrella of the actual value propositions themselves, e.g., emphasizing the value to the consumer rather than the technology for its own sake. Here is an updated key for Figure 2.

  1. The ability to ingest, traverse, retrieve, preview and return rich media including audio, video and high-end publishing materials is emerging as an optional set of functionality for a number of premium vendors.
  2. The algorithms to automatically generate greater varieties of metadata, categories and key concepts is a now a standard premium feature.
  3. The extraction of concepts automatically from content has pushed the degree of abstraction to even greater lengths permitting world views and subject matter expertise to influence the categorization of content.
  4. Just as vendors have pushed the ability to abstract semantics, or meaning, from content - they have pushed supplemental information to new levels by providing there own original content as well. This is primarily in the form of deep knowledge of syntax across multiple languages ensuring that all forms and declensions of a word can be found (concrete), and proprietary ontologies(1.) that add significant value to automatic categorization, metadata generation and search pre-processing.
  5. A great deal more attention has been placed upon the interpretation and internal representation of a user's query. There are approaches that apply all of the conceptual and computational categories in 1-4 to a query before it is ever submitted to a search engine. The rationale is a good one - the ability to simplify the experience and improve the results hits right to the heart of most users' frustrations with typical search tools. This also permits queries to be mapped to different search engines over time or simultaneously (see "mixed searches" in the previous article).
  6. Processing results, particularly large volumes of results and rich media-based results require specialized preview, drill-down, navigation, and visualization techniques. This family of functionality is also rapidly taking on a product lifecycle of its own quite apart from (although obviously co-dependent upon) the underlying search engine.

Figure 2: Premium Search Attributes

Search Concepts

Given the clear emphasis on the automation and enhanced abstraction of search technology, a very short selection of these concepts follows. While lengthy tomes have been written on each of these search concepts, here is the minimum required to establish enough of a context to follow the rest of the article.

  • Categorization: The process of organizing information into related groups (buckets). Yahoo! is an example of a popular, manually maintained category-based search engine. It is not uncommon for large manufacturing companies to organize their content into well-defined categories that number in the many tens of thousands (of categories - not entries). Some classification products will attempt to classify data automatically, while others assist human catalogers.
  • Clustering: Clustering is a technique for organizing documents/words into subsets of similar documents/words based on common elements between the documents/ words. This is similar to categorization, although clustering lacks the hierarchical depth of categorization and there are a multitude of clustering algorithms where categorization typically is based upon semantic meaning/relationships.
  • Clustering by Example: Another approach to classifying unstructured data is to develop a subset of documents that pre-establish categories defined by a set of reference content. These "training sets" can be automatic or supervised. The software analyzes new documents in comparison to the training set and searches for similar concepts and ideas. This approach is also referred to as "machine learning."
  • Linguistic Clustering: This technology observes and measures co-occurrences of words. For example, "Java" used in connection with Starbucks probably relates to a document about coffee instead of a programming language. Relative placement of words is important. Words in the first lines of a document are likely more important than information contained in the copyright section. Statistical analysis and clustering also look for word frequency, placement, and grouping, as well as the distance between words in a document.
  • Semantic Clustering: Semantic analysis depends on a particular language and dialect. Documents are clustered or grouped depending on meaning of words using thesauri, custom dictionaries (e.g., a dictionary of abbreviations), parts-of-speech analyzers, rule based and probabilistic grammar, recognition of idioms, verb chain recognition, and noun phrase identifiers (e.g., "business unit manager"). Linguistic software also analyzes the structure of the sentences identifying the subject, verbs and objects, like you did when you first studied grammar in grade school. Then sentence structure analysis is applied to extract the meaning. Stemming or reducing a word to its root also helps linguistic or semantic clustering.
  • Ontology: Used in Information Retrieval and Artificial Intelligence, an ontology defines concepts, providing a way to move towards consistency in vocabulary. It provides a working model of the entities and interactions of a particular topic, such as dentistry or anthropology.

Unfortunately, but not surprisingly, no one vendor has been able to push their offerings in all of these directions simultaneously. Rather than focus on a list of features, it is more meaningful to organize the various pockets of extended functionality under broader value propositions. This should provide a rationale as to how vendors can still offer material value without a complete suite of the latest search capabilities and, more importantly, provide a lens through which to view your own requirements.

Premium Value Propositions

Search technology can be counted on to work as long as the following conditions are met:

The content

  • exists
  • has been thoroughly ingested such that all metadata, categorization, indexes, etc. have been built
  • the taxonomy, metadata, etc. accurately reflects the usage of the content

The users

  • have been properly trained in the search tool/application they are using
  • are fluent in whatever search concepts are required in order to both express their requests and interpret their results.
  • use the search tool, e.g., are not frustrated or "turned-off" by the user interface

This is a short list but a long haul. In fact, there are only two essential ingredients to achieving this near-perfect state: unlimited time and exceptional expertise. Of course, no organization has either of these and this at the heart of all search value propositions - to permit organizations to approximate timely, regular, accurate and useful search results without investing the time or training required. We look at three types of value propositions:

  • Simple is hard
  • Time-to-value
  • Solution sell.

Simple is hard — Add-value to search by simplifying or hiding complexity

The following table lists the areas of complexity that most often frustrate or defeat users and the techniques that are used to overcome these obstacles.

Area of complexity Simplifying Technique Comments
Expressing a query Natural Language Processing Techniques that learn from specific users are most likely to provide best results over time.
  Profiling Can work well assuming privacy issues are not violated or conversely, that individual behaviors are permitted to be tracked.
  Fact Extraction This applies the same automated categorization to the query as was applied to the content. It compares apples to apples, but does not guarantee that the user will comprehend the full scope of the query they have just submitted.
  Navigation This provides more control to the user and eliminates the most general query interpretations, which are often the most difficult for a computer to do well.
  By example Like fact extraction, it applies apples to apples, e.g., give me more results like this one - the only area of surprise is that users and software may not agree on what the original sample's essential elements are.
Multi-lingual searches Linguistic and semantic knowledgebase The syntactic piece has been fairly well worked out by the top tier search engines and there are techniques for dealing with the structural peculiarities of German, Japanese, English etc. However, applying a query specified in one language to content authored in another is still a black art that can amuse and horrify as often as it satisfies.
Defining a taxonomy Pre-fabricated taxonomies or automatically generated taxonomies Defining and maintaining taxonomies is hard. Extensible, pre-existing taxonomies are ideal if they are a fit for your needs. Automatically generated taxonomies are also simple, but come with their own set of caveats covered in the next section (Time-to-value).
Browsing results Graphical visualization There are a variety of visualization techniques from charting results against concepts and "topic maps" to the generation of webs where each node is a related piece of information.
  Thumbnail and Gists This is the abstraction of content to essential concepts (for text) and low resolution formats for production print, audio and video. Users can browse results without having to pay the price of downloading what are typically large objects.

Time-to-value — Automation of ingestion, indexing, categorization, etc. through algorithms and content

The debate over automation versus manual processes has no potential winner. From suits to software and from cars to content - the advantages and drawbacks are actually fairly generic. If one can afford (in time and money) a hand tailored suit or a handmade car (Rolls Royce is still made by hand I believe), there is typically little downside. If you need a fleet of cars or have to clothe an entire organization quickly and cheaply - mass production is your only option. As you can see in the following list, there is little difference when it comes to content and information modeling.

Automatic

Advantages

  • Logically consistent over time
  • Scales to accommodate virtually any volume of content
  • Can be centrally managed

Limitations

  • Not synchronized with current business practices and behaviors of users
  • Often difficult to extend or train

Manual Processes

Advantages

  • Highly precise
  • Can be carefully mapped to user expectations and orientations

Limitations

  • Inconsistent over time - information modelers change as do their perspectives
  • Does not scale well requiring significant staffing
  • High degree of skill required to develop and populate information structures
  • Difficult to manage centrally or to audit behaviors

The theoretical ideal is an automatic process that can be rapidly taught and can be manually extended over time to account for any unique site requirements. However, these kinds of extensions are difficult to maintain whenever the automated systems need to be rerun. In other words, automated components can rarely, if ever, assimilate manual changes to generated content/index/taxonomies.

Solution Sell Providing turnkey solutions that benefit from, but are not exclusively, search-based

If organizations are not buying search, sell them something else. Virtually every premium search vendor that I spoke with claimed that they were offering "solutions" not technology. However, only a very few actually offered fully formed solutions that solved business problems.

In my view, only the vendors that successfully make the transition from technology function vendor to solution provider will survive (independently) over the next two years. Examples of "solutions" include:

  • Contextual Advertising: semantically analyze web-based queries to match them up with appropriate advertisements. This is essentially an advertising/personalization solution that leverages search technology behind the scenes.
  • Competitive Business Intelligence: Analyzing, filtering and presenting competitive information as it comes over news wires, becomes available through public disclosure or by revisiting existing archives and repositories can be dramatically enhanced through semantic analysis, prioritization and navigation.
  • Domain Name Suggestions: Ever try to select a domain name on the Internet? All of the good ones seem to be taken! There is an application that will suggest semantically related terms that are available when your first selection is already spoken for. This service combines the semantic and syntactic analysis with straight textual queries against the domain database.

The important element in each of these examples is that users and the technology consumers (not the same in these examples) don't know or care how search plays a role in providing these services.

Some Vendor Examples

Rather than provide a large matrix, which would do little to differentiate vendors from one another, this section uses a few select vendors to illustrate how the functionality outlined above can be effectively combined and positioned to provide premium value to customers. Thinking of vendors in this way can help you crystallize your own thinking about appropriate solution providers. This is not a complete list (see In Search of Search Solutions for a more complete list), nor is there any implied endorsement of the vendors included here. Further, this article reflects only what is available today and does not account for any future development.

Having inserted the requisite caveats, each of these vendors appears to have found a credible way to distinguish their offerings by solving large and expensive problems for the markets they serve.

To Simplify and Accelerate Time to Value

These select vendors do not offer turnkey solutions, although each takes great pride in the simplification and time to value that they offer. These vendors offer significant value-add to the traditional or baseline search experience. Each aligns proprietary, and often patented, technology behind the two essential roadblocks to successful information retrieval; inherent complexity and required investments in time and resources.

Albertwww.albert.com

Value Proposition: A relatively small investment that dramatically improves the value of your existing intranet and enterprise investments in infrastructure, content and processes.

What they offer

  • Query simplification: Natural language processing, automatic generation and maintenance of taxonomies, syntactic and semantic clustering, multilingual support and learning capabilities.
  • " Document/concept-centric indexing: A modular (optional) index that generates both surface (e.g., statistical analysis of direct relationships between words and concepts) and deep (ontological mapping of conceptually related topics) indexes.

Why this works: Albert has chosen to focus on one of the greatest areas of frustration for all users and the IT management that is chartered to serve them - if users do not "use", then 100% of all dollars and time invested are, in fact, wasted. Albert is focused on driving adoption rates(2) for existing intranet and enterprise content investments. Albert's motto could be - "everything you know - remember" since users express their queries as they would normally, existing infrastructure and even existing indexes can be plugged in to Albert's query processor and the complex and time consuming process of developing taxonomies is both automated and hidden.

ClearForest www.clearforest.com

Value proposition: ClearForest extracts and transfers knowledge to users from existing and real-time content resulting in more intelligent and more rapid decision making, increased productivity and a reduction in expenses.

What they offer

  • Graphical visualization: a broad set of visualization and navigation mechanisms that lend themselves extremely well to monitoring competitive activity and correlating seemingly distinct events such as product releases and stock prices.
  • Concept-centric indexing: automatic semantic tagging, real-time content ingestion and learning logic.

Why this works: ClearForest has in many ways eliminated the query entirely and replaced it with interactive, graphical results. Users simply drill-down on interesting facets of a visually intuitive representation/slice of their content to see a more detailed or correlated view of their content. They have democratized the old, manual and expensive Executive Information System to the point where entire communities of users can explore, discover and take action on important events and relationships that have been historically obscured or hidden.

Solution Providers

These select vendors have defined their business and revenue models in such a way that the majority of their customers (and revenue) are coming from customers that don't really care about search/information retrieval/whatever - they value specific business applications where to the user, search may not even be an obvious component (but, in point of fact, it is key).

Autonomywww.autonomy.com

Value proposition: Autonomy provides infrastructure that processes and organizes all of an enterprise's information into personally relevant experiences. This horizontal platform drives Customer Relationship Management (CRM), Business Intelligence, Human Resource and other business functions that are the backbone of an enterprise.

What they offer: Autonomy has the most comprehensive collection of algorithms, applications, interfaces and partnerships making it the 800lb Gorilla in the Enterprise Information Ecosystem.

Why this works: Search is complex, information modeling is complex, application integration is complex, evaluating a strategic supplier's viability is complex, and negotiating licensing deals and managing multiple releases of multiple product components is complex. Autonomy proposes to solve all of these issues in a one-stop fashion (3).

Applied Semantics www.appliedsemantics.com

Value proposition: Applied Semantics has embedded their core technology into specific solutions for publishing, domain name registration and online advertising. These solutions, and particularly the latter two, completely hide the categorization and searching to provide simple solutions to business problems. The domain name application uses their ontology to find semantically related domain name suggestions that are available when the initial request is denied. The advertising application matches appropriate/related advertising to an Internet user's location and navigation.

What they offer: A proprietary ontology, automated categorization, automatic metadata creation and the indexing to retrieve content via these relationships.

Why this works: Rather than leave the application of this sophisticated ontology and auto-modeling technology to the consumer's imagination, Applied Semantics is (yes) applying their technology to solve real problems and then going to market with those solutions.

DreMediawww.dremedia.com

Value proposition: DreMedia focuses on simplifying many of the production processes for broadcasters. The premise is that by speeding and democratizing access to high quality video content, their customers will significantly reduce time-to-air and reduce the cost of their operations.

What they offer: DreMedia combines advanced video, audio and unstructured textual ingestion, categorization and retrieval with applications specifically targeting broadcasters. Interestingly, this is incremental to Autonomy's offerings as the unstructured text component is Autonomy.

Why it works: This is one of - if not the only - commercial offering that marries speech to text with advanced semantic and syntactic analysis. With the connection made between time-based content, the content and the concepts therein, DreMedia is able to build applications that permit users to cut and paste text to build new video and to put video and audio categorization, search and retrieval on par with structured text.

What is a Consumer to Do?

As with all IT investments, let your business and end-user requirements drive the process. Do not be distracted by technological feats of magic, regardless of how impressive they may be.

  • Define the lens through which you will prioritize the features and functionality you will require. Since no vendor has it all - optimizations (trade-offs) must be made. Assess your customers' needs, their training levels, their willingness (and ability) for incorporate new behaviors and technology and do nothing that violates these essential requirements.
  • Estimate the potential for your content. Depending on the current scope, scale and state of your content, some applications may not be practical or may require significant upgrades to your content (via metadata, categorization, accuracy, format, etc.). There may be hidden costs in dollars, time and quality if this step is not properly concluded.
  • Never Assume. In today's economy, you should have no trouble trying before you buy. Test the algorithms, learning modules, visual interfaces, response times, etc. Heuristics are a funny thing - they do not always operate on your content in ways that you or your vendor may have expected.
  • Drive adoption. Capture and socialize technology, best practices associated with that technology and the rationale for what's in it for the user.

Search is Dead - Long Live Search

Search will vanish from premium vendor lexicon, but the need for Time-to-value, simplification, and complete solutions that happen to leverage search will not only persist - it is should continue to grow dramatically. Vendors that can stay one step ahead of the large infrastructure and application providers should remain viable and there is never a shortage of demand for a turn-key solution to a pressing business problem. However, the transition from horizontal technology vendor to solution provider is a very difficult one because it calls for completely different skill sets and investment profiles. The corporate re-alignment is often more challenging than the development of the application.

Search, and related technology, is getting better and, as we continue to integrate unstructured data into enterprise computing applications, it is becoming more critical - it will have a long, if often hidden, life.

Notes:

(1.) An ontology is a "world view" as expressed through carefully selected terms, definitions and complex relationships. An ontology should be a complete system of concepts and is typically modeled using formal and precise modeling methodologies.

(2) Adoption rate is defined as the use of technology in the context of the best practice that justified the initial investment and the abandonment of old, obsolete practices.

(3) Read my July 2002 column on www.gilbane.com/columns.html for this author's own views on the pluses and minuses of end-to-end solutions.

Sebastian Holst
sebastian@gilbane.com

 

Subscribe to NewsShark
Content technology industry news without the hype

Email Address:*
First Name:*
Last name*
* = Required Field

RSS/XML Newsfeeds
Industry News
Event Announcements
Analyst Blog
Enterprise Search Blog
Publishing Technology Blog
Globalization Blog
Collaboration Blog
Web Content Management Blog


The Gilbane Report is published by Bluebill Advisors, Inc. © 1993 - 2005 The Gilbane Report. All Rights Reserved.
Contact | Editorial Policy | Privacy Policy | Site Map