We have written about search technology for enterprise applications a number of times, but the interest in this space continues to accelerate, and the sophistication of the inquiries keeps on increasing. As companies grapple with ever-expanding amounts of (especially unstructured and semi-structured) content and the resulting difficulty of even finding information they know they have somewhere, they become more willing to consider the effort of organizing information so that it can be found, and found quickly.
This means that IT strategists and many business managers now need to understand what taxonomies are, what their value to search is, how they get developed, what is involved in their design and use, what technology can do vs. what humans still have to do, and what they need to consider before they get started.
This month contributor Lynda Moulton joins us to provide an introduction to taxonomies and related concepts. Lynda’s article is designed to serve as an introduction suitable for anyone implementing or managing a corporate search, portal or knowledge management application, but her advice, gleaned from years of helping companies better manage their information, will also be valuable to those of you who already understand taxonomies and their value.
Download a complete version of this issue that includes industry news and additional information (PDF)
The content management software industry has discovered that promoting taxonomy delivers significant visibility . It has the desired effect of letting the market know that a vendor is a serious player in the content management market, while also driving prospects to their consulting practices. Taxonomy is one of those words that is so bandied about that everyone is sure to feel the need for one – whatever it is, whatever it does. Like many good ideas, useful business tools, or enabling components, taxonomy, when affiliated with a product, is given impossible hype. The projected outcomes of building or deploying taxonomy go far beyond what professionals who build them and professional search experts who employ believe they can contribute.
In this paper, we’ll examine some concepts that surround taxonomy, reasons for building them, and most important, methods for deploying taxonomies once you build them. As in any hot new area of business, once you start cribbing terminology from other disciplines, new meanings for terms evolve and present problems for those trying to understand how an old term, with its original definition, fits the new application. In order to sort out some of these confounding ideas, some context for the subject sets out the scope of taxonomy in the enterprise content management field. [Content Management Strategies: Integrating Search, Gilbane Report Vol 11 Num 7, September 2003]
The concepts we’ll explore are set forth for the enterprise user with a need to find content (search), the IT staff tasked with building or deploying an application that employs search, and the manager that observes the need for more efficiency storing and retrieving information. Vendors in this market may benefit, as well, by observing the ways in which we position taxonomy and search for these audiences. Confusion and mystery around the basic topic only makes product marketing and successful implementation more difficult. Product suppliers can improve the experience for buyers by conversing at the customer level, addressing the specific needs, and establishing realistic expectations. The concepts are simple but the products and execution are anything but.
Context for Taxonomy
We are in an era when taxonomy, related to technology, refers to a business tool rather than an ordered nomenclature of organisms. The more recent business meaning relates to a list of terms divided into groups, categories or clusters, and is usually paired with search. Search, as a function, has play in many products and services. In fact, search is so ubiquitous that when we use it we often aren’t aware or we don’t give thought to what search technology is behind the experience. The search engine is usually “hidden” from the novice searcher in a way that makes it unnecessary to know how it works.
Indexing drives search
Underlying any search engine is computerized indexing; it is the method of indexing content that results in the richness, or lack, of the search experience. Our individual experiences with indexing dates from the time we learned to use a phone directory or an index at the back of a textbook. These examples of human created indexes illustrate two fundamental challenges still facing computer-indexing algorithms today.
Indexing a list of names to enable finding associated information, addresses and phone numbers, presents the organizer with sequencing or alphabetizing challenges. Does upper case belong before lower case? Does USA come before United Parcel Service? In a recent Verizon phone directory US Airways comes before U File Discount DocumentCenter, which comes before US Computer, which comes before United States Government. Even a professional indexer would be hard pressed to articulate the alphabetizing rules in this example. Clearly, there are rules being applied to which many of us don’t subscribe. In this example, we can see that norms of sorting are not universal in the business world. This carries over to search because search is partially governed by the sequencing of lists for browsing and the way results are sorted for viewing.
Our second common use of non-automated indexes, at the back of a book, presents another challenge in automated search, namely the choice of words that should be used to identify prominent concepts in a book. The most common practice is to use the author’s language, but that results in a dilemma when the author varies his or her language to express a single idea. In Management: Tasks, Responsibilities, Practices by Peter Drucker, viewed by many as the first significant work to describe knowledge management concepts, the term Knowledge workers and work appears in the index with several pages listed, but Motivation also appears in the index with a subheading of Knowledge workers. Different page numbers appear at each entry. This type of indexing has been assigned the name “keyword indexing” in an automated index. Finding information in a keyword index depends primarily on the skill of the searcher to guess the term or terms the author used. It also burdens the searcher who must look at all locations in the book that are referenced by the index.
Taxonomy supports indexing
One of the solutions to normalizing language in search is to create control lists of terminology and to sort out language rules such as sequencing, cross-references, and usage guidelines. Before continuing the discussion of how taxonomy supports indexing for search by managing what gets into a searchable index, a reader’s guide to the terminology in the remainder of this article is in order.
|Bibliographic data fields
|The elements that make up citations to a in a list of books, articles, maps, document items and other materials.
|The process of grouping materials into one or more classes or topical areas.
|A structured and reasoned system of organizing materials according to their single strongest attribute. One attribute may represent a single facet of the material
|The informational matter in a collection of materials or a single domain.
|An authorized and standardized list of terminology used to define the content of one attribute or facet of materials in a collection. A collection may be defined by more than one controlled vocabulary. (e.g. Subject, Corporate Affiliation, Publisher).
|In a controlled vocabulary list, a directive from one term in the list to use another.
|Categorization based on multiple aspects of a domain.
|The entire content of material being indexed for search and retrieval.
|A finding device. A set of information that directs the user to the object of the listing. In a term index, each entry points to one or more materials by a virtual or physical location.
|Significant term (word or phrase) being searched that may or may not belong to a controlled vocabulary list
|Structured categories defined to contain information about one aspect or attribute (e.g. publisher, subjects) of an information resource (e.g. book, document, image).
|A structural specification for expressing all possible relationships among concepts.
|A list of terms for classifying one attribute of information resources (e.g. subjects, names). A controlled vocabulary with a graphical structure for visualization of structures.
|A hierarchically structured controlled vocabulary of terms that are used to describe information resources. A more comprehensive form of taxonomy with deeper relationships and cross-references.
|Any controlled vocabulary that a computer algorithm uses to verify acceptability in a database field.
Classification to controlled vocabulary
In search technology, there is renewed interest in a traditional method of categorizing content (e.g. articles, books, papers). This interest is in the use of a controlled vocabulary as an indexing language, as opposed to simply keyword indexing. In its simplest form, a controlled vocabulary is a validation list, usually few in number, of terms that can be assigned to an entity. This is not unlike the labels assigned to the shopping aisles of a store. Similarly, libraries use a code, a classification number, to indicate wherein a collection a book “belongs.” The classification number represents the strongest subject content for the book. A classification number may reveal books strongly in that classification but does not identify weaker or alternative subjects in the book. In both the grocery store and library you have a dilemma for both the classifier and the searcher. Do dried fruits belong in the produce section or near other packaged goods such as canned fruits? Does a book on fossil fuels belong with geology (in Science) or with petroleum processing (Technology)? The distinctions are not always clear.
To overcome the limitations of giving an item only one classifying category for ease of shelf browsing, librarians devised a second system of assigning controlled vocabulary to express the “aboutness” of a book. They added the concept of controlled vocabulary to that of classification. In this second system, a book could be categorized by any number of controlled vocabulary terms, subject headings or topics. There are numerous controlled vocabulary lists that have evolved from institutions as diverse as the Library of Congress to the American Chemical Society. The terms in these lists range from the language of the generalist to that of highly specialized researchers with their own languages.
Controlled vocabularies that were developed for library subject categorization or subject indexing became highly evolved over the past century. In particular, they addressed the problem of synonymous concepts or related terminology by adding structure beyond alphabetizing rules. In the 1970s an ANSI standard (Z39.19) was issued that set forth a hierarchical structure for building up controlled term lists much like a taxonomic structure of organisms, called a thesaurus. The relationships included Broader Term (BT) with its reciprocal Narrower Term (NT). Unlike a biological taxonomy, thesaurus structure provides for synonymous relationships with Use (U) and its reciprocal Used For (UF) indicating the preferred controlled vocabulary term. Finally, for relationships that are merely associative but can’t be termed broader or narrower (e.g. causal as in vapor trails and aircraft) a Related Term (RT) relationship is included.
Scores of professional associations, scholarly society publishers, and government agencies have developed and applied ANSI standard thesauri when indexing specialized content in their fields. To name a few: NASA, Department of Energy, National Institutes of Health, American Society for Metals (ASM International), American Petroleum Institute (API), each developed a subject specialized thesaurus. Their lists have been used for decades to assign multiple subject categories to the individual pieces of content they publish: journal articles, papers, etc. Societies often publish for searching specialized indexes based primarily on their controlled subject lists, as well as author names, titles, and other finding categories.
The point is, thesauri control indexing and indexing enables search. Sometimes a limited vocabulary, taxonomy, is sufficient in an organized body of content to service indexing and to support a browsable search structure. Once collections approach the size of a major society publisher, a thesaurus of thousands of terms with relationships is needed.
On the horizon, ontology, like taxonomy, has its roots in another discipline, philosophy. Ontology has been superimposed on a newer and more complex method of relating subjects. In fact, ontology deals with whole concepts composed of terms and relationships among terms that are much more complex than thesaurus hierarchy. Ontologies provide semantic richness that imbues terms with meaning when relationships connect them. Because of the infinite combinations of terms that can be used to form concepts, ontologies cannot be thought of as controlled vocabularies. They are frameworks for the possibilities of language and term relationships that might be encountered in a specialized domain. The biomedical field has the largest such knowledge representation, the Universal Medical Language System (UMLS) developed by the National Institutes of Health. Commercial, government or private development of ontologies is in very early stages and is only beginning to find its way into experimental search systems. Exploration of this area is worth considering for future search options.
When Search is Deployed Controlled Lists may be Employed
Interactive computer applications require a finding function to locate the records in the database. Search can be as simple as locating a record by its primary key (e.g. an ID, a name, or a record number). It can be as complex as a menu of options to many indexes, one for each field in a structured database. Finally, search may mean that all content or records associated with an application are fully keyword indexed. In a structured database some of the indexed fields may be controlled by term lists that govern what data may be added to the field, while other fields may contain large amounts of “free text” that is then keyword indexed. Ease of access to records in an application depends on how simple or intuitive the search options are; you may not even be aware of whether or how you are using search.
Some examples of search illustrate the difference between a controlled vocabulary search and a free text search.
- Specialized on-line applications have replaced library catalogs since the early 1980s in public, academic, school and corporate libraries. These and publishers databases are the original database applications that made use of controlled vocabularies to categorize and index library resources. A couple of examples of these library databases for specialized collections are at Project Management Institute Library or Miami Dade County Library.
- Databases of publishers of specialized content (e.g. Index Medicus from the National Library of Medicine, Chemical Abstracts from the American Chemical Society) provide structured access to all bibliographic fields (e.g. author, title, subject, publication date), as well as full text searching of the actual content. In these databases the Subject Headings are highly controlled by specialized thesauri using language suited uniquely to each field. Because of high development and maintenance costs, these quality indexes have fees associates with use for “premium search.”
- In a bookkeeping application (e.g. Quicken) dropdown lists of Accounts or Categories are validation lists used to uniformly index all payments.
- When you search for a file on your computer you are accessing a keyword index of file and folder names that you create when you label items.
- Call center management for a large technical enterprise will undoubtedly choose an application that indexes customer organization names, customer contact people, date of calls, products used, among other structured fields, plus keywords associated with the call description.
- Google, which we use to search keywords and key phrases from content across the Web, also provides structured search in the form of categorizing. You can confine your search to types of Web sites, search for images-only, or choose Directory for category search at http://www.google.com/options/index.html.
- Most e-commerce Web site categorize by types of product and also provide keyword searching so that you can look for product names, product numbers, or product descriptions (e.g. http://www.hp.com/ for finding Hewlett Packard products)
- Specialized industry applications have search operations (e.g. Contractor’s Blue Book a contractor’s bidding site)
- New search engines are emerging that specialize as content aggregators that you pay to search. Examples would be Factiva (Factiva.com) for business content including news feeds, or KNovel for scientific and technical text books (Knovel.com).
Why employ a controlled vocabulary?
Controlling index content depends on the type of search experience you require, the size of the content collection, the audience, and the complexity of the content. Each of the previous examples has a different audience and purpose. When we use a software application designed for a specialized job function, we should expect that searching characteristics will anticipate the routine tasks needed to perform the job efficiently. When software designed to make our jobs easier don’t operate as quickly or effectively as when a function is performed manually, we feel frustrated and let down by lack of features. Among the most common complaints about business software are:
- The need to re-type the same information for each transaction completed
- The lack of support for adding consistent entries that would make future retrieval easier
- The inability to find information known to be in the database
These examples highlight places where an approved list of terms can benefit the worker who needs to add information to a content repository or database and the worker who will be required to find information at a later time. It is not sufficient for a database designer to simply provide a field for a particular category of information. If the data to be entered should be limited to one of 20 or even 100 possible terms, the software must provide a list in the form of a validation table. Consistent entries are needed for browsable lists, categorized reporting, and quick searches. The norms we expect people to follow in business practice, standardizing, become much more difficult without controls.
A common validation list is a postal code list for state names. A data entry form can confine the entries to two characters but the list further constrains the possible entries to 50 valid codes. The list will be more useful with translation from state name to code ensure that the correct code is used. In the case of Maine, many assume that the code is MA, which is Massachusetts. A feature that lets the data entry person or searcher type the name of a state and having the software translate it to the correct code is helpful. This constitutes a U relationship (i.e. Maine Use (U) ME)
If address possibilities include countries, provinces, and regions, a tree structure would be helpful to pinpoint qualifying characteristics needed to give a complete address. Where countries have been renamed, cross-references from the old name to the new should be included. This type of enhancement begins to change simple validation lists to something closer to a taxonomic structure.
We are often confronted with business language that is far more complex than geography or products, however. We need to ensure that indexing language is well defined and suitably expanded with highly specialized terminology for classifying documents to instill confidence they will be found when technical searches are performed. Nowhere is the investment in content greater than in research and development to create new and innovative products. Leveraging R & D, not just in the era when it is performed, but throughout the lifetime of an organization. It is best to capture results and insights for future use from experts when they are still actively engaged in research. Indexing with correct and consistent language is an activity that, currently, humans can do best but which automation will increasingly do well and more economically.
The better the controls on language used to categorize research, the better the search experience. A search engine that is built to take advantage of a controlled vocabulary thesaurus with cross-references can insure that a search on high blood pressure will also retrieve content that only contains the term hypertension, or on diuretics will find content containing hydrochlorothiazide. This will happen only as long as the taxonomy or thesaurus provides a USE relationship from term one to the other. Knowing that a taxonomy has built-in cross-references to encompass variations on terminology or to bring narrower concepts under the umbrella of a single uniform term strengthens confidence that a search engine will find all relevant content when a search is executed.
While controlled term lists are important to validate data entry, assigning topical categories to describe content in terms that even the author might not have used, they will also be a useful for browsing. By displaying terminology used to consistently index content the interface presents options for the searcher to select the most specific or broadest term that encompasses his knowledge quest.
Deploying taxonomies for metadata maintenance
Librarians describe information content through the assignment of bibliographic descriptors (e.g. Authors, Title, Publication Date) to form a complete bibliographic record called a citation. The citation was presented in the form of a catalog card until thirty + years ago when citations became available electronically. Publishers of bibliographic databases presented online variations of citations, all easily read by researchers.
In the late-1980s a new standard began to evolve for electronic citations, called the Dublin Core. It set forth generic categories for content descriptors called metadata tags in which some categories (e.g. Title) are the same as in bibliography, and others (e.g. Creator instead of Author) are more generic to describe the numerous possibilities for types of content. Dublin Core Specification
Metadata categories have been prominent in content management systems, similarly to the way bibliographic data elements are employed in automated library catalog systems. When it comes to search functions, both types of systems have exactly the same purpose, to categorize and index the important elements that describe a specific item of content (e.g. book, document, patent, photograph, news item, company annual report, laboratory notebook).
Data structures of either type of system must provide fields for all elements of metadata or bibliography needed for the content domain. The database must also accommodate one or more tables for the taxonomy or thesaurus that will be used to validate data entry fields. These must be embedded in the application to validate terms as they are added to individual content records. There needs to be support for adding new terms. One mechanism for adding new terms might be a periodic batch load from another source or, at the high end, a mechanism to permit the addition of new terms flagged as provisional entries during the content indexing process by subject experts. A good design will facilitate the reconciliation of provisional terms or modification of existing terms, plus global modification of the records that have used those terms.
A highly developed interface for those who categorize or index specialized content that enables them to interact easily with taxonomy is strongly recommended. This removes one of the serious barriers to indexing large volumes of material that will be useful as searchable resources in the future. The more awkward and cumbersome an indexing process, the less likely content will be routinely added. Some level of human interaction with documents when they are being added to a searchable database is necessary or the quality of indexing will suffer.
Deploying taxonomies in search
Once a substantial body of material has been indexed in a content management or library system there are several methods of search that might be offered. The first and most simple approach features a text box into which the searcher can type a word or phrase. The rules that govern the format of the text differ among systems but usually a help function describes when and how terms can be truncated (e.g. abbreviated to a searchable stem as in telephon* where all words beginning with this string will be found.). This is a keyword search approach most commonly offered in Web applications. While help files may define what is being searched, the typical user has no idea what the searchable body of information looks like. It may be an ordered list of human assigned terminology or it may be the entire content of all documents associated with the application. It may be language supplied by the author, specialized terms assigned by an expert (that may or may not appear in the text), or it may be that search engine rules are built in to retrieve related content. In this example, a cross-reference may include any content about phones in the search results.
A second common type of search is a form containing spaces for typing text that you would expect to find in the bibliographic or metadata fields. At any point in the form, there may be access to taxonomy that controls the field. Being able to type a word or phrase to trigger a scan of the taxonomic term list is a feature that some systems offer.
Finally, systems often provide taxonomy in a form that peals back layers or exposes narrower concepts as you select categories under which you believe your strongest search interest may be indexed. This type of “browsable” structure is available through full public Web-based search engines (e.g. Google and Yahoo) but also in specialized applications that focus on a narrow domain of content as in the HP Web site. This type of application requires that the taxonomy be maintained for currency of language, have sufficient cross-references from popular terminology to controlled terms, and be devoid of terms that have no links to content.
A browse structure for searching depends on the use of taxonomy for pre-structuring the content by using the taxonomy to categorize or index the content. The taxonomy becomes the organizing structure for content, a visual guide to the knowledge resources. Its success as a search aid depends on the graphic design used to display terms, ease of navigation and suitability of taxonomy language to the searcher. You can see an example of a browsable taxonomy for graphic arts at the Library of Congress site LC – Graphic Thesaurus (type ink and uncheck the Content box, then click the TERM button) or for a British Maritime site at Maritime. A subset of a Lawrence Livermore thesaurus can be viewed at LLNL.
Search engines that support customized taxonomies require a linkage between the taxonomy term list, which is stored in the database, and content resources to which the terms point. Without embedded linkage, search look-up (i.e. matching the term selected from a browsable list with terms in the content) is very inefficient and will result in poor performance. The best structures have a count posted with terms in the browsable taxonomy; as new content is added to the database with metadata linked to the taxonomy the document count is updated dynamically. If indexing is fully automated with no metadata supplied by human indexers, links between the taxonomy and content need to be automatically updated; a posted count with the terms is still desirable. Posted record counts with taxonomies have the benefit of letting the searcher know exactly how much content is available associated with a term. Then the searcher can decide whether to seek narrower term concepts to select.
It is probably worth noting that Web portals are often divided into sections, each governed by different taxonomic structures. This is an example of faceted classification (e.g. Products, Geographical Regions, Market Segments), each area with its own control list representing different views of how an enterprise organizes its content.
Taxonomy Development Guidelines
At the end of this section are links to some published material on developing taxonomies or thesauri. Briefly, there are five components outlined here with the fundamentals for a committed development process.
- Existing resources are always the best starting point. This includes published glossaries, taxonomies, thesauri and internally generated term lists. If the organization has been indexing materials manually or even in a simple database for a period of time, using keywords to categorize the materials, those keywords can form a useful basis for “fleshing out” a published term list. Internal terms are likely to be far more specific and expertise friendly for the domain than published lists. Data mining a large corpus of internal content, or training a categorizer with a few targeted and clearly written topical reports are also possibilities for building up relevant terms.
- The scope of the list you will build depends on both depth and breadth of content to be managed. A small repository of a few thousand documents does not require a long term list. In most cases a few hundred terms will suffice but a very diverse range of subject areas may demand more. It is easy to add terms when a topic becomes over-used by the indexing process but you then need to revisit content to update the metadata with more precise terms if they are added. A corpus of hundreds of thousands of documents in a specialized field may have a thesaurus of two or three thousand terms; however, be cautious about embracing an entire thesaurus from a published source that is many times larger. The language will be too general and there will likely be many sections you would never use. Choose what your enterprise really needs and leave the rest out.
- Subject matter experts are needed to validate the terms you will include and the relationships you will build among terms. They will help you decide which, among a group of synonymous terms, are the better choices for your audience. If you lack sufficient expertise in an area of content, utilize commercial categorization software for several passes to extract terminology from your content to build up a candidate term list. Once you see synonyms appearing, use a term count function to determine the most heavily used terms. Make cross-references from the little used terms to USE the popular term.
- Tools are software applications that come with your content management software or library software to build up term lists, assign relationships (Broader and Narrower), make cross-references and capture notes about term usage. There are standalone software applications available that let you build term structures fairly easily. They may expedite the initial build but can be a burden with on-going maintenance. External taxonomies need to be batch loaded periodically and the content re-indexed to take advantage of new terms. The key is to have tools that are highly integrated for real-time updating and maintenance with a minimum of human intervention. Library systems have been functioning in this real-time mode for decades; content management systems lag in comparison.
- Method is how you use the software tools you have, resources, subject experts and content repositories to build taxonomy most efficiently. The common aspects that all such projects share are that the process is highly iterative and it requires a high degree of human intelligence and focus. Consider short-term goals carefully and build small sections that can be tested early in the process. Employing a variety of methods and software tools for small lists will gradually reveal the most efficient methods for continuing to build up your vocabulary. Working in teams with frequent “sanity checks” of each other’s work is also useful.
Grimes, Seth. The Word on Text Mining; Text analytics provide concept discovery, automated classification, and innovative displays for volumes of unstructured documents, Intelligent Enterprise 12/10/2003.
Knox, Rita E. What Taxonomies Do for the Enterprise. 2p. Gartner Group 09/10/2003
Moulton, Lynda W. Why do You Need a Taxonomy Anyway? And How to Get Started, LWM Technology Services 06/01/2003.
Where is Taxonomy Headed? Conclusions & Recommendations
Serious research demands discipline and structure, regardless of whether the work is performed in a laboratory or by searching in books or databases of articles. Automation has turned the search for information and discovery on its head through the shear volume of content that is being produced and replicated daily. The same information is disseminated in hundreds of formats and stored in repositories of institutions and on publicly available servers throughout the world. One finds the same full text of published documents available globally through the Web in numerous formats and repositories. Structured databases are searchable through portals that disguise content structures and source, often making the context difficult to discern.
The masses of redundant information available in numerous unstructured or loosely defined formats beg for new order in the information world of research. Seekers are desperate for systematic approaches to information gathering that are at the same time comprehensive and non-redundant. They demand search interfaces that can search all possible repositories (file systems and databases) concurrently and return results in a relevancy-sorted list. This is the claim of federated search.
The new order and federated search are several years away from popular usability. Current federated search reflect search engine characteristics of the repositories where the data resides. Besides economic barriers to the commercialization of new federated search technology, there exist numerous economic barriers to building the components needed to have them work effectively. One key to federated searching in a subject specialized knowledge domain is the existence of ontological frameworks for the subject discipline. While a single ontology exists in medicine and health sciences (UMLS), which was government funded, other disciplines are years away from having ontologies fully developed. There are excellent thesauri in a number of defense areas, energy, metallurgy, chemistry, and some softer sciences. But activity to build ontologies for these areas and to create applications that will use them to provide federated and semantic (natural language) query searching is just beginning.
For now, the effort to build subject specific taxonomies and develop them into thesauri is significant and requires sufficient human effort. An enterprise should view this effort as a good first step toward future searching goals. Major efforts in building taxonomies are now and will remain primarily within the content product industry; these continue to be enhanced by enterprise efforts to refine the lists for their own internal use with search on enterprise content. When federated and semantic searching technology gives us appropriate commercial options, ontology availability will be a significant part of the offerings. It is coming.
Lynda Moulton, email@example.com