Recently in Search Problems/Solved Search Problems Category

I begin 2012 with a new perspective on enterprise search, one gained as purely an observer. The venues have all been medical establishments with multiple levels of complexity and healthcare workers. As the primary caregiver for a patient, and with some medical training, I take my role as observer and patient advocate quite seriously.

As soon as the patient was on the way to the emergency room, all of his medical records, insurance cards, medications, and contact information were assembled and brought to the hospital. With numerous critical care professionals intervening, and the patient being taken for various tests over several hours, I verbally imparted information I thought was important that might not yet show up in the system. Toward the end of the emergency phase, after being told several times that they had all his records available and "in the system" I relaxed to focus on the "next steps."

Numerous specialists were involved in the medical conditions and the first three days passed without "a crisis" but little did we know that medication choices were beginning to cause some major problems. Apparently, some parts of the patient's medical history were not fully considered, and once the medications caused adverse outcomes, all kinds of other problem arose.

Fortunately, I was there to verbally share knowledge that was in the patient's medical records and get choices of medicine reversed. On several occasions, doctor's care orders had been "overlooked" and complicating interventions were executed because the healthcare person "in the moment" took an action without "seeing" those orders. I personally watched the extensive recording of doctor's decisions and confirmed with them changes that were being made to the patient's care, but repeatedly had to ask why a change was not being implemented.

Observing for six to eight hours on several care floors, I can only say that time is the enemy for medical staff. When questions were raised, the answers were in the system; in other words, "search worked." What was not available to staff was time to study the whole patient record and understand overlapping and sometimes conflicting orders about care.

It is shortsighted for any institution to believe that it can squeeze professionals to "think-fast," "on-their-feet" for hours on end with no time to consider the massive amounts of searchable results they are able to assemble. Human beings should not be expected to sacrifice their professional integrity and work standards because their employers have put them in a constant time bind.

My family member had me, but what of patients with no one, or no one versed in medical conditions and processes to intervene. This extends to every line of business where risk is involved from the practice of law to engineering, manufacturing, design, research and development, testing, technical documentation writing, etc.

I don't minimize how hard it is for businesses and professional services to stay profitable and competitive when they are being pressed to leverage technology for information resource management. However, one measure that every enterprise must embrace is educating its workforce about the use of information technologies it employs. It is not enough to simply make a search engine interface accessible on the workstation. Every worker must be shown how to search for accurate information, authoritative information, and complete information, and be made aware of the ways to ingest and evaluate what they are finding. Finally, they must be given an alternative to getting a more complete chronicle when the results don't match the need, even if that alternative is to seek another human being instead of a technology.

Search experts are a professionally trained class of workers who can fill the role of trainers, particularly if they have subject matter expertise in the field where search is being deployed. The risks to any enterprise of short-changing workers by not allowing them to fully exploit and understand results produced from search are long-term, but serious.

It is important to leave this entry with recognition that, due to wonderful healthcare professionals and support staff, the outcomes for the patient have been positive. People listened when I had information to share and respected my role in the process. That in no way absolves institutions and enterprises from giving their employees the autonomy and time to pay attention to all the information flooding their sphere of operation. In every field of endeavor, human beings need the time and environment to mindfully absorb, analyze and evaluate all the content available. Technology can aid but cannot carry out thoughtful professional practice.

Semantic Software Technologies: Landscape of High Value Applications for the Enterprise was published just over a year ago. Since then the marketplace has been increasingly active; new products emerge and discussion about what semantics might mean for the enterprise is constant. One thing that continues to strike me is the difficulty of explaining the meaning of, applications for, and context of semantic technologies.

Browsing through the topics in this excellent blog site, http://semanticweb.com , it struck me as the proverbial case of the blind men describing an elephant. A blog, any blog, is linear. While there are tools to give a blog dimension by clustering topics or presenting related information, it is difficult to understand the full relationships of any one blog post to another. Without a photographic memory, an individual does not easily connect ideas across a multi-year domain of blog entries. Semantic technologies can facilitate that process.

Those who embrace some concept of semantics are believers that search will benefit from "semantic technologies." What is less clear is how evangelists, developers, searchers and the average technology user can coalesce around the applications that will semantically enable enterprise search.

On the Internet content that successfully drives interest, sales, opinion and individual promotion does so through a combination of expert crafting of metadata, search engine technology that "understands" the language of the inquirer and the content that can satisfy the inquiry. Good answers are reached when questions are understood first and then the right content is selected to meet expectations.

In the enterprise, the same care must be given to metadata, search engine "meaning" analysis tools and query interpretation for successful outcomes. Magic does not happen without people behind the scenes to meet these three criteria executing linguistic curation, content enhancement and computational linguistic programming.

Three recent meeting events illustrate various states of semantic development and adoption, even as the next conference, Semantic Tech & Business Conference - Washington, D.C. on November 29 - is upon us:

Event 1 - A relatively new group, the IKS-Community funded by the EU has been supporting open source software developers since 2009. In July they held a workshop in Paris just past the mid-point of their life cycle. Attendees were primarily entrepreneurs and independent open source developers seeking pathways for their semantically "tuned" content management solutions. I was asked to suggest where opportunities and needs exist in US markets. They were an enthusiastic audience and are poised to meet the tough market realities of packaging highly sophisticated software for audiences that will rarely understand how complex the stuff "under the hood" really is. My principal charge to them was to create tools that "make it really easy" to work with vocabulary management and content metadata capture, updates, and enhancements.

Event 2. - On this side of the pond, UK firm Linguamatics hosted its user group meeting in Boston in October. Having interviewed a number of their customers last year to better understand their I2E product line, I was happy to meet people I had spoken with and see the enthusiasm of a user community vested in such complex technology. Most impressive is the respectful tone and thoughtful sharing between Linguamatics principals and their customers. They share the knowledge of how hard it is to continually improve search technology that delivers answers to semantically complex questions using highly specialized language. Content contributors and inquirers are all highly educated specialists seeking answers to questions that have never been asked before. Think about it, search engines designed to deliver results for frequently asked questions or to find content on popular topics is hard enough, but finding the answer to a brand new question is a quantum leap of difficulty in comparison.

To make matters even more complicated, answers to semantic (natural language) questions may be found in internal content, in published licensed content or some combination of both. In the latter case, only the seeker may be able to put the two together to derive or infer an answer.

Publishers of content for licensing play a convoluted game of how they will license their content to enterprises for semantic indexing in combination with internal content. The Linguamatics user community is primarily in life sciences; this is one more hurdle for them to overcome to effectively leverage the vast published repositories of biological and medical literature. Rigorous pricing may be good business strategy, but research using semantic search could make more headway with more reasonable royalties that reflect the need for collaborative use across teams and partners.

Content wants to be found and knowledge requires outlets to enable innovation to flourish. In too many cases technology is impaired by lack of business resources by buyers or arcane pricing models of sellers that hold vital information captive for a well-funded few. Semantically excellent retrieval depends on an engine's indexing access to all contextually relevant content.

Event 3. - Leslie Owens of Forrester Research, at the Fall 2011 Enterprise Search Summit conducted a very interesting interactive session that further affirms the elephant and blind men metaphor. Leslie is a champion of metadata best practices and writes about the competencies and expertise needed to make valuable content accessible. She engaged the audience with a series of questions about its wants, needs, beliefs and plans for semantic technologies. As described in an earlier paragraph about how well semantics serves us on the Web, most of the audience puts its faith in that model but is doubtful of how or when similar benefits will accrue to enterprise search. Leslie and a couple of others made the point that a lot more work has to be done on the back-end on content in the enterprise to get these high-value outcomes.

We'll keep making the point until more adopters of semantic technologies get serious and pay attention to content, content enhancement, expert vocabulary management and metadata. If it is automatic understanding of your content that you are seeking, the vocabulary you need is one that you build out and enhance for your enterprise's relevance. Semantic tools need to know the special language you use to give the answers you need.

A discussion that began with a graduate scholar at George Washington University in November, 2010 about semantic software technologies prompted him to follow up with some questions for clarification from me. With his permission, I am sharing three questions from Evan Faber and the gist of my comments to him. At the heart of the conversation we all need to keep having is, how far does this technology go and does it really bring us any gains in retrieving information?

1. Have AI or semantic software demonstrated any capability to ask new and interesting questions about the relationships among information that they process?

In several recent presentations and the Gilbane Group study on Semantic Software Technologies, I share a simple diagram of the nominal setup for the relationship of content to search and the semantic core, namely a set of terminology rules or terminology with relationships. Semantic search operates best when it focuses on a topical domain of knowledge. The language that defines that domain may range from simple to complex, broad or narrow, deep or shallow. The language may be applied to the task of semantic search from a taxonomy (usually shallow and simple), a set of language rules (numbering thousands to millions) or from an ontology of concepts to a semantic net with millions of terms and relationships among concepts.

The question Evan asks is a good one with a simple answer, "Not without configuration." The configuration needs human work in two regions:
• Management of the linguistic rules or ontology
• Design of search engine indexing and retrieval mechanisms

When a semantic search engine indexes content for natural language retrieval, it looks to the rules or semantic nets to find concepts that match those in the content. When it finds concepts in the content with no equivalent language in the semantic net, it must find a way to understand where the concepts belong in the ontological framework. This discovery process for clarification, disambiguation, contextual relevance, perspective, meaning or tone is best accompanied with an interface making it easy for a human curator or editor to update or expand the ontology. A subject matter expert is required for specialized topics. Through a process of automated indexing that both categorizes and exposes problem areas, the semantic engine becomes a search engine and a questioning engine.

The entire process is highly iterative. In a sense, the software is asking the questions: "What is this?", "How does it relate to the things we already know about?", "How is the language being used in this context?" and so on.

2. In other words, once they [the software] have established relationships among data, can they use that finding to proceed - without human intervention- to seek new relationships?

Yes, in the manner described for the previous question. It is important to recognize that the original set of rules, ontologies, or semantic nets that are being applied were crafted by human beings with subject matter expertise. It is unrealistic to think that any team of experts would be able to know or anticipate every use of the human language to codify it in advance for total accuracy. The term AI is, for this reason, a misnomer because the algorithms are not thinking; they are only looking up "known-knowns" and applying them. The art of the software is in recognizing when something cannot be discerned or clearly understood; then the concept (in context) is presented for the expert to "teach" the software what to do with the information.

State-of-the-art software will have a back-end process for enabling implementer/administrators to use the results of search (direct commentary from users or indirectly by analyzing search logs) to discover where language has been misunderstood as evidenced by invalid results. Over time, more passes to update linguistic definitions, grammar rules, and concept relationships will continue to refine and improve the accuracy and comprehensiveness of search results.

3. It occurs to me that the key value added of semantic technologies to decision-making is their capacity to link sources by context and meaning, which increases situational awareness and decision space. But can they probe further on their own?

Good point on the value and in a sense, yes, they can. Through extensive algorithmic operations, instructions can be embedded (and probably are for high-value situations like intelligence work), instructing the software what to do with newly discovered concepts. Instructions might then place these new discoveries into categories of relevance, importance, or associations. It would not be unreasonable to then pass documents with confounding information off to other semantic tools for further examination. Again, without human analysis along the continuum and at the end point, no certainty about the validity of the software's decision-making can be asserted.

I can hypothesize a case in which a corpus of content contains random documents in foreign languages. From my research, I know that some of the semantic packages have semantic nets in multiple languages. If the corpus contains material in English, French, German and Arabic, these materials might be sorted and routed off to four different software applications. Each batch would be subject to further linguistic analysis, followed by indexing with some middleware applied to the returned results for normalization, and final consolidation into a unified index. Does this exist in the real world now? Probably there are variants but it would take more research to find the cases, and they may be subject to restrictions that would require the correct clearances.

Discussions with experts who have actually employed enterprise specific semantic software, underscores the need for subject expertise, and some computational linguistics training coupled with an aptitude for creative inquiry. These scientists informed me that individuals, who are highly multi-disciplinary and facile with electronic games and tools, did the best job of interacting with the software and getting excellent results. Tuning and configuration over time by the right human players is still a fundamental requirement.

The gradual upturn from the worst economic conditions in decades is reason for hope. A growing economy coupled with continued adoption of enterprise software, in spite of the tough economic climate, keep me tuned to what is transpiring in this industry. Rather than being cajoled into believing that "search" has become commodity software, which it hasn't, I want to comment on the wisdom of Jill Dyché and her Anti-predictions for 2011 in a recent Information Management Blog. There are important lessons here for enterprise search professionals, whether you have already implemented or plan to soon.

Taking her points out of order, I offer a bit of commentary on those that have a direct relationship to enterprise search. Based on past experience, Ms. Dyché predicts some negative outcomes but with a clear challenge for readers to prove her wrong. As noted, enterprise search offers some solutions to meet the challenges:


  1. No one will be willing to shine a bright light on the fact that the data on their enterprise data warehouse isn't integrated. It isn't just the data warehouse that lacks integration among assets, but among all applications housing critical structured and unstructured content. This does not have to be the case. Several state-of-the-art enterprise search products that are not tied to a specific platform or suite of products do a fine job of federating indexing of disparate content repositories. In a matter of weeks or few months, a search solution can be deployed to crawl, index and search multiple sources of content. Furthermore, newer search applications are being offered for pre-purchase testing for out-of-the-box suitability in pilot or proof-of-concept (POC) projects. Organizations that are serious about integrating content silos have no excuse for not taking advantage of easier to deploy search products.

  2. Even if they are presented with proof of value, management will be reluctant to invest in data governance. Combat this entrenched bias with a strategy to overcome lack of governance; a cost cutting argument is unlikely to change minds. However, risk is an argument that will resonate, particularly when bolstered with examples. Include instances when customers were lost due to poor performance or failure to deliver adequate support services, sales were lost because answers to qualifying questions could not be answered or were not timely, legal or contract issues could not be defended due to inaccessibility of critical supporting documents, or when maintenance revenue was lost due to incomplete, inaccurate or late renewal information getting out to clients. One simple example is the consequences of not sustaining a concordance of customer name, contact, and address changes. The inability of content repositories to talk to each other or aggregate related information in a search because a Customer labeled as Marion University at one address is the same as the Customer labeled University of Marion at another address will be embarrassing in communications and, even worse, costly. Governance of processes like naming conventions and standardized labeling enhances the value and performance of every enterprise system including search.

  3. Executives won't approve new master data management or business intelligence funding without an ROI analysis. This ties in with the first item because many enterprise search applications include excellent tools for performing business intelligence, analytics, and advanced functions to track and evaluate content resource use. The latter is an excellent way to understand who is searching, for what types of data, and the language used to search. These supporting functions are being built into applications for enterprise search and do not add additional cost to product licenses or implementation. Look for enterprise search applications that are delivered with tools that can be employed on an ad hoc basis by any business manager.

  4. Developers won't track their time in any meaningful way. This is probably true because many managers are poorly equipped to evaluate what goes into software development. However, in this era of adoption of open source, particularly for enterprise search, organizations that commit to using Lucene or Solr (open source search) must be clear on the cost of building these tools into functioning systems for their specialized purposes. Whether development will be done internally or by a third party, it is essential to place strong boundaries around each project and deployment, with specifications that stage development, milestones and change orders. "Free" open source software is not free or even cost effective when an open meter for "time and materials" exists.

  5. Companies that don't characteristically invest in IT infrastructure won't change any time soon. So, the silo-ed projects will beget more silo-ed data...Because the adoption rate for new content management applications is so high, and the ease for deploying them encourages replication like rabbits, it is probably futile to try to staunch their proliferation. This is an important area for governance to be employed, to detect redundancy, perform analytics across silos, and call attention to obvious waste and duplication of content and effort. Newer search applications that can crawl and index a multitude of formats and repositories will easily support efforts to monitor and evaluate what is being discovered in search results. Given a little encouragement to report redundancy and replicated content, every user becomes a governor over waste. Play on the natural inclination for people to complain when they feel overwhelmed by messy search results, by setting up a simple (click a button) reporting mechanism to automatically issue a report or set a flag in a log file when a search reveals a problem.

It is time to stop treating enterprise search like a failed experiment and instead, leverage it to address some long-standing technology elephants roaming around our enterprises.


To follow other search trends for the coming year, you may want to attend a forthcoming webinar, 11 Trends in Enterprise Search for 2011, which I will be moderating on January 25th. These two blogs also have interesting perspectives on what is in store for enterprise applications: CSI Info-Mgmt: Profiling Predictors 2011, by Jim Ericson and The Hottest BPM Trends You Must Embrace In 2011!, by Clay Richardson. Also, some of Ms. Dyché's commentary aligns nicely with "best practices" offered in this recent beacon, Establishing a Successful Enterprise Search Program: Five Best Practices

Mining content for facts and information relationships is a focal point of many semantic technologies. Among the text analytics tools are those for mining content in order to process it for further analysis and understanding, and indexing for semantic search. This will move enterprise search to a new level of research possibilities.

Research for a forthcoming Gilbane report on semantic software technologies turned up numerous applications used in the life sciences and publishing. Neither semantic technologies nor text mining are mentioned in this recent article Rare Sharing of Data Leads to Progress on Alzheimer's in the New York Times but I am pretty certain that these technologies had some role in enabling scientists to discover new data relationships and synthesize new ideas about Alzheimer's biomarkers. The sheer volume of data from all the referenced data sources demands computational methods to distill and analyze.

One vertical industry poised for potential growth of semantic technologies is the energy field. It is a special interest of mine because it is a topical area in which I worked as a subject indexer and searcher early in my career. Beginning with the 1st energy crisis, oil embargo of the mid-1970s, I worked in research organizations that involved both fossil fuel exploration and production, and alternative energy development.

A hallmark of technical exploratory and discovery work is the time gaps between breakthroughs; there are often significant plateaus between major developments. This happens if research reaches a point that an enabling technology is not available or commercially viable to move to the next milestone of development. I observed that the starting point in the quest for innovative energy technologies often began with decades-old research that stopped before commercialization.

Building on what we have already discovered, invented or learned is one key to success for many "new" breakthroughs. Looking at old research from a new perspective to lower costs or improve efficiency for such things as photovoltaic materials or electrochemical cells (batteries) is what excellent companies do.

How does this relate to semantic software technologies and data mining? We need to begin with content that was generated by research in the last century; much of this is just now being made electronic. Even so, most of the conversion from paper, or micro formats like fîche, is to image formats. In order to make the full transition to enable data mining, content must be further enhanced through optical character recognition (OCR). This will put it into a form that can be semantically parsed, analyzed and explored for facts and new relationships among data elements.

Processing of old materials is neither easy nor inexpensive. There are government agencies, consortia, associations, and partnerships of various types of institutions that often serve as a springboard for making legacy knowledge assets electronically available. A great first step would be having DOE and some energy industry leaders collaborating on this activity.

A future of potential man-made disasters, even when knowledge exists to prevent them, is not a foregone conclusion. Intellectually, we know that energy independence is prudent, economically and socially mandatory for all types of stability. We have decades of information and knowledge assets in energy related fields (e.g. chemistry, materials science, geology, and engineering) that semantic technologies can leverage to move us toward a future of energy independence. Finding nuggets of old information in unexpected relationships to content from previously disconnected sources is a role for semantic search that can stimulate new ideas and technical research.

A beginning is a serious program of content conversion capped off with use of semantic search tools to aid the process of discovery and development. It is high time to put our knowledge to work with state-of-the-art semantic software tools and by committing human and collaborative resources to the effort. Coupling our knowledge assets of the past with the ingenuity of the present we can achieve energy advances using semantic technologies already embraced by the life sciences.

It is not news that enterprise search has been relegated to the long list of failed technologies by some. We are at the point where many analysts and business writers have called for a moratorium on the use of the term. Having worked in a number of markets and functional areas (knowledge management/KM, special libraries, and integrated library software systems) that suffered the death knell, even while continuing to exist, I take these pronouncements as a game of sorts.

Yes, we have seen the demise of vinyl phonograph records, cassette tapes and probably soon musical CD albums, but those are explicit devices and formats. When you can't buy or play them any longer, except in a museum or collector's garage, they are pretty dead in the marketplace. This is not true of search in the enterprise, behind the firewall, or wherever it needs to function for business purposes. People have always needed to find "stuff" to do their work. KM methods and processes, special libraries and integrated library systems still exist, even as they were re-labeled for PR and marketing purposes.

What is happening to search in the enterprise is that it is finding its purpose, or more precisely its hundreds of purposes. It is not a monolithic software product, a one-size-fits-all. It comes in dozens of packages, models, and price ranges. It may be embedded in other software or standalone. It may be procured for a point solution to support retrieval of content for one business unit operating in a very narrow topical range, or it may be selected to give access to a broad range of documents that exist in numerous enterprise domains on many subjects.

Large enterprises typically have numerous search solutions in operation, implementation, and testing, all at the same time. They are discovering how to deploy and leverage search systems and they are refining their use cases based on what they learn incrementally through their many implementations. Teams of search experts are typically involved in selecting, deploying and maintaining these applications based on their subject expertise and growing understanding of what various search engines can do and how they operate.

After years of hearing about "the semantic Web," the long sought after "holy grail" of Web search, there is a serious ramping of technology solutions. Most of these applications can also make search more semantically relevant behind the firewall. These technologies have been evolving for decades beginning with so-called artificial intelligence, and now supported by some categories of computational linguistics such as specific algorithms for parsing content and disambiguating terms. A soon to-be released study featuring some of noteworthy applications reveals just how much is being done in enterprises for specific business purposes.

With this "teaser" on what is about to be published, I leave you with one important thought, meaningful search technologies depend on rich linguistically-based technologies. Without a cornucopia of software tools to build terminology maps and dictionaries, analyze content linguistically in context to elicit meaning, parse and evaluate unstructured text data sources, and manage vocabularies of ever more complex topical domains, semantic search could not exist.

Language complexities are challenging and even vexing. Enterprises will be finding solutions to leverage what they know only when they put human resources into play to work with the lingo of their most valuable domains.

Designing an enterprise search interface that employees will use on their intranet is challenging in any circumstance. But starting from nothing more than verbal comments or even a written specification is really hard. However, conversations about what is needed and wanted are informative because they can be aggregated to form the basis for the overarching design.

Frequently, enterprise stakeholders will reference a commercial web site they like or even search tools within social sites. These are a great starting point for a designer to explore. It makes a lot of sense to visit scores of sites that are publicly accessible or sites where you have an account and navigate around to see how they handle various design elements.

To start, look at:

  • How easy is it to find a search box?
  • Is there an option to do advanced searches (Boolean or parametric searching)?
  • Is there a navigation option to traverse a taxonomy of terms?
  • Is there a "help" option with relevant examples for doing different kinds of searches?
  • What happens when you search for a word that has several spellings or synonyms, a phrase (with or without quotes), a phrase with the word and in it, a numeral, or a date?
  • How are results displayed: what information is included, what is the order of the results and can you change them? Can you manipulate results or search within the set?
  • Is the interface uncluttered and easily understood?

The point of this list of questions is that you can use it to build a set of criteria for designing what your enterprise will use and adopt, enthusiastically. But this is only a beginning. By actually visiting many sites outside your enterprise, you will find features that you never thought to include or aggravations that you will surely want to avoid. From these experiences on external sites, you can build up a good list of what is important to include or banish from your design.

When you find sites that you think are exemplary, ask key stakeholders to visit them and give you their feedback, preferences and dislikes. Particularly, you want to note what confuses them or enthusiastic comments about what excites them.

This post originated because several press notices in the past month brought to my attention Web applications that have sophisticated and very specialized search applications. I think they can provide terrific ideas for the enterprise search design team and also be used to demonstrate to your internal users just what is possible.

Check out these applications and articles: on KNovel, particularly this KNovel pageThomasNet; EBSCOHost mentioned in this article about the "deep Web.". All these applications reveal superior search capabilities, have long track records, and are already used by enterprises every day. Because they are already successful in the enterprise, some by subscription, they are worth a second look as examples of how to approach your enterprise's search interface design.

A recent article about how Google Internet search does not use meta tags to find relevant content got me thinking about a couple of things.

First it explains why none of the articles I write for this blog about enterprise search appear in Google alerts for “enterprise search.” Besides being a personal annoyance, easily resolved if I invested in some Internet search optimization, it may explain why meta tagging is a hard sell behind the firewall.

I do know something about getting relevant content to show up in enterprise search systems and it does depend on a layer of what I call “value-added metadata” by someone who knows the subject matter in target content and the audience. Working with the language of the enterprise audience that relies on finding critical content to do their jobs, a meta tagger will bring out topical language known to be the lingua franca of the dominant searchers as well as the language that will be used by novice employee searchers. The key here is to recognize that in any specific piece of content its “aboutness” may never be explicitly spelled out in terminology by the author.

In one example, let’s consider some fundamental HR information about “holiday pay” or “compensation for holidays” or “compensation for time-off.” The strings in quotes were used throughout documents on the intranet of one organization where I consulted. When some complained about not being able to find this information using the company search system, my review of search logs showed a very large number of searches for “vacation pay” and almost no searches for “compensation” or “holidays” or “time off.” Thus, there was no way that using the search engine employees would stumble upon the useful information they are seeking – unless, meta tags make “vacation pay” a retrievable index pointer to these documents. The tagger would have analyzed the search logs, seen the high number of searches for that phrase and realized that it was needed as a meta tag.

Now, back to Google’s position on ignoring meta tags because writers and marketing managers were “gaming the system.” They were adding tags they thought would be popular to get people to look at content not related but for which they were seeking a huge audience.

I have heard the concern that people within enterprises might also hijack the usefulness of content they were posting in blogs or wikis to get more “eyeballs” in the organization. This is a foolish concern, in my opinion. First I have never seen evidence that this happens and don’t believe that any productive enterprise has people engaging in this obvious foolishness.

More importantly, professional growth and success depends on the perceptions of others, their belief in you and your work, and the value of your ideas. If an employee is so foolish as to misdirect fellow employees to useless or irrelevant content, he is not likely to gain or keep the respect of his peers and superiors. In the long run persistent, misleading or mischievous meta tagging will have just the opposite effect, creating a pathway to the door.

Conversely, the super meta tagger with astute insights into what people are looking for and how they are most likely to look for it, will be the valued expert we all need to care for and spoon feed us our daily content. Trusted resources rise to the top when they are appropriately tagged and become bedrock content when revealed through enterprise search on well-managed intranets.

It takes patience, knowledge and analysis to tell when search is really working. For the past few years I have seen a trend away from doing any "dog work" to get search solutions tweaked and tuned to ensure compliance with genuine business needs. People get cut, budgets get sliced and projects dumped because (fill the excuse) and the message gets promoted "enterprise search doesn't work." Here's the secret, when enterprise search doesn't work the chances are it's because people aren't working on what needs to be done. Everyone is looking for a quick fix, short cut, "no thinking required" solution.

This plays out in countless variations but the bottom line is that impatience with human processing time and the assumption that a search engine "ought to be able to" solve this problem without human intervention cripple possibilities for success faster than anything else.

It is time for search implementation teams to get realistic about the tasks that must be executed and milestones to be reached. Teams must know how they are going to measure success and reliability, then to stick with it, demanding that everyone agrees on the requirements before throwing the towel in at the first executive anecdote that the "dang thing doesn't work."

There are a lot of steps to getting even an out-of-the-box solution working well. But none is more important than paying attention to these:
• Know your content
• Know your search audience
• Know what needs to be found and how it will be looked for
• Know what is not being found that should be

The operative verb here is to know and to really know anything takes work, brain work, iterative, analytical and thoughtful work. When I see these reactions from IT upon setting off a search query that returns any results: "we're done" OR "no error messages, good" OR "all these returns satisfy the query," my reaction is:

• How do you know the search engine was really looking in all the places it should?
• What would your search audience be likely to look for and how would they look?
• Who is checking to make sure these questions are being answered correctly?
• How do you know if the results are complete and comprehensive?

It is the last question that takes digging and perseverance. It is pretty simple to look at search results and see content that should not have been retrieved and figure out why it was. Then you can tune to make sure it does not happen again.

To make sure you didn't miss something takes systematic "dog work" and you have to know the content. This means starting with a small body of content that it is possible for you to know, thoroughly. Begin with content representative of what your most valued search audience would want to find. Presumably, you have identified these people through establishing a clear business case for enterprise search. (This is not something for the IT department to do but for the business team that is vested in having search work for their goals.) Get these "alpha worker" searchers to show you how they would go about trying to find the stuff they need to get their work done every day, to share with you some of what they consider some of the most valuable documents they have worked with over the past few years. (Yes, years - you need to work with veterans of the organization whose value is well established, as well as with legacy content that is still valuable.)

Confirm that these seminal documents are in the path of the search engine for the index build; see what is retrieved when they are searched for by the seekers. Keep verifying by looking at both content and results to be sure that nothing is coming back that shouldn't and that nothing is being missed. Then double the content with documents on similar topics that were not given to you by the searchers, even material that they likely would never have seen that might be formatted very differently, written by different authors, and more variable in type and size but still relevant. Re-run the exact searches that were done originally and see what is retrieved. Repeat in scaling increments and validate at every point. When you reach points where content is missing from results that should have been found using the searcher's method, analyze, adjust, and repeat.

A recent project revealed to me how willing testers are to accept mediocre results when it became apparent how closely content must be scrutinized and peeled back to determine its relevance. They had no time for that and did not care how bad the results were because they had a pre-defined deadline. Adjustments may call for refinements in the query formulation that might require an API to make it more explicit, or the addition of better category metadata with rich cross-references to cover vocabulary variations. Too often this type of implementation discovery signals a reason to shut down the project because all options require human resources and more time. Before you begin, know that this level of scrutiny will be necessary to deliver good-to-great results; set that expectation for your team and management, so it will be acceptable to them when adjustments are needed for more work to be done to get it right. Just don't blame it on the search engine - get to work, analyze and fix the problem. Only then can you let search loose on your top target audience.

When thinking about some enterprise search use cases that require planning and implementation, presentation of search results is not often high on the list of design considerations. Learning about a new layer of software called Documill from CEO and founder, Mika Könnölä, caused me to reflect on possible applications in which his software would be a benefit.

There is one aspect of search output (results) that always makes an impression when I search. Sometimes the display is clear and obvious and other times the first thing that pops into my mind is "what the heck am I looking at" or "why did this stuff appear?" In most cases, no matter how relevant the content may end up being to my query, I usually have to plow through a lot (could be dozens) of content pieces to confirm the validity or usefulness of what is retrieved.

Admittedly, much of my searching is research or helping with a client's intranet implementation, not just looking for a quick answer, a fact or specific document. When I am in the mode for what I call "quick and dirty" search, I can almost always frame the search statement to get the exact result I want very quickly. But when I am trying to learn about a topic new to me, broaden my understanding or collect an exhaustive corpus of material for research, sifting and validating dozens of documents by opening each and then searching within the text for the piece of the content that satisfied the query is both tedious and annoyingly slow.

That is where Documill could enrich my experience considerably for it can be layered on any number of enterprise search engines to present results in the form of precise thumbnails that show where in a document the query criterion/criteria is located. In their own words, "it enhances traditional search engine result list with graphically accurate presentation of the content."

Here are some ideas for its application:

  • In an application developed to find specific documents from among thousands that are very similar (e.g. invoices, engineering specifications), wouldn't it be great to see only a dozen, already opened, pages to the correct location where the data matches the query?
  • In an application of 10s of thousands of legacy documents, OCRed for metadata extraction displayable as PDFs, wouldn't it be great to have the exact pages of the document that match the search displayed as visual images opened to read in the results page? This is especially important in technical documents of 60-100 pages where the target content might be on page 30 or 50.
  • In federated search output, when results may contain many similar documents, the immediate display of just the right pages as images ready for review will be a time-saving blessing.
  • In a situation where a large corpus of content contains photographs or graphics, such as newspaper archives, scientific and engineering drawings, an instantaneous visual of the content will sharpen access to just the right documents.

I highly recommend that you ask your search engine solution provider about incorporating Documill into your enterprise search architecture. And, if you have, please share your experiences with me through comments to this post or by reaching out for a conversation.

Gilbane Boston 2011

Establishing a Successful Enterprise Search Program: Five Best Practices,
by Lynda Moulton -

Download

OpenID accepted here Learn more about OpenID

NewsShark

Sign-up for our weekly NewsShark newsletter.
Content technology industry news without the hype:

* Email

* First Name

* Last Name

* = Required Field