At the recent Gilbane Boston Conference I was happy to hear many remarks positioning and defining “Big Data” and the variety of comments. Like so much in the marketing sphere of high tech, answers begin with technology vendors but get refined and parsed by analysts and consultants, who need to set clear expectations about the actual problem domain. It’s a good thing that we have humans to do that defining because even the most advanced semantics would be hard pressed to give you a single useful answer.
I heard Sue Feldman of IDC give a pretty good “working definition” of big data at the Enterprise Search Summit in May, 2012. To paraphrase is was:
- > 100 TB up to petabytes, OR
- > 60% growth a year of unstructured and unpredictable content, OR
- Ultra high streaming content
But we then get into debates about differentiating data from unstructured content when using a phrase like “big data” and applying it to unstructured content, which knowledge strategists like me tend to put into a category of packaged information. But never mind, technology solution providers will continue to come up with catchy buzz phrases to codify the problem they are solving, whether it makes semantic sense or not.
What does this have to do with enterprise search? In short, “findability” is an increasingly heavy lift due to the size and number of content repositories. We want to define quality findability as optimal relevance and recall.
A search technology era ago, publishers, libraries, content management solution providers were focused on human curation of non-database content, and applying controlled vocabulary categories derived from decades of human managed terminology lists. Automated search provided highly structured access interfaces to what we now call unstructured content. Once this model was supplanted by full text retrieval, and new content originated in electronic formats, the proportion of human categorized content to un-categorized content ballooned.
Hundreds of models for automatic categorization have been rolled out to try to stay ahead of the electronic onslaught. The ones that succeed do so mostly because of continued human intervention at some point in the process of making content available to be searched. From human invented search algorithms, to terminology structuring and mapping (taxonomies, thesauri, ontologies, grammar rule bases, etc.), to hybrid machine-human indexing processes, institutions seek ways to find, extract, and deliver value from mountains of content.
This brings me to a pervasive theme from the conferences I have attended this year, the synergies among text mining, text analytics, extractor/transformer/loader (ETL), and search technologies. These are being sought, employed and applied to specific findability issues in select content domains. It appears that the best results are delivered only when these criteria are first met:
- The business need is well defined, refined and narrowed to a manageable scope. Narrowing scope of information initiatives is the only way to understand results, and gain real insights into what technologies work and don’t work.
- The domain of content that has high value content is carefully selected. I have long maintained that a significant issue is the amount of redundant information that we pile up across every repository. By demanding that our search tools crawl and index all of it, we are placing an unrealistic burden on search technologies to rank relevance and importance.
- Apply pre-processing solutions such as text-mining and text analytics to ferret out primary source content and eliminate re-packaged variations that lack added value.
- Apply pre-processing solutions such as ETL with text mining to assist with content enhancement, by applying consistent metadata that does not have a high semantic threshold but will suffice to answer a large percentage of non-topical inquiries. An example would be to find the “paper” that “Jerry Howe” presented to the “AMA” last year.
Business managers together with IT need to focus on eliminating redundancy by utilizing automation tools to enhance unique and high-value content with consistent metadata, thus creating solutions for special audiences needing information to solve specific business problems. By doing this we save the searcher the most time, while delivering the best answers to make the right business decisions and innovative advances. We need to stop thinking of enterprise search as a “big data,” single engine effort and instead parse it into “right data” solutions for each need.