A recent inquiry about a position requiring ETL (Extraction/Transformation/Loading) experience prompted me to survey the job market in this area. It was quite a surprise to see that there are many technical positions seeking this expertise, plus experience with SQL databases, and XML, mostly in healthcare, finance or with data warehouses. I am also observing an uptick in contract positions for metadata and taxonomy development.

My research on Semantic Software Technologies placed me on a path for reporters and bloggers to seek my thoughts on the Watson-Jeopardy story. Much has been written on the story but I wanted to try a fresh take on the meaning of it all. There is a connection to be made between the ETL field and building a knowledgebase with the smarts of Watson. Inspiration for innovation can be drawn from the Watson technology but there is a caveat; it involves the expenditure of serious mental and computing perspiration.

Besides baked-in intelligence for answering human questions using natural language processing (NLP) to search, an answer-platform like Watson requires tons of data. Also, data must be assembled in conceptually and contextually relevant databases for good answers to occur. When documents and other forms of electronic content are fed to a knowledgebase for semantic retrieval, finely crafted metadata (data describing the content) and excellent vocabulary control add enormous value. These two content enhancers, metadata and controlled vocabularies, can transform good search into excellent search.

The irony of current enterprise search is that information is in such abundance that it overwhelms rather than helps findability. Content and knowledge managers can’t possibly contribute the human resources needed to generate high quality metadata for everything in sight. But there are numerous techniques and technologies to supplement their work by explicitly exploiting the mountain of information.

Good content and knowledge managers know where to find top quality content but may not know that, for all common content formats, there are tools to extract key metadata embedded (but hidden) in it. Some of these tools can also text mine and analyze the content for additional intelligent descriptive data. When content collections are very large but too small to justify (under a million documents) the most sophisticated and complex semantic search engines, ETL tools can relieve pressure on metadata managers by automating a lot of mining, extracting entities and concepts needed for good categorization.

The ETL tool array is large and varied. Platform tools from Microsoft (SSIS) and IBM (DataStage) may be employed to extract, transform and load existing metadata. Other independent products such as those from Pervasive and SEAL may contribute value across a variety of platforms or functional areas from which content can be dramatically enhanced for better tagging and indexing. The call for ETL experts is usually expressed in terms of engineering functions who would be selecting, installing and implementing these products. However, it has to be stressed that subject and content experts are required to work with engineers. The role of the latter is to help tune and validate the extraction and transformation outcomes, making sure terminology fits function.

Entity extraction is one major outcome of text mining to support business analytics, but tools can do a lot more to put intelligence into play for semantic applications. Tools that act as filters and statistical analyzers of text data warehouses will help reveal terminology for use in building specialized controlled vocabularies for use in auto-categorization. A few vendors that are currently on my radar to help enterprises understand and leverage their content landscape include EntropySoft Content ETL, Information Extraction Systems, Intelligenx, ISYS Document Filters, RAMP, and XBS, something here for everyone.

The diversity of emerging applications is a leading indicator that there is a lot of innovation to come with all aspects of ETL. While RAMP is making headway with video, another firm with a local connection is Inforbix. I spoke with a co-founder, Oleg Shilovitsky for my semantic technology research last year before they launched. As he then asserted, it is critical to preserve, mine and leverage the data associated with design and manufacturing operations. This area has huge growth potential and Inforbix is now ready to address that market.

Readers who seek to leverage ETL and text mining will gain know-how from the cases presented at the 2011 Text Analytics Summit, May 18-19 in Boston. As well, the exhibits will feature products to consider for making piles of data a valuable knowledge asset. I’ll be interviewing experts who are speaking and exhibiting at that conference for a future piece. I hope readers will attend and seek me out to talk about your metadata management and text mining challenges. This will feed ideas for future posts.

Finally, I’m not the only one thinking along these lines. You will find other ideas and a nudge to action in these articles.

Boeri, Bob. Improving Findability Behind the Firewall, 28 slides. Enterprise Search Summit 2010, NY, 05/2010.
Farrell, Vickie. The Need for Active Metadata Integration: The Hard Boiled Truth. DM Direct Newsletter, 09/09/2005, 3p
McCreary, Dan. Entity Extraction and the Semantic Web, Semantic Universe, 01/01/2009
White, David. BI or bust? KMWorld, 10/28/2009, 3p.