Curated for content, computing, and digital experience professionals

Category: Semantic technologies (Page 37 of 72)

Our coverage of semantic technologies goes back to the early 90s when search engines focused on searching structured data in databases were looking to provide support for searching unstructured or semi-structured data. This early Gilbane Report, Document Query Languages – Why is it so Hard to Ask a Simple Question?, analyses the challenge back then.

Semantic technology is a broad topic that includes all natural language processing, as well as the semantic web, linked data processing, and knowledge graphs.


Respect for Complexity and Security are Winners

I participated in one search vendor’s user conference this week, and a webinar sponsored by another. Both impressed me because they expressed values that I respect in the content software industry and they provided solid evidence that they have the technology and business delivery infrastructure to back up the rhetoric.

You have probably noted that my blog is slim on comments about specific products and this trend will continue to be the norm. However, in addition to the general feeling of good will from Endeca customers that I experienced at Endeca Discover 2007, I heard clear messages from sessions I attended that reinforced the company’s focus on helping clients solve complex content retrieval problems. Complexity is inherent in enterprises because of diversity among employees, methods of operating, technologies deployed and varied approaches to meeting business demands at every level.

In presentations by Paul Sonderegger and Jason Purcell care was given to explain Endeca’s approach to building their search technology solutions and why. At the core is a fundamental truth about how organizations and people function; you never know how a huge amount of unpredictably interconnected stuff will be approached. Endeca wants its users to be able to use information as levers to discover, through personalized methods, relationships among content pieces that will pry open new possibilities for understanding the content.

Years ago I was actively involved with a database model called an associative structural model. It was developed explicitly to store and manipulate huge amount of database text and embodied features of hierarchical, networked and relational data structures.

It worked well for complex, integrated databases because it allowed users to manipulate and mingle data groups in unlimited ways. Unfortunately, it required a user to be able to visualize the possibilities for combining and expressing data from hundreds of fields in numerous tables by using keys. This structural complexity could not easily be taught or learned, and tools for simple visualization were not available in the early 1980s. As I listened to Jason Purcell describe Endeca’s optimized record store, and concept of “intra-query” to provide solutions for the problems posed by double uncertainty I thought, “They get it.” They have acknowledged the challenge of making it simple to update, use and exploit vast knowledge stores; they are working hard to meet the challenge. Good for them! We all want flexibility to work the way we want but if it is not easy we will not adopt.

In a KMWorld webinar, Vivisimo’s Jerome Pesente and customer Arnold Verstraten of NV Organon co-presented with Matt Brown of Forrester Research. The theme was search security models. Besides the reasons for up-front consideration for security when accessing and retrieving from enterprise repositories, three basic control models were described. All three were based on access control lists (ACLs), how and why they are used by Vivisimo.

Having worked with defense agencies, defense contractors and corporations with very serious security requirements on who can access what, I am very familiar with the types of data structures and indexing methods that can be used. I was pleased to hear the speakers address trade-offs that include performance and deployment issues. It served to remind me that organizations do need to be thinking about this early in the selection process; inability to handle the most sensitive content appropriately should eliminate any enterprise search vendor that tries to equivocate on security. Also, as Organon did, there is nothing that demonstrates the quality of the solution like a “bake-off” against a sufficient corpus of content that will demonstrate whether all documents and their metadata that must not be viewed by some audiences in fact are always excluded from search results for all in those restricted audiences. Test it. Really test it!

Turbo Search Engines in Cars; it is not the whole solution

In my quest to analyze the search tools that are available to the enterprise, I spend a lot of time searching. These searches use conventional on-line search tools, and my own database of citations that link to articles, long forgotten. But true insights about products and markets usually come through the old-fashioned route, the serendipity of routine life. For me search also includes the ordinary things I do everyday:

  • Looking up a fact (e.g. phone number, someone’s birthday, woodchuck deterrents), which I may find in an electronic file or hardcopy
  • Retrieving a specific document (e.g. an expense form, policy statement, or ISO standard), which may be on-line or in my file cabinet
  • Finding evidence (e.g. examining search logs to understand how people are using a search engine, looking for a woodchuck hole near my garden, examining my tires for uneven tread wear), which requires viewing electronic files or my physical environment
  • Discovering who the experts are on a topic or what expertise my associates have (e.g. looking up topics to see who has written or spoken, reading resumes or biographies to uncover experience), which is more often done on-line but may be buried in a 20-year old professional directory on the shelf
  • Learning about a subject I want or need to understand (e.g. How are search and text analytics being used together in business enterprises? what is the meaning of the tag line “Turbo Search Engine” on an Acura ad?), which were partially answered with online search but also by attending conferences like the Text Analytics Summit 2007 this week

This list illustrates several things. First search is about finding facts, evidence, aggregated information (documents). It is also about discovering, learning and uncovering information that we can then analyze for any number of decisions or potential actions.

Second, search enables us to function more efficiently in all of our worldly activities, execute our jobs, increase our own expertise and generally feed our brains.

Third, search does not require the use of electronic technology, nor sophisticated tools, just our amazing senses: sight, hearing, touch, smell and taste.

Fourth, that what Google now defines as “cloud computing” and MIT geeks began touting as “wearable” technology a few years ago have converged to bring us cars embedded with what Acura defines as “turbo search engines.” On this fourth point, I needed to discover the point. In small print on the full page ad in Newsweek were phrases like “linked to over 7,000,000 destinations” and “knows where traffic is.” In even tinier print was the statement, “real-time traffic monitoring available in select markets…” I thought I understood that they were promoting the pervasiveness of search potential through the car’s extensive technological features. Then I searched the Internet for the phrase “turbo search engine” coupled with “Acura” only to learn that there was more to it. Notably, there is the “…image-tagging campaign that enables the targeted audience to use their fully-integrated mobile devices to be part of the promotion.” You can read the context yourself.

Well, I am still trying to get my head around this fourth point to understand how important it is to helping companies find solid, practical search solutions to problems they face in business enterprises. I don’t believe that a parking lot full of Acura’s is something I will recommend.

Fifth, I experienced some additional thoughts about the place for search technology this week. Technology experts like Sue Feldman of IDC and Fern Halper of Hurwitz & Associates appeared on a panel at the Text Analytics Summit. While making clear the distinctions between search and text analytics, and text analytics and text mining, Sue also made clear that algorithmic techniques employed by the various tools being demonstrated are distinct for each solving different problems in different business situations. She and others acknowledge that finally, having embraced search, enterprises are now adopting significant applications using text analytic techniques to make better sense of all the found content.

Integration was a recurring theme at the conference, even as it was also obvious that no one product embodies the full range of text search, mining and analytics that any one enterprise might need. When tools and technologies are procured in silos, good integration is a tough proposition, and a costly one. Tacking on one product after another and trying to retrofit to provide a seamless continuum from capturing, storing, and organizing content to retrieving and analyzing the text in it, takes forethought and intelligent human design. Even if you can’t procure the whole solution to all your problems at once, and who can, you do need a vision of where you are going to end up so that each deployment is a building block to the whole architecture.

There is a lot to discover at conferences that can’t be learned through search, like what you absorb in a random mix of presentations, discussions and demos that can lead to new insights or just a confirmation of the optimal path to a cohesive plan.

A Story About Structured Content and Search

Today was spent trying to sift through four distinct piles of paper, a backlog of email messages, and managing my calendar. My goal was first to get rid of documents and articles that are too old or irrelevant to “deal with.” The remainder I intended to catalog in my database, file in the appropriate electronic folder, or place in a new pile as a “to-do list.” This final pile plus my email “In Box” would then be systematically assigned to a spot in my calendar for the next six weeks. I did have deadlines to meet, but they depended on other people sending me content, which never came. So I kept sifting and organizing. As you can guess, the day that began with lofty intentions of getting to the bottom of the piles so that I could prioritize my real work is ending, instead, with this blog entry. It is not the one I began for this week four days ago.

First, the most ironic moment of the day came from the last pile in which I turned over an article that must have made an impression in 1997, from Newsweek it was entitled Drowning in Data. I knew I shouldn’t digress, again, but reading it confirmed what I already knew. We all have been “drowning in data” since at least 1997 and for all the same reasons we were back then, Internet publishing, email, voice mail, and faxes (well not so much anymore). It has the same effect as it did ten years ago on “info-stressed” professionals; it makes us all want to go slower so we can think about what is being thrown at us. Yes, that is why I was isolated trying to bring order to the info-glut on my desks. The article mentioned that “the average worker in a large corporation sent and received an astounding 177 messages a day…”

That is the perfect segue to my next observation. In the course of the day, while looking for emails needed to meet deadlines, I emptied over 300 messages from my Junk Mailbox, over 400 from my Deleted Mailbox, and that left me with just 76 in my In Mailbox, which I will begin acting on when I finish this blog entry. (Well, may-be after dinner.) What happened today that caused six different search vendors to send invitations to Webinars or analyst briefings? Oh well, when I finally get around to filling out my calendar for the next six weeks I will probably find out that some, if not all, conflict with appointments I already have. So, may-be I should finish the calendar before responding to the emails.

In the opening of this story I mentioned four distinct piles; I lied. As one document was replaced by another, I discovered that there was no unifying theme for any one pile. So much for categorization, but I did find some important papers that required immediate action, which I took.

Finally, I uncovered an article from http://techweb.cmp.com/iw in 1996. The Information Week archives don’t go back that far but the title was Library on an Intranet. It described a Web-based system for organizing corporate information by Sequent Computer Systems. I know why I saved it; because I had developed and was marketing corporate library systems to run over company networks back in 1980. I did find a reference to the Sequent structured system for organizing and navigating corporate content. You will find it at: http://www.infoloom.com/gcaconfs/WEB/seattle96/lmd.HTM#N136. It is a very interesting read.

What a ride we have had trying to corral this info-glut electronically for over 30 years. From citation searching using command languages in the 1970s, to navigation and structured searching in library systems in the 1980s and 90s, to Web-based navigation coupled with full-text searching in the mid-90s; it never ends. And I am still trying to structure my paper piles into a searchable collection of content.

May-be browsing the piles isn’t such a bad idea after-all. I never would have found those articles using “search” search.

Postscript: This really happened. When I finished this blog entry and went to place the “Drowning…” article on a pile I never got to, there on the top was an article from Information Week, April 9, 2007, entitled “Too Much Information.” I really didn’t need to read the lecturing subtitle: Feeling overwhelmed? You need a comprehensive strategy, not a butterfly net, to deal with data overload. I can assure you, I wasn’t waving butterfly nets all day.

The Google Effect on Cross-Language Search

As the Internet continues to redefine ubiquitous, the issue of cross language search becomes more critical. It’s a pervasive challenge with extreme scalability requirements. Hard to imagine, but the Internet will be full by about 2010 according to the American Registry for Internet Numbers. ARIN’s recommendation for IPv6 demonstrates the potential breadth of information overload.

Organizations such as the European-based Cross-Language Evaluation Forum (CLEF) have moved beyond discussion and into in-depth testing on cross-language search for many years. With its “Leaping over Language Barriers” announcement, Google has moved beyond experimentation and toward productization of its cross-language search feature.

  • The Wall Street Journal’s Jessica Vascellaro weighs in here, and includes commentary on rival strategies from Yahoo and Microsoft.
  • Google Blogoscoped weighs in here.
  • Clay Tablet’s Ryan Coleman weighs in here.
  • Global by Design’s John Yunker has a review here.
  • And from Google themselves, here’s the beta UI, the FAQ, and the “unveiling” at the company’s Searchology event held earlier this month.

IMO, any discussion of what the interconnected world “looks like” in the future, whether focused on fill in your label here 2.0, social networking, customer experience, global elearning, etc., (should) eventually drill-down to translation and localization issues. Once we’re at that level of conversation, there’s more challenges to discuss — the ongoing evolution of automated translation, the balance between human and machine translation, the conundrum of rich media and image translation, and as Kaija will always remind us, the quality and context of search results as opposed to merely the quantity.

As a researcher, I’ve used Google’s “translate this” functionality and Yahoo’s Babel Fish (originally AltaVista’s) numerous times to “get the gist” of a non-English article. But my reliance on the results has been more for sanity-checking trends than for factual data gathering. Inconsistencies skew the truth. I just can’t trust it. Can we trust this? Time will tell. Is it a step in the right direction for the masses? No doubt.

Reflecting on BI Shifts and Google Big Moves in the Search Market

Sometimes it pays to be behind in reading industry news. The big news last week was Google’s new patents and plans to enhance search results using metadata and taxonomy embedded in content. This was followed by the news that Business Objects plans to acquire Inxight, a Xerox PARC spin-off that has produced a product line with terrific data visualization tools, highly valued in the business analytics (BI) marketplace.

I had planned to write about the convergence of the enterprise search and BI markets this week until I caught up with industry news from April and early May. This triggered a couple of insights into these more recent announcements.

In April an Information Week article noted that Google has, uncharacteristically, contributed two significant enhancements to MySQL: improved replication procedures across multiple systems and expanded mirroring. Writer Babcock also noted that “Google doesn’t use MySQL in search” but YouTube does. I believe Google will come to be more tied to MySQL as they begin to deploy new search algorithms that take advantage of metadata and taxonomies. These need good text database structures to be managed efficiently and leveraged effectively to produce quality results from search on the scale that Google does it. Up to now Google results presentation has been influenced more by transaction processing than semantic and textual context. Look for more Google enhancements to MySQL to help it effectively manage all that meaningful text. The open source question is will more enhancements be released by Google for all to use? A lot of enterprises would benefit from being able to depend on continual enhancements to MySQL so they could (continue to) use it instead of Oracle or MS-SQL server as the database back-end for text searching.

The other older news (Information Week, May 7th) was that Business Objects was touting “business intelligence for ‘all individuals’” with some new offerings. BO’s acquisition announcement just last week, that they plan to acquire Inxight, only strengthens their position in this market. Inxight has been on the cusp of BI and enterprise search for several years and this portends more convergence of products in these growing markets. Twenty-five years ago when I was selling text software applications, a key differentiator was strong report building tool sets to support “slicing and dicing” database content in any desired format. It sounds like robust, intuitive reporting tools for all enterprise users of content applications is still a dream but much closer to reality for the high-end market.

With all the offerings and consolidation in BI and search, the next moves will surely begin to push some offerings with search/BI to a price point that small-medium businesses (SMBs) can afford. We know that Microsoft sees the opening (Information Week, May 14th) and let’s hope that others do as well.

Will Steve Arnold Scare IT Into Taking Search in the Enterprise Seriously?

Steve Arnold of ArnoldIT struck twice in a big way last week, once as a contributor to the Bear, Stearns & Co. research report on Google and once as a principal speaker at Enterprise Search in New York. I’ve read a copy of the Bear Stearns report, which contains information that should make IT people pay close attention to how they manage searchable enterprise content. I can verify that this blog summary of Steve’s New York speech by Larry Digman sounds like vintage Arnold, to the point and right on it. Steve, not for the first time, is making points that analysts and other search experts routinely observe about the lack of serious infrastructure vested in making content valuable by enhancing its searchability.

First is the Bear Stearns report, summarized for the benefit of government IT folks with admonitions about how to act on the technical guidance it provides in this article by Joab Jackson in GCN. The report’s appearance in the same week as Microsoft’s acquisition of aQuantive is newsworthy in itself. Google really ups the ante with their plans to change the rules for posting content results for Internet searches. If Webmasters actually begin to do more sophisticated content preparation to leverage what Google is calling its Programmable Search Engine (PSE), then results using Google search will continue to be several steps ahead of what Microsoft is currently rolling out. In other words, while Microsoft is making its most expensive acquisition to tweak Internet searching in one area, Google is investing its capital in its own IP development to make search richer in another. Experience looking at large software companies tells me that IP strategically developed to be totally in sync with existing products have a much better chance of quick success in the marketplace than companies that do acquisitions to play catch up. So, even though Microsoft, in an acquiring mode, may find IP to acquire in the semantic search space (and there is a lot out there that hasn’t been commercialized), its ability to absorb and integrate it in time to head off this Google initiative is a real tough proposition. I’m with Bear Stearn’s guidance on this one.

OK, on to Arnold’s comments at Enterprise Search, in which he continues a theme to jolt IT folks. As, already noted, I totally agree that IT in most organizations is loath to call on information search professionals to understand the best ways to exploit search engine adoption for getting good search results. But I am hoping that the economic side of search, Web content management for an organization’s public facing content, may cause a shift. Already, I am experiencing Web content managers who are enlightened about how to make content more findable through good metadata and taxonomy strategies. They have figured out how to make good stuff rise to the top with guidance from outside IT. When sales people complain that their prospects can’t find the company’s products online, it tends to spur marketing folks to adjust their Web content strategies accordingly.

It may take a while, but my observation is that when employees see search working well on their public sites, they begin to push for equal quality search internally. Now that we have Google paying serious attention to metadata for the purpose of giving search results semantic context, maybe the guys in-house will begin to get it, too.

Mapping Search Requirements

Last week I commented on the richness of the search marketplace. However, diversity presents the enterprise buyer with pressure to be more focused on immediate and critical search needs.

The Enterprise Search Summit is being held in New York this week. Two years ago I found it a great place to see the companies offering search products, where I could easily see them all, and still attend every session in two days. This year, 2007, there were over 40 exhibitors, most offering solutions for highly differentiated enterprise search problems. Few of the offerings will serve the end-to-end needs of a large enterprise but many would be sufficient for medium to small organizations. The two major search engine categories used to be Web content keyword searching, and structured searching. Not only is my attention as an analyst being requested by major vendors offering solutions for different types of search but new products are being announced weekly. Newcomers include those describing their products as data mining engines, search and reporting “platforms,” BI intelligence engines, semantic and ontological search engines. This mix challenges me to determine if a product really solves a type of enterprise search problem before I pay attention.

You, on the other hand, need to do another type of analysis before considering specific options. Classifying search categories, taking a faceted approach will help you narrow down the field. Here is a checklist for categorizing what and how content needs to be found:

  • Content types (e.g. HTML pages, PDFs, images)
  • Content repositories (e.g. database applications, content management systems, collaboration applications, file locations)
  • Types of search interfaces and navigation (e.g. simple search box, metadata, taxonomy)
  • Types of search (e.g. keyword, phrase, date, topical navigation)
  • Types of results presentation (e.g. aggregated, federated, normalized, citation)
  • Platforms (e.g. hosted, intranet, desktop)
  • Type of vendor (e.g. search-only, single purpose application with embedded search, software as service – SaS )
  • Amount of content by type
  • Number and type of users by need (personas)

Then use any tools or resources at hand to harvest an understanding of the mapping results to learn who needs what type of content, in what format and its criticality to business requirements. Prioritizing the facets produces a multidimensional view of enterprise search requirements. This will go a long way to narrowing down the vendor list and gives you a tool to keep discussions focused.

There are terrific options in the marketplace and they will only become richer in features and complexity. Your job is to find the most appropriate solution for the business search problem you need to solve today, at a cost that matches your budget. You also want a product that can be implemented rapidly with immediate benefit linking to a real business proposition.

« Older posts Newer posts »

© 2025 The Gilbane Advisor

Theme by Anders NorenUp ↑