Curated for content, computing, and digital experience professionals

Category: Semantic technologies (Page 28 of 72)

Our coverage of semantic technologies goes back to the early 90s when search engines focused on searching structured data in databases were looking to provide support for searching unstructured or semi-structured data. This early Gilbane Report, Document Query Languages – Why is it so Hard to Ask a Simple Question?, analyses the challenge back then.

Semantic technology is a broad topic that includes all natural language processing, as well as the semantic web, linked data processing, and knowledge graphs.


XML in Everyday Things

If you didn’t follow the link below to Bob DuCharme’s response to my January 13 posting on Why it is Difficult to Include Semantics in Web Content, you should read it. Bob does a great job describing tools in use to include semantics in Web content. Bob is a very smart guy. I like to think the complexity of his answer is a good illustration of my point that adding semantics is not easy. Anyway, his response is clearly worth reading and can be found at http://www.snee.com/bobdc.blog/2009/01/publishers-and-semantic-web-te.html.

Also, I have known Bob for some time. I am reminded that a while back he wrote an interesting article about XML data produced by his TiVo device (see http://www.xml.com/pub/a/2006/02/15/hacking-the-xml-in-your-tivo.html). I was intrigued how XML had begun to pop up in everyday things.

Ever since that TiVo article, I think of Bob every time XML pops up in unexpected everyday places (it’s better than associating him with a trauma). Once in a while I get a glimpse of XML data in a printer control file, in Web page source code, or as an export format for some software, but that sort of thing is to be expected. We all have seen examples at work or in commercial settings, but to find XML data at home in everyday devices and applications has always warmed my biased heart.

Recently I was playing a game of Sid Meier’s Civilization IV (all work and no play and so on….) and I noticed while it was booting up a game that one of the messages said “Reading XML FIles”. My first thought was “Bob would like to see this!” Then I was curious to see how XML was being used in game software. A quick Google search and the first entry, from Wikipedia (http://en.wikipedia.org/wiki/Civilization_IV#cite_note-10), says “More game attributes are stored in XML files, which must be edited with an external text editor or application.” Apparently you can “tweak simple game rules and change or add content. For instance, they can add new unit or building types, change the cost of wonders, or add new civilizations. Players can also change the sounds played at certain times or edit the play list for your soundtrack.”

I poked around in the directories and found schemas describing game units, events, etc. and configuration data instances describing artifacts and activities used in the game. A user could, if they wanted to, make buying a specific building very cheap for instance, or have the game play their favorite music instead of what comes with the game. That is if they know how to edit XML data. I think I just found a way to add many hours of enjoyment to an already great game.

I wonder how much everyday XML is out there just waiting for someone to tweak it and optimize it to make something work better. A thermostat, a refrigerator, or a television perhaps.

Churning in the Search Sector – Two BIG Events in One Week

Analysts having been projecting major consolidation in the enterprise search marketplace for a couple of years. What is interesting to me is how slowly this is evolving. For every merger or acquisition, whether small or large (acquisition of Mondosoft by SurfRay or FAST by Microsoft), other companies emerge or evolve with diverse and potentially competitive technologies (e.g. Attivio, Connotate, Expert System, EyeAlike, Truevert, Temis).

We have seen companies like Exalead, ISYS, and Vivisimo gain on former leaders. Microsoft is often listed as an industry leader because it acquired former leader FAST while companies with solid products for verticals, like Recommind in law and financial services, are often overlooked because they lack the total company revenues of a Microsoft that sells a lot more software than enterprise search.

This past week two industry news items caused me to reflect on the potential impact of announcements that, while not surprising, can upset the plans of buyers of search technology. The first was the announcement that Autonomy is planning to procure Interwoven. That Interwoven is being acquired is no surprise, since the company was being groomed for acquisition. However, this appears to be the first instance of a “search” company acquiring a “content management/document management” company. The norm has been that search companies get bought to fill a need by ECM or CMS vendors. For anyone planning to procure Interwoven because of its embedded Vivisimo Velocity for Universal search in its Worksite product, this does put a wrinkle in the fabric. What a shame because it is going to be a while before the actual impact is really known and could slow sales. The cost to buyers having to accept Autonomy’s IDOL instead of Velocity could be significant. The effect could be on both licensing and deployment because Velocity has been an efficient install for most enterprises. Autonomy has got a big ramp up to shift from being a search company to becoming an ECM supplier and some will take a wait and see attitude, regardless of the Idol reputation.

The second big announcement, of course, is the departure from Microsoft of John Marcus Lervik, a co-founder of FAST and recently named Executive VP in a newly created position for Enterprise Search at Microsoft. I’m sure you’ll be seeing plenty about the reasons elsewhere. However, the difficulty for those buyers who are depending on FAST’s search technology to be integrated sooner rather than later in Microsoft’s offerings has just been made more complicated as one of the original leaders of FAST is leaving the team.

Two years ago I commented to FAST executives about the need for vendors on a rapid growth path to make the buying, business and support experience for customers a priority, beyond technology enhancements; so, I take little consolation in seeing this turmoil. If you are a buyer, take a good hard look behind the technology to see what else you will be dealing with as you make plans to acquire software.

Taxonomy and Glossaries for Enterprise Search Terminology

Two years ago when I began blogging for the Gilbane Group on enterprise search, the extent of my vision was reflected in the blog categories I defined and expected to populate with content over time. They represented my personal “top terms” that were expected to each have meaningful entries to educate and illuminate what readers might want to know about search behind the firewall of enterprises.

A recent examination of those early decisions showed me where there are gaps in content, perhaps reflecting that some of those topics were:

  • Not so important
  • Not currently in my thinking about the industry
  • OR Not well defined

I also know that on several occasions I couldn’t find a good category in my list for a blog I had just written. Being a former indexer and heavy user of controlled vocabularies, on most occasions I resisted the urge to create a new category and found instead the “best fit” for my entry. I know that when the corpus of content or domain is small, too many categories are useless for the reader. But now, as I approach 100 entries, it is time to reconsider where I want to go with blogging about enterprise search.

In the short term, I am going to try to provide entries for scantily covered topics because I still think they are all relevant. I’ll probably add a few more along the way or perhaps make some topics a little more granular.

Taxonomies are never static, and require periodic review, even when the amount of content is small. Taxonomists need to keep pace with current use of terminology and target audience interests. New jargon creeps in although I prefer to use generic and terms broadly understood in the technology and business world.

That gives you an idea of some of my own taxonomy process. To add to the entries on terminology (definitions) and taxonomies, I am posting a glossary I wrote for last year’s report on the enterprise search market and recently updated for the Gilbane Workshop on taxonomies. While the definitions were all crafted by me, they are validated through the heavy use of the Google “define” feature. If you aren’t already a user, you will find it highly useful when trying to pin down a definition. At the Google search box, simply type define: xxx xxx (where xxx represents a word or phrase for which you seek a definition). Google returns all the public definition entries it finds on the Internet. My definitions are then refined based on what I learn from a variety of sources I discover using this technique. It’s a great way to build your knowledge-base and discover new meanings.

Glossary Taxonomy and Search-012009

Open Source Search & Search Appliances Need Expert Attention

Search in the enterprise suffers from lack of expert attention to tuning, care and feeding, governance and fundamental understanding of what functionality comes with any one of the 100+ products now on the market. This is just as true for search appliances, and open source search tools (Lucene) and applications (Solr). But while companies licensing search out-of-the-box solutions or heavily customized search engines have service, support and upgrades built-in into their deliverables, the same level of support cannot be assumed for getting started with open source search or even appliances.

Search appliances are sold with licenses that imply some high level of performance without a lot of support, while open source search tools are downloadable for free. As speakers about both open source and appliances made perfectly clear at our recent Gilbane Conference, both come with requirements for human support. When any enterprise search product or tool is selected and procured, there is a presumed business case for acquisition. What acquirers need to understand above all else is the cost of ownership to achieve the expected value. This means people and people with expertise on an ongoing basis.

Particularly when budgets are tight and organizations lay off workers, we discover that those with specialized skills and expertise are often the first to go. The jack-of-all-trades, or those with competencies in maintaining ubiquitous applications are retained to be “plugged in” wherever needed. So, where does this leave you for support of the search appliance that was presumed to be 100% self-maintaining, or the open source code that still needs bug fixes, API development and interface design-work?

This is the time to look to system integrators and service companies with specialists in tools you use. They are immersed in the working innards of these products and will give you better support through service contracts, subscriptions or labor-based hourly or project charges than you would have received from your in-house generalists, anyway.

You may not see specialized system houses or service companies listed by financial publications as a growth business, but I am going to put my confidence in the industry to spawn a whole new category of search service organizations in the short term. Just-in-time development for you and lower overhead for your enterprise will be a growing swell in 2009. This is how outsourcing can really bring benefits to your organization.

Post-post note – Here is a related review on the state-of-open source in the enterprise: The Open Source Enterprise; its time has come, by Charles Babcock in Information Week, Nov. 17, 2008. Be sure to read the comments, too.

Why Adding Semantics to Web Data is Difficult

If you are grappling with Web 2.0 applications as part of your corporate strategy, keep in mind that Web 3.0 may be just around the corner. Some folks say a key feature of Web 3.0 is the emergence of the Semantic Web where information on Web pages includes markup that tells you what the data is, not just how to format it using HTML (HyperText Markup Language). What is the Semantic Web? According to Wikipedia:

“Humans are capable of using the Web to carry out tasks such as finding the Finnish word for “monkey”, reserving a library book, and searching for a low price on a DVD. However, a computer cannot accomplish the same tasks without human direction because web pages are designed to be read by people, not machines. The semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing and combining information on the web.” (http://en.wikipedia.org/wiki/Semantic_Web).

To make this work, the W3C (World Wide Web Consortium) has developed standards such as RDF (Resource Description Framework, a schema for describing properties of data objects) and SPARQL (SPARQL Protocol and RDF Query Language, http://www.w3.org/TR/rdf-sparql-query/) extend the semantics that can be applied to Web delivered content.

We have been doing semantic data since the beginning of SGML, and later with XML, just not always exposing these semantics to the Web. So, if we know how to apply semantic markup to content, how come we don’t see a lot of semantic markup on the Web today? I think what is needed is a method for expressing and understanding the semantics intended to be expressed beyond what current standards capabilities allow

A W3C XML schema is a set of rules that describe the relationships between content elements. It can be written in a way that is very generic or format oriented (e.g., HTML) or very structure oriented (e.g., Docbook, DITA). Maybe we should explore how to go even further and make our markup languages very semantically oriented by defining elements, for instance, like <weight> and <postal_code>.

Consider though, that the schema in use can tell us the names of semantically defined elements, but not necessarily their meaning. I can tell you something about a piece of data by using the <income> tag, but how, in a schema can I tell you it is a net <income> calculated using the guidelines of US Internal Revenue Service, and therefore suitable for eFiling my tax return? For that matter, one system might use the element type name <net_income> while another might use <inc>. Obviously a industry standard like XBRL (eXtensible Business Reporting Language) can help standardize vocabularies for element type names, but this cannot be the whole solution or XBRL use would be more widespread. (Note: no criticism of XBRL is intended, just using it as an example of how difficult the problem is).

Also, consider the tools in use to consume Web content. Browsers only in recent years added XML processing support in the form of the ability to read DTDs and transform content using XSLT. Even so, this merely allows you to read, validate and format non-HTML tag markup, not truly understand the content’s meaning. And if everyone uses their own schemas to define the data they publish on the Web, we could end up with a veritable “Tower of Babel” with many similar, but not fully interoperable data models.

The Semantic Web may someday provide seamless integration and interpretation of heterogeneous data. Tools such as RDF /SPARQL, as well as microformats (embedding small, specialized, predefined element fragments in a standard format such as HTML), metadata, syndication tools and formats, industry vocabularies, powerful processing tools like XQuery, and other specifications can improve our ability to treat heterogeneous markup as if it were more homogeneous. But even these approaches are addressing only part of the bigger problem. How will we know that elements labeled with <net_income> and <inc> are the same and should be handled as such. How do we express these semantic definitions in a processable form? How do we know they are identical or at least close enough to be treated as essentially the same thing?

This, defining semantics effectively and broadly, is a conundrum faced by many industry standard schema developers and system integrators working with XML content. I think the Semantic Web will require more than schemas and XML-aware search tools to reach its full potential in intelligent data and applications that process them. What is probably needed is a concerted effort to build semantic data and tools that can process these included browsing, data storage, search, and classification tools. There is some interesting work being done in Technical Architecture Group (TAG) at the W3C to address these issues as part of Tim Berners-Lee’s vision of the semantic Web (see for a recent paper on the subject).
Meanwhile, we have Web 2.0 social networking tools to keep us busy and amused while we wait. </>

What Does an Analyst Do for You?

Among the roles that I have chosen for myself as Lead Analyst for Enterprise Search at the Gilbane Group is to evaluate, in broad strokes, the search marketplace for internal use at enterprises of all types. My principal audience is those within enterprises that may be involved in the selection, procurement, implementation and deployment of search technology to benefit their organizations. In this role, I am an advocate for buyers. However, when vendors pay attention to what I write it should help them understand the buyer’s perspective. Ultimately, good vendors incorporate analyst guidance into their thinking about how to serve their customer better.

We do not hide the fact that, as industry analysts, we also consult to various content software companies. When doing so, I try to keep in mind that the market will be served best when I honestly advocate for software and service improvements that will benefit buyers. This is a value to those who sell and those who buy software. My consulting to vendors indirectly benefits both audiences.

Analysts also consult to buyers, to help them make informed decisions about technology decisions and business relationships. I particularly enjoy and value those experiences because what I learn about enterprise buyers’ needs and expectations can translate directly into advice to vendors. This is an honest brokering role that comes naturally because I have been a software vendor and also in a position to make many software procurement decisions, particularly tools and applications that were used by my development and service teams. I’m always enthusiastic to be in a position to share important information about products with buyers and information about buying audiences with those who build products. This can be done effectively while preserving confidentiality on both sides and making sure that everyone gets something out of the communications.

As an analyst, I receive a lot of requests by vendors to listen to, by phone and Web, briefings on their products, or to meet, one-on-one with their executives. You may have noticed that I don’t write reviews of specific products although, in a particular context, I may reference products and applications. While we understand the reason that product vendors want analysts to pay attention to them, I don’t find briefings particularly enlightening unless I know nothing about a company and its offerings. For these types of overviews, I can usually find what I want to know on their Web site, in press releases and by poking around the Web. During briefings I want to drive the conversation toward user experiences and needs.

What I do like to do is talk to product users about their experiences with a vendor or a product. I like to know what the implementation and adoption experience is like and how their organization had been affected by product use, both benefits and drawbacks. It is not always easy to gain access to customers but I have ways of finding them and also encourage readers of this blog to reach out with your stories. I am delighted to learn more through comments to the blog, an email or phone call. If you are willing to chat with me for a while, I will call you at your convenience.

The original topic I planned to write about this week will have to wait because, after receiving over 20 invitations to “be briefed” in the past few days, I decided it was more important to let readers know who I want to be briefed by – search technology users are my number one target. Vendors please push your customers in this direction if you want me to pay attention. This can bring you a lot of value, too. It is a matter of trust.

Enterprise Search 2008 Wrap-Up

It would be presumptuous to think that I could adequately summarize a very active year of evolution among a huge inventory of search technologies. This entry is more about what I have learned and what I opine about the state-of-the-market, than an analytical study and forecast.

The weak link in the search market is product selection methods. My first thought is that we are in a state of technological riches without clear guideposts for which search models work best in any given enterprise. Those tasked to select and purchase products are not well-educated about the marketplace but are usually not given budget or latitude to purchase expert analysis when it is available. It is a sad commentary to view how organizations grant travel budgets to attend conferences where only limited information can be gathered about products but will not spend a few hundred dollars on in-depth comparative expert analyses of a large array of products.

My sources for this observation are numerous, confirmed by speakers in our Gilbane conference search track sessions in Boston and San Francisco. As they related their personal case histories for selecting products, speakers shared no tales of actually doing literature searches or in-depth research using resources with a cost associated. This underscores another observation, those procuring search do not know how to search and operate in the belief that they can find “good enough” information using only “free stuff.” Even their review of material gathered is limited to skimming rather than a systematic reading for concrete facts. This does not make for well-reasoned selections. As noted in an earlier entry, a widely published chart stating that product X is a leader does nothing to enlighten your enterprise’s search for search. In one case, product leadership is determined primarily by the total software sales for the “leader” of which search is a miniscule portion.

Don’t expect satisfaction with search products to rise until buyers develop smarter methods for selection and better criteria for making a buy decision that suits a particular business need.

Random Thoughts. It will be a very long time before we see a universally useful, generic search function embedded in Microsoft (MS) product suites as a result of the FAST acquisition. Asked earlier in the year by a major news organization whether I though MS had paid too much for FAST, I responded “no” if what they wanted was market recognition but “yes” if they thought they were getting state-of-the-art-technology. My position holds; the financial and legal mess in Norway only complicates the road to meshing search technology from FAST with Microsoft customer needs.

I’ve wondered what has happened to the OmniFind suite of search offerings from IBM. One source tells me it makes IBM money because none of the various search products in the line-up are standalone, nor do they provide an easy transition path from one level of product to another for upward scaling and enhancements. IBM can embed any search product with any bundled platform of other options and charge for lots of services to bring it on-line with heavy customization.

Three platform vendors seem to be penetrating the market slowly but steadily by offering more cohesive solutions to retrieval. Native search solutions are bundled with complete content capture, publishing and search suites, purposed for various vertical and horizontal applications. These are Oracle, EMC, and OpenText. None of these are out-of-the-box offerings and their approach tends to appeal to larger organizations with staff for administration. At least they recognize the scope and scale of enterprise content and search demands, and customer needs.

On User Presentations at the Boston Gilbane Conference, I was very pleased with all sessions, the work and thought the speakers put into their talks. There were some noteworthy comments in those on Semantic Search and Text Technologies, Open Source and Search Appliances.

On the topic of semantic (contextual query and retrieval) search, text mining and analytics, the speakers covered the range of complexities in text retrieval, leaving the audience with a better understanding of how diverse this domain has become. Different software application solutions need to be employed based on point business problems to be solved. This will not change, and enterprises will need to discriminate about which aspects of their businesses need some form of semantically enabled retrieval and then match expectations to offerings. Large organizations will procure a number of solutions, all worthy and useful. Jeff Catlin of Lexalytics gave a clear set of definitions within this discipline, industry analyst Curt Monash provoked us with where to set expectations for various applications, and Win Carus of Information Extraction Systems illustrated the tasks extraction tools can perform to find meaning in a heap of content. The story has yet to be written on how semantic search is and will impact our use of information within organizations.

Leslie Owens of Forrester and Sid Probstein of Attivio helped to ground the discussion of when and why open source software is appropriate. The major take-way for me was an understanding of the type of organization that benefits most as a contributor and user of open source software. Simply put, you need to be heavily vested and engaged on the technical side to get out of open source what you need, to mold it to your purpose. If you do not have the developers to tackle coding, or the desire to share in a community of development, your enterprise’s expectations will not be met and disappointment is sure to follow.

Finally, several lively discussions about search appliance adoption and application (Google Search Appliance and Thunderstone) strengthen my case for doing homework and making expenditures on careful evaluations before jumping into procurement. While all the speakers seem to be making positive headway with their selected solutions, the path to success has involved more diversions and changes of course than necessary for some because the vetting and selecting process was too “quick and dirty” or dependent on too few information sources. This was revealed: true plug and play is an appliance myth.

What will 2009 bring? I’m looking forward to seeing more applications of products that interest me from companies that have impressed me with thoughtful and realistic approaches to their customers and target audiences. Here is an uncommon clustering of search products.

Multi-repository search across database applications, content collaboration stores document management systems and file shares: Coveo, Autonomy, Dieselpoint, dtSearch, Endeca, Exalead, Funnelback, Intellisearch, ISYS, Oracle, Polyspot, Recommind, Thunderstone, Vivisimo, and X1. In this list is something for every type of enterprise and budget.

Business and analytics focused software with intelligence gathering search: Attensity, Attivio, Basis Technology, ChartSearch, Lexalytics, SAS, and Temis.

Comprehensive solutions for capture, storage, metadata management and search for high quality management of content for targeted audiences: Access Innovations, Cuadra Associates, Inmagic, InQuira, Knova, Nstein, OpenText, ZyLAB.

Search engines with advanced semantic processing or natural language processing for high quality, contextually relevant retrieval when quantity of content makes human metadata indexing prohibitive: Cognition Technologies, Connotate, Expert System, Linguamatics, Semantra, and Sinequa

Content Classifier, thesaurus management, metadata server products have interplay with other search engines and a few have impressed me with their vision and thoughtful approach to the technologies: MarkLogic, MultiTes, Nstein, Schemalogic, Seaglex, and Siderean.

Search with a principal focus on SharePoint repositories: BA-Insight, Interse, Kroll Ontrack, and SurfRay.

Finally, some unique search applications are making serious inroads. These include Documill for visual and image, Eyealike for image and people, Krugle for source code, and Paglo for IT infrastructure search.

This is the list of companies that interest me because I think they are on track to provide good value and technology, many still small but with promise. As always, the proof will be in how they grow and how well they treat their customers.

That’s it for a wrap on Year 2 of the Enterprise Search Practice at the Gilbane Group. Check out our search studies at https://gilbane.com/Research-Reports.html and PLEASE let me hear your thoughts on my thoughts or any other search related topic via the contact information at https://gilbane.com/

« Older posts Newer posts »

© 2025 The Gilbane Advisor

Theme by Anders NorenUp ↑