Curated for content, computing, and digital experience professionals

Category: Semantic technologies (Page 29 of 72)

Our coverage of semantic technologies goes back to the early 90s when search engines focused on searching structured data in databases were looking to provide support for searching unstructured or semi-structured data. This early Gilbane Report, Document Query Languages – Why is it so Hard to Ask a Simple Question?, analyses the challenge back then.

Semantic technology is a broad topic that includes all natural language processing, as well as the semantic web, linked data processing, and knowledge graphs.


Case Studies and Guidance for Search Implementations

We’ll be covering a chunk of the search landscape at the Gilbane Conference next week. While there are nominally over 100 search solutions that target some aspect of enterprise search, there will be plenty to learn from the dozen or so case studies and tool options described. Commentary and examples include: Attivio, Coveo, Exalead, Google Search Appliance (GSA), IntelliSearch, Lexalytics, Lucene, Oracle Secure Enterprise Search, Thunderstone and references to others. Our speakers will cue us into the current state of the search as it is being implemented. Several exhibitors are also on site to demonstrate their capabilities and they represent some of the best. Check out the program lineup below and try to make it to Boston to hear those with hands-on experience.

EST-1: Plug-and Play: Enterprise Experiences with Search Appliances

  • So you want to implement an enterprise search solution? Speaker: Angela A. Foster, FedEx Services, FedEx.com Development, and Dennis Shirokov, Marketing Manager, FedEx Digital Access Marketing.
  • The Make or Buy Decision at the U.S. General Services Admin. Speaker: Thomas Schaefer, Systems Analyst and Consultant, U.S. General Services Administration
  • Process and Architecture for Implementing GSA at MITRE. Robert Joachim, Info Systems Engr, Lead, The MITRE Corporation.

EST-2: Search in the Enterprise When SharePoint is in the Mix

  • Enterprise Report Management: Bringing High Value Content into the Flow of Business Action. Speaker: Ajay Kapur, VP of Product Development, Apps Associates
  • Content Supply? Meet Knowledge Demand: Coveo SharePoint integration. Speaker: Marc Solomon, Knowledge Planner, PRTM.
  • In Search of the Perfect Search: Google Search on the Intranet. Speaker: June Nugent, Director of Corporate Knowledge Resources, NetScout Systems,

EST-3: Open Source Search Applied in the Enterprise

  • Context for Open Source Implementations. Speaker: Leslie Owen, Analyst, Forrester Research
  • Intelligent Integration: Combining Search and BI Capabilities for Unified Information Access. Speaker: Sid Probstien, CTO, Attivio

EST-4: Search Systems: Care and Feeding for Optimal Results

  • Getting Off to a Strong Start with Your Search Taxonomy. Speaker: Heather Hedden, Principal Hedden Information Management
  • Getting the Puzzle Pieces to Fit; Finding the Right Search Solution(s) Patricia Eagan, Sr. Mgr, Web Communications, The Jackson Laboratory.
  • How Organizations Need to Think About Search. Speaker: Rob Wiesenberg, President & Founder, Contegra Systems

EST-5: Text Analytics/Semantic Search: Parsing the Language

  • Overview and Differentiators: Text Analytics, Text Mining and Semantic Technologies. Jeff Catlin, CEO, Lexalytics
  • Reality and Hype in the Text Retrieval Market. Curt Monash, President, Monash Research.
  • Two Linguistic Approaches to Search: Natural Language Processing and Concept Extraction. Speaker: Win Carus, President and Founder, Information Extraction Systems

Exhibitors with a Search Focus:

Enterprise Search is Everywhere

When you look for an e-mail you sent last week, a vendor account rep’s phone number, a PowerPoint presentation you received from a colleague in the Paris office, a URL to an article recommended for reading before the next Board meeting, or background on a company project you have been asked to manage, you are engaged in search in, about, or for your enterprise. Whether you are working inside applications that you have used for years, or simply perusing the links on a decade’s old corporate intranet, trying to find something when you are in the enterprise doing its work, you are engaging with a search interface.

Dissatisfaction comes from the numbers of these interfaces and the lack of cohesive roadmap to all there is to be found. You already know what you know and what you need to know. Sometimes you know how to find what you need to know but more often you don’t know and stumble through a variety of possibilities up to and including asking someone else how to find it. That missing roadmap is more than an annoyance; it is a major encumbrance to doing your job and top management does not get it. They simply won’t accept that one or two content roadmap experts (overhead) could be saving many people-years of company time and lost productivity.

In most cases, the simple notion of creating clear guidelines and signposts to enterprise content is a funding showstopper. It takes human intelligence to design and build that roadmap and put the technology aids in place to reveal it. Management will fund technology but not the content architects, knowledge “mappers” and ongoing gatekeepers to stay on top of organizational change, expansions, contractions, mergers, rule changes and program activities that evolve and shift perpetually. They don’t want infrastructure overhead whose primary focus, day-in and day-out, will be observing, monitoring, communicating, and thinking about how to serve up the information that other workers need to do their jobs. These people need to be in place as the “black-boxes” that keep search tools in tip-top operating form.

Last week I commented on the products that will be featured in the Search Track at Gilbane Boston, Dec. 3rd and 4th.

What you will learn about these tools is going to be couched in case studies that reveal the ways in which search technology is leveraged by people who think a lot about what needs to be found and how search needs to work in their enterprises. They will talk about what tools they use, why and what they are doing to get search to do its job. I’ve asked the speakers to tell their stories and based on my conversations with them in the past week, that is what we will hear, the reality!

When We Are Missing Good Metadata in Enterprise Search

This blog has not focused on non-profit institutions (e.g. museums, historical societies) as enterprises but they are repositories of an extraordinary wealth of information. The past few weeks I’ve been trying, with mixed results, to get a feel for the accessibility of this content through the public Web sites of these organizations. My queries leave me with a keen sense of why search on company intranets also fail.

Most sizable non-profits want their collections of content and other information assets exposed to the public. But each department manages its own content collections with software that is unique to their specific professional methods and practices. In the corporate world the mix will include human resources (HR), enterprise resource management (ERP) systems, customer relationship management (CRM), R & D document management systems and collaboration tools. Many corporations have or “had” library systems that reflected a mix of internally published reports and scholarly collections that support R & D and special areas such as competitive intelligence. Corporations struggle constantly with federating all this content in a single search system.

Non-profit organizations have similar disparate systems constructed for their special domain, museums or research institutions. One area that is similar between the corporate and non-profit sector is libraries, operating with software whose interfaces hearken back to designs of the late 1980s or 90s. Another by-product of that era was the catalog record in a format devised by the Library of Congress for the electronic exchange of records between library systems. It was never intended to be the format for retrieval. It is similar to the metadata in content management systems but is an order of magnitude more complex and arcane to the typical person doing searching. Only librarians and scholars really understand the most effective ways to search most library systems; therein lies the “public access” problem. In a corporation a librarian often does the searching.

However, a visitor to a museum Web site would expect to quickly find a topic for which the museum has exhibit materials, printed literature and other media, all together. This calls for nomenclature that is “public friendly” and reflects the basic “aboutness” of all the materials in museum departments and collections. It is a problem when each library and curatorial department uses a different method of categorizing. Libraries typically use Library of Congress Subject Headings. What makes this problematic is that topics are so numerous. The number of possible subject headings is designed for the entire population of all Library of Congress holdings, not a special collection of a few tens of thousands of materials. Almost no library systems search for words “contained in” the subject headings if you try to browse just the Subject index. If I am searching Subjects for all power generation materials and a heading such as electric power generation is used, it will not be found because the look-up mechanism only looks for headings that “begin with” power generation.

Let’s cut to the chase; mountains of metadata in the form of library cataloging are locked inside library systems within non-profit institutions. It is not being searched at the search box when you go to a museum Web site because it is not accessible to most “enterprise” or “web site” search engines. Therefore, a separate search must be done in the library system using a more complex approach to be truly thorough.

We have a big problem if we are to somehow elevate library collections to the same level of importance as the rest of a museum’s collections and integrate the two. Bigger still is the challenge of getting everything indexed with a normalized vocabulary for the comfort of all audiences. This is something that takes thought and coordination among professionals of diverse competencies. It will not be solved easily but it must be done for institutions to thrive and satisfy all their constituents. Here we have yet another example of where enterprise search will fail to satisfy, not because the search engine is broken but because the underlying data is inappropriately packaged for indexes to work as expected. Yet again, we come to the realization that we need people to recognize and fix the problem.

What Determines a Leader in the Enterprise Search Market?

Let’s agree that most if not all “enterprise search” is really about point solutions within large corporations. As I have written elsewhere, the “enterprise” is almost always a federation of constituencies, each with their own solutions for content applications and that includes search. If there is any place that we find truly enterprise-wide application of search, it is in small and medium organizations (SMBs). This would include professional service firms (consultancies and law firms), NGOs, many non-profits, and young R&D companies. There are plenty of niche solutions for SMBs and they are growing.

I bring this up because the latest Gartner “magic quadrant” lists Microsoft (MS) as the “leader” in enterprise search; this is the same place Gartner has positioned Fast Search & Transfer in the past. Whether this is because Fast’s assets are now owned by MS or because Gartner really believes that Microsoft is the leader, I still beg to strongly differ.

I have been perplexed by the Microsoft/Fast deal since it was announced earlier this year because, although Fast has always offered a lot of search technology, I never found it to be a compelling solutions for any of my clients. Putting aside the huge upfront capital cost for licenses, the staggering amount of development work, and time to deployment there were other concerns. I sensed a questionable commitment to an on-going, sustainable, unified and consistent product vision with supporting services. I felt that any client of mine would need very deep pockets indeed to really make a solid value case for Fast. Most of my clients are already burned out on really big enterprise deployments of applications in the ERP and CRM space, and understand the wisdom of beginning with smaller value-achievable, short-term projects on which they can build.

Products that impress me as having much more “out-of-the-box” at a more reasonable cost are clearly leaders in their unique domains. They have important clients achieving a good deal of benefit at a reasonable cost, in a short period of time. They have products that can be installed, implemented and maintained internally without a large staff of administrators, and they have good reputations among their clients for responsiveness and a cohesive series of roll-outs. Several have as many or more clients than Fast ever had (if we ever know the real number). Coveo, Exalead, ISYS, Recommind, Vivisimo, and X1 are a few of a select group that are marking a mark in their respective niches, as products ready for action with a short implementation cycle (weeks or months not years).

Autonomy and Endeca continue to bring value to very large projects in large companies but are not plug-and-play solutions, by any means. Oracle, IBM, and Microsoft offer search solutions of a very different type with a heavy vendor or third-party service requirement. Google Search Appliance has a much larger installed base than any of these but needs serious tuning and customization to make it suitable to enterprise needs. Take the “leadership” designation with a big grain of salt because what leads on the charts may be exactly what bogs you down. There are no generic, one-suit-fits-all enterprise search solutions including those in the “leaders” quadrant.

The Future of Enterprise Search

We’ve been especially focused on enterprise search this year. In addition to Lynda’s blog and our normal conference coverage, we have released two extensive reports, one authored by Lynda and one by Stephen Arnold, and Udi Manber VP Engineering, Search, Google, keynoted our San Francisco conference. We are continuing this focus at our upcoming Boston conference where Prabhakar Raghavan, Head of Yahoo! Research, will provide the opening keynote.

Prabhakar’s talk is titled “The Future of Search”. The reason I added “enterprise” to the title of the post, is that Prabhakar’s talk will be of special interest to enterprises because of its emphasis on complex data in databases and marked-up content repositories. Prabhakar’s background includes stints CTO at Verity and IBM so enterprise (or, if you prefer “behind-the-firewall”, or “intranet”) search requirements are not new to him.

Here is the description from the conference site:

Web content continues to grow, change, diversify, and fragment. Meanwhile, users are performing increasingly sophisticated and open-ended tasks online, connecting broadly to content and services across the Web. The simple search result page of blue text links needs to evolve to address these complex tasks, and this evolution includes a more formal understanding of user’s intent, and a deeper model of how particular pieces of Web content can help. Structured databases power a significant fraction of Web pages, and microformats and other forms of markup have been proposed as mechanisms to expose this structure. But uptake of these mechanisms remains limited, as content owners await the killer application for this technology. That application is search. If search engines can make deep use of structured information about content, provided through open standards, then search engines and site owners can together bring consumers a far richer experience. We are entering a period of massive change to enable search engines to handle more complex content. Prabhakar Raghavan, head of Yahoo! Research, will address the future of search: how search engines are becoming more sophisticated, what the breakthrough point will be for semantics on the Web and what this means for developers and publishers.

Join us on December 3rd at 8:30am at the Boston Westin Copley. Register.

Dewey Decimal Classification, Categorization, and NLP

I am surprised how often various content organizing mechanisms on the Web are compared to the Dewey Decimal System. As a former librarian, I am disheartened to be reminded how often students were lectured on the Dewey Decimal system, apparently to the exclusion of learning about subject categorization schemes. They complemented each other but that seems to be a secret among all but librarians.

I’ll try to share a clearer view of the model and explain why new systems of organizing content in enterprise search are quite different than the decimal model.

Classification is a good generic term for defining physical organizing systems. Unique animals and plants are distinguished by a single classification in the biological naming system. So too are books in a library. There are two principal classification systems for arranging books on the shelf in Western libraries: Dewey Decimal and Library of Congress (LC). They each use coding (numeric for Dewey decimal and alpha-numeric for Library of Congress) to establish where a book belongs logically on a shelf, relative to other books in the collection, according to the book’s most prominent content topic. A book on nutrition for better health might be given a classification number for some aspect of nutrition or one for a health topic, but a human being has to make a judgment which topic the book is most “about” because the book can only live in one section of the collection. It is probably worth mentioning that the Dewey and LC systems are both hierarchical but with different priorities. (e.g. Dewey puts broad topics like Religion and Philosophy and Psychology at top levels and LC puts those two topics together while including more scientific and technical topics at the top of the list, like Agriculture and Military Science.)

So why classify books to reside in topic order? It requires a lot of labor to move the collections around to make space for new books. It is for the benefit of the users, to enable “browsing” through the collection, although it may be hard to accept that the term browsing was a staple of library science decades before the internet. Library leaders established eons ago the need for a system of physical organization to help readers peruse the book collection by topic, leading from the general to the specific.

You might ask what kind of help that was for finding the book on nutrition that was classified under “health science.” This is where another system, largely hidden from the public or often made annoyingly inaccessible, comes in. It is a system of categorization in which any content, book or otherwise, can be assigned an unlimited number of categories. Wondering through the stacks, one would never suspect this secret way of finding a nugget in a book about your favorite hobby if that book was classified to live elsewhere. The standard lists of terms for further describing books by multiple headings are called “subject headings” and you had to use a library catalog to find them. Unfortunately, they contain mysterious conventions called “sub-divisions,” designed to pre-coordinate any topic with other generic topics (e.g. Handbooks, etc. and United States). Today we would call these generic subdivision terms, facets. One reflects a kind of book and the other reveals a geographical scope covered by the book.

With the marvel of the Web page, hyperlinking, and “clicking through” hierarchical lists of topics we can click a mouse to narrow a search for handbooks on nutrition in the United States for better health beginning at any facet or topic and still come up with the book that meets all four criteria. We no longer have to be constrained by the Dewey model of browsing the physical location of our favorite topics, probably missing a lot of good stuff. But then we never did. The subject card catalog gave us a tool for finding more than we would by classification code alone. But even that was a lot more tedious than navigating easily through a hierarchy of subject headings, narrowing the results by facets on a browser tab and further narrowing the results by yet another topical term until we find just the right piece of content.

Taking the next leap we have natural language processing (NLP) that will answer the question, “Where do I find handbooks on nutrition in the United States for better health?” And that is the Holy Grail for search technology – and a long way from Mr. Dewey’s idea for browsing the collection.

Controlling Your Enterprise Search Application

When interviewing search administrators who had also been part of product selection earlier this year, I asked about surprises they had encountered. Some involved the selection process but most related to on-going maintenance and support. None commented on actual failures to retrieve content appropriately. That is a good thing whether it was because, during due diligence they had already tested for that during a proof of concept or because they were lucky.

Thinking about how product selections are made, prompts me to comment on a two major search product attributes that control the success or failure of search for an enterprise. One is the actual algorithms that control content indexing, what is indexed and how it is retrieved from the index (or indices). The second is the interfaces, interfaces for the population of searchers to execute selections, and interfaces for results presentation. On each aspect, buyers need to know what they can control and how best to execute it for success.

Indexing and retrieval technology is embedded with search products; the number of administrative options to alter search scalability, indexing and content selection during retrieval is limited to none. The “secret sauce” for each product is largely hidden, although it may have patented aspects available for researching. Until an administrator of a system gets deeply into tuning, and experimenting with significant corpuses of content, it is difficult to assess the net effect of delivered tuning options. The time to make informed evaluations about how well a given product will retrieve your content when searched by your select audience is before a purchase is made. You can’t control the underlying technology but you can perform a proof of concept (PoC). This requires:

  • human resources and a commitment of computing resources
  • well-defined amount, type and nature (metadata plus full-text or full-text unstructured-only) to give a testable sample
  • testers who are representative of all potential searchers
  • a comparison of the results with three to four systems to reveal how well they each retrieve the intended content targets
  • knowledge of the content by testers and similarity of searches to what will be routinely sought by enterprise employees or customers
  • search logs of previously deployed search systems, if they exist. Searches that routinely failed in the past should be used to test newer systems

Interface technology
Unlike the embedded search technology, buyers can exercise design control or hire a third-party to produce search interfaces that vary enormously. Controlling for what searchers experience when they first encounter a search engine, either a search box at a portal or a completely novel variety of search options with search box, navigation options or special search forms is within the control of the enterprise. This may be required if what comes “out-of-the box” as the default is not satisfactory. You may find, at a reasonable price, a terrific search engine that scales well, indexes metadata and full-text competently and retrieves what the audience expects but requires a different look-and-feel for your users. Through an API (application programming interface), SDK (software development kit) or application connectors (e.g. Documentum, SharePoint) numerous customization options are delivered with enterprise search packages or are available as add-ons.

In either case, human resource costs must be added to the bottom line. A large number of mature software companies and start-ups are innovating with both their indexing techniques and interface design technologies. They are benefiting from several decades of search evolution for search experts, and now a decade of search experiences in the general population. Search product evolution is accelerating as knowledge of searcher experiences is leveraged by developers. You may not be able to control emerging and potentially disruptive technologies, but you can still exercise beneficial controls when selecting and implementing most any search system.

Enterprise Search: Case Studies and User Communities

While you may be wrapping up your summer vacation or preparing for a ramp up to a busy fourth quarter of business, the Gilbane team is securing the speakers for an exciting conference Dec. 2 – 4 in Boston. Evaluations of past sessions always give high marks to case studies delivered by users. We have several for the search track but would like a few more. If one of your targets for search is documents stored in SharePoint repositories, your experiences are sure to draw interest.

SharePoint is the most popular new collaboration tool for organizations with a large Microsoft application footprint but it usually resides with multiple other repositories that also need to be searched. So, what search products are being used to retrieve SharePoint content plus other content? A majority of search applications provide a connector to index SharePoint documents and they would not be making that available without a demand. We would like to hear what SharePoint adopters are actively using for search. What are you experiencing? If you would like to participate in the Gilbane Conference, and have experiences you to share, I hope you will get in touch and check out the full program.

On a related note, I was surprised, during my recent research, to discover few identifiable user-groups or support communities for search products. Many young companies launch and sponsor “user-group meetings” to share product information, offer training, and facilitate peer-to-peer networking among their customers. It is a sign of confidence when they do help customers communicate with each other. It signals a willingness to open communication paths the might lead to collective product critiques which, if well organized, can benefit users and vendors. It is also a sign of maturity when companies reach out to encourage customers to connect with each other. May-be some are operating in stealth mode but more should be accessible to interested parties in the marketplace.

Organizing functions are difficult to manage by users on their own professional time, so, having a vendor willing to be the facilitator and host for communication mechanisms is valuable. However, they sometimes need to have customers giving them a nudge to open the prospect of such a group. If you would value participating in a network of others using your selected product, I suggest taking the initiative by approaching your customer account representative.

Communities for sharing tips about any technology are important but so is mutual guidance to help others become more successful with any product’s process management and governance issues. User groups can give valuable feedback to their vendors and spur product usage creativity and efficiency. Finally, as an analyst I would much rather hear straight talk about product experiences from those who are active users, than a filtered version from a company representative. So, please, reach out to your peers and share your story at any opportunity you can. Volunteer to speak at conferences and participate in user groups. The benefits are numerous, the most important being the formation of a strong collective voice.

« Older posts Newer posts »

© 2024 The Gilbane Advisor

Theme by Anders NorenUp ↑