Curated content for content, computing, and digital experience professionsals

Tag: Categorization

E-discovering Language to Launch Your Taxonomy

New enterprise initiatives, whether for implementing search solutions or beginning a new product development program, demand communication among team leaders and participants. Language matters; defining terminology for common parlance is essential to smooth progress toward initiative objectives.

Glossaries, dictionaries, taxonomies, thesauri and ontologies are all mechanisms we use routinely in education and work to clarify terms we use to engage and communicate understanding of any specialized domain. Electronic social communication added to the traditional mix of shared information (e.g. documents, databases, spreadsheets, drawings, standardized forms) makes business transactional language more complex. Couple this with the use of personal devices for capturing and storing our work content, notes, writings, correspondence, design and diagram materials and we all become content categorizing managers. Some of us are better than others at organizing and curating our piles of information resources.

As recent brain studies reveal, humans, and probably any animal with a brain, have established cognitive areas in our brains with pathways and relationships among categories of grouped concepts. This reinforces our propensity for expending thought and effort to order all aspects of our lives. That we all organize differently across a huge spectrum of concepts and objects makes it wondrous that we can live and work collaboratively at all. Why after 30+ years of marriage do I arrange my kitchen gadget drawer according to use or purpose of devices while my husband attempts to store the same items according to size and shape? Why do icons and graphics placed in strange locations in software applications and web pages rarely impart meaning and use to me, while others “get it” and adapt immediately?

The previous paragraph may seem to be a pointless digression from the subject of the post but there are two points to be made here. First, we all organize both objects and information to facilitate how we navigate life, including work. Without organization that is somehow rationalized, and established accordingly to our own rules for functioning, our lives descend into dysfunctional chaos. People who don’t organize well or struggle with organizing consistently struggle in school, work and life skills. Second, diversity of practice in organizing is a challenge for working and living with others when we need to share the same spaces and work objectives. This brings me to the very challenging task of organizing information for a website, a discrete business project, or an entire enterprise, especially when a diverse group of participants are engaged as a team.

So, let me make a few bold suggestions about where to begin with your team:

  • Establish categories of inquiry based on the existing culture of your organization and vertical industry. Avoid being inventive, clever or idiosyncratic. Find categories labels that everyone understands similarly.
  • Agree on common behaviors and practices for finding by sharing openly the ways in which members of the team need to find, the kinds of information and answers that need discovering, and the conditions under which information is required. These are the basis for findability use cases. Again, begin with the usual situations and save the unusual for later insertion.
  • Start with what you have in the form of finding aids: places, language and content that are already being actively used; examine how they are organized. Solicit and gather experiences about what is good, helpful and “must have” and note interface elements and navigation aids that are not used. Harvest any existing glossaries, dictionaries, taxonomies, organization charts or other definition entities that can provide feeds to terminology lists.
  • Use every discoverable repository as a resource (including email stores, social sites, and presentations) for establishing terminology and eventually writing rules for applying terms. Research repositories that are heavily used by groups of specialists and treat them as crops of terminology to be harvested for language that is meaningful to experts. Seek or develop linguistic parsing and term extraction tools and processes to discover words and phrases that are in common use. Use histograms to determine frequency of use, then alphabetize to find similar terms that are conceptually related, and semantic net tools to group discovered terms according to conceptual relationships. Segregate initialisms, acronyms, and abbreviations for analysis and insertion into final lists, as valid terms or synonyms to valid terms.
  • Talk to the gurus and experts that are the “go-to people” for learning about a topic and use their experience to help determine the most important broad categories for information that needs to be found. Those will become your “top term” groups and facets. Think of top terms as topical in nature (e.g. radar, transportation, weapons systems) and facets as other categories by which people might want to search (e.g. company names, content types, conference titles).
  • Simplify your top terms and facets into the broadest categories for launching your initiative. You can always add more but you won’t really know where to be the most granular until you begin using tags applied to content. Then you will see what topics have the most content and require narrower topical terms to avoid having too much content piling up under a very broad category.
  • Select and authorize one individual to be the ultimate decider. Ambiguity of categorizing principles, purpose and needs is always a given due to variations in cognitive functioning. However, the earlier steps outlined here will have been based on broad agreement. When it comes to the more nuanced areas of terminology and understanding, a subject savvy and organizationally mature person with good communication skills and solid professional respect within the enterprise will be a good authority for making final decisions about language. A trusted professional will also know when a change is needed and will seek guidance when necessary.

Revisit the successes and failures of the applied term store routinely: survey users, review search logs, observe information retrieval bottlenecks and troll for new electronic discourse and content as a source of new terminology. A recent post by taxonomy expert Heather Hedden gives more technical guidance about evaluating and sustaining your taxonomy maintenance.

Atex Releases Polopoly v9.13 Web Content Management System

Atex released an update to their Web content management system, Polopoly 9.13, which integrates with their Text Mining engine to automatically tag, and categorize content. A new Polopoly widget also allows content to be “batch categorized”, which enhances the search results for end users, while providing internal users with a discovery and knowledge management tool. Instead of editors applying relevant categories manually, the text mining engine will now do it automatically. Editors can instruct the engine to analyze a piece of content and suggest relevant categories based on the text, and receive suggestions based on the metadata and IPTC categorization. With Polopoly 9.13, classified content is automatically placed in dynamic lists based on metadata selections in the repository. These lists can automatically serve up older stories with links for related content, which are placed in context alongside the current articles. Interested users could be encouraged to  “read more” or “find similar” stories based on information from the articles they are viewing. Publishers can even create new pages based entirely on archived content that’s been categorized by metadata. http://atex.com

Dewey Decimal Classification, Categorization, and NLP

I am surprised how often various content organizing mechanisms on the Web are compared to the Dewey Decimal System. As a former librarian, I am disheartened to be reminded how often students were lectured on the Dewey Decimal system, apparently to the exclusion of learning about subject categorization schemes. They complemented each other but that seems to be a secret among all but librarians.

I’ll try to share a clearer view of the model and explain why new systems of organizing content in enterprise search are quite different than the decimal model.

Classification is a good generic term for defining physical organizing systems. Unique animals and plants are distinguished by a single classification in the biological naming system. So too are books in a library. There are two principal classification systems for arranging books on the shelf in Western libraries: Dewey Decimal and Library of Congress (LC). They each use coding (numeric for Dewey decimal and alpha-numeric for Library of Congress) to establish where a book belongs logically on a shelf, relative to other books in the collection, according to the book’s most prominent content topic. A book on nutrition for better health might be given a classification number for some aspect of nutrition or one for a health topic, but a human being has to make a judgment which topic the book is most “about” because the book can only live in one section of the collection. It is probably worth mentioning that the Dewey and LC systems are both hierarchical but with different priorities. (e.g. Dewey puts broad topics like Religion and Philosophy and Psychology at top levels and LC puts those two topics together while including more scientific and technical topics at the top of the list, like Agriculture and Military Science.)

So why classify books to reside in topic order? It requires a lot of labor to move the collections around to make space for new books. It is for the benefit of the users, to enable “browsing” through the collection, although it may be hard to accept that the term browsing was a staple of library science decades before the internet. Library leaders established eons ago the need for a system of physical organization to help readers peruse the book collection by topic, leading from the general to the specific.

You might ask what kind of help that was for finding the book on nutrition that was classified under “health science.” This is where another system, largely hidden from the public or often made annoyingly inaccessible, comes in. It is a system of categorization in which any content, book or otherwise, can be assigned an unlimited number of categories. Wondering through the stacks, one would never suspect this secret way of finding a nugget in a book about your favorite hobby if that book was classified to live elsewhere. The standard lists of terms for further describing books by multiple headings are called “subject headings” and you had to use a library catalog to find them. Unfortunately, they contain mysterious conventions called “sub-divisions,” designed to pre-coordinate any topic with other generic topics (e.g. Handbooks, etc. and United States). Today we would call these generic subdivision terms, facets. One reflects a kind of book and the other reveals a geographical scope covered by the book.

With the marvel of the Web page, hyperlinking, and “clicking through” hierarchical lists of topics we can click a mouse to narrow a search for handbooks on nutrition in the United States for better health beginning at any facet or topic and still come up with the book that meets all four criteria. We no longer have to be constrained by the Dewey model of browsing the physical location of our favorite topics, probably missing a lot of good stuff. But then we never did. The subject card catalog gave us a tool for finding more than we would by classification code alone. But even that was a lot more tedious than navigating easily through a hierarchy of subject headings, narrowing the results by facets on a browser tab and further narrowing the results by yet another topical term until we find just the right piece of content.

Taking the next leap we have natural language processing (NLP) that will answer the question, “Where do I find handbooks on nutrition in the United States for better health?” And that is the Holy Grail for search technology – and a long way from Mr. Dewey’s idea for browsing the collection.

Enterprise Search: Leveraging and Learning from Web Search and Content Tools

Following on my last post in which I covered the unique value propositions offered by a variety of enterprise search products, this one takes a look at the evolution of enterprise search. The commentary by search company experts, executives, and analysts indicates some evolutionary technologies and the escalation of certain themes in enterprise search. Furthermore, the pursuit of organizations to strengthen the link between searching technologies and knowledge enablers has never been more prominently featured taking search to a whole new level beyond mere retrieval.

The following paraphrased comments from the Enterprise Search Keynote session are timely and revealing. When I asked, Will Web and Internet Search Technologies Drive the Enterprise (Internal) Search Tool Offerings or Will the Markets Diverge?, these were some thoughts from the panelists.

Matt Brown, Principal Analyst from Forrester Research, commented that enterprise search demands much different and richer content interpretation types of search technologies. What Web-based searching does is create such high visibility for search that enterprises are being primed to adopt it, but only when it comes with enhanced capabilities.

Echoing Matt’s remarks, Oracle search solution manager Bob Bocchino commented on the difficulty of making search operate well within the enterprise because it needs to deal with structured database content and unstructured files, while also applying sophisticated security features that let only authorized viewers see restricted content. Furthermore, security must be deployed in a way that does not degrade performance while supporting continuous updates to content and permissions.

Hadley Reynolds, VP & Director of the Center for Search Innovation at Fast Search & Transfer, noted that the Web isn’t really making a direct impact on enterprise search innovation but many of the social tools found on the Web are being adopted in enterprises to create new kinds of content (e.g. social networks, blogs and wikis) with which enterprise search engines must cope in richer contextual ways.

Don Dodge, Director of Business Development for the Emerging Business Team at Microsoft further noted that the Internet’s biggest problem is scale. That is a much easier problem to solve than in the enterprise where user standards for what qualifies as a good and valuable search results are much higher, therefore making the technology to deliver those results more difficult.

Among the other noteworthy comments in this session was a negative about taxonomies. The gist of it was that they require so much discipline that they might work for a while but can’t really be sustained. If this attitude becomes the norm, many of the semantic search engines which depend on some type of classification and categorization according to industry terminologies or locally maintained lists will be challenged to deliver enhanced search results. This is a subject to be taken up in a later blog entry.

A final conclusion about enterprise search was a remark about the evolution of adoption in the marketplace. Simply put, the marketplace is not monolithic in its requirements. The diversity of demands on search technologies has been a disincentive for vendors to focus on distinct niches and place more effort on areas like e-commerce. This seems to be shifting, especially with all the large software companies now seriously announcing products in the enterprise search market.