Curated for content, computing, and digital experience professionals

Tag: Classification systems

Newgen unveils new AI and ML-based Document Classification Service

Newgen Software, a global provider of low code digital automation platform for managing content, processes, and communication, announced it has launched an enhanced version of its document classification service for enabling the high-volume document-handling environment. Intelligent Document Classifier 1.0 allows users to gain hidden insights by classifying documents, based on structural features and/or textual features. It uses machine learning (ML) and artificial intelligence (AI), to enable layout- and content-based document classification. Organizations can leverage the solution to automatically classify various documents such as sales/purchase orders, enrollment and claim forms, legal documents, mailroom documents, contracts, correspondences, and others. This helps ensure important information is available thereby reducing risks and costs associated with manual document management.

Key features  include:

  • Image Classification – Allows users to automatically classify images using neural networks and deep learning algorithms based on structural features
  • Content Classification – Enables document classification based on content, in the absence of structural features
  • Trainable Machine Learning – Auto-learns definitions and features of a document class and creates a trained model
  • Admin Dashboard – Generates analytics reports for a 360-degree view of the process
  • Integration Capabilities – Facilitates easy integration with core business applications, content management platforms, and document capture applications

https://newgensoft.com/

Dewey Decimal Classification, Categorization, and NLP

I am surprised how often various content organizing mechanisms on the Web are compared to the Dewey Decimal System. As a former librarian, I am disheartened to be reminded how often students were lectured on the Dewey Decimal system, apparently to the exclusion of learning about subject categorization schemes. They complemented each other but that seems to be a secret among all but librarians.

I’ll try to share a clearer view of the model and explain why new systems of organizing content in enterprise search are quite different than the decimal model.

Classification is a good generic term for defining physical organizing systems. Unique animals and plants are distinguished by a single classification in the biological naming system. So too are books in a library. There are two principal classification systems for arranging books on the shelf in Western libraries: Dewey Decimal and Library of Congress (LC). They each use coding (numeric for Dewey decimal and alpha-numeric for Library of Congress) to establish where a book belongs logically on a shelf, relative to other books in the collection, according to the book’s most prominent content topic. A book on nutrition for better health might be given a classification number for some aspect of nutrition or one for a health topic, but a human being has to make a judgment which topic the book is most “about” because the book can only live in one section of the collection. It is probably worth mentioning that the Dewey and LC systems are both hierarchical but with different priorities. (e.g. Dewey puts broad topics like Religion and Philosophy and Psychology at top levels and LC puts those two topics together while including more scientific and technical topics at the top of the list, like Agriculture and Military Science.)

So why classify books to reside in topic order? It requires a lot of labor to move the collections around to make space for new books. It is for the benefit of the users, to enable “browsing” through the collection, although it may be hard to accept that the term browsing was a staple of library science decades before the internet. Library leaders established eons ago the need for a system of physical organization to help readers peruse the book collection by topic, leading from the general to the specific.

You might ask what kind of help that was for finding the book on nutrition that was classified under “health science.” This is where another system, largely hidden from the public or often made annoyingly inaccessible, comes in. It is a system of categorization in which any content, book or otherwise, can be assigned an unlimited number of categories. Wondering through the stacks, one would never suspect this secret way of finding a nugget in a book about your favorite hobby if that book was classified to live elsewhere. The standard lists of terms for further describing books by multiple headings are called “subject headings” and you had to use a library catalog to find them. Unfortunately, they contain mysterious conventions called “sub-divisions,” designed to pre-coordinate any topic with other generic topics (e.g. Handbooks, etc. and United States). Today we would call these generic subdivision terms, facets. One reflects a kind of book and the other reveals a geographical scope covered by the book.

With the marvel of the Web page, hyperlinking, and “clicking through” hierarchical lists of topics we can click a mouse to narrow a search for handbooks on nutrition in the United States for better health beginning at any facet or topic and still come up with the book that meets all four criteria. We no longer have to be constrained by the Dewey model of browsing the physical location of our favorite topics, probably missing a lot of good stuff. But then we never did. The subject card catalog gave us a tool for finding more than we would by classification code alone. But even that was a lot more tedious than navigating easily through a hierarchy of subject headings, narrowing the results by facets on a browser tab and further narrowing the results by yet another topical term until we find just the right piece of content.

Taking the next leap we have natural language processing (NLP) that will answer the question, “Where do I find handbooks on nutrition in the United States for better health?” And that is the Holy Grail for search technology – and a long way from Mr. Dewey’s idea for browsing the collection.

© 2020 The Gilbane Advisor

Theme by Anders NorenUp ↑