Category: Semantic technologies (Page 17 of 72)

Our coverage of semantic technologies goes back to the early 90s when search engines focused on searching structured data in databases were looking to provide support for searching unstructured or semi-structured data. This early Gilbane Report, Document Query Languages – Why is it so Hard to Ask a Simple Question?, analyses the challenge back then.

Semantic technology is a broad topic that includes all natural language processing, as well as the semantic web, linked data processing, and knowledge graphs.

Microsoft details T-ULRv2 model that can translate between 94 languages

October 20, 2020 / NewsShark

From Kyle Wiggers at VentureBeat

The same week Facebook open-sourced M2M-100, an AI model that can translate between over 100 languages, Microsoft detailed an algorithm of its own — Turing Universal Language Representation (T-ULRv2) — that can interpret 94 languages. The company claims T-ULRv2 achieves the top results in XTREME, a natural language processing benchmark created by Google, and will use it to improve features like Semantic Search in Word and Suggested Replies in Outlook and Teams ahead of availability in private preview via Azure.

T-ULRv2, a joint collaboration between Microsoft Research and the Microsoft Turing team, contains a total of 550 million parameters, or internal variables that the model leverages to make predictions. (By comparison, M2M-100 has around 15 billion parameters). Microsoft researchers trained T-ULRv2 on a multilingual data corpus from the web that consists of the aforementioned 94 languages. During training, the model learned to translate by predicting masked words from sentences in different languages, occasionally drawing on context clues in pairs of translations like English and French.

As Microsoft VP Saurabh Tiwary and assistant managing director Ming Zhou note in a blog post, the XTREME benchmark covers 40 languages spanning 12 families and 9 tasks that require reasoning about varying levels of syntax. The languages are selected to maximize diversity, coverage in existing tasks, and availability of training data, and the tasks cover a range of paradigms including sentence text classification, structured prediction, sentence retrieval, and cross-lingual question answering. For models to be successful on the XTREME benchmarks, then, they must learn representations that generalize to many standard cross-lingual transfer settings.

The jury is out on T-ULRv2’s potential for bias and its grasp of general knowledge. Some research suggests benchmarks such as XTREME don’t measure models’ knowledge well and that models like T-ULRv2 can exhibit toxicity and prejudice against demographic groups. But the model is in any case a step toward Microsoft’s grand “AI at scale” vision, which seeks to push AI capabilities by training algorithms with increasingly large amounts of data and compute. Already, the company has used its Turing family of models to bolster language understanding across Bing, Office, Dynamics, and its other productivity products.

…

T-ULRv2 will power current and future language services available through Azure Cognitive Services, Microsoft says. It will also be available as a part of a program for building custom applications, which was announced at Microsoft Ignite 2020 earlier this year. Developers can submit requests for access.

https://venturebeat.com/2020/10/20/microsoft-details-t-urlv2-model-that-can-translate-between-94-languages/, https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv2-tops-xtreme-leaderboard/

Tisane Labs adds Wikidata extraction feature on Microsoft Azure

October 13, 2020 / NewsShark

Tisane Labs, a supplier of text analytics AI solutions, announced a new feature in Tisane API, already available on Microsoft Azure Marketplace and AppSource. With the new feature, Tisane API now allows tagging and extraction of Wikidata entities, complementing the capabilities provided by Azure Cognitive Services and supporting nearly 30 languages. Users can easily obtain Wikidata IDs from Tisane’s JSON response providing the ability to annotate text with images, GPS coordinates, important dates, 3rd party references, and whatever the ever-growing and open Wikidata database contains. Tisane API runs in the cloud utilizing Azure API Management, with a simple REST interface that can be linked from any popular programming platform today. Tisane Labs provides a range of tailored plans for its clients with the option of a custom installation on-premises and a free plan.

https://tisane.ai

Information Model

Gilbane Advisor 6-9-21 — stacks & aggregation, JavaScript for data, news

Dstl releases free Baleen 3 data processing update

October 5, 2020 / NewsShark

The Defence Science and Technology Laboratory (Dstl) has released a new free version of its popular data processing tool. Baleen 3 is a tool for building data processing pipelines using the open source Annot8 framework and succeeds Baleen 2, one of the first open source projects by Dstl, the science inside UK defence and security. It offers users the ability to search, process and collate data, and is suitable for personal and commercial applications. It has been used across government, and by industry and academia, and also internationally as well as in the UK.

The tool enables the creation of a bespoke chain of “processors” to extract information from unstructured data (e.g. text documents, images). For example, Baleen 3 could process a folder with thousands of Word Documents and PDFs in it to extract all e-mail addresses and phone numbers in those documents and store them in a database. As well as text, Baleen 3 can also find and extract images within those documents, perform OCR to find text within those images, translate that text into English, and then run machine learning models to find mentions of People within those images. Baleen 3 supports components developed within the Annot8 framework, and as a result it is easy to extend and develop further to cover new use cases and provide additional functionality. There are already a large number of components available for use within the Annot8 framework, including some previously developed by Dstl.

Following the release of Baleen 3, support for the existing Baleen 2 project will be withdrawn. Dstl is encouraging all users to move to using Baleen 3 where possible. Baleen 3 is built on top of newer technologies, and will be easier to maintain and deploy as a result of the upgrade. It also extends Baleen 2’s focus on text to support other forms of unstructured data, such as images. Baleen 3 is available to download now.

https://github.com/dstl/baleen

Yext releases “Milky Way” search algorithm with BERT

August 27, 2020 / NewsShark

Yext, Inc. announced “Milky Way,” the latest upgrade to the natural language processing (NLP) algorithm that powers Yext Answers, Yext’s site search product. Headlining this milestone update is the adoption of BERT, (Bidirectional Encoder Representations from Transformers). Developed by Google, BERT is an open source machine learning framework for NLP designed to better understand user searches. By leveraging BERT within Named Entity Recognition (a process to locate and classify named entities mentioned in unstructured text into predefined categories), Yext Answers improves its ability to distinguish locations from other types of entities, including people, jobs, and events. The update includes:

Improved Named Entity Recognition: By leveraging BERT, Yext Answers can now better understand the contextual relationship between search terms. Answers will return a more relevant result by taking into account the correct classification, whether a location, person or product.
Improved Location Detection: The update leaves behind location biasing. Now, Yext Answers will filter through locations stored by a business in their Yext knowledge graph to surface the best match.
Updated Healthcare Taxonomy: More than 3,000 new healthcare-related synonyms, conditions, treatments, and procedures have been added to the algorithm’s taxonomy.
Improved Stemming and Typo Tolerance.

https://www.yext.com/resources/about/news-media/2020-08-yext-releases-milky-way/

Google open-sources LIT for evaluating natural language models

August 14, 2020 / NewsShark

Google-affiliated researchers released the Language Interpretability Tool (LIT), an open source, framework-agnostic platform and API for visualizing, understanding, and auditing natural language processing models. It focuses on questions about AI model behavior, like why models made certain predictions and why they’re performing poorly with input corpora. LIT incorporates aggregate analysis into a browser-based interface that’s designed to enable explorations of text generation behavior. The tool set is architected so that users can hop between visualizations and analysis to test hypotheses and validate those hypotheses over a data set. New data points can be added on the fly and their effect on the model visualized immediately, while side-by-side comparison allows for two models or two data points to be visualized simultaneously. And LIT calculates and displays metrics for entire data sets to spotlight patterns in model performance, including the current selection, manually generated subsets, and automatically generated subsets.

LIT works with any model that can run from Python, the Google researchers say, including TensorFlow, PyTorch, and remote models on a server. And it has a low barrier to entry, with only a small amount of code needed to add models and data. The team cautions that LIT doesn’t scale well to large corpora and that it’s not “directly” useful for training-time model monitoring. But they say that in the near future, the tool set will gain features like counterfactual generation plugins, additional metrics and visualizations for sequence and structured output types, and a greater ability to customize the UI for different applications.

H/T VentureBeat: https://venturebeat.com/2020/08/14/google-open-sources-lit-a-toolset-for-evaluating-natural-language-models/

Zignal Labs adds Lexalytics to provide natural language processing to platform

August 5, 2020 / NewsShark

Lexalytics announced that Zignal Labs, creator of the Impact Intelligence platform for measuring the evolution of opinion in real time, has added Lexalytics Salience engine to extend its platform’s natural language processing (NLP) and text analytics capabilities to help marketers, communicators and analysts gain a greater understanding of perceptions across traditional and social media. With Lexalytics, Zignal’s customers across industries can understand what people are saying about products, services or current events, categorize discussions into separate groupings and themes, and evaluate the sentiment of media coverage across multiple languages.

http://www.lexalytics.com, http://www.zignallabs.com

Neofonie announced TXTWerk – text mining for SAP solutions

July 28, 2020 / NewsShark

Neofonie announced that TXTWerk – Text mining for SAP solutions, a framework application is now available for trial and online purchase on SAP App Center, the digital marketplace for SAP partner offerings. TXTWerk is delivered online as a subscription service and integrates with SAP and third-party software through the API management capabilities of SAP Cloud Platform Integration Suite. TXTWerk enables the extraction of metadata from texts, providing structured data from unstructured texts. By applying machine learning techniques in combination with rule-based approaches, TXTWerk can read and understand texts quickly. Whether 1,000 or 10 billion documents need to be processed, TXTWerk recognizes the most important keywords, people, places, organizations, events and key concepts and links them to sources such as knowledge graphs or internal company data. Also, part of the framework are artificial intelligence (AI) processes for classification in classes defined by the customer, a sentiment analysis of texts, phrase and role recognition as well as the automatic linking of entities according to specially defined relations. In addition to the AI processes, TXTWerk comes with a knowledge graph with over seven million entries.

https://www.neofonie.de/english, https://www.sapappcenter.com/en/product/display-0000059151_live_v1

Luminoso introduces deep learning model for evaluating sentiment at concept level

July 28, 2020 / NewsShark

Luminoso’s new deep learning model understands documents using multiple layers of attention, a mechanism that identifies which words are relevant to get context around a specific concept as expressed by a word or phrase. This model is capable of identifying the author’s sentiment for each individual concept they’ve written about, as opposed to providing an analysis of the overall sentiment of the document.

Using Concept-Level Sentiment, users will be able to:

Effectively analyze mixed feedback — Concept-level sentiment analysis is critical for capturing and understanding the voice of the customer (VoC). For example, product reviews rarely contain just one type of feedback, and it’s important to tease apart the good from the bad. Getting a polarity for each of the topics in an open-ended survey response is critical for understanding what works and what doesn’t for your customers.
Quickly surface buried feedback — Uncovering negative comments in overwhelmingly positive open-ended survey responses is critical for better understanding customers and employees. For instance, in voice of the employee (VoE) surveys, employee feedback can be overwhelmingly positive and delivered in an upbeat way in an effort to soften criticisms. Concept-Level Sentiment in Luminoso enables users to quickly identify and understand “buried” feedback, such as negative points in an overwhelmingly positive HR survey.
Intuitively aggregate concept sentiment across an entire dataset — For instance, after responses to a mobile app market research survey are loaded into Luminoso Daylight, a user can get a distribution of positive, negative, and neutral opinions about every aspect of the mobile experience across all of its mentions in the dataset.
Analyze customer and employee feedback across multiple languages — Global organizations often receive customer and employee feedback in multiple languages. With Luminoso, users can analyze the sentiment of concepts, natively in 15 languages.

https://luminoso.com/solutions/concept-level-sentiment