Curated for content, computing, and digital experience professionals

Why Adding Semantics to Web Data is Difficult

If you are grappling with Web 2.0 applications as part of your corporate strategy, keep in mind that Web 3.0 may be just around the corner. Some folks say a key feature of Web 3.0 is the emergence of the Semantic Web where information on Web pages includes markup that tells you what the data is, not just how to format it using HTML (HyperText Markup Language). What is the Semantic Web? According to Wikipedia:

“Humans are capable of using the Web to carry out tasks such as finding the Finnish word for “monkey”, reserving a library book, and searching for a low price on a DVD. However, a computer cannot accomplish the same tasks without human direction because web pages are designed to be read by people, not machines. The semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing and combining information on the web.” (http://en.wikipedia.org/wiki/Semantic_Web).

To make this work, the W3C (World Wide Web Consortium) has developed standards such as RDF (Resource Description Framework, a schema for describing properties of data objects) and SPARQL (SPARQL Protocol and RDF Query Language, http://www.w3.org/TR/rdf-sparql-query/) extend the semantics that can be applied to Web delivered content.

We have been doing semantic data since the beginning of SGML, and later with XML, just not always exposing these semantics to the Web. So, if we know how to apply semantic markup to content, how come we don’t see a lot of semantic markup on the Web today? I think what is needed is a method for expressing and understanding the semantics intended to be expressed beyond what current standards capabilities allow

A W3C XML schema is a set of rules that describe the relationships between content elements. It can be written in a way that is very generic or format oriented (e.g., HTML) or very structure oriented (e.g., Docbook, DITA). Maybe we should explore how to go even further and make our markup languages very semantically oriented by defining elements, for instance, like <weight> and <postal_code>.

Consider though, that the schema in use can tell us the names of semantically defined elements, but not necessarily their meaning. I can tell you something about a piece of data by using the <income> tag, but how, in a schema can I tell you it is a net <income> calculated using the guidelines of US Internal Revenue Service, and therefore suitable for eFiling my tax return? For that matter, one system might use the element type name <net_income> while another might use <inc>. Obviously a industry standard like XBRL (eXtensible Business Reporting Language) can help standardize vocabularies for element type names, but this cannot be the whole solution or XBRL use would be more widespread. (Note: no criticism of XBRL is intended, just using it as an example of how difficult the problem is).

Also, consider the tools in use to consume Web content. Browsers only in recent years added XML processing support in the form of the ability to read DTDs and transform content using XSLT. Even so, this merely allows you to read, validate and format non-HTML tag markup, not truly understand the content’s meaning. And if everyone uses their own schemas to define the data they publish on the Web, we could end up with a veritable “Tower of Babel” with many similar, but not fully interoperable data models.

The Semantic Web may someday provide seamless integration and interpretation of heterogeneous data. Tools such as RDF /SPARQL, as well as microformats (embedding small, specialized, predefined element fragments in a standard format such as HTML), metadata, syndication tools and formats, industry vocabularies, powerful processing tools like XQuery, and other specifications can improve our ability to treat heterogeneous markup as if it were more homogeneous. But even these approaches are addressing only part of the bigger problem. How will we know that elements labeled with <net_income> and <inc> are the same and should be handled as such. How do we express these semantic definitions in a processable form? How do we know they are identical or at least close enough to be treated as essentially the same thing?

This, defining semantics effectively and broadly, is a conundrum faced by many industry standard schema developers and system integrators working with XML content. I think the Semantic Web will require more than schemas and XML-aware search tools to reach its full potential in intelligent data and applications that process them. What is probably needed is a concerted effort to build semantic data and tools that can process these included browsing, data storage, search, and classification tools. There is some interesting work being done in Technical Architecture Group (TAG) at the W3C to address these issues as part of Tim Berners-Lee’s vision of the semantic Web (see for a recent paper on the subject).
Meanwhile, we have Web 2.0 social networking tools to keep us busy and amused while we wait. </>

2 Comments

  1. Frank Gilbane

    Great explanation Dale. I agree, although I am less optimistic about the outcome. http://gilbane.com/blog/2006/11/web_20_30_and_so_on/

  2. Bob DuCharme

    Hi Dale,

    This is why the semantic web is built around URLs, not just element names. If someone refers to a “title” and you don’t know whether that person is an HR adminstrator who means “job title” or a real estate dealer who referring to the deed to a piece of property, you don’t know what they mean. However, if I refer to a , you know that I mean the title of a work.

    In GE’s XBRL financial statement, they use http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome to identify net income. Being XBRL, the semantics of what they mean must be identified in a place that people can find.

    >How will we know that elements labeled with <net_income> and

    ><inc> are the same and should be handled as such.

    If company X refers to net income as and company Y refers to it as , then the following bit of OWL asserts that they’re the same thing, and a SPARQL query that uses the GE URL to say “get me net income figures” will get the others as well:

    <owl:ObjectProperty
    rdf:about="http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome">
    <owl:equivalentProperty>
    <owl:DatatypeProperty rdf:about="http://www.x.com/ns/xbrl/net_income"/>
    <owl:equivalentProperty>
    <owl:equivalentProperty>
    <owl:DatatypeProperty rdf:about="http://www.y.com/some/path/inc"/>
    </owl:equivalentProperty>
    </owl:ObjectProperty>
    

    (You can call the syntax wordy, but if it was terser people would call it cryptic.) This is the beauty of OWL’s role as metadata that adds value to existing bodies of data. The nice thing about its relationship to XBRL is that much of XBRL is about defining taxonomies and semantics, and OWL is about building on such definitions to get more value out of data.

    More at http://www.snee.com/xml/xml2006/owlrdbms.html.

Leave a Reply

© 2024 The Gilbane Advisor

Theme by Anders NorenUp ↑