Linked Data for the Web

Earlier I left two points to be discussed later. The first was a remark that when using http:-type URIs, there are expectations that something actually exists at that web URL. The second was the open question about how Semantic Web clients are supposed to find RDF data on the web. A new Semantic Web community movement under the name Linked Data seeks to provide some answers to these questions.

The notion of Linked Data is to bring the concept and benefits of hyperlinking between HTML documents on the World Wide Web to RDF documents on the Semantic Web. The core principle is that http:-type URIs should be used for RDF resources, so that RDF documents can exist at those locations describing the resources. When those documents mention other resources, if they have http:-type URIs then SemWeb clients can jump from document to document finding more information as it goes.

To take an example: I have minted the URI http://www.rdfabout.com/rdf/usgov/geo/us/ny to represent the state of New York in the United States. If you visit that URL you will get back an RDF document describing New York. And it refers to some other resources, which happen to have http:-type URIs that you could retrieve to get documents describing the other resources. Don’t confuse the documents you get back with the resources named by the URIs themselves. There’s no guarantee that the document you get back will even mention the resource named by that address (though that would certainly defeat the purpose).

Terminology

I chose to use a http:-type URI so that it is “dereferencable”. Dereferencable is a term in the World Wide Web, not the Semantic Web, which means a URI that is a URL, or in other words a URI with the http: scheme (among others) that specifies how to fetch a document at that address. The tag: and urn: schemes do not specify how to find documents with those types of URLs, so such URIs are not dereferencable. You can’t put them in your browser and get back a document. URIs that you can put in your browser are dereferencable.

The distinction above between document — what you can get back from a browser — and resource — something named by a URI — highlights some common terminology people use. An “information resource” is something that can be transmitted electronically. Documents, such as web pages and RDF/XML documents, images, and binary files are all information resources. “Non-information resources” are those resources that can be named by a URI but which cannot be transmitted electronically. Human beings, abstract concepts, etc. are non-information resources. As we’ve seen, both information and non-information resources can be named with URIs. However, browsers can only display information resources. They can display representations of non-information resources (such as pictures of people), but they are (by definition) incapable of displaying a non-information resource itself.

Linked Data Under the Hood

Using HTTP GET

According to web architecture standards, the HTTP 200 OK response to requests is to be used for URIs denoting information resources only. So when visiting the URI above for New York, a non-information resource, the web server at the other end first sends back a HTTP 303 See Other response, i.e. a redirect. This indicates that the URI is not something the web server can provide directly, because it cannot transmit a non-information resource over the wire. It sends back instead a URL directing the user agent to an information resource, with the implication that information about the original URI can be found in the document at that URL.

If you use Linux, you can observe this with the curl or wget command-line tools. Running curl as shown below prints out the redirect browsers get when going to the URI for New York.

Using curl to follow the linked data

$ curl http://www.rdfabout.com/rdf/usgov/geo/us/ny
...
<p>The answer to your request is located <a href="http://rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com/rdf/usgov/geo/us/ny%3E">here</a>.</p>
...

If you look carefully, you’ll see that in this case the redirect happens to take you to what looks like a dynamically-generated URL. More on this URL later. Often, however, you will be redirected to a static RDF/XML document (with a .rdf extension, for instance).

Use the -L option to follow the redirect:

Using curl to follow the linked data

$ curl -L http://www.rdfabout.com/rdf/usgov/geo/us/ny

<rdf:RDF xmlns:rdf="http://www.w...
    <usgovt:State rdf:about="http://www.rdfabout.com/rdf/usgov/geo/us/ny">
        <ns:title>New York</ns:title>
        <terms:isPartOf rdf:resource="http://www.rdfabout.com/rdf/usgov/geo/us" />
        <wgspos:lat rdf:datatype="http://www.w3.org/2001/XMLSchema#double">42.155127</wgspos:lat>
        ...

After following the redirect, an RDF/XML document that describes New York is returned.

You may also want to try wget with the -S option to view the HTTP response headers:

Using wget to follow the linked data

$ wget -S -O /dev/null http://www.rdfabout.com/rdf/usgov/geo/us/ny