Linked Data - Part 1


Linked Data is a method of linking data from different data-sources and being able to perform queries on that data. The (idealistic) goal is to link as much data from as many data-sources as possible, so all (or at least a lot of) knowledge of humankind may be queried using a relatively easy query language. Wouldn't it be nice to be able to query "Give a list of movies that do not contain a leading actor who is known to be vegan"? Using (old) technology you would need to manually search for all actors that are known to be vegan, search for all movies, find out who the leading actor(s) are and filter out those movies that contain pretentious, uhhhh, I mean vegan leading Actors. Technology like this helps us nerds in building something awesome like the computer from Star Trek, the ultimate goal of Google. Well maybe not really like Google, because Google is more like a web-spider that generates it's own databases to be used for fancy queries. The idea of Linked Data is that those databases are distributed, open, linked and accessible by everyone (and not only in a web-interface that shows ads to monetize good ideas)

I don't want to talk about...

I don't 'really' like to talk about Linked Data because current technology supporting Linked Data sucks. Don't get me wrong, the idea is awesome, but instead of focusing on the really hard problems they are shifting their focus to stupid things like "everything should be denoted by a URI'" and "if the URI's scheme is HTTP then it should lead to a web-page". Well we could have expected this because the big man behind this Linked Data initiative is Tim Berners-Lee (a name you should know!). I don't find it very surprising that Berners-Lee, director of W3C, wants to talk about uniquely identifying stuff with URI's and XML-scheme's. Creating an open standards is great, but first you'll have to create some product that people want to use. For the problem of "How to store and retrieve knowledge", coming up with an answer like XML, HTML front-ends or some SQL-like language is just missing the point. Another fallacy that annoys the crap out of me is people saying things like "adding 'tags' to pieces of text give computers a semantic interpretation of that text", because it really doesn't work that way. Just writing that previous sentence makes me want to puke :(

I do want to talk about...

So what do I want to talk about? All those difficult questions that are too often ignored about the difficulty of storing knowledge and being able to retrieve that knowledge in a reasonable way. When I'm considering solutions I'll probably talk about data-structures like directed graphs, queues or hash-sets and will ignore all talk about standardization of network protocols or database-storage. I expect some minimal knowledge of the reader (that's you!) about the terms Linked Data and Semantic Web, as I'll try to incorporate some of the good ideas. While writing this text i'll be stealing some aspects of RDF, RDFS, OWL, SPARQL and probably a couple of other technologies so don't get angry that I'm plagiarizing existing stuff, because I probably am (you have been warned). After describing a base data-structure for storing knowledge in a distributed fashion, we start talking about the limitations (of which there are plenty) and possible solutions of how to 'fix' them. That is where the real fun starts! Computing knowledge, inferring information, uncertainty, conflicts of information. Ow joy, puzzles like these make me happy to be alive :)

Structured or Unstructured?

We need to store the knowledge somewhere so it'll have to be a little bit structured, but how structured does it need to be? For this problem we'll be looking at RDF and RDFS.  If you are reading the Wikipedia pages or some W3C page then you are already reading a lot of unnecessary information, because I don't care if information is stored as triplets and that nodes are uniquely identified as URI's. The only thing I care about is that RDF is basically a directed graph with multiple relation-types and that nodes 'can' be uniquely identified (obviously). So as example RDF describes something like:


There are a couple of special relation-types you can use such as 'rdf:Property' (to state that node X has property Y) or 'rdf:List', 'rdf:first', 'rdf:next' and 'rdf:nil' so you can build list data-structures, but the general idea is to make your own relation-types. A data structure such as this is nice because it is possible to freely link data, but this freedom also makes it difficult to retrieve information. If we type a query like "Select p as Person where p AnnoyedBy 'Alex'" (to select all persons that are annoyed by Alex) than our algorithm to parse the query needs to know that it has to select all nodes that have a relation 'Is a' towards Person. Another obvious problem is that storing knowledge in a different data-structure would be silly if implemented in a directed graph. Data-structures like 'list' are still somewhat reasonable, but what if we like to store an audio-fragment in our knowledge-base?

Problem 1 If we like to query the data then some relation-types needs to be 'special' so the interpreter of the query knows how to handle the request.
This is exacly what RDFS is meant to do. By introducing relation-types such as 'rdfs:Class' and 'rdfs:type' some query-interpreter (such as SPARQL) is able to retrieve the specific information that we request.
Problem 2 The only native supported node-datastructure is a string (or URI if you will :P), we can't store things like pictures or audio fragments.
We are able to store files separately from the the directed graph and create a reference in our knowledge graph such as "  IsA  Picure". Other datastructures like lists can be integrated in the graph itself.
I hope that you agree with me that both solutions to the two problems are a bit hacky and ad-hoc. This clearly shows the problem of creating a general purpose knowledge database. Adding additional requirements to make the knowledge structured and easier to retrieve also limits its use to only databases that are structured. This may not seem like a big issue as you could say "If you like the data to be easier to retrieve, than it makes sense to add a predefined structure to it" which is true off course if you are talking about facts. If we have a database that stores general rules, exceptions and perhaps even uncertainties (e.g. "Human IsAbleTo Walk", "Stephen Is Human", "Stephen IsUsuallyNotAbleTo Walk") we'd require a whole different method of querying for information. Simply said, we may not limit the use of the knowledge database to little factoids that have a 100% truth factor. Databases that do not contain any actual 100%-certainty-facts or do not store types hierarchically should still be relatively easy to query (yes I know, easier said then done). Perhaps some solution can be found by storing the knowledge of how to query the knowledge in the database too. This is a little bit like the chicken and the egg problem (knowledge about knowledge about knowledge ...) but we do have to start somewhere.

Storing information like pictures or audio outside the database and just referencing the file locations is also "mweah", for lack of a better word :). These file-blobs are stored somewhere in the hope that the client-application (most likely some sort of web-browser) is able to interpret it. Because these files are not really part of the knowledge database we cannot query about the content of the blobs, only about the metadata. Any information about the file-blobs that may be procured using computation (e.g. "select all audio fragments with at least 1 second silence in it") are out of the question. Predefining allowed datatypes (strings, floating point numbers, images, ...) is also not a feasible option because we cannot predict nor enforce the datatypes that may be used.

Ah how lovely, we just started with the basic underlying data-structure and we are already changing stuff in our design. This is great, let's see where this is going!

Big change #1: We will support the storage of data in the nodes of the graph

Each and every node in the knowledge graph may store a blob inside its node. Well not exactly a blob, more like a byte-stream. Keep in mind that we are designing a distributed database and bytes-streams go hand-in-hand with network protocols, which will play a big role in the actual implementation of multi-database-querying. Allowing streams could give our model some nice features in the future. Together with computed knowledge we could even create a UPnP server and client with on-the-fly remote and local transcoding of video streams using stream-wrapping techniques! *cough* Maybe I'm getting a little bit ahead of myself here *cough*. Keep in mind that the binary stream is completely type-less data so it could contain a video, a random-number-generator or simply 4 bytes that describe a floating-point number. Because the data is type-less it cannot 'really' be used by query algorithms, for that we would need to have 'knowledge' about the structure of the data in the byte-stream (more on that later).

Big change #2: We will allow the author(s) of the knowledge database to define how queries are executed

Instead of supplying a special set of relation-types that tells the query-parser how to handle the request, we will store some knowledge about how queries should be parsed as knowledge itself. We will not be as ambitious as allowing total freedom in query structure (such as using natural language), but some form of freedom should be derived by interpreting knowledge dynamically (as opposed to hard-coded). Perhaps some form of core-querying language defined as LL(n) syntax that can be dynamically extended with new terminals and non-terminals that directly translates to core-terminals, core-non-terminals or simple sub-queries. It would allow people (with intimate knowledge about the core querying language) to extend the querying language to a certain extend. I'm not sure if this will be the best approach and we probably need to experiment on the best way of solving this puzzle. I'm just thinking/typing out loud now, but it seems like a reasonable way to go.

The Downside

Beside being very difficult to implement, these two changes will introduce so called special nodes and relations; something we said that we'd like to prevent. For instance we need to store knowledge about the structure of the byte-stream if we want to be able to actually use it and we also need knowledge about how queries are executed. The query parser needs this knowledge to be able to do it's job, without it he can't do anything (fancy). We need to make this sacrifice if we want knowledge databases to be less structural without negatively effecting it's usability, I just don't see any other way.

Next Time

Next time we'll probably talk and brainstorm about the basics of linking databases together, knowledge that can be inferred and consistency rules. So, yeah. Bye, have a beautiful time!

  1. anonymous says:

    Appreciating the commitment you put into your website and in depth information you provide. It's awesome to come across a blog every once in a while that isn't the same old rehashed information. Wonderful read! I've bookmarked your site and I'm adding your RSS feeds to my Google account.