Linked Data - Part 4

Lets revisit the basic structure we are using for storing knowledge. In the previous page about Linked Data we have used some strange structures such as:

  • JLP HasNickname2 'Number 0'
  • Person HasPropery HasFirstName in combination with JLP HasFirstName Jean-Luc
If we keep saying that we are using an directed graph with different relation-types (labeled multidigraph if you are a NERD), then we probably have some explaining to do. Do we uniquely identify nodes with strings? If a node is used as a relation, are we talking about the same 'thing'? Let's look at a sample data-structure from RDF by stealing a picture from wikipedia without even the slightest bit of remorse for plagiarizing existing work:
RDF Those prefix thingies in the example like "rdf:" are just shorthand-notations for URI's like "http://www.w3.org/1999/02/22-rdf-syntax-ns#" so it isn't necessary to type complete URI's each and every time. RDF also allows relation from a relation to a node (see 'rdfs:range'), unlike traditional graph-like structures. Allowing relations that do not go from node-to-node adds a whole level of freedom to the data-structure, but do we really need that? Freedom in our base data-structure is good off course, but keep in mind that the more restrictions we apply the easier it becomes to retrieve knowledge. If we would have liked lots and lots of freedom we could just as easy defined our knowledge base as an unstructured byte-array!

Wait! cat2 isn't of type cat? Oh well, let's just say it's an unfinished knowledge database. So what advantages do we get from allowing non-traditional relations and is it worth it? In our previous page we used such a relation in combination with a consistency rule to store knowledge about how to define a tabular-like structure. Could we also have done that with a normal graph-structure? No we couldn't have, not in a direct way at least. To store knowledge about something we need to create a relation from that something towards something else. If we want to store knowledge about relation-types, we need to create a relation from that relation-type towards something else. Welllllllllllllllll.... there is an alternative, kind off:

  • MyRelation1 IsA Relation
  • MyRelation1 LeavesFrom MyNodeX
  • MyRelation1 GoesTo MyNodeY
  • MyRelation1 HasRelationType MyRelationTypeA
  • ...

Seems a bit silly, no? To describe a singular relation we're adding at least 4 other relations to the knowledge-database. This is the knowledge-about-knowledge-about-knowledge-problem all over again. Using this structure there wouldn't be any need to have non-traditional relations (because a relation has become a node), but at what cost. We don't have to use RDF-like triplets of course, maybe we could just use tuples (unlabeled digraph). Encoding the same information using tuples would look like:

UsingTuples

Because the data-structure has become less restrictive we have more possibilities (e.g. it has become possible to store knowledge about relation-instances), but queries will also become more difficult. If we want to query "give a list of all cat-instances", we need to:

  • select X such that "X Y", "Y TypeRelation" and "Y Cat"

instead of:

  • select X such that "X Type Cat"
We need to decide. Do we use triples or tuples, labeled or unlabeled? This one of the more annoying things when reading scientific papers (or W3C documents); they almost never explain 'why'. I've been reading quite a bit about triple-stores, RDF and whatnot but never, not once, has anyone asked "Why should we use triples? Aren't their alternatives?". They just give the definition: "We are using triples and this is what you can do with it". Where is the reasoning and what information have they used to make an informed decision? I always thought that science was about acquiring knowledge by asking the rights questions, reasoning, predictions and falsifying. Since when has it been enough just to write down dreams, proofs and conclusions? Well I better stop nagging, I don't want the RDF-coppers to arrest met at the middle of the night and tell me how wrong I am for the audacity of trying without the approval of a publicly acknowledged consortium. RDF_RDF_RDF

Let's just try to list pros and cons.

Unlabeled (Tuples) Labeled (Triples)
1. Less intuitive for defining typed relations (for one typed relation we need to add 1 node and 3 relations, or 3 tuples depending how we store the graph) More intuitive for defining typed relations
2. Less computationally efficient for querying information (most likely) More computationally efficient for querying information (most likely)
3. General (still intuitive when used to represent a triple-store) Specific
4. More intuitive for knowledge about knowledge (e.g. "SpecificRelationX HasBeenAddedBy JohnMalkovich") Less intuitive for knowledge about knowledge
5. No weird relation-type to node relations, just a normal graph Very uncommon data-structure (cannot be classified as graph)

Wow, I'm really doubting myself now. I thought that a triple-store (or something similar) would be the best approach. The most prominent weaknesses of an unlabeled directed graph do not really bother me that much. For example:

  • May become 'messy', because less structure is enforced. Not a real problem if we also allow author-defined consistency rules.
  • Less intuitive for storing information. Don't care that much about this, because we are only talking about the underlying data-structure and not necessarily the way that we represent it in a GUI.
  • Less intuitive for querying information. True, but we want to design an extensible query-language that should be able to remedy this issue.
  • Less computationally efficient. Although probably true, this has yet to be proven. Using smart techniques using hash-sets, caching and data-alignment it may end up to be just as fast.

StupidBrainHmm, I probably need to sleep on this, am I missing something here? Currently it looks like the balance is shifting towards a normal directed graph. Darn, this would change a lot of things I thought about. Let's just keep the talk about global unique identifiers for some other time. Stupid brain, why didn't you see this coming?

line