Linked Data - Part 5

Dear Diary, today I felt very sad.  A bird pooped on me. Why?

Was that a haiku? I don't know. It felt like one, isn't that what counts?

Tuples or Triples, that's the question?

LetKatouHelpYouSee what I did there? Is it a question whether that is the question, or just.... Damn it random guy #5,903,454,104, we need to focus, the future of humankind is at stake! Maybe my thoughts are a bit scattered because we need to make a big decision that'll affect pretty much everything along the line. Just clear your mind, open your mind, open your miiind.

Should we use the unorthodox RDF-approach? No? Well perhaps there is no good and bad answer here. Okay okay, I think this will be reasonable:

  • If we are able to enforce a triple-like structure on our unlabeled directed graph using author defined consistency-rules and querying that data will become just as simple as querying for triplets in SPARQL, then it's okay for us to use unlabeled relations.

Enforcing structure on a directed graph

For us being able to create rules, our naive graph is no longer sufficient. We need to know what nodes in the graph are meant as relation-types and what nodes are meant as relation instances. We need to know this because we can't define the consistency-rules without it.

Naive Graph

More realistic 'messy' graph

To enforce these typed relations I'd like to write these four rules:

Nr: Rule:Explanation:Visual Representation:
Not [A] RelationType and
Not [A] TypedRelationInstance and
[A] [B]
then it must be that
[B] TypedRelationInstance
If there is a 'normal' node [A] (not a typed relation instance or relation type), and that node has a relation so some other node [B], then that node must be a TypedRelationInstance node.Rule1v2
[A] [B] and
Not [B] TypedRelationInstance and
Not [B] RelationType
then it must be that
[A] TypedRelationInstance
If there is a node [A] that has a relation to some other 'normal' node [B] (not a typed relation instance or relation type), then that node [A] must be a TypedRelationInstanceNode node.Rule2v2
[A] TypedRelationInstance and
[A] [B] and
not [B] RelationType and
[C] RelationType
then it must be that
[A] [C]
If there is a node [A] that is a TypedRelationInstance that has at least one other relation (to some normal node [B]) then that node must also have a relation to a node that defines the relation type.Rule3v3
[A] TypedRelationInstance and
[A] [B] and
[A] [C]
then it may not be that
[A] [D]
If there is a node [A] that is a TypedRelationInstance that has at least two (other) relations, then there may not exist a third relation.Rule4

Writing rules in an undefined language is pretty simple, we'll just make things up as we go. As it turned out we need to have a 'not'-operator and each and every node/variable in a rule must be unique. So if variable [A] is used in the same rule as the node 'RelationType', then [A] cannot be RelationType itself. Also if variable [A] is used in the same rule as variable [B], then [A] is not equal to [B].

Algorithm for applying the rules

FunWithWheelsIt's probably time to start implementing a small prototype, so let's just try to implement a quick-'n-dirty program that is able to enforce structure on a graph. It may seem simple, but it is far from trivial. Enforcing a rule like "if [x] Person then it Must be that [x] LikesToEatMeat" is relatively simple, but it becomes more complex if the rule uses more variables. What about: "if [x] [y] and [y] [z] then it must be that [ z] [x]"? A big advantage of using a standardized data-structure is that lots of people have done research to problems like these. A simple google search for "graph pattern matching" gives us plenty of papers for good algorithms, so there is no need to reinvent the wheel here.

The first thing we learn when reading these papers is that we are talking about a variant of the subgraph isomorphism problem. That is quite a mouthful (huhhuh... *mouthful*), but according to wiki it's an NP-complete problem (all known algorithms don't scale nicely when the graph grows in size). This is not really good news, but there are some differences:

  • We often use nodes that refer to a specific node, so we're not always looking for isomorphism. This should make the problem easier.
  • We also use a 'not' relation to allow matching absent relations. This shouldn't have a big impact computationally, I think.
  • The enforcement-rule-graphs are usually static (once created and knowledge has been added to the database they are probably never modified). Having a fixed set of graphs to pattern-match could make the problem computationally easier (allows efficient caching).
  • We need to pattern-match every consistency rule on a modification of the knowledge-graph :(.

Because we mainly check the consistency rules when we are adding or removing nodes and relations it'd be best to check only those regions in the graph that have changed. So it's actually not a subgraph-isomorphism problem, just something similar. This is good news because it would be rather sucky that every add/remove on the graph requires an algorithm that does not scale well. We still need to find a good method for checking consistency of course.

Okay, let's just try something

Let's try to keep track of positive matches against the consistency-rule patterns and 'listen' for changes in the graph that may result in the fact that the pattern no longer applies. So for the node "Cat" for example we have checked:

  • Is it a successful match for variable [A] in the pattern for rule #1?
    Yes, namely:
    • for [B] = CatSubClassOfRelation
      Store "Rule #1, [A]=Cat and [B]=CatSubClassOfRelation" as a match (if it doesn't already exist)
  • Is it a successful match for variable [B] in the pattern for rule #1?
  • Is it a successful match for variable [A] in the pattern for rule #2?
  • Is it a successful match for variable [B] in the pattern for rule #2?
    Yes, namely:
    • for [A] = Cat1TypeRelation
      Store "Rule #2, [A]=Cat1TypeRelation and [B]=Cat" as a match (if it doesn't already exist)
    • .... (perhaps more if the database was bigger)
  • ...

So we'll have a whole list of  matches. Something like this:

Match Nr:Rule:Match:Should result in:
1.#1[A]=Cat1, [B]=Cat1TypeRelationCat1 -----> TypedRelationInstance
2.#2[A]=Cat1TypeRelation, [B]=CatCat1TypeRelation -----> TypedRelationInstance
3.#2[A]=Cat1TypeRelation, [B]=TypeRelationCat1TypeRelation -----> TypedRelationInstance
4.#3[A]=Cat1TypeRelation, [B]=TypeRelationCat1TypeRelation -----> [C], where [C] RelationType
5.#4[A]=Cat1TypeRelation, [B]=TypeRelation, [C]=CatCat1TypeRelation --X--> [D]
6.#4[A]=Cat1TypeRelation, [B]=Cat, [C]=TypeRelationCat1TypeRelation --X--> [D]

This approach seems to have some potential, it looks a bit like using dynamic programming on a ever-changing subgraph-isomorphism problem. It'll greatly increase the required size of a knowledge database, but using this cache we do not have to recheck the complete database every time a modification is made. On modification of the database we'll check the changes against the cache (to remove matches that are no longer valid), create new matches for the nodes/relations that have been added/removed and finally determine if the database is still considered consistent according to the consistency-rules. This seems like a reasonable plan of attack. Assuming this will work and it is running in a production environment, then we could even do consistency-checks in background-threads (there is no need to delay results of insert- or update-queries, consistency checks could be performed separately).

Next Time

So next time we will try to be more specific on how we can turn this idea in an actual algorithm. I hope it doesn't become too boring to write about....