Linked Data - Part 6

Today is all about creating a decent algorithm for pattern-matching in a graph. Last time we found a decent method to tackle the problem (I think), but we are still pretty far from having a workable solution.

First things first

There was one consistency rule that bothered me a little, namely rule #3. Don't worry, I'll just copy-paste it below so there is no need to open a new tab in your browser. (I like to pretend there are people reading this :P)

Rule: Explanation: Visual Representation:
If [A] TypedRelationInstance and [A] [B] and not [B] RelationType and [C] RelationType then it must be that [A] [C] If there is a node [A] that is a TypedRelationInstance that has at least one other relation (to some normal node [B]) then that node must also have a relation to a node that defines the relation type. Rule3v3

This rule is a little bit weird because we could interpret it in two ways:

  1. Blind variable matching: Find all combinations of [A], [B] and [C] such that the if-pattern is valid and for every match enforce that [A] has a connection to [C].
  2. Following the arrows: Find all combinations of [A], [B] such that [A] TypedRelationInstance and [A] [B] and not [B] RelationType and for every match enforce that [A] has a connection to some [C] such that [C]  RelationType.

We tried to write it down using interpretation #2, but interpretation #1 is way easier and it also seems a bit more logical. To rewrite this rule using blind variable matching we need to modify it so the rules are doing what they are meant to do. Keep in mind that blind variable matching only looks at the variables used in the if-part.

Rule: Explanation: Visual Representation:
If [A] TypedRelationInstance and [A] [B] and then it must be that [A] [C] If there is a node [A] that is a TypedRelationInstance that has at least one other relation then that node must have another relation to some other node. Rule3.0
If [A] TypedRelationInstance and [A] [B] and [A] [C] and not [B] RelationType then it must be that [C] RelationType If there is a node [A] that is a TypedRelationInstance and has two other relations to node [B] and [C], and [B] is not a RelationType then [C] must be a RelationType. Rule3.1v2
If [A] TypedRelationInstance and [A] [B] and [A] [C] and [B] RelationType then it may not be that [C] RelationType If there is a node [A] that is a TypedRelationInstance and has two other relations to node [B] and [C], and [B] is a RelationType then [C] may not be a RelationType. Rule3.2v2

Yeah! That is way better! This also helps in building the cache! Hmm, what about unconnected patterns such as "If [A] [B] and [C] [D] then it must be that [E] [F]"? I couldn't care less about the [E] [F] part, but that if-part is a bit troublesome. If the if-part is unconnected then a *lot* of matches may become possible. Hrmmm... there are a lot of ifs and buts here, but you know what? Screw it! Lets just force that the if-patterns must be weakly connected. It seems reasonable for most scenario's and we can always extend it to also support unconnected patterns sometime in the future. O wait! What about the 'not' relations? How do we interpret those for connectivity? I almost forgot about those! Simple not-operations are no issue at all, but as soon as we create a not-operation towards complex sub-patterns we'll be in a whole bunch of trouble:

NotTroubleV2 Computing not-relations is not the issue, caching pattern-match results for not-relations is. Basically we have two options here. We can take the lazy-mans approach and say that we don't allow complex sub-patterns or we'll have to create a definition for sub-patterns which is usable for our algorithm. It would be the right thing to do, but it's just soooo much work :(. Ugh, okay let's at least try to support complex sub-patterns.
Let's start by giving a clear-cut definition for patterns and matches so we have some solid ground to work on.

A definition for patterns and matches

A "directed graph", D, is a bunch of nodes and edges such that D=(V,A).
  • V is the set of vertices (nodes).
  • A is a set of ordered pairs of vertices (edges/arrows).
A "pattern", P, for directed graph D=(V, A) is a directed graph with two types of nodes and two types of  edges such that P=(V_{constant}, V_{variable}, A_{match}, A_{notmatch}).
  • V_{constant} is a set of 'constant' vertices such that V_{constant} \subseteqV.
  • V_{variable} is a set of vertices that our algorithm will interpret as variable.
  • A_{match} is a set of ordered pairs (a,b) of vertices such that a,b \in V_{constant} \cup V_{variable}}.
  • A_{notmatch} is a set of ordered pairs (a,b) of vertices such that a,b \in V_{constant} \cup V_{variable} that or algorithm with interpret as a 'not'-relation.
A "possible match", M, for pattern P=(V_{constant}, V_{variable}, A_{match}, A_{notmatch}) in directed graph D=(V,A) is a complete instantiation (v,c) \in M for every v \inV_{variable} such that c \in V.
  • For all (v_1, c_1) \in M there may not exist some other (v_2, c_2) \in M such that c_1=c_2
  • For all (v_1, c_1) \in M there may not exist some c_2 \in V_{constant} such that c_1=c_2
A "possible match graph", M_G, constructed from possible match M, for pattern P=(V_{constant}, V_{variable}, A_{match}, A_{notmatch}) in directed graph D=(V,A) is a directed graph with nodes and two types of edges such that M_G=(V_{match}, A_{exists}, A_{notexists}).
  • V_{match} is the set of vertices such that V_{match} \subseteqV. The set is constructed as: for every c_1 \in V_{constant} \rightarrow c_1 \inV_{match} and for every (v,c_2) \in M\rightarrow c_2 \inV_{match}
  • A_{exists} is a set of ordered pairs of vertices that tells the algorithm what edges must exist. The set is constructed as: for every (a,b) \in A_{match} \rightarrow (instantiate(a),instantiate(b)) \in A_{exists}.
  • A_{notexists} is a set of ordered pairs of vertices that tells the algorithm what edges may not exist. The set is constructed as: for every (a,b) \in A_{notmatch} \rightarrow (instantiate(a),instantiate(b)) \in A_{notexists}.
  • The function instantiate(x) looks up the actual node for x such that instantiate(x)=x if x \in V_{constant} and instantiate(x)=c if (x,c) \in M.
A "possible match graph" is called a "match graph" iff:
  • for all (v_1, v_2) \in A_{exists} it's true that (v_1, v_2) \inA
  • for all (v_1, v_2) \inA_{notexists} it's true that (v_1, v_2) \notinA
A "possible match" is called a "match" iff the "possible match graph" constructed from the "possible match" is called a "match graph".

Now What?

GuatemalaSinkHoleNow we'll have to detect sub-patterns, but first we need to define these so called sub-patterns. My body is not yet ready to pump out another boring sciency definition so let's use words and pictures instead. Who doesn't like pictures? Well I suppose not all pictures are fun, but I'll try to prevent creating pictures such as those from goatse.cx. For those not familiar with goatse.cx, it was a picture like the one on the right, just a little less flattering. Ahhh, you've gotta love the internet.

Ow my, I've lost focus for a bit. Right, right... "sub-patterns". A sub-pattern is a 'disconnected' part of a pattern (if seen as an undirected graph). If some part of the pattern is completely disconnected then we are talking about "permutation components". Wait! Pictures! We need pictures!

Permutation components and filter components

Lets create a pattern that may be bit weird, but valid according to our definition:

MyPattern

When we look at this pattern, we can clearly see three weak components. We could separate it into three sub-patterns and end up with:

MyPermutationComponents

We call these "permutation components" because the matches we find for these three sub-patterns will help in creating permutations for the 6 variables such that we can create matches for the pattern. Unfortunately we can't just blindly create permutations of all matches we find for the three sub-patterns, because of the uniqueness-property of nodes. For example, we cant use a match from permutation component #1 where [A]=Alex.

There are actually more sub-patterns in this example if we take a better look at the not-relations. If we pretend that the not-relations do not exist we see more sub-patterns whom we are going to call "filter components":

MyFilterComponts

These filter components are named as such because they help in filtering out possible matches. So let's take a look at how our algorithm should find all possible matches.

  • For every filter component create a collection of variable-instantiations such that the sub-pattern for the filter-component is a match.
  • For every permutation-component create valid permutations from the filter components but filter out matches if it isn't valid according to the not-relations.
  • Create valid permutations from the permutation components to get all matches for the pattern.

Although I'm talking about "create a collection", there is often no need to actually create a new collection in memory. Most collections will be simple wrapper-collections of existing data-structures so there is probably no need to worry about memory-usage. This workflow is still overly simplified, but we're closer at creating an algorithm for pattern matching then when we started typing today. Yay! It's been a good day!

line