Navigation

Saturday, February 05, 2011

Implementing SPARQL 1.1 Query - first findings

I am currently in the middle of implementing SPARQL 1.1 Query Language into Sesame 2 (code can be found in Sesame's subversion repository, branch 2.4). The current working draft specifies a number of new features for SPARQL, and I will briefly make some points about the features I have implemented thus far, noting problems I encountered or where the current working draft was unclear to me.

1. Expressions in SELECT

This new feature was fairly straightforward to implement, mainly as Sesame already had support for it in its query algebra. I only needed to adapt the parser.

2. Negation

In section 8 two additional operators are introduced, both of which can be used to express negation. They are (NOT) EXISTS, and MINUS. Implementation of the EXISTS function again was quite straightforward, Sesame already having algebraic support for it.

The definition of MINUS in SPARQL gave me some headaches, however. In Sesame's native query language SeRQL, the MINUS operator is a set operator operating on collections of triples - that is, the result of {A} MINUS {B} is the set of all triples matching A, minus all triples matching B. In SPARQL, however, MINUS is defined in terms of  compatible solutions. This means that Sesame's own algebra operator for MINUS can not simply be reused for SPARQL. However, it also seems that SPARQL's definition of MINUS makes it, for all practical purposes, exactly equivalent to using a NOT EXISTS filter. To see why this is, we have to take a look at the definitions of both operators.

In section 8.3 , the difference between NOT EXISTS and MINUS is explained, with a number of examples. This explanation shows that when the right-hand side pattern shares no variables with the left-hand pattern, the outcome is different. However, what is also apparent from this explanation that when a MINUS operator is used and no shared variables exist between the two patterns, the MINUS operator effectively does nothing.

This also follows if we look at  the definition of MINUS in the SPARQL algebra and the definition of compatible solutions in section 17.3: by definition any two solutions µ and µ' which share no variables v are compatible. So the outcome of any such query would be exactly the same as if the MINUS were not there. This leaves us with two scenarios:
  1. the two patterns share a variable, in this case the MINUS can be replaced with a NOT EXISTS;
  2. the two patterns do not share a variable, in this case the MINUS can be ignored.
All in all it seems to me that MINUS as currently defined does not add additional expressivity to the language and is only a syntactic variant. If that is intended, that might be useful to clarify in the working draft.

3. Subqueries

Sesame already having basic support for this in the algebra, again this was rather simple to add, as it only required me to tweak the SPARQL parser. I will probably need to test it further though.

4. Aggregates

Fortunately for me, it turns out that basic support for aggregate functions had been added to Sesame's algebra earlier, courtesy of David Huynh. It required a bit of tweaking to be compatible with SPARQL's definitions, but the framework was already there, ready for me to extend.

There are a number of things unclear in the working draft however, regarding the expected behaviour of aggregate functions.

The first problem has to do with datatypes. Most examples take it as given that all input to, say, a SUM operator will be numeric values. It is not clearly stated what the expected behaviour is if a particular variable binding turns out to be non-numeric value. As a case in point, SUM is formally defined in terms of the XPath function op:numeric-add. This function explicitly states that it operates only on specific numeric types. No mention is made however, of expected behaviour when one operand is not a numeric type (section 16.3 of the SPARQL WD does mention that a type error results for incompatible operands, but it is not clear if this also applies to aggregate functions). Moreover, it is not clearly stated how a type error in an aggregate function should influence the result. I can see a number of possible scenarios, when a type error occurs during evaluation of an aggregate function:
  1. the entire query fails with an error;
  2. the incompatible operand value is ignored and evaluation continues;
  3. the aggregate operator fails silently, returning 0.

From a usability perspective, I would probably have a preference for option 2, although I note that in other mathematical operators (+, -, *, etc.) Sesame's interpretation currently is that an incompatible operand results in a failed query.

Another problem with the definition of aggregates, or more in general with aggregates in combination with several other features, is that it is not always clearly defined how they should interact (disclaimer: perhaps it is properly defined in section 10.2, but I'm having a hard time following the definitions there). For example, what happens when we apply an ORDER BY on a graph pattern that already has a GROUP BY and an aggregate function? For example, take the following data set:

:org1 :affiliates :auth1, :auth2 .
:auth1 :name "John" .
:auth2 :name "Paul" .
:org2 :affiliates :auth3 .
:auth3 :name "Ringo" .

And the following query:

SELECT (GROUP_CONCAT(?name) AS ?names)
WHERE {
  ?org :affiliates ?auth .
  ?auth :name ?name.
}
GROUP BY ?org
ORDER BY ASC(str(?name))

My intuitive understanding would be that the result of this query would be:

?names     
"Ringo"
"John Paul"

That is: the ordering is applied to the intermediate result of the grouping, thus supplying the aggregate operator (in this case, GROUP_CONCAT) with an ordered sequence (which makes sure that we get a concatenated string "John Paul" rather than "Paul John"). But it is not completely clear to me from the working draft if the ORDER BY clause should be applied to a grouping in this fashion.

These are my findings thus far. I have not yet started on property paths or federated query. In the mean time, I would welcome any feedback on my notes, including feedback that tells me I should have read section so-and-so and it's all clear as glass if I had just taken the time to study it properly :)

Also, this: in the course of this work I have written several DAWG-Manifest style unit tests to check conformance as I saw it. They can be found in Sesame's SVN repository, and I'd be happy to let them be reused.