This is a collection of implementation notes by ericP.
RDF queries and rules match data that corresponds to a given graph pattern. Data from many non-RDF sources, e.g. relational databases, may be trivially transformed to a stem graph; an RDF graph which reflects native graph model. Stem graphs represent the trivial expression in RDF of non-RDF data formats e.g. relational, spreadsheets, textual. Stem data is brittle in the face of implementation changes, and requires consumers to learn the stem schema. An interface graph, a more Semantic-Web-friendly form of the data, can be expressed by applying deduction rules to re-use popular RDF predicates and graph structures. Consumers of the data will want to express queries and rules in terms of the interface graph.
Any system that requires materializing data transformation involves synchronicity problems. The deduction rules used to map the stem graph to the interface graph can be used to transform queries that match the interface graph to instead match the stem graph, which can then be trivially expressed as queries on the native data format, e.g. SQL or a relational execution plan. This document describes this process, deriving the transformations that are needed to match in the stem graph, the same data that would be matched in a materialized interface graph. The process is intended to be sound and complete with respect to the SPARQL logical connectives and sound with respect to datatype support.
Both RDF and Relational Databases are cyclic graph structures (e.g. an Employee table in a database may have a manager field which identifiers another Employee). While RDF is based on triples, relational data is based on n-ary tuples, where each position in the tuple has a prescribed name and data type. The expression of these tuples as triples is intuitive and practical, enabling simple tools to make relational data available to RDF tools. Earlier work ([SPASQL], [D2R], [SquirrelRDF]) has shown how RDF queries can be executed on relational databases, providing unambiguous querying on databases, despite other databases having coincidently named, though not necessarily similar, attributes. The default minimal installation for many of these systems required only a stem URI to serve as a base name for the relations, attributes and tuple identifiers.
[D2R], [SquirrelRDF] and [Virtuoso] provide languages for mapping queries over specific relations and attributes to particular classes and predicates. [SPASQL] is using deduction rules to provide a non-materialized view of the data, here called the interface graph, which allows custodians of the data to provide access to their proprietary data with common, shared attribute identifiers. These rules can be applied to materialize an interface graph, but that raises the usual concurrency issues associated with any data warehouse. These issues are avoided if the rules are used to transform the interface queries to work on the original data source, enabling a query expressed in terms of common, public taxonomies to perform with the same efficiency as a proprietary SQL query.
[SPASQL] is a set of libraries which can transform and express [SPARQL] queries. Its use has been prototyed in MySQL, where it functions as user-loadable parser which parses SPARQL queries and produces a MySQL-native execution plan which then executes like any execution plan produced by the SQL parser. [SPASQL] can also express queries as SQL (strings), though that is not discussed here.
The deduction rules used in this paper are equivalent to SPARQL CONSTRUCTs extended with named graphs in the constructed pattern, e.g.
CONSTRUCT { <graph1> { ?s <foo> ?o } <graph2> { ?s <bar> ?p } } WHERE { ?s <baz> ?o }
While many use cases can be met with the conventional SPARQL CONSTRUCT grammar, the full expressivity of the transformations in this paper can be parsed by changing the ConstructTemplate production to
[30] |
ConstructTemplate |
::= | '{' TriplesBlock? ( ( 'GRAPH' VarOrIRIref GraphTemplate ) '.'? TriplesBlock? )* '}' |
Other languages can be used to express these deduction rules, but this may limit the expressivity. For instance, N3 doesn't have anything corresponding to a SPARQL OPTIONALs, though some optionals can be expressed as multiple rules. Similarly, N3 rules don't have SPARQL's syntactically sequestered constraints, though it has a wide array of built-in logical predicates, which can be factored out and re-expressed as algebraic expressions if the query is transformed into e.g. SPARQL, SQL. Other candidate rule languages include RIF Basic Logic Dialect, SWRL, RuleML, and OWL.
Consider a deduction rule transforming a stem graph to an interface graph:
stem interface ?B pX ?A ?A p1 ?B ?C pY ?B => ?B p2 ?C ?C pZ [] <x>: { ?C p3 ?A } ?A pW ?D ?B p4 ?D ... ...
and an interface query that, when expressed in disjunctive normal form, has a component:
{ ?x p1 ?y } ∧ { ?y p2 ?z } ∧ ?g: { ?z p3 ?x } ∨ …
Given a sample query graph, we can compute the constructed interface graph:
B1 pX A1 A1 p1 B1 C1 pY B1 => B1 p2 C1 C1 pZ [] <x>: { C1 p3 A1 } B1 pW D1 A1 p4 D1
If we reverse the rule, we can re-write the query instead of the data. The interface query may match any subset of the interface graph, so the antecedent must make each arc optional (caveats discussed in a moment):
interface stem OPT ?A p1 ?B ?B pX ?A OPT ?B p2 ?C => ?C pY ?B <x>: { OPT ?C p3 ?A } ?C pZ [] OPT ?B p4 ?D ?A pW ?D ... ...
Matching the interface pattern against the pattern "asserted" by the query yields the following result set:
?A | ?B | ?C | ?D |
---|---|---|---|
?x | ?y | ?z | unbound |
Thus, extending SPARQL to allow variables in the input graph, we could use the CONSTRUCT semantics to produce a corresponding stem graph pattern:
{ ?y pX ?x } ∧ { ?z pY ?y } ∧ ?g: { ?z pZ [] }
However, this query throws away a potentially critical constraint in the antecedent of the mapping rule: ?A pw ?D
. This may or may not represent a real integrity constraint. If it does not, the antecedent should relfect this by enclosing it in an OPTIONAL. Assuming now that it does, the query needs to include the constaint ?x pW ?gen1
. Because we don't need the value of the variable ?D, we use a generated variable. If this variable ?D appears elsewhere in the pattern, the same variable (?gen1) is substituted. (Blank nodes could be used, except the SPARQL semantics stipulates that "labels for blank nodes are scoped to the basic graph pattern".) The final stem query:
{ ?y pX ?x } ∧ { ?z pY ?y } ∧ ?g: { ?z pZ [] } ∧ { ?x pW ?gen1 }
is generated by filling out all NULLs in the result set with generated symbols and performing a CONSTRUCT with the antecedent of the mapping rule. Just as apparently uninteresting triple and graph patterns (those that do not bind or restrict variables used in the portion of the interface rule tha corresponds to the interface query) represent potential integrity constraints, so to do any FILTERs applying to the antecedent of the mapping rule. ORDER BY is meaningless as the interface graph has no order. A real treatment of LIMIT and OFFSET in the rule antecedent is left for futre work.
Filters, limits and order in the interface query may be passed through to the stem query.
If our example deduction rule handles data that is potentially incomplete, we may add optional patterns to accomodate the potentially missing data.
Mapping Rule with Optionals:
stem interface ?B pX ?A ?A p1 ?B OPT { ?C pY ?B => ?B p2 ?C ?C pZ [] } <x>: { ?C p3 ?A } ?A pW ?D ?B p4 ?D ... ...
The SPARQL CONSTRUCT rules for unbound specify that construct triples patterns referencing unbound variables are silently ignored. Thus, the triple patterns ?B p2 ?C
and <x>: { ?C p3 ?A }
are consequent to the optional stem pattern which binds ?C. If we add that optional to the (yet empty) list of dependencies for ?C, we can add optionals to the stem query for each of the variables that is optionally bound in the interface graph. For instance, the following modification of our earlier interface query binds ?z optionally:
SELECT ?x ?y ?z ?g WHERE { ?x p1 ?y . OPTIONAL { ?y p2 ?z . GRAPH ?g: { ?z p3 ?x } } }
The corresponding graph pattern that matches the stem graph is:
SELECT ?x ?y ?z ?g WHERE { ?y pX ?x . OPTIONAL { ?z pY ?y . ?z pZ [] . } ?x pW ?gen1 }
We can verify this by creating data that does and does not match the optional clause in the stem pattern, and examining which stem data is retrieved by this stem query:
sample stem and query graph:
B1 pX A1 A1 p1 B1 C1 pY B1 => B1 p2 C1 C1 pZ [] <x>: { C1 p3 A1 } B1 pW D1 A1 p4 D1 B2 pX A2 A2 p1 B2 C2 pY B2 # no triple B2 pW D2 A2 p4 D2
Both the interface query, applied to the interface graph, and the stem query, applied to the stem graph, arrive at the same result set:
?x | ?y | ?z |
---|---|---|
A1 | B1 | C1 |
B2 | B2 | unbound |
In an interface query, an optional pattern involving only variables which were elsewhere non-optional would be a meaningless optional as there would be no graph constructed from the stem=>interface rule which would produce the non-optional pattern but not the optional pattern. For instance, using the same example mapping rule, the subject of the p3
arc is not actually optional in the following interface query:
SELECT ?x ?y ?z ?g WHERE { ?x p1 ?y . ?y p2 ?z . OPTIONAL { GRAPH ?g: { ?z p3 ?x } } }
Translating that OPTIONAL to the stem query:
SELECT ?x ?y ?z ?g WHERE { ?y pX ?x . ?z pY ?y . OPTIONAL { ?z pZ [] } }
would be an error as it would result in a binding of ?z=>C2 in the above sample stem graph, while C2 would never appear in the interface graph.
?x | ?y | ?z |
---|---|---|
A1 | B1 | C1 |
B2 | B2 | C2/unbound |
An optional graph pattern in the mapping rule antecedent may introduce variables that are used as constraints in later graph patterns. In the following stem pattern, C
is introduced in an optional pattern.
stem ?B pX ?A OPT { ?C pY ?B } ← may bindC
. ?C pZ [] ← more restrictive ifC
is bound. ?A pW ?D ...
A binding of C in the optional pattern constrains the subsequent match of ?C pZ [].
B1 pX A1 C1 pY B1 ← matches optional pattern and binds ?C to C1. C2 pZ [] ← no match for C1 pZ [] so the potential solution is eliminated. B1 pW D1
Including any graph pattern in the stem query requires the prior inclusion of all optional patterns which introduce variables used in that pattern. This can be handled by generating a mapping from variable to the graph pattern that introduces that variable. (This dependency list is also needed by an optimizer to ensure that optionals are not reordered in a way that changes the query solution.)
Any OPTIONAL graph patterns in the antecedent of the mapping rule are included in a stem query if some triple patterns which are introduced in and consequent to that OPTIONAL graph pattern match triple patterns in the interface query. Further, when expressing that graph pattern in stem query, the pattern is not optional unless every corresponding matched triple pattern in the interface query is OPTIONAL. Any optionals which introduce a variable used in an included portion graph pattern are also included. All other OPTIONALs may be safely omitted from the stem query.
The interface graph can matche the interface query in mutiple ways. For example, the interface triples with p2
as a predicate in:
stem interface ?B pX ?A ?A p1 ?B ?C pY ?B => ?B p2 ?C ?C pZ [] ?C p2 ?A ?A pW ?D ?B p4 ?D ... ...
match this pattern:
{ ?y p2 ?z } ∨ …
with two distinct sets of bindings:
?A | ?B | ?C | ?D |
---|---|---|---|
unbound | ?y | ?z | unbound |
?z | unbound | ?y | unbound |
This corresponds to a situation where a single match of the antecedent of the mapping rule produces an interface graph which is matched two ways by the interface query. The stem query that fulfills this query completely is a disjunction of all of the ways in which the antecedent could produce the appropriate interface query. It is also possible that some or all of these antecedent patterns are optional by the rules described above in Optionals in the Interface Query.
Given a stem graph SG, a set of mapping rules MRs, IG is an interface graph that is produced by executing MRs on SG and adding the product to an empty graph. An interface query QI, performed on IG produces a result set RSi. The following steps give results in a stem query QS that, when executed on QI, also produce RSi:
?B_gen1
(so that co-references will use the same variable).While there are many applications where a specific stem graph can be mapped to a non-RDF data model or query language, the most obvious is probaby SQL. Specific stem queries can be simply expressed as SQL or an execution plan in a relational databse. There are many possible mapping from relational data to RDF; the following one requires as input, a stem URI, a database, and some miscellaneous punctuation for separating components in constructed URIs:
The stem graph is composed of the set of tuple maps for each tuple in the database. A tuple map assigns an identifier for a tuple and makes RDF triple assertions expressing each attribute of that tuple. For any attribute in the tuple, the subject, predicate and object of each triple are:
For example, the following table can be converteed by supplying a stem URL http://hr.example/DB/
:
id | lastName | manager |
---|---|---|
18 | Johnson | NULL |
253 | Smith | 18 |
@prefix empl: <http://hr.example/DB/Employee/> . empl:id.18 empl:lastName "Johnson" . empl:id.253 empl:lastName "Smith" . empl:id.253 empl:manager empl:id.18 .
The above relational strings
(or varchars
, or however the last names might be encoded) are expressed as RDF literals
. For Common relational datatypes have corresponding W3C XML Schema (XSD) datatypes, which are available for a literal map, e.g. a mapping from a relational date
to an xsd:date
:
id | omitted columns | birthday |
---|---|---|
18 | ... | 1969-11-08 |
253 | ... | 1979-01-18 |
@prefix empl: <http://hr.example/DB/Employee/> . empl:id.18 empl:birthday "1969-11-08"^^xsd::date . empl:id.253 empl:birthday "1979-01-18"^^xsd::date .
The literal mapping used in examples in this paper is:
relational | RDF | ||
---|---|---|---|
type | SQL | type | SPARQL |
varchar | "text" | literal | "text" |
integer | 123 | xsd::integer | 123 |
float | 1.23 | xsd::float | 1.23 |
date | "2008-08-26" | xsd::date | "2008-08-26" |
The core SPARQL language demands only a subset of these XSD datatypes, however, support for additional datatypes is well-defined, and trivial to implement by pushing down into the relational database. For example, a SPARQL constraint manager.birthday="1969-11-08"^^xsd::date
is easy to convert to an SQL constraint manager.birthday="1969-11-08"
.
The above rule mapping has a few simplifications which make it difficult to use in the web of Linked Data [LD] (due to practical considerations when using HTTP). Examining the above stem graph from the perspective of a web browser, we see that:
http://hr.example/DB/Employee/lastName
and http://hr.example/DB/Employee/manager
. http://hr.example/DB/Employee/id.18
is both a web page and a tuple in the database.It is both convenient and scalable if the schema for a given data set is available in one comprehensive page, or a set of pages that describe similar properties. Thus, it is preferable if lastName
and manager
are identified as http://hr.example/DB/Employee#lastName
and http://hr.example/DB/Employee#manager
respectively. A schema page should be served at http://hr.example/DB/Employee
, describing both properties.
Because RDF has no prescribed way to use the same identifier in the distinct roles of a web page and the thing identified by that web page, conventions have been established to produce identifiers that do not identify web pages, but, when placed in a web browser, will resolve to an informative web page. Extensive deliberation has resulted in the recommendation to use either a hash fragment in an RDF document (e.g. http://hr.example/DB/Employee/id.18#record
) or any identifier (the one discussed, http://hr.example/DB/Employee/id.18
, would work) that is served with an HTTP 303 redirect. If we select the first approach, we get this graph, which is easily navigated by conventional Linked Data tools:
@prefix emplP: <http://hr.example/DB/Employee#> . @prefix emplN: <http://hr.example/DB/Employee/> . emplN:id.18#record emplP:lastName "Johnson" . emplN:id.253#record emplP:lastName "Smith" . emplN:id.253#record emplP:manager emplN:id.18#record .
The URIs used in the example were composed from:
subject map: | stem | '/' | table | '/' | attribute | '.' | value | # | fragment |
---|---|---|---|---|---|---|---|---|---|
predicate map: | stem | '/' | table | '#' | attribute |
Keys that are composed from more than one attribute may be addressed as well. For example, http://hr.example/DB/Employee/key1.value1.key2.value2#record
uses the pattern ['.'attribute'.'value]* to address the attributes in the key.
Given the mapping from relational data to stem graphs, it is possible to convert stem queries to relational queries. A mapped stem query is an SQL query that, when applied to any possible relational data, produces the same result set as would the stem query performed on the stem graph for that data. The following sections examine the logical components of SPARQL queries and illustrate how they are mapped to SQL.
A Basic Graph Pattern is set of triple patterns and filter constraints which have similar expressivity to their SQL counterparts, joins and restrictions. Just as tuple maps produce triples from tuple attributes, the process can be reversed, converting triple patterns to queries on tuples.
A triple pattern is attached to an attribute in an instance (alias) of a table, where the table and attribute are determined by parsing the predicate name , and the alias name is a function of the pair of the subject and the table. Variable predicates are not treated in this document, though an explicit enumeration of table attributes would probably suffice. Any two triple patterns with the same subject and predicates that differ only in attribute name are traversing properties of the same tuple. Triple patterns on the same table, but with different subjects, are addressing potentially different tuples in the same table, and thus require an extra join (and alias name) against that table. In this SPARQL query to select the names of employees and their managers,
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?empName ?managName WHERE { ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager . ?manager emplP:lastName ?managName }
the third triple pattern is on the same table, but is intended to match managers, rather than their employees. This can be expressed using different table aliases for each tuple map with the same table and different subjet. For readability in this document, these table names are derived from the variable name in the subject of the triple pattern:
SELECT emp.lastName, manager.lastName FROM Employee as emp INNER JOIN Employee as manager ON emp.manager=manager.id WHERE emp.lastName IS NOT NULL AND manager.lastName IS NOT NULL
The SPARQL graph pattern had a coreference on ?manager
; it was used as the object of the second triple pattern, and the subject of the third. This coreference is enforced in the join constraint emp.manager=manager.id
.
The IS NOT NULL
constraints in the WHERE clause account for the differences in expression of missing attributes in relational data and RDF. In relational data, missing attributes are given a value of NULL, while in RDF, the triple including them is not expressed. In SQL, NULL is not equal to anything else, including NULL, so variables and constants used in constraints have the same behavoir in SQL and SPARQL. non-NULL rule: any variables that are mentioned only once in non-optional parts of the query must be restricted to non-NULL values.
Queries may include node identifiers, such as this query to list the names of the employees who work for Employee 18:
PREFIX emplP: <http://hr.example/DB/Employee#> PREFIX emplN: <http://hr.example/DB/Employee/> SELECT ?empName WHERE { ?emp emplP:lastName ?empName . ?emp emplP:manager emplN:id.18 }
The includesion of a foreign key (manager
) introduces a join, and the value of that foreign key (18
) produces a constraint manager=18
. The query would be inconsistent if http://hr.example/DB/Employee#manager
and http://hr.example/DB/Employee/id.18
did not both refer to the same table.
SELECT emp.lastName FROM Employee as emp WHERE emp.manager=18 AND emp.lastName IS NOT NULL
If emplN:id.18
were replaced with a variable, say ?manager
, the SQL would need a join Employee AS manager
only if ?manager
were selected or if some property of manager were referenced, e.g. ?manager emplP:lastName ?manName
.
Literal and numeric value constraints in a graph pattern are represented as SQL constraints, expressed either in join constraints, on in an SQL WHERE expression. For example, a constraint on the manager's last name:
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?empName WHERE { ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager . ?manager emplP:lastName "Johnson" }
may be used to restriction of the candidate tuples for manager:
SELECT emp.lastName FROM Employee as emp INNER JOIN Employee as manager ON emp.manager=manager.id AND manager.lastName="Johnson" WHERE emp.lastName IS NOT NULL
or as a restriction on the final product of the join:
SELECT emp.lastName FROM Employee as emp INNER JOIN Employee as manager ON emp.manager=manager.id WHERE manager.lastName="Johnson" AND emp.lastName IS NOT NULL
SQL filters and SPARQL filters have similar arithmetic and boolean evaluation, e.g. manager.birthday > "1969-01-01"^^xsd::date
is expressed in SQL as manager.birthday > "1969-01-01"
. This document does not discuss the edges of datatype compatibility (e.g. maximum integer or float value) as this would require a survey of the SQL implementations.
Complex filter expressions, involving conjunction and disjunction, can be expressed in conjunctive normal form, and each conjunct can be moved into the earliest join that binds the variables in that conjunct. To demonstrate this, our example tables are extended to include many to many relations for employees and managers:
Example Tables:
id | lastName | birthday |
---|---|---|
18 | Johnson | 1969-11-08 |
253 | Smith | 1979-01-18 |
255 | Jones | 1981-03-24 |
19 | Xu | 1966-11-08 |
254 | Ishita | 1971-10-31 |
manager | manages |
---|---|
18 | 253 |
253 | 255 |
19 | 255 |
253 | 254 |
A query for third-line employees include multiple joins. A constraint on the second-line manager can constrain the join of that table. For example, in this query with many joins:
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?empName ?grandManagName WHERE { ?emp emplP:lastName ?empName . ?emp emplP:birthday ?empBday . ?lower managP:manages ?emp . ?lower managP:manager ?manager . ?manager emplP:birthday ?manBday . ?upper managP:manages ?manager . ?upper managP:manager ?grandManager . ?grandManager emplP:birthday ?grandManBday . ?grandManager emplP:lastName ?grandManagName FILTER (?manBday < ?empBday && ?grandManBday < ?manBday) }
the filter constraint conjuncts (?manBday < ?empBday
and ?grandManBday < ?manBday
), can be sorted into the join constraints:
SELECT emp.lastName, grandManager.lastName FROM Employee as emp INNER JOIN Manage AS lower ON lower.manages=emp.id INNER JOIN Employee AS manager ON manager.id=lower.manager AND manager.birthday < emp.birthday INNER JOIN Manage AS upper ON upper.manages=manager.id INNER JOIN Employee AS grandManager ON grandManager.id=upper.manager AND grandManager.birthday < manager.birthday WHERE emp.lastName IS NOT NULL AND grandManager.lastName IS NOT NULL
Optionals and Disjunctions can produce SQL subqueries. Filters on these subqueries should be treated as applying to the ON constraint when the subquery in joined to the rest of the query. Constants and variables introduced in an OPTIONAL or UNION may be pushed down into the subquery when generating SQL strings. When generating execution plans which exceed SQL expressivity, it may be possible to push into the subquery constraints against variables bound outside of the subquery.
In SQL, UNIONS are represented as subqueries. For example, a query for employees above and below employee 253:
SELECT ?name WHERE { ?who emplP:lastName "Smith" { ?above manageP:manages ?who . ?above manageP:manager ?manager . ?manager emplP:lastName ?name } UNION { ?below manageP:manager ?who . ?below manageP:manages ?managed . ?managed emplP:lastName ?name } }
The first constraint is handled as it is in basic graph patters; the Employee
table is aliased to who
, the variable ?who
is attached to the attribute who.id
, and who.lastName="Smith"
is added to the constraints
The SQL UNION representing the SPARQL UNION has two disjoints, each expressed as a subselect. The joins and constraints are processed normally, yielding these two join patterns:
FROM Manage AS above INNER JOIN Employee as manager ON above.manager=manager.id WHERE manager.lastName IS NOT NULL
FROM Manage AS below INNER JOIN Employee as managed ON below.manages=managed.id WHERE managed.lastName IS NOT NULL
Both sides of the SPARQL UNION reference the same two variables which are used ouside of the union: ?who
, which is coreferenced elsewhere in the graph pattern, and ?name
, which is a result of the query. Therefor, these subqueries must select the attributes to which the variables are attached, i.e. SELECT manager.lastName AS name, above.manages AS who
for the first and SELECT managed.lastName AS name, below.manager AS who
for the second.
Finally, the union must be uniquely named, AS union1
, and the coreference on ?who
must be enforced by constraining the attribute to which it is attached in the subquery against the the attribute to which it was attached in the initial SPARQL constraint, i.e. ON union1.who=who.id
.
This is a mapped stem query for the SPARQL query above.
SELECT union1.name FROM Employee AS who INNER JOIN ( SELECT manager.lastName AS name, above.manages AS who FROM Manage AS above INNER JOIN Employee as manager ON above.manager=manager.id WHERE manager.lastName IS NOT NULL UNION SELECT managed.lastName AS name, below.manager AS who FROM Manage AS below INNER JOIN Employee as managed ON below.manages=managed.id WHERE managed.lastName IS NOT NULL ) AS union1 ON union1.who=who.id WHERE who.lastName="Smith"
As mentioned in Filters on Basic Graph Patterns, when compiling an execution plan directly, it might be possible to violate the SQL scoping rules and push the constraints inside the subqueries. In the mapped stem query, the constraint imposed by the coreference to ?who
is enforced on the union two subselects, union1.who=who.id
. Some relational databases have execution plans that exceed SQL expressivity, specifically, they allow constraints in the subselects to reference attributes in aliases joined outside of the subselect. While there is no SQL expression for this, it can be thought of as looking like this:
SELECT union1.name FROM Employee AS who INNER JOIN ( SELECT manager.lastName AS name FROM Manage AS above INNER JOIN Employee as manager ON above.manager=manager.id WHERE above.manages=who.id AND manager.lastName IS NOT NULL UNION SELECT managed.lastName AS name FROM Manage AS below INNER JOIN Employee as managed ON below.manages=managed.id WHERE below.manager=who.id AND managed.lastName IS NOT NULL ) AS union1 ON TRUE WHERE who.name="Smith"
Care should be taken when violating SQL scoping rules as this may exploit unexplored code paths as well as violate assumptions made while optimizing. Pushing constraints into subselects can be thought of as an optimization, and may be inconsequential on some systems where the optimizer performs steps like this automatically.
The first query in Disjunctions was a convenient query in that the constraints imposed on the subselects by coreferences in the patterns were identical in all disjoints. Specifically, both disjoints include a constraint imposed by a coreference on ?who
. This query finds all of Smith's immediate managers, and those immediate employees that also share Smith's birthady (perhaps to support some favoritism on the part of Smith). The structure of this query provides more rigorous requirements for a mapping solution from SPARQL UNIONs to SQL UNIONs:
SELECT ?lastName WHERE { { ?above manageP:manages ?who . ?above manageP:manager ?manager . ?manager emplP:lastName ?name } UNION { ?below manageP:manager ?who . ?below manageP:manages ?managed . ?managed emplP:lastName ?name . ?managed emplP:birthday ?bday } ?who emplP:name "Smith" . ?who emplP:birthday ?bday }
As in the earlier query, both disjoints have coreferences on ?who
. The second disjoint also has a coreference on ?bday
. If one can violate the SQL scoping as above, these constraints are managed trivially in the expression of the subselects. However, to serialize this as SQL, the constraints on the SQL UNION must be treated individually for each disjoint. This can be accomplished by adding a disjoint ordinal to the subselect, e.g. 2 AS _DISJOINT_
, and expressing the constraints on the SQL UNION in terms of the disjoint ordinal, e.g. _DISJOINT_=2 AND union1.who=who.id AND union1.bday=who.birthday
.
SELECT union1.lastName FROM ( SELECT 1 AS _DISJOINT_, manager.lastName AS lastName, above.manages AS who, NULL AS bday FROM Manage AS above INNER JOIN Employee as manager ON above.manager=manager.id WHERE manager.lastName IS NOT NULL UNION SELECT 2 AS _DISJOINT_, managed.lastName AS lastName, below.manager AS who, managed.birthday AS bday FROM Manage AS below INNER JOIN Employee as managed ON below.manages=managed.id WHERE managed.lastName IS NOT NULL ) AS union1 INNER JOIN Employee AS who ON (_DISJOINT_=1 AND union1.who=who.id) OR (_DISJOINT_=2 AND union1.who=who.id AND union1.bday=who.birthday) WHERE who.lastName="Smith"
SPARQL optional graph patterns segregate a portion of the match which, by failing, will not eliminate rows. This behaves the same way as the SQL left outer join. A common use of SPARQL OPTIONAL is to bind a new variable to the value of a single attribute that may or may not be in the graph:
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?empName ?birthday WHERE { ?emp emplP:lastName ?empName . OPTIONAL { ?emp emplP:birthday ?birthday } }
If a SPARQL OPTIONAL introduces no variables that bind to a foreign key, and no introduces no other new join aliases, the query can be represented in SQL by simply ommiting the IS NOT NULL constraints and including tests that allow for NULL in subsequent matches:
SELECT emp.lastName, emp.birthday FROM Employee as emp WHERE emp.lastName IS NOT NULL
When converting this to a result set, any columns that are bound to NULL are simply not included in the bindings.
This query looks for any set of three employees with lexically increasing names (e.g. Ishita, Johnson, Jones, Smith) who all have the same birthday if known (typical in scenarios where talent scouts recruit from hospital nurseries). If the first two employees' birthdays may be unknown, in which case, only the remaining two have to match.
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?emp1Name ?emp2Name ?emp3Name WHERE { ?emp1 emplP:lastName ?emp1Name . OPTIONAL { ?emp1 emplP:birthday ?birthday } ?emp2 emplP:lastName ?emp2Name . OPTIONAL { ?emp2 emplP:birthday ?birthday } ?emp3 emplP:lastName ?emp3Name . ?emp3 emplP:birthday ?birthday . ?emp4 emplP:lastName ?emp4Name . ?emp4 emplP:birthday ?birthday . FILTER ( ?emp1Name < ?emp2Name && ?emp2Name < ?emp3Name && ?emp3Name < ?emp4Name) }
The first two utterances of ?birthday
are optional and the next two are not. If the initial binding to birthday were not optional, we could attach the variable ?birthday
to the Employee attribute emp1.birthday
, and use that attribute in subsequent constraints. Treating the sequence of constraints in the query, we can loosely attach ?birthday
to emp1.birthday
, which means that, as long as it is loosely bound, each constraint on ?birthday
must not fail if it is unbound. After it is used in a non-optional pattern, ?birthday
is attached to the corresponding table attribute, emp3.birthday
and the restriction emp3.birthday IS NOT NULL
is a restriction on the solution. (Note that in the next binding, emp4
, tests for emp4.birthday=emp3.birthday
, so this IS NOT NULL
restriction is again dropped.
SELECT emp1.lastName AS emp1Name, emp2.lastName AS emp2Name, emp3.lastName AS emp3Name FROM Employee AS emp1 INNER JOIN Employee AS emp2 ON emp2.lastName > emp1.lastName AND ( emp2.birthday IS NULL OR emp1.birthday IS NULL OR emp2.birthday=emp1.birthday ) INNER JOIN Employee AS emp3 ON emp3.lastName > emp2.lastName AND ( emp1.birthday IS NULL AND emp2.birthday IS NULL OR emp3.birthday=emp1.birthday OR emp3.birthday=emp2.birthday ) INNER JOIN Employee AS emp4 ON emp4.lastName > emp3.lastName AND emp4.birthday=emp3.birthday
Any constrain on a loosely attached variable must include a disjoint allowing the attribute to which that variable is attached to be NULL, e.g. emp2.birthday IS NULL
for the join of emp2
. Any constraint of a previously loosely attached variable must include a disjoints for
emp2
: emp1.birthday IS NULL
emp3
: emp1.birthday IS NULL AND emp2.birthday IS NULL
emp2
: emp2.birthday=emp1.birthday
emp3
: emp3.birthday=emp1.birthday OR emp3.birthday=emp2.birthday
SPARQL optional patterns may introduce multiple joins, which must be treated as a subquery. Following is a query of employees and, if they are a third-line employee, their manager and manager's manager (a "grand manager", in the spirit of "grandparents"):
PREFIX emplP: <http://hr.example/DB/Employee#> PREFIX mangP: <http://hr.example/DB/Manage#> SELECT ?empName ?managName ?grandManagName WHERE { ?emp emplP:lastName ?empName . OPTIONAL { ?mang mangP:manages ?emp . ?mang mangP:manager ?manager . ?manager emplP:lastName ?managName . ?grandMang mangP:manages ?manager . ?grandMang mangP:manager ?grandManager . ?grandManager emplP:lastName ?grandManagName } }
This introduces multiple joins within the OPTIONAL (in this case, of the Manage table) These two additional joins can be treated as a subquery, similar to a subquery in a UNION, and following the same rules for enforcing coreferences:
SELECT emp.lastName AS empName, opt1.managName AS managName, opt1.grandManagName AS grandManagName FROM Employee AS emp LEFT OUTER JOIN ( SELECT grandManager.lastName AS grandManagName, manager.lastName AS managName, mang.manages AS emp FROM Manage AS mang INNER JOIN Employee AS manager ON manager.id=mang.manager INNER JOIN Manage AS grandMang ON grandMang.manages=mang.manager INNER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager ) AS opt1 ON opt1.emp=emp.id WHERE emp.lastName IS NOT NULL
These Optionals Introducing Joins Results: are produced by the above query on the Example Tables.
empName | manageName | grandManageName |
---|---|---|
Johnson | NULL | NULL |
Smith | NULL | NULL |
Jones | Smith | Johnson |
Xu | NULL | NULL |
Ishita | Smith | Johnson |
It is tempting to treat SPARQL OPTIONALs as a series of LEFT OUTER JOINs in the same query
SELECT emp.lastName AS empName, manager.lastName AS managName, grandManager.lastName AS grandManagName FROM Employee AS emp LEFT OUTER JOIN Manage AS mang ON mang.manages=emp.id LEFT OUTER JOIN Employee AS manager ON manager.id=mang.manager LEFT OUTER JOIN Manage AS grandMang ON grandMang.manages=mang.manager LEFT OUTER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager WHERE emp.lastName IS NOT NULL
but that does not properly handle the cardinality in the case where an employee has multiple managers, but only one of them has a manager:
empName | grandManageName |
---|---|
Johnson | NULL |
Smith | NULL |
Jones | Johnson |
Jones | NULL |
Xu | NULL |
Ishita | Johnson |
The tuple {Jones, NULL}
was erroneously included because it partially satisfaction of an OPTIONAL pattern. OPTIONAL patterns may be represented as a sequence of simple outer joins if:
Note that, in both SPARQL and SQL, OPTIONALs (LEFT OUTER JOINS) introduce an order constraint on the sequence of joins. A variable that is bound in an optional can constraint future non-optional joins.
Pushing the ?emp emplP:lastName ?empName
constraint to the bottom is a very different, but illustrative query:
PREFIX emplP: <http://hr.example/DB/Employee#> SELECT ?empName ?grandManagName WHERE { OPTIONAL { ?manager emplP:manager ?emp . ?manager emplP:manager ?grandManager . ?grandManager emplP:lastName ?grandManagName } ?emp emplP:lastName ?empName }
The SPARQL semantics assert that WHERE { OPTIONAL { X } }
is equivalent to WHERE { { } OPTIONAL { X } }
and that the result set after processing { }
contains one soltion with no bindings. SQL has not syntax for starting off with an optional, but these semantics can be emulated by creating a table with one solution with one binding, the name of which is unique so that it will not affect any joins: SELECT 12817 AS _IGNORE_ATTRIBUTE__) AS _IGNORE_RELATION_
. The selected value is unimportant; 12817 was chosen to be conspicuous in case of implementation error. The alias name must not collide with another alias.
The joins and constraints for the subselect representing the OPTIONAL are expressed normally:
FROM Manage as emp INNER JOIN Manage as manager ON manager.manages=emp.manager INNER JOIN Employee as grandManager ON grandManager.id=manager.manager
The variable ?emp
in the OPTIONAL has corefrences outside of the OPTIONAL. To enforce this, the resulting subselect, AS opt1
, must produce an attribute manages
. ?emp
is initially loosely bound to opt1.manages
, and follows the according the rules described in Equivalence of Variables Bound in Optionals, ultimately producing the constraint opt1.manages IS NULL OR opt1.manages=emp.id
.
The variable ?grandManagName
is loosely attached to opt1.grandManagName
and is not subsequently strongly attached. As specified by the non-NULL rule, there is no opt1.lastName IS NOT NULL
constraint.
SELECT emp.lastName AS empName, opt1.grandManageName FROM (SELECT 12817 AS _IGNORE_ATTRIBUTE__) AS _IGNORE_RELATION_ LEFT OUTER JOIN ( SELECT emp.manages AS manages, grandManager.lastName AS grandManageName FROM Manage as emp INNER JOIN Manage as manager ON manager.manages=emp.manager INNER JOIN Employee as grandManager ON grandManager.id=manager.manager ) AS opt1 ON TRUE INNER JOIN Employee AS emp ON opt1.manages IS NULL OR opt1.manages=emp.id WHERE emp.lastName IS NOT NULL
This is a very different query than the one with with the ?emp emplP:lastName ?empName
pattern leading the optional: it has either the number of solutions in the OPTIONAL, or, if the optional fails, one solution with no bindings joined against each entry in Eployee. In fact, we know from the two solutions in the Optionals Introducing Joins Results that both Jones
and ishita
had grand managers.
empName | grandManageName |
---|---|
Jones | Johnson |
Ishita | Johnson |
Optionals nested inside another graph pattern, e.g. a UNION or another OPTIONAL, are treated as OPTIONALs at the top level.
PREFIX emplP: <http://hr.example/DB/Employee#> PREFIX mangP: <http://hr.example/DB/Manage#> SELECT ?empName ?managName ?grandManagName WHERE { ?emp emplP:lastName ?empName . OPTIONAL { ?mang mangP:manages ?emp . ?mang mangP:manager ?manager . ?manager emplP:lastName ?managName . OPTIONAL { ?grandMang mangP:manages ?manager . ?grandMang mangP:manager ?grandManager . ?grandManager emplP:lastName ?grandManagName } } }
These can be respresented as nested OUTER subselects.
SELECT emp.lastName AS empName, opt1.managName AS managName, opt1.grandManagName AS grandManagName FROM Employee AS emp LEFT OUTER JOIN ( SELECT opt1.grandManagName AS grandManagName, manager.lastName AS managName, mang.manages AS emp, mang.manager AS manager FROM Manage AS mang INNER JOIN Employee AS manager ON manager.id=mang.manager LEFT OUTER JOIN ( SELECT grandManager.lastName AS grandManagName, grandMang.manages AS manager FROM Manage AS grandMang INNER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager ) AS opt1 ON opt1.manager=mang.manager ) AS opt1 ON opt1.emp=emp.id WHERE emp.lastName IS NOT NULL
Note that, while similar to the Optionals Introducing Joins query, the fact that the grandManager is optional within the optional manager clause, permits a solution of an employee who has a manager but no grand manager.
empName | manageName | grandManageName |
---|---|---|
Johnson | NULL | NULL |
Smith | Johnson | NULL |
Jones | Smith | Johnson |
Jones | Xu | NULL |
Xu | NULL | NULL |
Ishita | Smith | Johnson |
A system could safely reduce this query to the simple set of left outer joins described in Erroneous Simplification of Joins in an Optional by examining the possible cardinalities for each of the outer joins:
LEFT OUTER JOIN Manage AS mang ON mang.manages=emp.id
— cardinality: the number of mangP:manages
predicates in the OPTIONAL which introduces managName
.Manage.manages
has a foreign key on Employee.id
.Manage
tuple results in an arc with a mangP:manages
predicate.LEFT OUTER JOIN Employee AS manager ON manager.id=mang.manager
— cardinality: 1Manage.manager
is a foreign key on Employee.id
.Manage.manager
is NOT NULL (by virtue of it being a foreign key on Employee.id
).LEFT OUTER JOIN Manage AS grandMang ON grandMang.manages=mang.manager
— cardinality: the number of mangP:manages
predicates in the OPTIONAL which introduces grandManagName
.LEFT OUTER JOIN Employee AS grandManager ON grandManager.id=grandMang.manager
— cardinality: 1.All graph patterns in a SPARQL query that are not separated by {}s scoping an OPTIONAL or UNION graph pattern are conjunctions. These may be treated as if all their triple patterns were in the same basic graph pattern. For example, these three patterns:
{ { ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager } ?manager emplP:lastName ?managName FILTER ?managName = "Johnson"}
{ { ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager } { ?manager emplP:lastName ?managName } FILTER ?managName = "Johnson"}
{ { ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager } FILTER ?managName = "Johnson" { ?manager emplP:lastName ?managName } }
are all equivalent to
{ ?emp emplP:lastName ?empName . ?emp emplP:manager ?manager . ?manager emplP:lastName ?managName FILTER ?managName = "Johnson" }
Given a set of relationals DB and a stem query QS:
Following the SPARQL rule on CONSTRUCTs with unbound variables (¶2), variables not bound in the stem graph would produce no triples in the interface graph. Thus, triples using variables that are bound in OPTIONALs may be annotated as at risk:
stem interface ?B pX ?A ?A p1 ?B OPT { ?C pY ?B => ?B p2 ?C <- At risk because the dependent bindings ?C pZ [] } <x>: { ?C p3 ?A } <- might not come from the optional stem. ?A pW ?D ?B p4 ?D ... ...
We can see that the query is at risk for missing many potential solutions as the constraint ?g: { ?z p3 ?x }
is not going to eliminate some solutions.
This work was all done with funding from Eli Lilly and Lincoln Labs. I would like to thank Susie Stephens (Eli Lilly) and Daniel Van Hook (Lincoln Labs) for their commitment to making the Semantic Web practical and easy to use. I would also like to thank C. M. Sperberg-McQueen for his rigorous examples and bug discoveries.
$Log: Overview.html,v $
Revision 1.44 2009/09/01 18:02:07 eric
~ moved PatientDataExample out
Revision 1.43 2009/01/06 05:43:25 eric
~ incorporated feedback mid:1225706364.12028.64.camel@tweetie08
Revision 1.42 2009/01/05 20:53:49 eric
+ Acknowledgements
Revision 1.41 2008/09/28 17:32:51 eric
+ tinkering on Nested Optional Mapping
Revision 1.40 2008/09/26 18:29:38 eric
+ Nested Optional Mapping
Revision 1.39 2008/09/23 22:28:19 eric
~ fixed SPARQL query for optJoin1
Revision 1.38 2008/09/08 12:48:51 eric
+ ids to generate stem queries from MappingRules
Revision 1.37 2008/08/28 06:58:50 eric
~ improved Leading Optionals
Revision 1.36 2008/08/28 06:20:19 eric
~ fixed layout in Conjunction
Revision 1.35 2008/08/28 05:51:34 eric
~ more rigorous
+ coreference
+ General Mapping for Unions
Revision 1.34 2008/08/27 14:27:30 eric
+ Literal Maps
+ variables attached to attributes
+ Equivalence of Variables Bound in Optionals
+ Leading Optionals
Revision 1.33 2008/08/26 05:07:53 eric
~ mellowed border boxes
~ re-labeled rules in Query Mapping Algorithm
+ new rules in SQL Mapping Algorithm
Revision 1.32 2008/08/25 06:46:06 eric
+ filter sorting example
Revision 1.31 2008/08/25 05:09:22 eric
+ @Disjunctions: analysis of external constraints in subqueries
Revision 1.30 2008/08/25 04:03:29 eric
+ Disjunctions
Revision 1.29 2008/08/25 00:12:27 eric
+ Optionals
Revision 1.28 2008/08/24 11:18:36 eric
+ filters
Revision 1.27 2008/08/24 10:48:08 eric
~ more rigorous treatment of examples
Revision 1.26 2008/08/24 08:02:16 eric
- forward refs in Introduction
+ Stem Query as SQL
Revision 1.25 2008/08/18 06:27:50 eric
~ validate
Revision 1.24 2008/08/18 06:25:59 eric
+ note to handle dependent optionals in Query Mapping Algorithm
Revision 1.23 2008/08/18 06:14:42 eric
~ expanded Treatment of Optionals to include dependent optionals.
Revision 1.22 2008/08/18 03:28:13 eric
+ Dependent Optionals
Revision 1.21 2008/08/14 03:40:44 eric
+ Multiple Interface Graph Matches to Interface Query
Revision 1.20 2008/08/12 11:28:51 eric
+ fabulous rule for when to include OPTIONALs
Revision 1.19 2008/08/07 18:20:32 eric
~ typos
~ s/required/included/
Revision 1.18 2008/08/07 04:13:29 eric
+ more rules
Revision 1.17 2008/08/06 23:56:54 eric
+ more rules
Revision 1.16 2008/08/06 14:48:28 eric
+ start on algorithm
Revision 1.15 2008/08/05 18:42:05 eric
+ author list
Revision 1.14 2008/08/04 18:58:26 eric
+ response to [email protected]
Revision 1.13 2008/08/04 15:08:02 eric
+ pW/p3 productions
+ anchors
Revision 1.12 2008/08/04 14:54:01 eric
+ anchors
Revision 1.11 2008/08/04 14:46:55 eric
+ intro
Revision 1.10 2008/08/04 04:20:59 eric
+ introduction
Revision 1.9 2008/08/04 02:42:16 eric
+ stem graph transformation
Revision 1.8 2008/08/03 14:47:05 eric
+ entailment mapping
Revision 1.7 2008/08/03 05:09:51 eric
~ documented rules
Revision 1.6 2008/08/03 00:16:18 eric
+ ActivityHeaderID homonym
Revision 1.5 2008/08/02 17:57:44 eric
~ split out mapping options by expressivity
Revision 1.4 2008/07/30 22:42:02 eric
+ annotation
Revision 1.3 2008/07/30 22:40:50 eric
+ hospital example
Revision 1.2 2008/07/30 17:41:47 eric
+ tables
Revision 1.1 2008/07/24 21:33:19 eric
created