RDf is a mechanism to universalize relational data. The vast majority of relational data is stored in conventional SQL data stores. This documents implementation experience mapping RDF queries to SQL data stores. This mechanism can be used to automatically join data from relational databases in different administration (creative) domains.
This is a very rough document and represents the results of my studies and not those of the greater community.
Unlike much of the current work on SQL RDF data stores, this approach attempts to provide RDF access to data in specialized SQL data stores rather than provide an SQL data store for RDF. Such a mechanism will allow the joining of vast repositories of relational data currently isolated by administrative barriers and only occasionally exported via tedious interfaces.
Data access efficiency is critical in many domains. While conveniently general, a generic triple store forces a join for each term in a complex query. Accessing multiple fields in a record represented in a single table is more efficient.
Following is an ACLs (access control list) example which is used in the W3C web server. For optimization reasons, the ACLs are grouped -- the relevent fields for ACL group 6 are shown here:
SELECT acl,id,access FROM acls WHERE acl=6 | acls | +-----+-----+--------+ | acl | id | access | +-----+-----+--------+ | ... | ... | ... | | 6 | 100 | 3122 | | 6 | 102 | 3955 | | ... | ... | ... | +-----+-----+--------+
In a generic triple store, each field above would constitute another entry in the list of triples:
| triples | +-----------+---------+--------+ | predicate | subject | object | +-----------+---------+--------+ | ... | ... | ... | | ...acl | ...2138 | 6 | | ...id | ...2138 | 100 | | ...access | ...2138 | 3122 | | ...acl | ...9287 | 6 | | ...id | ...9287 | 102 | | ...access | ...9287 | 3955 | | ... | ... | ... | +-----------+---------+--------+
Querying this data requires self-joins for each field you wish to constrain:
SELECT t0.predicate,t0.subject,t0.object, t1.predicate,t1.subject,t1.object, t2.predicate,t2.subject,t2.object, FROM triples AS t0, triples AS t1, triples AS t2 WHERE t0.subject=t1.subject AND t0.subject=t2.subject AND t0.predicate="...acl" AND t1.predicate="...id" AND t2.predicate="...access" AND t0.object=6 | triples as t0 | triples as t1 | triples as t2 | +-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+ | predicate | subject | object | predicate | subject | object | predicate | subject | object | +-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+ | ...acl | ...2138 | 6 | ...id | ...2138 | 100 | ...access | ...2138 | 3122 | | ...acl | ...9287 | 6 | ...id | ...9287 | 102 | ...access | ...9287 | 3955 | +-----------+---------+--------+-----------+---------+--------+-----------+---------+--------+
The added cost of the self joins must be balanced against the flexibility of a generic triple store when selecting a database. This cost will not be high in indexed databases with sparse data sets. The cost goes up when multiple initial constraints select large amounts of rows in the triples table which are eliminated by later constraints.
For instance, if the constraints on the above query were re-ordered so that the table (self-join) with the object=6 constraint were taken last:
WHERE t1.subject=t2.subject AND t1.subject=t0.subject AND t1.predicate="...id" AND t2.predicate="...access" AND t0.predicate="...acl" AND t0.object=6
t1 is constrained by having the predicate "...id". If there were 10000 things with the "...id" property in the triple store, this would select 10000 rows. The next (self-)join is on t2 where t2.predicate is "...access" and t2.subject = t1.subject. This would likely eliminate no rows at all, and, if some had more than one "...access" predicate, could cause the row set to multiply. The last constraints say that our object also has an "...acl" property and that its value is 6, trimming our row set down to 2 and throwing away the rest of the rows that had been kept in memory to this point.
In exporting an RDF view of a relational database (or business process), it is rare that one would choose to export the actual implementation details. It is more likely that the exporter would have an abstract model of the data that would represent the interface provided by the exporter. Agents outside of the exporter's administrative control would be invited to use this interface. The back end database would be mapped to the exported interface. For instance, W3C's access control system exports an RDF interface that exports RDF data and accepts POSTed RDF data in a particular schema.
Once the data is translated from the exported interface is translated to the internal implementation-specific RDF schema, it must be translated to actually SQL queries. In this opperation, an RDF query like
namespace '(a http://localhost/SqlDB#) attach '(\"W3C::Rdf::SqlDB\" (\"properties:/usr/local/perl/modules/Conf/chacl.prop\" \"name:a::W3Cacls\")) ask '(a::W3Cacls (a::uris.uri ?uri0 http://www.w3.org/Member/Overview.html) (a::uris.acl ?uri0 ?t1) (a::acls.acl ?acl0 ?t1) (a::acls.access ?acl0 ?access) (a::ids.value ?id0 \"eric\") (a::ids.info ?id0 ?info) (a::idInclusions.id ?idInc0 ?id0) (a::acls.id ?acl0 ?t0) (a::idInclusions.groupId ?idInc0 ?t0) )
is tranlated to SQL like
SELECT b.acl as acl0_acl,b.id as acl0_id,b.access as acl0_access, a.id as uri0_id, c.id as id0_id, b.access as access, b.id as t0, d.id as idInc0_id,d.groupId as idInc0_groupId, a.acl as t1 FROM uris AS a,acls AS b,ids AS c,idInclusions AS d WHERE (a.uri="http://www.w3.org/Member/Overview.html") AND (a.acl=b.acl) AND (c.value="eric") AND (c.id=d.id) AND (b.id=d.groupId); +----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+ | acl0_acl | acl0_id | acl0_access | uri0_id | id0_id | access | t0 | idInc0_id | idInc0_groupId | t1 | +----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+ | 6 | 100 | 3122 | 810 | 2112 | 3122 | 100 | 2112 | 100 | 6 | | 6 | 102 | 3955 | 810 | 2112 | 3955 | 102 | 2112 | 102 | 6 | +----------+---------+-------------+---------+--------+--------+-----+-----------+----------------+----+
Some steps to translation:
Tables are identified by prefixing the name with a base URI supplied in the configuration.
Fields are identified by prefixing the name with the table identifier.
Identification of a row in a table was done by identifying the row by a unique index on that table. For instance, the acls table has a single unique index on it formed by the combination of the acl
, id
and access
fields:
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Comment | +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+ | acls | 0 | PRIMARY | 1 | acl | A | NULL | NULL | NULL | | | acls | 0 | PRIMARY | 2 | id | A | NULL | NULL | NULL | | | acls | 0 | PRIMARY | 3 | access | A | 241 | NULL | NULL | | +-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+---------+
Therefor, a row in the acls table could be identified by
http://localhost/SqlDB#acls.acl=6&id=100&access=3122
which indentifies the one row in the table where acl = 6
, id = 100
and access = 3122
.
Fields that represent external keys to another table are identified by the row identifier in that other table. For instance, the id
field in the acls
+--------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------+------------------+------+-----+---------+----------------+ | acl | int(10) unsigned | | PRI | NULL | auto_increment | | type | char(1) binary | | | | | | id | int(10) unsigned | | PRI | 0 | | | access | int(10) unsigned | | PRI | 0 | | | last | timestamp(14) | YES | | NULL | | +--------+------------------+------+-----+---------+----------------+
table constitutes a reference to the id
field in the ids
table. Therefor, the range of the #acls.id
field is an http://localhost/SqlDB#ids
table row represented by a URI like
http://localhost/SqlDB#ids.id=100
While walking through the RDF to be translated, it proved easiest (after trying several other approaches) to record each place where a variable was used in the constructed SQL query and construct the constraints from list of references to that variable. Furthur, any place that variable was used as an object (the second position in the algae syntax above), represents the appropriate identifier for that node (row). Alternatively, one could say that the any and all object (field value) refrences to such a node should be documented as being external keys. This approach puts more of an assumption on the modeling completeness of the database relations than is otherwise necessary.
Mapping RDF schema to each other is (appears to be) as simple as defining the rules for the mapping. For instance, The W3C acls SQL database implies a structure where a uri
object has a reference to a set of acl
objects which each grant a specific access to a specific group.
This needs to be translated to the less pretty internal structure
by a simple set of translation rules
( namespace '(sqlDB http://localhost/SqlDB# \ acl http://www.w3.org/2001/02/acls/ns# \ rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#) \ attach '(\"W3C::Rdf::SqlDB\" \ (\"properties:/usr/local/perl/modules/Conf/chacl.prop\" \ \"name:sqlDB::W3Cacls\")) \ interface (sqlDB::W3Cacls API '((acl::login ?u ?name)) \ impl '((sqlDB::ids.value ?u ?name))) \ interface (sqlDB::W3Cacls API '((acl::memberOf ?sub ?super)) \ impl '((sqlDB::idInclusions.groupId #?g1 ?super) \ (sqlDB::idInclusions.id ?g1 ?sub))) \ interface (sqlDB::W3Cacls API '((acl::accessor ?a ?principal)) \ impl '((sqlDB::acls.id ?a ?principal))) \ interface (sqlDB::W3Cacls API '((acl::accessTo ?a ?uri)) \ impl '((sqlDB::acls.acl ?a #?acl) \ (sqlDB::uris.acl #?urisRow ?acl) \ (sqlDB::uris.uri ?urisRow ?uri))) \ interface (sqlDB::W3Cacls API '((acl::access ?a ?access)) \ impl '((sqlDB::acls.access ?a ?access))) \ )
in order to be processed by the SQL translator. This produces a graph matching the original internal graph:
This allows us to answer the query
(ask '(sqlDB::W3Cacls (acl::accessTo ?acl http://www.w3.org/Member/Overview.html) (acl::access ?acl ?access) (acl::accessor ?acl ?accessor) (acl::memberOf ?u1 ?accessor) (acl::login ?u1 "eric")) collect '(?acl))
by translating it to
((http://localhost/SqlDB#uris.uri ?urisRow http://www.w3.org/Member/Overview.html) (http://localhost/SqlDB#uris.acl ?urisRow ?aacl) (http://localhost/SqlDB#acls.acl ?acl ?aacl) (http://localhost/SqlDB#acls.access ?acl ?access) (http://localhost/SqlDB#ids.value ?u1 "eric") (http://localhost/SqlDB#idInclusions.id ?g1 ?u1) (http://localhost/SqlDB#idInclusions.groupId ?g1 ?accessor) (http://localhost/SqlDB#acls.id ?acl ?accessor))
and getting
(<http://localhost/SqlDB#acls.acl=6&id=100&access=3122>) (<http://localhost/SqlDB#acls.acl=6&id=102&access=3955>)
uri
node pointing at the accessTo
resource (http://www.w3.org/Member/Overview.html
). Perhaps this should be done with a daml:uniqueProperty on uris.uri
.access
bitmaskaccess
bitmask represent a set of access privileges that should be represented by a repeated property.