W3C's Resource Description Framework, RDF [1], has been of great interest for networked data expression due in part to the popularity of the World Wide Web, and in part to its foundations in more traditional knowledge representation and reasoning systems. Among the attractive features of this language is the ability to distribute ontology development by basing terms in dereferencable, globally unambiguous identifiers. For this, RDF uses URIs whose scalability has been proven by the success of the web. This paper introduces a relational database/RDF gateway called Federate.
SQL has worked well for intranets and homogeneous trust environments. When relational data is made available to wider audiences, custodians must create custom gateways. This is often done via web sites which present content generated out of the database; RDF provides a similar path from the database to the world, but without the loss of semantic precision which occurs in using HTML. RDF's universally grounded entities and relationships provide a mechanism to unify RDBs with other RDBs, as well as static data and other sources. The resulting network of universally interpretable data can be used as a medium for future data-oriented applications.
Data encoding languages such as XDR and ASN.1 concern themselves with expressing data structures but leave the semantics up to context hidden in the communication protocol endpoints and in contained data available to higher level protocols. This limits the re-use of this data to applications which are given special knowledge of this context. Fully self-described documents allow transport layer protocols like SMTP, FTP, NNTP and HTTP to provide caching, authentication and other generic infrastructure functions.
RDF is a language for expressing Directed Labeled Graphs (DLGs) using URIs (a superset of URLs) as node and arc identifiers. It can be expressed in XML or in non-XML syntaxes like n3 or ntriples. The three statements:
testDB:Orders_id_2185 has an testDB:Orders-customer of testDB:Customers_id_1 testDB:Orders_id_2185 has an testDB:Orders-product of testDB:Products_id_1004 testDB:Orders_id_2185 has an testDB:Orders-orderDate of "Jan 1 2001 13:24:55"
can be easily visuallized as a graph.
As DLGs clearly express relationships between objects, RDF is an excellent language for expressing relational data. For instance, the above statements came from a row in an example OrderTracking database:
Orders | Customers | |||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Enthusiasm for XML and semi-structured data has prompted some shift of highly structured relational data into XML documents and databases. Despite this, storage of structured data is still highly concentrated in relational databases, and, for performance and integrity checking reasons, likely to stay that way. Tools like Federate can be used to provide a consistent interface to the vast data stores in commercial infrastructure. This consistent interface simplifies the job of defining views of the data that will be available to different actors. These views may be joined with views from other publishers, creating a much more useful fabric of data that doesn't require human intuition (and luck) to interpret.
Storing RDF data in a relational database is not novel, however, Federate provides RDF agents with the ability to query a relational database with an application-specific schema. Querying RDF stored as Statements incurs the cost of an additional join per restriction condition (WHERE or JOIN ON clause in SQL). In general, properly normalized relational databases do not need a join for unique properties, ie, properties for which there is one and only one slot. Federate provides RDF data with the speed, compactness, and integrety constraints of conventional relational databases.
RDF is a language for modeling data in binary relations. Relations are expressed in statements composed of a subject, object and a predicate (relationship) between them. These statements combine to form a directed, labeled graph (DLG).
The data expressed may be anything to which one may assign relationships. Data and meta-data are expressed in the same language and may appear in the same document. Those familiar with UML may note that RDF imposes no distinction between the data, model and meta-model (...). Statements are unordered as assertions in predicate logic are unordered. Ordering of lists is handled withing the model with specific predicates.
Database federation needs arise typically from independently developed and administered data. Relational schemes impose strict understanding of the relationships of the modeled data. However, joining two databases can be tricky as these relationships may be less clear. This can arrise from incomplete understanding, but more often will arrise when the custodians of the data have different needs. While the elements in the microworld described in two databases may coincide, it is not guaranteed they will correspond exactly to entities in both database. For instance, a vender database that maintains a part number for each tracked item may not correspond to a client's database which has sales names for these items, or house-defined names for groups of the items, or part numers from that vender interspersed with part numbers from other venders. Sometimes the correlation of two entities can only be asserted for a particular domain of query, eg, for the purposes of this query, butter melting frobnosticators can be considered the same as heated mystifiers. This use case considers the case where such an impedance mismatch occurs for data that is managed by different departments in a large organization.
Custom code can be written to handle joins across two independently maintained databases within the same organization, but it will only be useful for those two databases, and serviced queries will have to have the same attitude about what items can be merged. It is frequently impractical to provide a meta language to describe what items can be conditionally merged, and when. Development of such languages leads to many years of theoretical exploration.
Mapping databases to RDF has proven quite simple. Slightly more challenging is the use of ontology languages to join disparate databases. RDFS and OWL can be used to relate these databases, either in a universal it-is-always-true-that sense, or in a premise, for-the-purposes-of-this-query, sense. This allows one to code the expertise embedded in custom database gateways in a language that has already been developed to globally unambiguously model data.
Instead of simply defining mappings between two databases, it is also possible to define mappings to a "house standard" of entity references. This allows the mapping problem to be addressed once for each database, rather than once for each pair of database.
Because the "intranet" use case did not rely on coordination of disparate databases within an organization, the same approach can be used to magage relationships between databases published by different organizations, for instance, a vendor and a product rating service.
A product database may have some public visibility in the form of a marketing portal. This is currently done by publishing product lists in HTML, frequently with forms for purchasing the advertised products. Consumer desire to use specialized tools to browse and submit orders may motivate the vendor to publish the data in a predictable, machine-readbale language. This eliminates tedious screen scraping from the more necessary task of semantic interpretation of the data.
A consumer-oriented rating service, ala Consumer Reports, can now publish their product reviews reusing the vendor's entity identifiers. Users of this data now have a trivial job of joining the two sources as they have the equivilent of an external key. This is one ideal of the semantic web.
In the common scenario that publishers do not agree on common identifers for items, the same mechanisms used in the "intranet" use case can be used to establish the useful relationships.
At present, Federate works with two reasoning agents, algae [1], and cwm [2]. Most of the experimentation has been done on the W3C access control database and a sample OrderTracking database.
A subset of n3 is used this paper. It can be thought of as a series of statements, each terminated by a '.'. Each statement consists of a subject, relation, and object. Each of the nodes may be a URI enclosed in <> or a qname as described in XML namespaces. Statements may be grouped in a formula. This has the effect of removing them from the assertions of that document. These formulae are used in, amoung other things, rules. The standard format of a rule is:
<> log:forAll ','-sepparated var list . { statements with vars } log:implies { statements with vars } .
In fact, n3 is much more than the portion used for examples in this document. Unlike Algae, it uses triples to declare rules (and implicitly, queries) and to declare the relationship between documents and the data in those documents. For more information, see A Rough Guide to N3.
Algae is a simple, pragmatic query and rules language. The functions that are relevent to this paper are:
ns
prefix=uriattach
type dbName (parameter=value...)ask
dbName (triple ...)collect
(variable ...)ask
The connectives are described below in Language Extents.
The following algae query of the example OrderTracking database demonstrates access to a simple, but practical JOIN on a set of database relations (tables). See the complete tables from which this data was taken, the algae script and the resulting graph.
ns testDB=<http://localhost/OrderTracking#> attach <http://www.w3.org/1999/02/26-modules/algae#dynamic> testDB:test1 ( class="W3C::Rdf::SqlDB" properties="../test/OrderTracking.prop") ask testDB:test1 ( ?o testDB:Orders_id ?orderId . ?o testDB:Orders_customer ?c . ?o testDB:Orders_orderDate 20020907 . ?o testDB:Orders_product ?p . ?p testDB:Products_name ?productName . ?c testDB:Customers_givenName ?first . ?c testDB:Customers_familyName ?last . ?c testDB:Customers_billingAddress ?billAddr . ?billAddr testDB:Addresses_street ?billStreet . ?billAddr testDB:Addresses_city ?billCity . ?billAddr testDB:Addresses_state ?billState ) collect (?orderId ?productName ?first ?last ?billStreet ?billCity ?billState)
Following is a breakdown of the same query in n3 of the OrderTracking database.
n3 distinguishes variables be explicitly listing existentials and universals. Following "log:forAll" is a list of universal quantifiers.
this log:forAll :o, :d, :p, :productName, :c, :first, :last, :billAddr, :billStreet, :billCity, :billState .
<sql://[email protected]/OrderTracking/> is log:authoritativeService of Orders:id, Orders:customer, Orders:product, Orders:orderDate, Products:id, Products:name, Customers:id, Customers:familyName, Customers:givenName, Customers:billingAddress, Addresses:id, Addresses:street, Addresses:city, Addresses:state.
This example query doesn't rely on any external schema definition; the external keys are associated with their table/primary key pairs with "log:pointsAt".
Orders:product log:pointsAt Products:id . Orders:customer log:pointsAt Customers:id . Customers:billingAddress log:pointsAt Addresses:id .
:o Orders:id :orderId . :o Orders:customer :c . :o Orders:orderDate 20020907 . :o Orders:product :p . :p Products:name :productName . :c Customers:givenName :last . :c Customers:familyName :first . :c Customers:billingAddress :billAddr . :billAddr Addresses:street :billStreet . :billAddr Addresses:city :billCity . :billAddr Addresses:state :billState .
For each order :o, find the order date, product name and the customer's name and billing address.
SELECT Orders_0.id AS o_id, Customers_0.id AS c_id, Products_0.id AS p_id, Products_0.name AS productName_name, Customers_0.givenName AS first_givenName, Customers_0.familyName AS last_familyName, Addresses_0.id AS billAddr_id, Addresses_0.street AS billStreet_street, Addresses_0.city AS billCity_city, Addresses_0.state AS billState_state FROM Orders AS Orders_0 INNER JOIN Customers AS Customers_0 ON Orders_0.customer=Customers_0.id INNER JOIN Products AS Products_0 ON Orders_0.product=Products_0.id INNER JOIN Addresses AS Addresses_0 ON Customers_0.billingAddress=Addresses_0.id WHERE Orders_0.orderDate="20020907"
orderId | productName | first | last | billStreet | billCity | billState |
---|---|---|---|---|---|---|
"2185" | "pool" | "Biff" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" |
"2187" | "nose ring" | "Chip" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" |
"3183" | "other ring" | "Chip" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" |
This example extends the OrderTracking JOIN example above by adding optional arcs for shipping address and contact (person to sign for the shipment). This demonstrates the algorithm's ability to manage multiple joins to the same table. See the complete tables from which this data was taken, the algae script and the resulting graph.
ns testDB=<http://localhost/OrderTracking#> attach <http://www.w3.org/1999/02/26-modules/algae#dynamic> testDB:test1 ( class="W3C::Rdf::SqlDB" properties="../test/OrderTracking.prop") ask testDB:test1 ( ?o testDB:Orders_id ?orderId . ?o testDB:Orders_customer ?c . ?o testDB:Orders_orderDate 20020907 . ?c testDB:Customers_givenName ?first . ?c testDB:Customers_familyName ?last . ?c testDB:Customers_billingAddress ?billAddr . ?billAddr testDB:Addresses_street ?billStreet . ?billAddr testDB:Addresses_city ?billCity . ?billAddr testDB:Addresses_state ?billState . ~?o testDB:Orders_shippingAddress ?shipAddr . ~?shipAddr testDB:Addresses_street ?shipStreet . ~?shipAddr testDB:Addresses_city ?shipCity . ~?shipAddr testDB:Addresses_state ?shipState . ~?shipAddr testDB:Addresses_contact ?signer . ~?signer testDB:Customers_givenName ?sFirst . ~?signer testDB:Customers_familyName ?sLast . ) collect (?orderId ?first ?last ?billStreet ?billCity ?billState ?sFirst ?sLast ?shipStreet ?shipCity ?shipState)
specifies that the Addresses table JOINed on the relvar Orders.shippingAddress, as well as the Customers table then joined on Addresses.contact, are JOINed with an OUTER join:
yielding:
orderId | first | last | billStreet | billCity | billState | sFirst | sLast | shipStreet | shipCity | shipState |
---|---|---|---|---|---|---|---|---|---|---|
"2185" | "Biff" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" | NULL | NULL | NULL | NULL | NULL |
"2187" | "Chip" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" | NULL | NULL | NULL | NULL | NULL |
"3183" | "Chip" | "Thompson" | "123 Elm Street" | "EdgeCity" | "AV" | "Eustis" | "Walker" | "245 King Street" | "EdgeCity" | "AV" |
The SQL query generator parses an RDF query and examines the predicates to determine from which relation they will come. The algorithm is roughly as follows:
For each tableName { c = new Class(tableName) for each fieldName { c.addProperty(new Property(c, fieldName) } }
property = getProperty(predicate uri) property.processTerm(triple, db)
Addresses
relation to be used as both the home address and the shipping address above.db.getAlias(subject) if (ret = symbol2table{symbol}) return ret ret = symbol2table{symbol} = tableAlias{table}++ # Orders_0
Suppose Chip places an order for a nose ring and we wish to store this in a generic triple store. This would be composed of the triples:
[ customer [ givenName "Chip" ] . product [ name "nose ring" ] ] .
which implies the following triples, using _:<n> to represent an unnamed node in the graph:
_:1 customer _:2 . _:2 givenName "Chip" . _:1 product _:3 . _:3 name "nose ring" .
If we wanted a normalized database to optimize storage and retrieval of this information, we could create an OrderTracking database:
Orders | Customers | Products | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
In doing this, we must invent some integers (2187, 2, 2004) to use as identifiers for the rows in the tables. These enable use to store the relationships conveyed by _:1, _:2 and _:3 above. The id attribute for Orders, Customers and Products is not an attribute of the row, but an artifact of the mechanism for storing internal structure.
For consistency between the two expressions (holds and normalized), it is necessary to treat all external keys (Orders.customer and Orders.product above) not as integers, but as references to the entity represented by the indicated row in some table. A consequence of this contraint is that a benighted relational database that does not know the external keys, but instead treats all joins at being constrained on coincident attributes will report the object of would-be foreign key arcs as simply the integer representing this primary key of the referenced table/row.
_:1 customer 2 .
This would be analogous to a generic triple store reporting that the object of an arc some memory address instead of the node in the graph that was the true object of that arc. When later informed of these external keys, the object will no longer be an integer, but instead an entity reference.
_:1 customer <http://...Customers#id.204> .
These two statements are non-monotonic. This is appropriate as this system is indeed inconsistent and reflects a change in the state of the microuniverse.
Frequently, larger distribution organizations assign identifiers to orders and products. These are seldom small integers that should be used as primary keys, but instead published alphanumeric strings appearing in catalogs or returned to the customer who has place an order.
Orders | Customers | Products | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
This would add two triples to the above list.
_:1 customer _:2 . _:2 givenName "Chip" . _:1 product _:3 . _:3 name "nose ring" . _:1 orderID "200303120001a" . _:3 prodID "nr21a.5" .
It is possible that a custodian of the relational store would choose to publish the external keys as order or product identifiers. In such a case, the primary key would be a published attribute of the entity represented by the table row. External keys would remain as opaque references to the entity.
_:1 customer _:2 . _:2 givenName "Chip" . _:1 product _:3 . _:3 name "nose ring" . _:1 orderID "2187" . _:3 prodID "2004" .
The current implementations of Federate assign ids to rows in tables based on the primary key. This may not be good practice, but is temptingly convenient and useful.
Federate can interpret most common logical connectives and express them in SQL to the federated sources. The cwm port can, at this time, express only conjunction, but the development platform, algae, can express and execute the standard set:
connective | algae | cwm |
---|---|---|
conjunction | (?a ?b ?c)(?d ?c ?e) | :a :b :c. :d :c :e. |
disjunction | (|| (?a ?b ?c)(?a ?b ?d)) | not implemented |
negation | !(?a ?b ?c) | not implemented |
outer join | (?a ?b ?c)~(?d ?c ?e) | not implemented |
domain constraints | ||
regular expression | ~http://example.com/.* | not implemented |
@@@For instance, the somewhat complicated looking query:@@@
A common criticism of RDF is the consequence of the decision to model relations as only binary. This requires additional enitties to be created when modeling n-ary relations. For example, the common supplier schenaro:
requires that a Supplies entity be created though it does not model a phenomena and not an object in the microuniverse. (It could be argued that the Supplies entity models a contract or a percieved acquisition agreement.) This turns out to not be a problem for mapping to the relational world as RDBs are similarly constrained. A relation may not have duplicate tuples, implying a primary key (with potentially more than one attribute). The entity represented by this tuple can be represented by a node in an RDF graph where the node can be named by a function of the attributes in the primary key. Alternatively, the node may be unnamed but distingueshed by arcs describing the attributes in the primary key.
Many relational databases provide a mechanism to distribute database tuples and attributes in a manner that allows a query planner to predict which sites will have which data. The same processes can be used by an SQL query engine handling Federate requests. This could also be done by the Federate engines cwm and algae as both have unifiers to merge data from SQL with data from other sources like HTTP or rules. At present, both implementations decide which SQL store to query based on the predicate. Distributions based on the subject or object would be analogous to relation databases with tuples distributed at different sites. This should not be significant more difficult than the attribute distribution which is already implemented.
Inserting or replacing arcs derivative of a normalized relational store is analogous to the classic view update problem in relational databases. Briefly, a view is created by defining a query involving one or more base tables and assigning the resulting table a name. This name becomes available for constructing future queries. Whether the view is populated with tuples which must be updated when the base tables are updates, or simply invokes the defining query each time it is referenced, is an implementation detail that the custodian should not have to consider. The view update problem occurs when one inserts, updates, or deletes the triples in a view. Such updates need to reflect back to updates in the base tables in such a way that the base tables and the derivative view remain consistent.
C J Date outlines one approach to view update. At this time, this approach has not been implemented in Federate. Given a set up triple assertions which reflect changes to the microuniverse, it seems feasible to use this approach to create a set of SQL updates resulting in a consistent system.
A benefit of semi-structured databases is the flexibility to store data outside of a predefined schema. RDF is designed so that adding statements to a graph does not change the meaning of the existing data. This enables agents to interpret the portion of the database that they understand and safely ignore the rest. In a sales context, RDF can be used to store data in a schema that evolves rapidly to suit client needs. As mentioned above, storing arbitrary RDF in a relational database is commonly implemented. This data may be stored in the same database (though different tables) as normalized data. The agent storing data can examine the data and see if it is appropriate for a normalized table. If not, it can be stored in the arbitrary RDF store. A query agent can examine the terms in a query and separate those that are in the normalized database from those stored in the generic triple store. It is possible to shift all the unificatoin to the database by constructing joins between the arbitrary RDF store tables and the normalized tables. For example, the following query searches for extra terms introduced ad hoc by the sales department:
@@@
The constructed SQL query is large, but effective:
@@@
As stated above, the emergence of XML tools has enabled relational data to be expressed in XML semi-structured databases. For instance, the example OrderTracking database could trivially be stored in an XML tree structure.
<OrderTracking xmlns="http://example.com/colloquial-ns#" <Orders> <Order id="2185"> <customer>1</customer> <product>1004</product> <orderDate>20020902132455</orderDate> </Order> <Order id="2186"> <customer>2</customer> <product>2001</product> <orderDate>20020902132457</orderDate> </Order> </Orders> <Customers> <Customer id="1"> <givenName>Biff</givenName> <familyName>Thompson</familyName> <billingAddress>1</billingAddress> </Customer> <Customer id="2"> <givenName>Chip</givenName> <familyName>Thompson</familyName> <billingAddress>1</billingAddress> </Customer> </Customers> </OrderTracking>
taken from a larger example colloquial-embedded.xml.
@@@ ... Leads us to consider the difference between a database communication protocol and a more application specific protocol for placing orders ala XMLP.
There are a few advantages to encoding this data in XML. The principle one is human readability/editability which comes at a cost of verbosity. Another is the use of standard XML constraint languages like schemas and dtds, though these are usually available directly in a relational database and more tailored to the semantics of database attributes. A third is the use of XML transformation tools, like xinclude and XSTL, to manipulate the structure of the document.
The most useful tool for such an XML database is XQuery. This language is designed to provide similar functionality to SQL but work on XML databases. The advantage of SQL of XQuery is that SQL is tied more closely to the semantics of the relations. Where a query for billing address for the pending orders would look like
SELECT Orders.id, familyName, givenName FROM Orders INNER JOIN Customers ON customer=Customers.id
or the familiar form
SELECT Orders.id, familyName, givenName FROM Orders, Customers WHERE customer=Customers.id
in XQuery, it would be more verbose:
FOR $o in doc("colloquial.xml")/OrderTracking/Orders/Order, $c in doc("colloquial.xml")/OrderTracking/Customers/Customer, WHERE $o/customer/text() = $r/attributes("id") RETURN <answer> {$r/attributes("id")} {$c/familyName} {$c/givenName} </answer>
function | SQL | XQuery | RDF Query |
---|---|---|---|
table selection | FROM Orders, Customers | FOR $o in doc("colloquial.xml")/OrderTracking/Orders/Order, $c in doc("colloquial.xml")/OrderTracking/Customers/Customer, | |
constraints | WHERE customer=Customers.id | WHERE $o/customer/text() = $r/attributes("id") | |
output selection | SELECT Orders.id, familyName, givenName | RETURN <answer> {$r/attributes("id")} {$c/familyName} {$c/givenName} </answer> | |
The overhead in processing "/OrderTracking/Orders/Order" is much higher than selecting the Orders table in a relational store. The XQuery processor must check the root node to see if it matches "OrderTracking". If it does, it must check each of the immediate children in the XML tree to see if it matches "Orders" and for each match, repeat the process for "Order". An XML database could assume such a layout of the data and store it in relations, but then it would be unable to process arbitary XQuery (XPath, actually) paths. Such an implementation would actually be a relational database with an alternative query syntax.
On the generation side, it will again be inefficient to create an tree stucture capable of storing arbitrary tree data when a simple derived relation would do. I believe, however, that the output is guaranteed to be an XML tree so the benefits of closure (the same operations may be performed on the output as on the input), such as nested expressions, would be available to XML.
<DB:Database xmlns:DB="http://example.com/relationalModel-ns#"> <DB:Relation DB:name="Orders"> <DB:Tuple> <DB:Attribute DB:AttrName="id" DB:Domain="DB:Integer" DB:PrimaryKeyOrdinal="0">2185</DB:Attribute> <DB:Attribute DB:AttrName="customer" DB:Domain="DB:Integer" DB:ExternalKey="Customers:id">1</DB:Attribute> <DB:Attribute DB:AttrName="product" DB:Domain="DB:Integer" DB:ExternalKey="Products:id">1004</DB:Attribute> <DB:Attribute DB:AttrName="orderDate" DB:Domain="DB:Date">20020902132455</DB:Attribute> </DB:Tuple> <DB:Tuple> <DB:Attribute DB:AttrName="id" DB:Domain="DB:Integer" DB:PrimaryKeyOrdinal="0">2186</DB:Attribute> <DB:Attribute DB:AttrName="customer" DB:Domain="DB:Integer" DB:ExternalKey="Customers:id">2</DB:Attribute> <DB:Attribute DB:AttrName="product" DB:Domain="DB:Integer" DB:ExternalKey="Products:id">2001</DB:Attribute> <DB:Attribute DB:AttrName="orderDate" DB:Domain="DB:Date">20020902132457</DB:Attribute> </DB:Tuple> </DB:Relation> <DB:Relation DB:name="Customers"> <DB:Tuple> <DB:Attribute DB:AttrName="id" DB:Domain="DB:Integer" DB:PrimaryKeyOrdinal="0">1</DB:Attribute> <DB:Attribute DB:AttrName="givenName" DB:Domain="DB:PersonName">Biff</DB:Attribute> <DB:Attribute DB:AttrName="familyName" DB:Domain="DB:PersonName">Thompson</DB:Attribute> <DB:Attribute DB:AttrName="billingAddress" DB:Domain="DB:Integer" DB:ExternalKey="Addresses:id">1</DB:Attribute> </DB:Tuple> <DB:Tuple> <DB:Attribute DB:AttrName="id" DB:Domain="DB:Integer" DB:PrimaryKeyOrdinal="0">2</DB:Attribute> <DB:Attribute DB:AttrName="givenName" DB:Domain="DB:PersonName">Chip</DB:Attribute> <DB:Attribute DB:AttrName="familyName" DB:Domain="DB:PersonName">Thompson</DB:Attribute> <DB:Attribute DB:AttrName="billingAddress" DB:Domain="DB:Integer" DB:ExternalKey="Addresses:id">1</DB:Attribute> </DB:Tuple> </DB:Relation> </DB:Database>
taken from a larger example RM-embedded.xml.
This schema is a way to represent any relational database. The same tools could be used to query/manipulate the OrderTracking database as any other database.
One strong motivation for gateways between RDF and relational data is the ability of relational databases to take advantage of data modeling constraints to increase efficiency. For instance, if a set of entities have a unique attributes, they can be indexed with a B-Tree yielding log(n) access time. The holds database is a generic triple store and therefor may not be indexed so efficiently. Further, a query for a tuple constrained by n properties is quite efficient in a normalized store, but must be resolved with n-1 self-joins in the holds database. All the operations described above to store appropriate RDF data in a relational store impose schema constraints on the data and take advantage of them to increase efficiency.
The constraint over multiple attributes on a tuple takes advantage of a database's locality. In general, relational stores are designed so that retrieving any attributes from a tuple retrieves the whole tuple. Extending an object database to support RDF natively, we could provide for that same locality to represent a node and all the properties going to or coming from that node. Further, the database can take advantage of schema constraints like OWL's InverseFunctionalProperty to create unique indexes to map these property values back to the respective nodes. The query planner can look for these properties and optimize database queries much as they are done with conventional relational databases.
As many arcs touch two nodes, and as those nodes need to be retrievable when acquired via either node, duplication may be necessary. This improves the efficiency to read, say, the Product 1004 or Order 2185, but increases the burden for updating the database (need two writes). This is efficient in scenarios where reading happens more frequently than update.
Four tables were used in the examples above:
Orders | ||||
---|---|---|---|---|
id | customer | product | orderDate | shippingAddress |
2185 | 1 | 1004 | 20020907 | NULL |
2186 | 2 | 2001 | 20020908 | NULL |
2187 | 2 | 2004 | 20020907 | NULL |
3183 | 2 | 2005 | 20020907 | 2 |
Customers | |||
---|---|---|---|
id | givenName | familyName | billingAddress |
1 | Biff | Thompson | 1 |
2 | Chip | Thompson | 1 |
3 | Eustis | Walker | 2 |
4 | Elie | Tweak | 3 |
Products | |
---|---|
id | name |
1001 | white house |
1002 | picket fence |
1003 | sport utility vehicle |
1004 | pool |
1005 | grill |
2001 | skateboard |
2002 | rebelious music |
2003 | earring |
2004 | nose ring |
2005 | other ring |
Addresses | |||||
---|---|---|---|---|---|
id | apt | street | city | state | contact |
1 | NULL | 123 Elm Street | EdgeCity | AV | 1 |
2 | 1 | 245 King Street | EdgeCity | AV | 3 |
3 | 18b | Ally 17 | BigCity | AV | 3 |
These are also available in a text file.