Copyright © 1998 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability,trademark, document use and software licensing rules apply.
This document represents an preliminary proposal for a persistent RDF database schema. It does not represent the views of the W3C team or member companies.
The semantic web provides a wealth of opportunity to enable structured searches for search engines and web agents. This assumes the overhead and latency or a query-time search. A large-scale database makes this data immediately available for use by the infrastructure and searchs.
Beyond immediate search gratification, the RDF database can be used in the infrastructure: store web server configuration, store workflow information. The purpose of this design and implementation is to create the underlying database and API.
Good database design and @@@word for uniquing strings in a DB@@@ should automatically produce a database that can store huge numbers of statements (RDF node relationships). Each statement is a set of 6 integers. The expense of string storage is unavoidable but most RDF data is self-referential and therefor has a comparitvely small number of unqiue strings. As the database grows larger, new documents will find a greater fraction of their strings already present in the database. Indexing allows the administrator to push the slider between speed and space efficiency.
@@@ s/uniqueness/that word for uniqueness in databases/g
The tables are constructed to provide as many unique indexes as possible. A breakdown of the tables can be seen in the machine-generated table description.
Following is a table-by-table breakdown of the relationships in the RDF persistent database.
The Uris table stores a list of the URIs without their fragments.
Because a URI may refer to data that changes over time, Revisions are an attempt to allow multiple versions of a document to co-exist in the same RDF database. One benifit of having Revisions is that data may be half-parsed from a new version without interfering with queries on triples in the previous version. This is pretty dicey and is likely to go away.
The Fragments table stores the Fragment portion of the URIs as well as the URI element which is the base name for the full fragment identifier.
More recent work on attributions is available at Source Attribution in RDF.
In an environment of hetrogeneous trust requirements, each statement needs some way to identify its source. Attributions stores the External ID of the document being parsed, including the most recent fragment ID encountered when the Attribution is stored.
The typical storage mechanism is:
http://www.w3.org/1999/07/example.rdf
for the docuement.http://www.w3.org/1999/07/example.rdf
and stores it as the current attribution
.current attribution
is stored with this Statement so it is know where this Statement was encountered.current attribution
to the Attribution for http://www.w3.org/1999/07/example.rdf#id1
current attribution
.Used in:
SELECT Statements.id FROM Statements,Attributions,Fragments,Uris WHERE Statements.attribution=Attributions.id AND Attributions.fragment=Fragments.id AND Fragments.uri=Uris.id AND Uris.uri="http://www.w3.org/1999/07/example.rdf" GROUP BY Statements.id
Other APIs have other names for Attributions and similar concepts. See also:
The Strings table stores the
uniqified strings
encountered in the database. This has the interesting side-effect that
(<identifier for Joe's car> --hasColor-> "blue")
refers to the same blue node as (<identifier for Joe>
--hasMood-> "blue")
. Good? Bad? I don't know, but it will
probably be useful to limit queries on nodes with an object that is a
String. This will be especially pertinant in the rdf_browser
where you won't want to see zillions of nodes pointing to "Blue".
Anonymous nodes are given GenIds to uniquely identify them for query relationships.
The RdfIds table is a convenience that simplifies the Statement table. Since the predicate is a fragment identifier, the subject is a fragment identifier or an anonymous node, and the object is any of those or a tring, queries where one Statement's object is another Statement's subject tend to get very cumbersom (like this sentence). The solution is to store each of these elements as an RdfId.
In an environment of hetrogeneous data sources, it is likely, (and even hoped) that a statement may come from multiple Attributions. It is, however, important to allow statements made by some data source to be recinded, for instance, when the document is replaced. All Nodes referring to the same RdfId represent all the data sources that have made assertions about that same object.
@@@ can I have a unique on rdfId and attribution? If so, change add('Nodes') to ensure('Nodes', 'u_rdfId_attribution'). While I'm at it, may I assert that the same statement made multiple times in the same document are redundant and should be counted only once?
The Statements table stores the Statements made in the parsed RDF.
Some statements are not made about a specific subject, but instead about each of the elementes in a container. This is implemented as a generic mapping from one (p,s,o) triple to another. In this case, the mapped from triple is the membership of X in a bag Y. The MappedNodes ...
Typical data:
+----+-------+---------+-------------+-----------+-------------+ | id | rdfId | mapFrom | description | container | attribution | +----+-------+---------+-------------+-----------+-------------+ | 1 | 37 | 59 | 42 | 32 | 13 | +----+-------+---------+-------------+-----------+-------------+which corresponds to:
file:/test.rdf#B1 map from http://www.w3.org/1999/02/22-rdf-syntax-ns#li
The MappedStatements table stores the MappedStatements
Typical data:
+----+-----------+-------------+------------+------------+-----------+-------------+-----------+-------------+ | id | predicate | subjectNode | subjectMap | objectNode | objectMap | containerId | reifiedId | attribution | +----+-----------+-------------+------------+------------+-----------+-------------+-----------+-------------+ | 1 | 24 | NULL | 1 | 128 | NULL | 32 | 52 | 13 | +----+-----------+-------------+------------+------------+-----------+-------------+-----------+-------------+which corresponds to:
(http://www.w3.org/schema/certHTMLv1/hasAccessTo, file:/test.rdf#B1 map from http://www.w3.org/1999/02/22-rdf-syntax-ns#li, file:/test.rdf#http://resource/a) reified as file:/test.rdf#hata in file:/test.rdf#pre by file:/test.rdf
Used in:
(http://www.w3.org/1999/02/22-rdf-syntax-ns#li <59>,
file:/test.rdf#B1 <129 with RdfId 37>,
file:/test.rdf#U1 <130>)
by file:/test.rdf
execute the SQL:
INSERT INTO Statements (predicate,subjectNode,objectNode,containerId,reifiedId,attribution) SELECT MappedStatements.predicate,130,MappedStatements.objectNode, MappedStatements.containerId,MappedStatements.reifiedId, MappedStatements.attribution FROM MappedStatements,MappedNodes WHERE MappedStatements.subjectMap=MappedNodes.id AND MappedNodes.rdfId=37 AND MappedNodes.mapFrom=59which produces:
+-----------+-----+------------+-------------+-----------+-------------+ | predicate | 130 | objectNode | containerId | reifiedId | attribution | +-----------+-----+------------+-------------+-----------+-------------+ | 24 | 130 | 128 | 32 | 52 | 13 | +-----------+-----+------------+-------------+-----------+-------------+
or (http://www.w3.org/schema/certHTMLv1/hasAccessTo,file:/test.rdf#U1,file:/test.rdf#http://resource/a) reified as file:/test.rdf#pre in file:/test.rdf#hata for @@@
The Descriptions table stores the information that comes with an RDF:Description tag.
Used in:
SELECT Nodes.id FROM Nodes,Descriptions WHERE Nodes.description=Descriptions.id AND Descriptions.type IN (40) GROUP BY Nodes.id
The general goal in designing a relational database is to accurately represent the acquired data. This includes splitting the data into as many fields ascan be made useful in a typical query. However, since RDF is a way of storing data, and that data can be stored in triples, any fields beyon those necessary to store the predicate, subject, object and their types may be regarded as extraneous.
One possible schema is to store the attributions in the Statements table. It could be argued that one should not have to make queries on a statement's reifiedId. I put the reification information in separate Statements fields as I wanted to be able to implement a trust policy without doubling my number of joins. A nice benefit of this is that it is very easy to avoid presenting the triples that are the product of reification. I will probably not get around to experimenting with it the other way, but it probably should be done.
This section outlines the process for locating or creating nodes in the SQL database.
.
Here are links to relevent specifications and products:
Last revised: $Date: 2003/11/12 12:09:45 $ by: $Author: eric $