RDF Query Access to WWW2005 Conference Data

CVS Version:
$Id: Overview.html,v 1.15 2005/05/19 05:25:11 eric Exp $
Author:
Eric Prud'hommeaux, W3C

Abstract

This document describes the release of non-private WWW2005 data via an RDF SPARQL query. The query is transformed to an SQL query over a normalized conventional database containing the WWW2005 conference data.

Status of this Document

This documents experiments by the author. It is not endorsed by the W3C Team or Membership.

Table of Contets

Introduction

There are many benefits to making relational data available on the semantic web. There is a wealth of relational data; database schemas help keep this data accurate semantically consistent; regular access by SQL helps keep the data in gor form for machine processing. This data is the key to much of our infrastructure. Providing access to this data will encourage the adoption of semantic web tools for infrastructure support.

Relational data may easily be copied into RDF by script. The approach taken here was to declaratively describe the WWW2005 conference data and use FeDeRate to transform SPARQL (an RDF query language) queries into SQL queries. Declarative database descriptions are easier to maintain and share amongst applications. FeDeRate also uses these descriptions to provide update access to the authoritative relational data.

Much of our infrastructure depends on machine-processable data in relational databases. In a sense, that some data is in a relational database shows that it was valuable enough to maeke it available to machines. Making this data available on the semantic web encourages a smooth adoption of the semantic web tools into our daily lives.

This demo shows how RDF queries can be translated to SQL queries over a conventional relational database. The query presents an intuitive, schema independent interface to the data. The generated SQL shows that for many practical queries, a simple RDF query mapping to SQL produces queries that are as good as anything hand-coded.

A more terse presentation of this material is available in a WWW2005 talk ([Talk]).

Schedule Info

Given non-private conference data availabe in RDF, the user can make the a SPARQL query (HTML demo interface) and learn which talks are presented by "Bing Liu":

nameroomDescstartend
"Visualizing and Comparing Opinions on the Web""Room 303"10:3012:00
"Web Data Extraction Based on Partial Tree Alignmen""Room 303"10:5012:20

SPARQL Query

PREFIX w5: <http://www.w3.org/2004/10/18-RDF-WWW2005/#>
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
SELECT ?name ?roomDesc ?start ?end
 WHERE { ?author  w5:authorName       "Bing Liu" .
         ?author  w5:wrote            ?paper .
         ?paper   w5:paperName        ?name . 
         ?paper   w5:inSession        ?session .
         ?session w5:Session_room     ?room .
         ?room    w5:Room_description ?roomDesc .
         ?program w5:Program_session  ?session .
         ?program w5:start            ?start .
         ?program w5:end              ?end }

SQL Query

The above query uses a machine-readable description of the database to create the following SQL query:

SELECT STRAIGHT_JOIN
       Author_0._id AS author__id,
       Paper_0._id AS paper__id,
       Paper_0.name AS name_name,
Session_0._id AS session__id,
       Room_0._id AS room__id,
       Room_0.description AS roomDesc_description,
Program_0._id AS program__id,
       Program_0.start1 AS start_start1,
Program_0.end1 AS end_end1

FROM Author AS Author_0
     INNER JOIN Paper AS Paper_0 ON Author_0.paper=Paper_0._id
     INNER JOIN Session AS Session_0 ON Paper_0.session=Session_0._id
     INNER JOIN Room AS Room_0 ON Session_0.room=Room_0._id
     INNER JOIN Program AS Program_0 ON Program_0.session=Session_0._id
WHERE Author_0.name="Bing Liu"
GROUP BY author__id,paper__id,name_name,session__id,room__id,roomDesc_description,program__id,start_start1,end_end1

SQL queries generated from SPARQL queries can be seen by enabling the HTML rendering option processing messages (see the SQL for the above query).

Declarative Description of Schema

Most of the database description comes from a properties file describing access to the database and some mappings of RDF properties to database fields:

user: xxx
password: yyy
database: WWW2005
queryHost: localhost
schemaNS: http://www.w3.org/2004/10/18-RDF-WWW2005/#

predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#wrote => Author.paper
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#authorName => Author.name
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#paperName => Paper.name
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#paperEmail => Paper.email
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#inSession => Paper.session
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#start => Program.start1
predicateMap: http://www.w3.org/2004/10/18-RDF-WWW2005/#end => Program.end1

auth.group.Paper.email: registered

The top section includes database access information. The schemaNS specifies the base URI for the row and property identifiers to provide this database with global identifiers. predicateMaps provide aliases for these properties. They hide the actual field names from the query writer. This is handly because often databases evlove with some rather arcane field names (such as Program.start1 above). The auth directive is described below in .

This data should be represented in RDF instead of a Java properties file. At that point, it can merge with other system data and the predicateMaps will be simple rules. (This is already accomplished by interface rules, but the predicateMaps are easier to explain and debug.) Note that wrote is a reciporical property to dc:creator. One need only consider how one would like that data to appear in the semantic web to set up predicate maps.

Privacy

The only data that was in the relation data that was not available on the website was the email addres contacts for associated with each paper. No query may divulge or even confirm these email addresses. The directive

auth.group.Paper.email: registered

prevents FeDeRate from processing any query involving the Paper.email field for any user who is not in the registered group. This group is currently empty, except for the author, but is intended populated by all the authors of the conference papers. Properties could be defined (and code implemented) to make sure that only the author could see the email field. This will be most useful when the properties file is in RDF.

Interface Rules

The predicateMaps described above are a simple form of interface rule. Interface rules can hide database structure as well as field names. To date, the only work with interface rules was done on the W3C ACLs database. The rules were expressed in Algae2. Here is an example:

interface (sqlDB:W3Cacls
           API  (?sub     acl:memberOf                ?super) 
           impl (#?g1     sqlDB:idInclusions.id       ?sub .
                  ?g1     sqlDB:idInclusions.groupId  ?super)) 
interface (sqlDB:W3Cacls
           API  (?a       acl:accessor                ?principal) 
           impl (?a       sqlDB:acls.id               ?principal)) 

Efficiency

This section uses terms from the SPARQL query language ([SPARQL Query]).

Apart from interface rules, query translation involves a linear iteration over the triple patterns in the graph pattern, linear tree transformations, many lookups in small (order of the number of variables in the graph pattern) hash tables, and linear query serialization. In short, the query preparation is a very small amount of work.

Result processing is linear with the number of results. It involves some URI-generation, and hash lookups. By far, the largest overhead comes from re-serializing the data.

Conclusions

The work to make a relational database available to RDF is very small. One need only provide a properties file defining the database access parameters (host, database, name, password) and add auth parameters to protect confidential information. For this minimal investment, the publisher offers a new kind of access to the relational data. This may be used to join with other data in other databases.

Additional modeling work in the form of predicateMaps or interface rules is rewarded by a more intuitive interface to the data. Again, the investment is small compared to the potential benefits. It is the author's hope that this work will encourage the reader to expose more data on the semantic web.

Future Work

Here are some ideas for future conferences:

Appendixes

Relational Data

The data used for the above query was from a relational database storing the WWW2005 conference data. The schema of this data is described by example with excerpts from the tables involved in the above query:

Author

SELECT * FROM Author WHERE name="Bing Liu";
_idpapertypenameorg
194516authorBing LiuUniversity of Illinois at Chicago
204536authorBing LiuUniversity of Illinois at Chicago

Paper

SELECT * FROM Paper WHERE _id IN (516,536);
_idnamename1name2emailsession
516Web Data Extraction Based on Partial Tree AlignmenLiuBing[email protected]5
536Visualizing and Comparing Opinions on the WebLiuBing[email protected]23

Session

SELECT * FROM Session WHERE _id IN (5,23);
_idnameroomtypetitle
5PP038paperData Extraction
23PP128paperText Analysis and Extraction

Room

SELECT * FROM Room WHERE _id IN (8);
_id_keynamedescription
8303303Room 303

Program

SELECT * FROM Program WHERE session IN (5,23);
_iddatestartendsessionstart1end1
51110:50:0012:20:0052005-05-11 10:50:002005-05-11 12:20:00
231210:30:0012:00:00232005-05-11 10:30:002005-05-11 12:00:00

References

[SPARQL Query] SPARQL Query Language, Eric Prud'hommeaux, Andy Seaborne (See http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/ .)
[SqlDB] Optimized RDF Access to Relational Databases, Eric Prud'hommeaux (See http://www.w3.org/2004/04/30-RDF-RDB-access/ .)
[Talk] Accessing WWW Conference Data, Eric Prud'hommeaux (See http://www.w3.org/2005/Talks/14-May-SqlDB/all.html .)

$Date: 2005/05/19 05:25:11 $