Abstract

This document describes a XSLT stylesheet that transforms application/xml+rdf to a series of RDF database API calls. Further, it describes a schema annotation system for generating that XSLT, as well as other grammar-defined applications. This research is used in RDAL.

Status

This document represents some experiments in using XSLT to parse RDF. This is not endorsed by the W3C membership.

Goals
RdfXS API
- RdfXS Todo
XQ-like
- Bits from XPath
- Bits from XQuery
Fear, Uncertainty, Doubt
Tests
Revision History

Goals

Generate a validating XMLRDF parser from the RDFXML schema.
Define an annotation schema to provide semantic actions for grammar productions defined by the schema.
Generate RdfXS API from an annotated schema.
Provide templates to have the RdfXS calls generate useful data formats like
- ntriples (recent example output).
- SQL insert statements.
- SAX event handlers.
Replace the XML-centric grammer used by a perl RDF parser with SAX event handlers.

RdfXS API

This basic RDF API has four syntactically differentiated datatypes (no constructors for them now):

bnode: example: _:g23fe
uri: example: <http://example.com/>
literal: "10"^^http://www.w3.org/2001/XMLSchema#int
XMLLiteral: <foo><bar/></foo>

and five functions:

startDocument()
endDocument()
addTriple($predicate, $subject, $object)
addLiteralTriple($predicate, $subject)
error($hint, $expected)

where the parameters $predicate, $subject, $object should all be interned from the database atom dictionary.

RdfXS Todo

It seems practical to add the following constructors:

uri(string): intern a URI string into the database internal representation
bnode(): get a bnode object from the database object dictionary
literal(string, [{prefix1, ns1}, ...]): XML-encoded literal data

The expressivity of XSLT limits variable assignment to an awkward construction of segmenting a template and passing all of the state into the new second segment of the template. In a terse syntax, this looks roughly like:

  # Call typedNode with a predicate and a subject.
  call-template typedNode_0(predicate="p1", subject="s1")

# Template for typedNode production.
template typedNode_0 (predicate, subject)
  if (@r:about) call-template typedNode_1(predicate, subject, object=uri(@r:about))
  if (@r:ID) call-template typedNode_1(predicate, subject, object=uri(@r:ID, baseUri))
  if (@r:nodeID) call-template typedNode_1(predicate, subject, object=bnode(@r:nodeID))
  call-template typedNode_1(predicate, subject, object=bnode(generate-id(.)))

# Chained typedNode production with object variable set.
template typedNode_1 (predicate, subject, object)
  # Continue the typedNode template with object set.

To really have uri, bnode and literal be templates would require a version of each template for each possible set of parameters passed to the next template. Yeah, right. Perhaps it will be easy to implement them as sort of a macro that gets expanded when writing the XSLT.

XQ-like

This language is called XQ-like because it is similar in syntax to XQuery, and even shares some semantics like variable assignment, XPath node access... It is intended to use a very small subset of XQuery. That subset currently excludes access to parent and child nodes (apart from the attribute nodes that are children of their containing element) in order to make the SAX event handlers simple. It will be possible to write handlers that track state for access to XPath nodes during other events, but that seemed like work so I punted. (Some of this has to be done to distinguish productions which differ in their nested elements.)

Bits from XQuery

let $f:=
declare function foo

Fear, Uncertainty, Doubt

Generating the collection template is going to be hard. I fear it. I rue the day.

RDF is unordered by default. The parseType="Collection" attribute is used to specify ordered, closed (thoroughly enumerated) sets in RDF. As an example

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:ex="http://example.org/stuff/1.0/">
  <rdf:Description rdf:about="http://example.org/basket">
    <ex:hasFruit rdf:parseType="Collection">
      <rdf:Description rdf:about="http://example.org/banana"/>
      <rdf:Description rdf:about="http://example.org/apple"/>
      <rdf:Description rdf:about="http://example.org/pear"/>
    </ex:hasFruit>
  </rdf:Description>
</rdf:RDF>

states that the basket has exactly the set of (banana, apple, pear). This is represented by the graph graphviz-generated from 7 triples . This nil node indicates the end of the list (and keeps anyone from adding to the closed list).

The tricky part is adding the arc to the nil node as the rest of the last element. RDFXMLtoRdfXS currently uses the test ./*[$index+1] to find the last element in a collection and has some conditional code to stich the earlier elements together. This breaks the easy mapping to SAX handlers. Guess this will require some of the state-tracking hander alluded to in XQ-like.

Parsing Collections with XSLT

The hand-coded stylesheet uses a recursive template to walk through children (members of the collection):

  if (@parseType = 'Collection')
    addTriple(predicate, subject, bnode(.))
    collection_r(subject, 1)

  collection_r(subject, index)
    for-each select="./*[$index]"
      typedNode_0(r:first, subject)

    if (./*[$index+1])
      addTriple(r:rest, subject, bnode(.))
      collection_r(bnode(.), index+1)
    else
      addTriple(r:rest, subject, r:nil)

Tests

So far, I've only tested the hand-generated XSLT on a few RDF tests:

name	input	output	problems
kitchen sink test	test.rdf	test.ntriple	needs XSLT for c14n
attribute	testAttr.rdf	testAttr.ntriple
literal	literal.rdf	literal.ntriple

The current machine-generated XSLT shows is much more rigourous, though not actually functional. Features:

attribute test: Test that all attributes are allowed with a given production.
production selection -- not working: I actually have a version of this that's a few hours closer to working, if I can recover my disk drive.
template chaining -- not generalized: This is needed for productions that have assignments while staying in the same production , specifically, parseType="Collection".
semantic actions dispatch -- not started: Provide variable assignment and calls to templates describe with function declarations. This will lean on template-chaining for variable assignment.

Revision History

The bulk of the work is in the RDFXMLtoRdfXS.xsl script so versions track it's CVS version:

1.3: added startDocument and endDocument functions to the API.; wrote a dot presentation.
1.2: separated ntriple presentation via the RdfXS API.
1.1: imported from rdfToDB and output ntriples. Added support for parseType="Collection" and "Literal".

This work started with rdfToDB.xsl, which is no longer being maintained:

1.3: added parseType="Literal" support
1.2: s/GENID/genid/ for consistency
1.1: import of Evan Lenz's work

Eric Prud'hommeaux

Last modified: Sat Jan 3 04:40:19 EST 2004