Smart Descriptions & Smarter Vocabularies (SDSVoc) Report

Executive Summary

The Smart Descriptions & Smarter Vocabularies Workshop (SDSVoc) was organized by W3C under the EU-funded VRE4EIC project and hosted by CWI in Amsterdam. Of 106 registrations, it is estimated that 85-90 people attended. The event comprised a series of sessions in which thematically related presentations were followed by Q&A with the audience, which notably included representatives from both the scientific research and government open data communities.

The workshop began with a series of presentations of different approaches to dataset description, including the CERIF standard used by VRE4EIC, followed by a closely related set of experiences of using the W3C’s Data Catalog Vocabulary, DCAT. It was very clear from these talks that DCAT needs to be extended to cover gaps that practitioners have found different ways to fill. High on the list is versioning and the relationships between datasets, but other factors such as descriptions of APIs and a link to a representative sample of the data are also missing.

High level descriptions of any dataset are likely to be very similar (title, license, creator etc.) but to be useful to another user, the metadata will need to include domain-specific information. A high profile case is data related to specific locations in time and space, and one can expect this to be part of the general descriptive regime. On the other hand, highly specialized data, such as details of experiments conducted at CERN, will always need esoteric descriptions.

An important distinction needs to be made between data discovery – for which the very general approach taken by schema.org is appropriate – and dataset evaluation. In the latter case, an individual needs to know details of things like provenance and structure before they can evaluate its suitability for a given task.

A descriptive vocabulary is only a beginning. In order to achieve real world interoperability, the way a vocabulary is used must also be specified. Application Profiles define cardinality constraints, enumerated lists of allowed values of given properties and so on, and it is the use of these that allows data to be validated and shared with confidence. Several speakers talked about their validation tools and it’s clear that a variety of techniques are used. Validation techniques as such were out of scope for the workshop, although there were many references to the emerging SHACL standard. Definitely in scope however was how clients and servers might exchange data according to a declared profile. That is, apply the concept of content negotiation not just to content type (CSV, JSON, RDF or whatever) but also the profile used. The demand for this kind of functionality has been made for many years and proposals were made to meet that demand in future standardization work.

The workshop concluded that a new Working Group should be formed to:

revise and extend DCAT;
provide guidance and exemplars of its use;
standardize, or support the standardization elsewhere, of content negotiation by profile.

Introduction

The Smart Descriptions & Smarter Vocabularies Workshop (SDSVoc) was organized by W3C under the EU-funded VRE4EIC project and hosted by CWI in Amsterdam, 30 November – 1 December 2016. 106 people registered for the event for which the estimated actual attendance was 85-90. There were 30 presentations plus 12 additional panelists. In a summary report like this it is impossible to capture every detail of the many conversations held across two very full days. The papers, slides and raw notes are all linked from the agenda but even they can’t capture everything.

The workshop came about through discussions within the VRE4EIC project which is building a virtual research environment. Among other services, this provides users with access to data held in multiple research infrastructures, most of which use different methods to describe the data in their catalogs. How can one tool access different datasets when the descriptions are so varied? The VRE4EIC answer is to use CERIF as a common model to which a large number of metadata vocabularies have been mapped. The European Commission’s answer to the related problem of making data portals interoperable is to encourage the use of their application profile of DCAT (DCAT-AP).

This is a reminder that ‘standards’ are nothing more than the consensus view of the people who wrote and who operate in a particular way. SDSVoc was successful in bringing together representatives from the scientific research community and the government open data community. It was encouraging that the problems and current solutions offered were so similar.

The following sections offer a brief summary of each of the presentations made, roughly in the order given and grouped into themes that provided the backbone of each of the workshop sessions.

Approaches to Dataset Description

VRE4EIC’s Scientific Coordinator, Keith Jeffery, began the workshop with a description of the CERIF standard, which the project uses in its virtual research environment. It predates Dublin Core and supports rich metadata, including things like start and ends dates for relationships between entities. Many other metadata standards have been mapped to CERIF so that it provides an interchange point between, for example, DCAT, Dublin Core, eGMS, CKAN, DDI and INSPIRE. Andrea Perego noted however that temporal aspects of dataset descriptions appear to be missing, something Keith agreed needed to be added, perhaps using Prov-o on which he’s already working with Kerry Taylor of the Australian National University.

Alasdair Gray presenting the HCLS Dataset Profile

Much of the workshop was focused on describing datasets, whether using DCAT or one of the many other methods and standards for doing this. Alasdair Gray presented the HCLS Community Profile that began under the original openPHACTS project, highlighting the importance of information about versioning and provenance, support for which is lacking in DCAT and others. Dublin Core is widely used and broadly applicable but is very generic and not comprehensive. VoID provides a method to include descriptive metadata within an RDF dataset but again, doesn’t handle versioning and has no cardinality constraints. Nor does DCAT but this does make the important distinction between a dataset and a distribution. Alasdair went on to describe both the HCLS Community Profile and its validation tool. Validation was the focus of a separate session later in the workshop.

Andrea Perego presented the use of DCAT-AP in the new data portal that will provide a single point of access to all data produced or maintained by the European Commission’s Joint Research Centre (JRC). For the JRC, even DCAT-AP – the EC’s Application Profile of the W3C Data Catalog Vocabulary – lacks some needed metadata elements that cross scientific domains and that support data citation. For the latter, Andrea offered a comparison with DataCite where there is good, although not full, alignment. The Geospatial application profile, GeoDCAT-AP, of which Andrea Perego was lead author, is a better match although, again, not complete. Many of the existing standards also enumerate Agent Roles. The geospatial metadata standard, ISO 19115, lists 20, while DataCite lists more and GeoDCAT-AP just 4 – but these are not aligned in any way. Should a platform support both? Neither? Greater use of Prov is needed to provide the kind of detail of exactly who created a dataset, when and how and under what conditions etc. Andrea’s final remarks concerned dataset discovery and the possible mapping to schema.org. This is an active topic within the schema.org community and, during the workshop, a new Community Group was proposed around exactly this topic.

Picking up on comments Andrea made about identifiers, Keith Jeffery noted that many identifiers are closely associated with a role (ORCIDs with researchers, for example) and Axel Polleres wondered if the EC were likely to endorse ORCIDs. This might for example, allow people to authenticate to systems like the European Commission’s Authentication Service, ECAS, using their ORCID.

Markus Freudenburg's comparison of Data ID and DBpedia with the Data on the Web Best Practices

Another approach to dataset descriptions was presented by Markus Freudenburg of Universität Leipzig. The DataID project identified a number shortcomings of DCAT that needed to be addressed, including a lack of provenance information, hierarchies and versions of datasets. Echoing Andrea Perego’s presentation, Markus highlighted the insufficient information about Agent roles and several other issues. As with the HCLS Community profile, Data ID brings DCAT together with the Prov ontology and adds in support for versioning, but, unlike the other vocabularies presented so far, DataID does not impose cardinalities on terms. Full documentation of the extensive DataID vocabulary and its modularization has been prepared and is likely to be submitted to W3C as a Member Submission in the very near future. In the Q&A session it was noted that DataID emphasizes the importance of licensing. In this context the active development of ODRL as a W3C Recommendation is relevant.

Moving slightly away from dataset descriptions, Christian Mader presented the Industrial Dataspace vocabulary that also describes applications and other components in the broader industrial data ecosystem. These include endpoints, providers, service contracts, policies and data formats, as well as spatiotemporal aspects. The vocabulary is aligned with DCAT, Dublin Core, ODRL, the Data Quality Vocabulary and more. IDS is very much a work in progress, using Linked Data to improve communication throughout the manufacturing chain, but a lot of the ideas have resonance with the ‘traditional’ view of data as a Web resource. It shows clear linkage between the Internet of Things, the Web of Data, the Web of Services and the Web of Things.

In a brief presentation, Nandana Mihindukulasooriya described how he and his colleagues at Universidad Politécnica de Madrid are working to make it easier to discover datasets by their content, for example, data that includes descriptions of restaurants. They also want to be able to discover vocabulary usage patterns by crawling and analyzing large volumes of available data. Both tasks are among those tackled by the Loupe model and tool chain.

The discussion that followed the initial presentations was wide ranging but made a number of points:

Simple vocabularies have wide applicability but people complain that they lack features that they need.
Machines need to be able to access data itself, not just retried the URL of a download.
You need to be able to access subsets of large datasets, such as satellite data.
API descriptions are an important part of data access and description.
Tooling is critical.
There is a big difference between the way you handle data that is immutable and data that is frequently or constantly updated.

What’s Wrong With DCAT?

An important motivation for running the workshop was the sustained interest shown from a wide community in the Data Catalog Vocabulary, DCAT. Developed initially at DERI Galway (now part of the Insight Centre), DCAT became a W3C standard in January 2014. That sustained interest is evidence that the standard is a success: it’s widely used in data portals around the world. However, it is abundantly clear that more work needs to be done. As the earlier presentations showed, DCAT is not yet the full story and more needs to be done.

One piece of evidence for this is the European Commission’s development of their DCAT application profile (DCAT-AP). Brecht Wyns of PwC, who works on the European Commission’s ISA² Programme, explained that DCAT-AP is part of a series of efforts that date back to 2013 designed to improve interoperability between data catalogs. DCAT-AP has been adopted by many countries for their national data portals including Italy, Norway and the Netherlands, all of whom have extended it to create their own local extensions. The EC has also created validators, harvesters and exporters to support the profile. In addition to this work, there are even more extensions to the DCAT-AP that cover geospatial information (GeoDCAT-AP) and statistical data (StatDCAT-AP).

One task that these application profiles fulfil is to try and fill the gaps in the base DCAT standard. Brecht suggested that these include:

Relationships between Datasets, incl. versioning, time sequence, parent/child and grouping of collections: use of relation types.
Rights and licenses for datasets: relationship with licenses on catalogue and distributions.
Service-based data access: modelling of non-file distributions and set of properties to enable machine-processing.
Relationship between Distributions: similarity criteria.
Packaging of distribution files: expression of format of included files.
Scientific data and data citation.

During the Q&A session, it was noted that the scientific research community is not represented in the work on DCAT application profiles, which is almost entirely driven by those working on open government data portals. Brecht agreed that was largely true (although the JRC is involved). Like all ISA² working groups, the one working on the implementation guidelines is open to all.

In his presentation, Andreas Kuckartz, co-chair of the W3C Open Government Community Group announced that Germany was about to adopt DCAT and DCAT-AP having previously used a slightly different metadata standard, OGD, used only in Germany and Austria. The move to DCAT-AP.DE is expected to take place during March 2017. Andreas agreed that there is a need for greater clarity over detailed issues such as whether the value of dcterms:license should be a URL or a string, and that the proliferation of national extensions strongly suggests a need to revise and extend the original standard.

The idea that work needed to be done to DCAT was amplified by Beata Lisowska of Development Initiatives. In their work to trace financial and resource flows of development aid, the issues of dataset series, versioning and provenance are particularly important so that the whole data journey can be described. This is not currently possible with DCAT without introducing various hacks around the problems. Jeroen Baltussen, a consultant with Netherlands Enterprise Agency Ministry of Economic affairs, took a different angle. In his work, he’s trying to match open data obligations with explanations of available data, EU initiatives with third party initiatives and more. There is no shortage of possible metadata models from which to choose, as demonstrated by his draft UML diagram of a possible metadata model that just covers some of the areas he’s working on. Jeroen made two concluding comments:

Develop an abstract model for your business and serialize and use API.
Our model is intended for interoperability inside they EU. It does not necessarily fit in an organization. You can't always mesh vocabularies together in one model.

There was a very active discussion at the end of this session, some of which reiterated points made in the previous session. The new points raised can be summarized as:

Any future work on DCAT would do well to begin by looking at the national extensions to DCAT-AP, looking for common approaches.
The standard is not enough; there is a need for further explanation and guidance.
DCAT is one of many related standards. CKAN is not the only data portal software.
Open source tools to support DCAT would be a big help to its adoption.
The scope of DCAT should be broadened to cover APIs and must cover versioning and provenance. However, the scope needs to be contained if it is to remain useful.

Time and Space

This relatively short session focused on how datasets with spatial aspects should be described. A large proportion of data fits into this category, from obvious ones like records of air quality samples to less obvious ones like lists of Christmas Markets. How should the spatial aspects be described in such a way that search engines can find them?

A screenshot of a Wweb form showing a drop down box and a text input element — Andrea Perego's GeoDCAT-AP API

Andrea Perego of the European Commission’s JRC presented the work he did under the ISA Programme to create GeoDCAT-AP. In the geospatial world, metadata is very widely encoded using the ISO 19115:2003 and ISO 19119:2005 standards. GeoDCAT-AP exists to represent this metadata in a manner compatible with DCAT-AP to make it more searchable in data portals. A good deal of work has been done to support the standard itself, in particular, a service that takes the output of a Catalog Service on the Web (the widely used OGC standard) and represents it in either DCAT-AP or GeoDCAT-AP, determining the output format via HTTP content negotiation. Importantly, this makes no change or demands on the underlying service that conforms to INSPIRE and ISO 19115. Andrea cited a number of issues that he feels need addressing, beginning with the very limited use of globally persistent identifiers, in particular HTTP URIs, within geospatial data. He also raised issues around the need for better descriptions of data quality and for content negotiation by profile which was the topic of a separate session on day 2.

Making spatial data more accessible on the Web is the core aim of the work being conducted in the Spatial Data on the Web working group, a collaboration between W3C and OGC. In that context, Linda van den Brink of Geonovum presented work she and Ine de Visser have commissioned to test how existing spatial data can be exposed in a more search engine and developer-friendly way. As with Andrea Perego’s work the aim is to build on top of what is there already without changing it.

In the Geonovum testbed, for each record in a dataset, a Web page was generated that included schema.org markup. Such pages can be generated on the fly or prepared as a set of static pages. A path for entry for search engines is important too, either through a ‘collection point’ or by providing a sitemap. Search engines will not be concerned with all available metadata but that doesn’t make it irrelevant. A human will still need it to assess whether the data meets their needs. As ever, different metadata formats and details serve different purposes and audiences. The Dutch geoportal (NGR) is now implementing the recommendations of the testbed as well as providing the option of accessing metadata following GeoDCAT-AP from its catalog.

There are two specific recommendations from the Geonovum work when publishing any dataset that has a spatial component, and both should be followed:

include the spatial coverage of the features by reference to a named place in a common vocabulary for geospatial semantics (e.g. GeoNames);
in addition, use a set of coordinates to specify the boundaries of the area either as a bounding box (add glossary ref) or a polygon - as is “normal” in the geo world.

Linda highlighted two properties that are missing from schema.org: spatial resolution and projection. These are fundamental for spatial data but are often overlooked by those not directly involved in the field.

In the ensuing panel discussion, Daniele Bailo of VRE4EIC/INGV and Otakar Čerba, University of West Bohemia agreed that scientists don’t tend to use general Web search for data, preferring to use specialist sources. However, Andrea Perego said that most people looking for spatial data will use their regular search engine, again, emphasizing the point about different metadata for different needs and audiences: general spatial metadata so that the relevance of a dataset to a specific location and time can be recognized, and then more detailed metadata, including quality and provenance, for expert assessment.

Searching for data

Kevin Ashley of the UK’s Digital Curation Centre moderated the final session of the first day which covered a wide range of topics with a general theme of dataset discovery.

Dmytro Potiekhin presenting his thought provoking work on CivicOS from Kyiv

Workshop attendees were pleased to hear from Dmytro Potiekhin who joined the proceedings by Skype from Kyiv. He is working with others to develop a vocabulary for describing the actions of governments. In his experience, he was one of the organizers of the Orange Revolution, this cannot always be left to governments themselves since in less than fully democratic countries there is a real danger of falsified information. Rather, the government’s actions should be described by civilians, which is the basis of the work on CivicOS. As a stark example, Dmytro suggested that petitions should not be organized on government Web sites, as is typical in many democratic countries, as this is an easy way for less than democratic governments to obtain lists of their opponents. An earlier talk on CivicOS by Dmytro is available on YouTube.

It was suggested that this approach didn’t lend itself to greater data interoperability. Dmytro said that for things like flight information and eGovernment, the use of agreed international standards worked really well in Ukraine, but that’s not the same as eGovernance or eCivil society. A good example of a wide community working towards a common vocabulary is schema.org. Dan Brickley, schema.org’s webmaster, encouraged Dmytro to go ahead and develop the vocabulary. There are some concepts already in schema.org that would seem to fit but maybe not all the 200 that Dmytro has in mind. These would probably need to make use of schema.org’s extension mechanism. Dan also pointed to Huridocs which uses technology to track and expose human rights violations.

The example contact page for the Flemish government, including extra markup (click to visit and view source!)

One of the governments most active in developing and using vocabularies in its work is the Flemish government. There, a range of international, European, national and regional standards are being used. Raf Buyle of Informatie Vlaanderen explained that their aim is to allow Flemish citizens and businesses easily to find the services they’re looking for, such that a query about funding for roof insulation, for example, will return qualitative, personalized information. A crucial aspect of this is their base registries, defined by the ISA Programme as trusted authentic sources of information under the control of an appointed public administration or organization appointed by the government containing information about public services. In the same way that Dutch geospatial data is being made available as Web pages that can be indexed by search engines, Flanders is using the data in its base registers to markup Web pages, using both schema.org and the ISA Core Vocabularies to address different potential end users. Raf emphasized that in the pilot studies is was important that annotating Web pages must be easy and useful. To test whether the latter aim has been achieved, an example contact page has been created. At the time of the workshop, the questions being asked were:

Is annotated data the new data portal?
Will annotations support the uptake of authoritative government information?
Is it ‘wise’ to focus on both Schema.org & ISA CORE vocabularies?
How can we build in a feedback loop?
Can Schema.org & ISA² manage a mapping together?

Peter Winstanley of the Scottish government asked how the inclusion of false information could be prevented. Raf’s belief is that public feedback would quickly identify any errors.

On behalf of WDAQUA (Answering Questions using Web Data), a Marie Skłodowska-Curie Innovative Training Network (ITN), Luis-Daniel Ibáñez of the University of Southampton presented the results of their survey into what users look for when searching for data. Many of the issues identified in the survey had been highlighted in earlier discussions, such as spatial and temporal coverage, format, license etc. One aspect highlighted by the work was the desire for a summary or preview of the data, a sample that could be looked at easily. Most searches for data began with regular Web search engines rather than the search function of data portals, perhaps validating the Flemish and Dutch work to create search engine-friendly Web pages for each metadata record. Furthermore, 68% of searches were conducted during office hours, mostly from desktop devices, suggesting that data is primarily something used for work rather than leisure.

In any discussion about handling data, CERN always presents an extreme case. The volume of data, the speed of its creation and the enormous cost and international effort that goes into creating it is going to be greater than almost every other case. For Artemis Lavasa and her colleagues in the analysis preservation team, the aim is to capture, analyze and preserve CERN’s data, but also to preserve the tools, the processing steps, the configuration of the system, i.e. the full context in which the data was created. The aim is to be able to repeat the analysis many years after its initial publication.

Artemis began her talk with a spoiler alert: they don’t use DCAT. CERN’s data capture forms are very rich, can be very long, and vary from experiment to experiment. Vocabularies like schema.org can be useful for some of the high level descriptions but far more detail is needed such that 80% of the necessary terms are specific to particle physics. After various tests, the CERN team found that JSON schema offers the best way to encode their complex metadata, however, they’re aware that they’re using locally invented terms, not standardized ones. Ideally, there would be a standard approach that would support the level of specialism needed.

A busy slide with references to lots of standards, vocabulary elements and organizations including DataCite and Forece 11 — The model of a general core with specialized domain-specific extensions, as presented by Alejandra Gonzalez-Beltran, was recognized as the way forward by many delegates

Continuing on the theme of describing scientific research data, Alejandra Gonzalez-Beltran of Oxford University’s e-Research Centre described the Data Tag Suite, DATS. Akin to the successful Journal Article Tag Suite used in PubMed, DATS was designed to provide a scalable means for indexing data sources in the DataMed prototype. Part of the bigger Biocaddie initiative, the work picked up on a recurring theme in the workshop, that discovery metadata can be lightweight but that more detail is required for a human to assess the relevance and value of a particular dataset after it has been discovered. To that end, DATS is grounded in multiple search cases provided by the biomedical community and uses a core vocabulary based on mappings to a number of generic models: schema.org, DataCite, DCAT and others, plus life science-specific models such as BioProject, MAGE-Tab etc.

Through the bioschemas.org initiative and Elixir, the Oxford team is also contributing to the schema.org extension to cover life sciences.

In the final presentation of the first day, Richard Nagelmaeker of Blue Sky proposed a Linked Data Location Service. The core idea is that when querying a SPARQL endpoint, the returned data is likely to include IRIs from a variety of internet domains. The proposed service would make it possible and easy to look up a VoID description of the dataset from which those IRIs originate, thus enabling the user to make more informed analyses of the data returned from the query.

Dan Brickley of Google/schema.org joined the panel discussion that followed the presentations on searching for data. He made the remark that there is an important difference between making data available to billions of people and making it available to a tiny number of highly specialized people. Both can have enormous impact. Ultimately, what the search engines really want is to provide answers, not just links to datasets that may or may not contain those answers. He further expressed the hope that the social aspects of data portals – feedback, usage reports etc. – would ultimately aid data discovery.

Social event

Many thanks to Informatie Vlaanderen for sponsoring the wine and nibbles at the end of day 1.

The mission of the Informatie Vlaanderen Agency is to develop a coherent information policy across the Flemish public sector and to support and help realize its transition to an information-driven administration.

Negotiation By Profile

Eric Prud'hommeaux and Ruben Verborgh take questions

The second day of the workshop began with two presentations and a panel that all pointed in the same direction. Both Ruben Verborgh and Lars Svensson set out the problem that MIME types are insufficient to describe the nature of the payload of HTTP requests. As Ruben put it “my JSON is not your JSON” – as there are any number of ways of describing the same thing in JSON. Noting the skepticism from some, particularly Dan Brickley, there was agreement that Content Negotiation in some form was desirable. Clients should be able to request, for example, a data portal’s metadata not just in JSON-LD but following the DCAT application profile, perhaps with any DCAT-based metadata as a second preference. Lars presented his analysis of 5 possible approaches to this which is repeated in the table below.

Of the 5 options, the fifth is new and would require new standardization effort. Lars has written a draft, as yet not submitted to IETF, and during the workshop, he had offers of support from Herbert Van de Sompel and Ruben to complete it.

But there was some disagreement.

Whilst the idea of an accept schema/schema header is attractive and meets the use cases, introducing a new HTTP Header and, moreover, seeing support for it implemented in servers across the Web is very challenging. It was agreed therefore that there would need to be a fall back mechanism, i.e. an alternative route to achieving most of the aims without the need for additional server capabilities. The CSV on the Web standard was cited as an example of how to do this, offering 4 possible methods of attaching metadata to the CSV data in descending order of priority. The workshop agreed that the trade-offs for the different methods of negotiating by profile should be described.

Lars Svensson’s analysis of the pros and cons of different approaches to negotiation by profile.
Method	Pro	Con
Accept/Content-Type + Profile	No header registration necessary	Requires registration of profile parameter for each content-type q-value only available for each combination of content-type and profile
Link: rel="profile" (RFC 6906)	No header, parameter or relation type registration necessary	q-values not supported does not support multiple combinations of namespace + profile
Link: profile="…" (RFC 5988)	No header registration necessary	q-values not supported different use in request and response does not support multiple combinations of namespace + profile in request
Prefer/Preference- Applied	No header registration necessary Supports combination of multiple namespace + profile- combinations	q-values not supported not clear if absence of “Preference-Applied” was because server does not understand “Prefer” or because it did not honour the preference stated requires registration of “profile” preference
Accept- Schema/Schema	Supports q-values Fits nicely into content negotiation framework supports combination of multiple namespace + profile- combinations	Requires registration of new http-headers

As an interesting historical aside, DanBri noted that this topic should be considered a permathread. He dug out a paper from 1996 on this topic, written in the context of the then new Dublin Core element set.

Lightning Talks

A series of lightning talks allowed the workshop to hear about a range of other activities related to dataset descriptions, profiles and validation.

Matthias Palmér talked about his work on EntryScape, a data publishing platform based on DCAT and SKOS. His work suggested the need for greater specificity in some vocabularies and the essential role of profiles in making data truly interoperable. For example, saying that a document is an instance of dcterms:RightsDocument is barely helpful if you don’t also specify what encoding is used (ODRS? ODRL?) and which properties are used.

The European Data Portal has a different problem, as Simon Dutkowski explained. As a harvester of metadata from multiple catalogs, the EDP often finds that it harvests duplicate records for the same dataset. This means that the user is presented with duplicate results in their search. In particular, it’s difficult to tell whether two records describe two versions of the same dataset or just one version. The final judgement may have to be made by a human with the aid of a tool that presents similar metadata records.

The proliferation of data portals is itself a phenomenon that needs study. The portal watch project at the Vienna University of Economics and Business monitors ca. 250 portals, 133 of which use CKAN, hosting more than 500,000 datasets between them. Sebastian Neumaier explained that in order to make sense of the metadata provided, they map it all to DCAT, noting that CKAN provides an extension to export data in that form. 93 of the 133 CKAN instances include that extension but the underlying CKAN metadata schema is very lightweight. Therefore, what CKAN sees as ‘extra keys’ – i.e. extra terms that the portal operator chooses to include over and above the CKAN baseline - aren’t easily mapped to DCAT. WU found many instances of the same extra keys and suggested that the CKAN extension could include useful mappings for things like dcterms:spatial. Likewise, DCAT could be extended to include commonly defined terms to describe harvested datasets.

A complex slide showing many different standards and standards bodies — One of Gerldine Nolf's slides showing the complex standards landscape

The final lightning talk was from Gerldine Nolf of Informatie Vlaanderen who picked up on WU’s theme to speak about interoperability between metadata standards. In any mapping, some data is likely to be lost, especially if, as if often the case, the mapping is from a specialist vocabulary to a more general one. A short term solution is to publish and share mappings between metadata standards, but a longer term solution is for different standardization bodies and regulators to work together more closely and to harmonize their outputs. This was the underlying message from a study of the gap between metadata standards used in the geographic and general open government data worlds. It can be narrowed, but to close it needs the SDOs and their communities to collaborate directly, perhaps by providing best practices on how to use competing standards.

VRE4EIC’s Keith Jeffery pointed to CERIF that provides a full mapping from many standards, and to the Flanders Research Information Space (FRIS) that uses it.

Tools

One of the most common complaints about handling data and vocabularies is the lack of tooling. In that context, the workshop was pleased to learn about various tools for managing vocabularies and for validating data.

VoCol is a distributed vocabulary development tool with version control developed by University of Bonn/Fraunhofer IAIS. Christian Mader explained that it offers an impressive array of features such as access control, issue tracking, provenance recording, syntactic and semantic validation, automatic documentation generation and more. It uses GitHub repositories as its primary storage and has been tested in real world industrial settings, for example, to develop a vocabulary with more than 250 classes and 400 properties.

Discussions at the workshop around application profiles generally imply the existence of methods to validate data against a given profile. There are several methods of doing this including OWL constraints, SPIN, SHACL and ShEx, each with their particular strengths (and adherents). Picking up on his presentation of the HCLS dataset description profile presented earlier in the workshop, Alasdair Gray showed the Validata tool developed at the same time. Based on the ShEx validator, it’s a JavaScript tool that can be configured to validate data against that profile but also DCAT, OpenPHACTS and more. For example, it tests constraints such as:

A dataset:

MUST be declared to be of type dctype:Dataset
MUST have a dcterms:title as a language-typed string
MUST NOT have a dcterms:created date (since in HCLS these are associated with versions, not datasets).

The European Publications Office has need of both a vocabulary management tool and a validator but uses different solutions to those already described. One of its services is the Metadata Registry, a repository of reference data with 70 authority tables, each entry labelled in all official languages of the EU.

During his presentation, to cheers from many in the room, Willem Van Gemert took the opportunity of the workshop to announce that, as of that day, all URIs in the Metadata Registry now dereference.

The base information for the Metadata Registry is stored and managed in XML, on top of which various formats of the data are offered. In contrast, EuroVoc is natively managed in RDF using VocBench. Both are made available in multiple formats including SKOS which is published following the EU’s application profile of SKOS (SKOS-AP-EU). It’s important to be able to validate the authority tables after the data has been transformed (using XSLT), for which the Publications Office is using the SHACL API, however, they expressed concerns over its stability.

Discovering Open Data Standards

One of Keith Jeffery's slides including the line "metadata standards must be a good thing there are so many of them"

As the workshop headed towards drawing its conclusions and recommendations for future standardization work, Deirdre Lee of Derilinx and co-chair of the Data on the Web Best Practices Working Group, gave an overview of the current standards landscape. As the saying goes: we love standards, that’s why there are so many to choose from. Just as there are catalogs of datasets, there are catalogs of standards, designed to help potential adopters to discover what’s available, and Deirdre showed an early version of an extension to the Irish national data portal’s CKAN instance that provides a standards catalog and in the Q&A session after her talk, several people highlighted the European Commission’s Joinup Platform as a catalog of standards. After discovery comes evaluation. A recent panel discussion at the International Open Data Conference suggested that there are 5 facts that are needed in order to evaluate a standard’s suitability for a particular use:

descriptive metadata;
licensing information;
how has the standard been developed and what is its level of maturity;
what is its level of adoption/who is using it;
what tools and support are available.

Workshop participants were invited to join the Google Group that has been established to discuss these issues and to act as a forum for the exchange of ideas and experience of using different standards.

The Way Ahead

11 people round a table having a discussion — Daniele Bailo of INGV/VRE4EIC making a point during a breakout session discussing the Data on the Web Best Practices. One of its editors, Caroline Burle, is to Daniele's left.

The workshop host, Jacco van Ossenbruggen of CWI moderated the final session that tried to pull together many of the issues that had been talked about. There was unanimous consensus that work on DCAT is necessary. Its success and that of DCAT-AP was seen as a clear indication of interest – there’s more that’s right with DCAT than is wrong – however, it is very evident that there are gaps. In particular concerning versioning and relationships between datasets. In the bar camp that ended the day, there were sessions looking particularly at the need to improve the modelling of APIs (DCAT’s support for this is weak), how to annotate and query data and how to link data to legislation that requires its preparation. Peter Winstanley of the Scottish Government led a discussion of how illustrative examples could be provided of how to use a particular dataset and of good practice for describing data.

The topic of tooling, best practice and the sharing of experience was raised repeatedly too: “don’t ask us to adopt a standard without the tools to do so” is a common refrain. Rebecca Williams of GovEx, formerly at the White House and data.gov, talked about her experiences of the community development of the JSON schema used by data.gov and now offered out of the box by the majority of proprietary portal providers in the US. A general standard won’t offer detailed coverage of highly specialized contexts, such as at CERN, but at a high level, dataset descriptions are commonplace. The role and potential of schema.org is an important factor in this discussion too and different panel members had had different experiences. For data.gov, most of their traffic comes through general search engines and that’s driven by Rich Snippets; the life sciences community is working on an extension to schema.org, but CERN’s experiments with it didn’t bring in the extra hoped for users. By the end of the workshop, Dan Brickley had launched a new Community Group to further develop schema.org for datasets.

Overall, there was strong agreement that there was a lot of useful work that a new Working Group could tackle.

Conclusions

Following two very full days of active discussion, it is possible to draw some high level conclusions.

Dataset description poses significant problems, with many different aspects that are of importance to different communities.
There is a distinction between the metadata necessary to discover a dataset, and the metadata needed to evaluate it.
There is a distinction between general purpose metadata and domain-specific metadata that is necessary to provide full context.
DCAT is one of many dataset description vocabularies. CERIF is unusual in that it brings them together to offer greater interoperability.
Communities as diverse as industry, scientific research, government and the geospatial industry all share data so there are many voices to be heard in the discussion.
There is a lot of experience of using existing standards that can be drawn on to improve and extend DCAT. The development and use of DCAT-AP is an important data point.
Vocabulary profiles are critical for real world interoperability.
Validation is really only meaningful if done against a profile, not a general vocabulary.
Related work at W3C, notably Data on the Web Best Practices and SHACL, are receiving significant attention but with too many people leaving it to others to do the work: encouraging everyone to take part in standardization is hard.
The variety of dataset metadata schemas will never reduce to 1.
For maximum reuse of data, publishers should make their metadata available following multiple profiles.
MIME types are insufficient for declaring the structure of data/metadata and an indication of the profile used is also necessary.
Tools for use by consumers of data, such as Virtual Research Environments, therefore need to be able to handle different formats and profiles of data.
A new W3C Working Group should be chartered to revise and extend DCAT, to offer guidance on its use, and either standardize, or support the standardization elsewhere, of content negotiation by profile.