Maximising interoperability — core vocabularies, location-aware data and more: Report
Executive Summary
The fifth and final workshop in the series was well attended, and took the general topic of interoperability as its theme with a specific focus on vocabularies and the representation of locations.
The opening plenary session included the results of a detailed study of the impact of open data in Finland. An important message from that report is that there needs to be a more systematic approach to monitoring the impact is any meaningful conclusions are to be drawn by governments in future. The work on Core Vocabularies within the ISA2 Programme was presented, including various Application Profiles. It's clear that how a vocabulary is used is as important as which vocabulary is used if genuine interoperability is to be achieved. This means that multiple stakeholders have to agree on the use of common code lists as values for common properties.
There is often a lot of focus on metadata, but the data itself should also be aligned. This is the focus of work in the Open Group with its emerging ODEF standard and efforts such as UABLD's Linked Data Templates and the 5 star scheme for self describing data proposed by the University of Economics, Prague.
The Share-PSI 2.0 workshop in Berlin provided an excellent opportunity to discuss the details of the European Data Portal which had had its beta launch the previous week. Participants discussed its operations, its features and its technical infrastructure. One feature of the EDP is its Goldbook, a guide on sharing data, future updates of which are likely to include direct links to the Share-PSI 2.0 Best Practices. A portal is clearly more than just a catalogue of datasets and applications. It's the hub for a community that is interested in using Public Sector Information and data, whether open or not, and for interacting with the data providers. A portal is a source of quality feedback, of intelligence on what data is most valuable, and acts as a proxy for Government as a development platform.
Sourcing data can be problematic but if provided with the right tools, any public authority can provide data with the right levels of quality, detail and metadata. At least some metadata can be auto-generated by tools that make use of machine learning. Good quality tooling can enable non-experts to benefit from the power of Linked Data and sophisticated data visualisations can be created as Web Components by data experts to be incorporated in applications by regular Web developers.
The workshop heard about a variety of projects that use data for social good including predictive tools to help plan fire risk management and how mobile phone data, once anonymised and visualised, acts as a proxy for the movement of people throughout a region, helping planners and policy makers.
The location track highlighted many of the problems of accurately and reliably referring to locations in data. The subject is inherently complicated and different developers will have different needs. Some will want a set of points, others need geometries. Addresses are insufficient for locations and must be translated into at least points to be useful. regular Web developers are not expected to have to handle this complexity but it does exist. The simplest way to tackle the issue is for locations to have dereferencable URIs that return data in different formats. Given this, spatial experts can support non-experts. The lack of government support for services such as Geonames and Open Street Map is unhelpful.
Share-PSI 2.0 Workshop - November 25 to 26 - Fraunhofer Institute FOKUS - Berlin from Noël Van Herreweghe on Vimeo.
Introduction
The fifth and final Share-PSI workshop put more focus on technical issues than its predecessors. The specific topics addressed were core vocabularies and interoperability, with a track on location, often the common factor between different datasets although also often the most difficult to recognise and 'get right.' The workshop was held just a week after the launch of the initial beta version of the new European Data Portal and the event provided an opportunity for detailed discussion about the portal: how it works, its features and the plans for future development. The Berlin workshop also enjoyed the direct participation of several departments within the European Commission: the ISA2 Programme, the JRC, the Publications Office and DG CNECT whose Daniele Rizzi presented the latest call under the CEF. Registration numbers were high (145) with a brief post-event analysis suggesting that 118 people actually attended the two day event. It is noteworthy that the Dutch Government was represented for the first time at a Share-PSI event and that over the workshop series, Cyprus is the only Member State not to have had any representation of any kind.
This report was compiled from the notes taken during the workshop, together with the papers and slides presented, all of which are linked from the agenda.
Opening Remarks
The opening plenary session included a presentation of findings of a recent study on the impact of open data by the Research Institute of the Finnish Economy. Dr Heli Koski pointed to an article in the Economist from November 2015 that asserted that the impact of open data has not met expectations and identified four possible reasons for this:
- data that is made available is often useless;
- data engineers & entrepreneurs who might be able to turn it into useful, profitable products find it hard to navigate;
- there are too few people capable of mining data for insights or putting the results to good use;
- it has been hard to overcome anxieties about privacy.
Articles and studies into the socio-economic impacts of open data are rare. None of the 77 countries recognised in the Open Data Barometer (PDF) have conducted a comprehensive assessment of the impacts of opening data. Heli Koski's conclusion is that before we can ask why more has not been achieved, we need a systematic, long term effort to be able to evaluate the impacts of open data use which would require the development of an assessment model with proper indicators and the systematic collection of data. ideally of course this would be coordinated across countries.
Data Portals
The launch of the renewed European Data Portal provided a hook for many discussions throughout the workshop. All the partners in that project were represented, along with Sebastian Moleski, Open Knowledge's product manager for CKAN, the PwC/AMI Consult team behind the DCAT Application Profile and Andrea Perego from the JRC, author of the GeoDCAT-AP as well as many people responsible for portals around Europe.
Heli Koski's call for the systematic measurement of the impact of open data was answered in part by Wendy Carrara in her introductory presentation of the European Data Portal. One of the features of the portal is a set of metrics to do just that, as well as the data catalogue itself, links to applications that use the data, data visualisation and mapping tools, an eLearning library, training materials, a selection of stories about open data and the Open Data Gold Book. All these features exist in their initial form in the beta version of the portal but are under active development. The Gold Book, for example, will be draw on the Share-PSI Best Practices. These features and technical details such as the choice of CKAN for the primary portal due to its substantial developer community, Virtuoso for the triple store to store the DCAT-AP metadata, the proprietary machine translation engine and so on, were designed to achieve a number of goals:
- provision of easier access to open data across national borders;
- support for data publishers by providing feedback on data quality;
- support for location based search via Gazetteer functionality;
- provision if easy summaries of data license texts next to each dataset;
- provision of eLearning modules around open data.
This list points to a much broader role for the portal than simply acting as a catalogue of datasets and applications that use that data. Hans Overbeek from the Dutch Government was keen to see the discussion move to talk about data portals, not just open data portals, i.e. that portals should carry metadata for datasets that are closed as well as open. This was supported by Bernhard Krabina of the Centre for Public Administration Research, Austria (KDZ), which runs the Open Government Data cockpit. He focussed on the portal as part of a process not just for sharing data (openly or otherwise) but of quality monitoring and improvement.
Georg Hittmair of the PSI Alliance suggested that portals should record requests for public data. Such requests would be harder to reject than Freedom of Information requests that were publicly visible, and that it would help public authorities both in prioritising datasets for release and in their annual reporting of the availability of PSI as required by Article 13 of the PSI Directive. However, the idea was not well received. In Austria (Georg's country), requests for data need to include a reason for why the data is needed. This is not consistent with the PSI Directive (which makes no such demand) or practice in many countries. Overall, it was felt that what looks like a very god idea at first, might actually backfire in that publication of dataset requests would give public authorities more reason to resist publication. This might follow skewed logic along the lines of: department X said no to that request so we'll so no to you. Open by default is a better strategy than open by request.
An area where consensus was easily reached is that the discoverability of the uses of datasets is as important as discoverability of the datasets themselves. Although the (Revised) PSI Directive does not require users to notify suppliers, still less inform them what use has been made of the data, it is natural for data publishers to want to know what has been done with the data. The W3C Data on the Web Best Practice Dataset Usage Vocabulary addresses this issue.
Application profiles
One session at the Share-PSI Berlin workshop was dedicated to the DCAT Application Profile. Such profiles describe not just which standards to use but how to use them. This is necessary for realisable interoperability since standards are always written in such a way as to allow the developers of implementations to make choices about how they use them. For example, DCAT specifies that the dcterms:theme
property is used to classify datasets and that the value of that property should be a skos:Concept
, but it doesn't specify which concept scheme to use. The European Data Portal, following the DCAT-AP, uses the Eurovoc taxonomy for this purpose. Such constraints are implemented in the EDP using a JSON Object although the workshop noted >SHACL as a future option.
An issue discussed in the DCAT-AP session, and raised particularly by Dietmar Gattwinkel of the Staatsbetrieb Sächsische Informatikdienste, was that of provenance. Should provenance be applied at the term level? dataset level? And the DCAT-AP lacks a means of identifying a dataset as being a legal reference, such as a company register. Related to that concept is the need to point from the dataset to the authority that is legally empowered to change the data. This in turn provides a use case for the development of the Core Public Service Vocabulary, one of those presented by Athanasios Karalopoulos of the ISA2 Programme.
Initiatives such as the DCAT and GeoDCAT Application profiles, together with their implementation by the European Data Portal, promote interoperability across borders. At the city level, common approaches are much harder to find with PSI published as scanned documents and other formats that are difficult or impossible to process by computer. There do not appear to be agreed standards for publishing spending data, for example, which makes comparisons difficult to make. An exception to this is the 6 Cities Strategy lead by Helsinki, a programme that enjoys the participation of 120 companies. Although the remit of the strategy goes well beyond the PSI Directive and open data, it is an example of inter-city cooperation with technological innovation and collaboration at its heart.
An example of an application profile developed and applied at national level was provided by Gabriele Ciasullo, Giorgia Lodi, Antonio Rotundo of the Agenzia per l'Italia Digitale (AgID). They are among the countries building a comprehensive catalogue of public sector services and have taken the ISA Programme's Core Public Service Application Profile as their starting point. However, even this application profile was not sufficient. AgID's work shows a need for:
- a clear and agreed classification of public services;
- a shared naming of the same public services offered by a variety of administrations (e.g., municipalities);
- A standard definition of the elements that allows one to clearly represent the concept of “public service.”
To achieve their goals, AgID had to create an Italian application profile of the European application profile of the vocabulary, the CPSV-ITAP. This specifies, for example, the use of:
- the DCAT-AP theme for the public service theme;
- COFOG and NACE to define the subtheme for the public service;
- the revised version of UK ESD toolkit to define the channels for public services;
- custom controlled vocabularies for other types of data according to national initiatives, such as the authentication type.
The work provided further support for the ISA2 Programme's work on the Core Public Service Vocabulary. In her presentation, Giorgia Lodi asserted that this kind of detail is necessary for public organisations to meet, for example, Article 5 paragraph 1 of the (Revised) PSI Directive:
Public sector bodies shall make their documents available in any pre-existing format or language, and, where possible and appropriate, in open and machine-readable format together with their metadata. Both the format and the metadata should, in so far as possible, comply with formal open standards.
And Article 9:
Member States shall make practical arrangements facilitating the search for documents available for re-use, such as asset lists of main documents with relevant metadata, accessible where possible and appropriate online and in machine-readable format, and portal sites that are linked to the asset lists. Where possible Member States shall facilitate the cross-linguistic search for documents.
Tools
The Italian experience is that public authorities are willing and able to participate in providing data in an agreed but distributed manner – if they're provided with the right tools – a simple form to fill in, for example, that sends data back to the central collecting point. This matches the Spanish experience reported by Dolores Hernandes in the very first Share-PSI workshop in Samos when she described the Spanish data catalogue federation tool and reinforced by Hannes Kiivet in his talk in Berlin describing the Estonian Metadata Reference Architecture. This is certainly more likely to be productive than expecting public authorities at any level to supply a valid JSON or JSON-LD file. Providing tools decouples the provision of data from organisational websites and allows data to be validated with a minimum of standardisation.
The workshop heard about the XÖV tool chain, which includes an Interoperability Browser that visualises the taxonomy and code lists used in data exchange within the German public sector. In contrast to the many distributed systems discussed in t he workshop, XÖV is a legally supported, top-down infrastructure. There was also a plenary talk by the workshop hosts and Share-PSI partner, Fraunhofer FOKUS, about the LinDA project that developed a toolset designed to help SMEs work with Linked Data, including data conversion, storage, linking and visualisation. This is a serious attempt to address the lack of tooling for Linked Data, or rather, the perceived lack of tooling, depending on your point of view, which is clearly a barrier to entry.
The team from Università di Salerno and Università di Napoli proposed Web Components as a technology that has the potential to put data visualisation in the hands of a greater number of developers. It provides a means for specialist developers with a knowledge of the data to create an arbitrarily complex element that can be embedded in any Web page by any developer. It is based entirely on open Web standards, rather than external run time environments such as Flash, Java etc.
Projects
Ales Veršič, of the Ministry of Public Administration in Slovenia, gave a presentation on how Igor Kuzma is using >mobile phone usage information (PDF) in Slovenia. By tracking the number of mobile phone signals in a given area, and by tracking their movements (having first aggregated and annonymised the data), planners have a powerful dataset to work with. Since mobile phone ownership is all but ubiquitous in Slovenia, as elsewhere, phone signals are an effective proxy for what planners are really interested in – people. The mobile data can be used for estimating population and tourism statistics, communication and transport planning, epidemiology, proximity to risks and spatial marketing.
That's an example where a single dataset is used for multiple purposes. In contrast, the RESC.Info Insight service uses multiple datasets for a single service for fire departments to assess residential fire risks in neighbourhoods based on Open Data, to help distribute resources and fire safety education. Funded by the Open Data Incubator Europe (ODINE) project, Nicky van Oorschot of Netage explained that it uses Linked Data technologies to merge the building register, population statistics, demographics and forensic data. Historically, fire risk for a given area was calculated based on past events, RESC.info is predictive and allows fire departments to prioritise homes for fire safety education and encouragement to fit smoke detectors.
Incubators such as ODINE and FINODEX (presented at the Share-PSI workshops in Lisbon and Krems) provide start up funding for a variety of ideas. In Portugal, a similar programme is operated by the Agência para a Modernização Administrativa (AMA). In his talk Government As A Developer - Challenges And Perils, AMA's André Lapa described the process by which project ideas are pitched, assessed and funded. For government staff, the key thing is the benefits to the department and/or the user; politicians hope for kudos. For all parties, it's the quality of the projects that is under the closest scrutiny, and it's notable that many new datasets have been uploaded to dados.gov.pt as part of a revision to regular processes rather than as an additional task.
André ended his talk by addressing Share-PSI's 3 recurring questions directly:
What X is the thing that should be done to publish or reuse PSI?
Identify strategic and relevant ICT projects at inception level that can foster or facilitate opening data for the Public Sector.
Why does X facilitate the publication or reuse of PSI?
Projects with strong political support and subject to standards of approval generate ‘pressure’ on public bodies that otherwise weren’t so motivated to release PSI, also you can tie some of the costs regarding data availability to something that already exists.
How can one achieve X and how can you measure or test it?
Align yourself with a national body responsible for ICT coordination at the government (or ministry) level, and work at establishing PSI considerations into the framework for identifying relevant projects.
Self-Describing Data
Linked Data was discussed in various ways in many sessions. As already mentioned, there does not appear to be an agreed schema for describing spending and budgetary data so it can be hard to mix different datasets, although this issue is being addressed, at least in part, by the ISA2 Programme in its work on the EU Budget Vocabulary. An obvious standard to point to is XBRL but that is a framework that is an accounting data standard used in many different ways in different countries and sectors, and so does not provide the kind of European interoperability needed without specifying a profile.
The ideal is that data is self-describing, which is what Linked Data offers, but there are still gaps. For example, there is no agreed ontology of financial units, analogous to physical (SI) units. This highlighted a difference between accounting standards and financial standards. The two are clearly linked but, as with any discipline, there is a lot of detail at lower levels that need to be captured. The team from the University of Economics, Prague propose a 5 star scheme for self-describing data (Figure 5)
Returning to the theme of increased interoperability through reduced flexibility in a specific application environment, Džiugas Tornau and Martynas Jusevičius of UABLD/Graphity propose the concept of Linked Data Templates. This builds on the ideas in the emerging Shapes Constraint Language, SHACL, and the Linked Data API where URI patterns are converted to SPARQL queries to return results to the most frequent queries, such as lists of similar resources, or details of a specific resource.
★ | No data description Example: spreadsheet with 'self-descriptive' column names |
★★ | Some data description Example: spreadsheet with accompanying Web page explaining it |
★★★ | Language of description is understandable Example: jargon-free spreadsheet, no insider knowledge necessary |
★★★★ | Data description in a machine-readable format Example: spreadsheet accompanied by JSON Table Schema |
★★★★★ | Data description uses dereferenceable URIs to identify things Example: data cube vocabulary dataset reusing third party Linked Data |
Yannis Charalabidis of the University of the Aegean argued against reliance on adaptation from one person's metadata concepts to another. It leads to data loss and misinterpretation; better to stick to standardised terms perhaps with the automated filling of metadata fields. The continued development of machine learning technologies make this increasingly realistic. also supported the notion of distributed data input with tools based on standardised schema.
Chris Harding of the Open Group focussed on interoperability not just of data vocabularies but on the grammar used. Data may be free to access but it is almost never free to use since significant effort is required to covert and clean it up. The emerging ODEF standard is designed to:
- enable use of the existing vocabularies that have been developed by industry;
- bodies and standards organizations;
- provide a basic grammar for data descriptions;
- be consistent with relational database usage;
- be able to accommodate other data representation approaches;
- use RDF to facilitate semantic processing.
Currently in technical review, ODEF builds on the existing and widely used UDEF standard but adds the concept of roles to the usual classes and properties, and supports plugins.
Location
Location has long been identified as being of critical importance for PSI and shared data. It is often one, if not the only, aspect that multiple datasets have in common. As discussed during the Share-PSI workshop, the problem is that location means different things to different people. Addresses are ephemeral. Buildings are ephemeral and may have multiple addresses. The world is not flat. Nor is it a sphere. Continents move. It is for this reason that cartographers must use coordinate reference systems other than latitude and longitude if their maps are to be accurate. A representative point on a map is an adequate representation of something's location for one application but a polygon showing the extent of an area is required by others. An address is good for many situations but may not be related to a physical place at all, and it is meaningless to talk about location without also talking about time.
In short: location is complicated, and many of the existing methods of recording where something is, or where a service is available, or where something happened, are inadequate for many applications.
Bart van Leeuwen of Netage and the Amsterdam Fire Service asserted that, in his experience, the best source of address data is not the official address list published by the Dutch government, but Open Street Map. However, OSM (and geonames) are not generally considered sufficiently robust in their organisation to warrant public funding. Thus, the actual, or potentially best source of address data, and the world's de facto gazetteer are often overlooked by the public sector.
The locations sessions included many breakout groups and therefore discussed a variety of topics. A recurring theme was the need for persistent URIs for coordinate reference systems (Arnulf Christl pointed to spatialreference.org as a source of exactly that, Andrea Perego suggested http://epsg.io) and a public sector lookup service to match that offered by Google. In Arnulf's view, it is essential that government provide a geocoding lookup service for addresses. The ODI's Open Addresses initiative was also highlighted.
Valentina Janev of the Institute Mihajlo Pupin talked about two use cases that drove the GeoKnow project:
- We want to know who is visiting a tourist portal (based on their IP address for where they come from and where do they want to go based on their booking, the kind of reservation, transport used or lookup).
- Supply chain use case: linking of data, meteorological data, stations from where the data is coming, people on the road who are delivering something, linked with news, Twitter. This crosses the geospatial data, time dimension and statistical data (with code lists) etc.
These seem straightforward as ideas but they are hard to implement without a better data infrastructure, including an open licensing system without which the choice of technology is irrelevant.
On one thing all three location session were agreed: dereferenceable, persistent URIs are needed for locations, be they addresses, regions, NUTS regions, countries, or individual features. Once you have a URI, i.e. a global identifier, then you can provide the centroid point, the polygon (in any coordinate reference system) or whatever else you want to provide and in any format, be it JSON, GeoJSON, JSON-LD, GML or anything else. Content negotiation obviates the need to choose a single representation format. Once URIs are used, they can be accessed by developers whether they have geospatial expertise or not. The collaboration between the OGC and W3C in the Spatial Data on the Web Working Group was noted in Berlin. Since then, that working group has published the initial draft of its best practice document which supports this in its very first recommendation.
Best Practices
The aim of Share-PSI 2.0 is to identify best practice across the EU for implementing the (Revised) PSI Directive. During the workshop, there was a series of very short presentations of each of the current and proposed BPs that act as a supplement to those defined by the W3C Data on the Web Best Practices Working Group. Feedback gathered at the workshop is being used by the project partners in their efforts to refine and finalise this work.
Conclusions
The main conclusions of the fifth and final Share-PSI 2.0 workshop can be summarised as follows.
- The socio-economic impact of the opening of PSI should be studied over the long term using agreed indicators and a systematic collection of data.
- Data Portals should act as the focal point of a community, to include training materials and guides, discussion fora, feedback mechanisms etc. and not just a catalogue of datasets.
- Making the use of datasets discoverable is an important incentive to public authorities to make further information available. The W3C's Dataset Usage and Data Quality vocabularies are relevant in this regard.
- Follow profiles of standards, not just standards as written, in order to achieve practical interoperability. The DCAT Application Profile is a prime example of this.
- Effective cooperation between cities can be achieved, with commercial involvement, if the remit is focussed on innovative outcomes rather than a promotion of the sharing of PSI.
- The task of data collection can be successfully distributed to non-specialists if they are given the appropriate tools that should be standards-based.
- Sharing data can help services move from reactive to predictive models of operation.
- Public authorities can be encouraged to share data within a project that shows tangible results; such projects can be centrally coordinated and may help turn data sharing into a normal part of every day operations.
- Initiatives such as Geonames and Open Street Map provide high quality, up to date geospatial information that the public sector would do well to support rather than try and duplicate.
- Addresses should be made open, along with a lookup service that provides geolocations for a given address.
- Locations (features, addresses, regions etc.) should be identified by persistent, dereferenceable URIs that return data in a variety of formats through content negotiation.