Just how should we share data on the Web?
The UK government is currently running a survey to elicit ideas on how it should update data.gov.uk. As one of the oldest such portals, despite various stages of evolution and upgrade, it is, unsurprisingly, showing signs of age. Yesterday’s blog post by Owen Boswarva offers a good summary of the kind of issues that arise when considering the primary and secondary functions of a data portal. Boswarva emphasizes the need for discovery metadata (title, description, issue date, subject matter etc.) which is certainly crucial, but so too is structural metadata (use our Tabular Metadata standards to describe your CSV, for example), licensing information, the use of URIs as identifiers for and within datasets, information about the quality of the data, location information, update cycle, contact point, feedback loops, usage information and more.
It’s these kind of questions that gave rise to the Data on the Web Best Practices WG whose primary document is now at Candidate Recommendation. We need help from the likes of Owen Boswarva and data.gov.* portals around the world to help us gather evidence of implementation of course. The work is part of a bigger picture that includes two ancillary vocabularies that can be used to provide structured information about data quality and dataset usage, the outputs of the Spatial Data on the Web Working Group, in which we’re collaborating with fellow standards body OGC, and the Permissions and Obligations Expression WG that is developing machine readable license terms and more, beginning with the output of the ODRL Community Group.
A more policy-oriented view is provided by a complementary set of Best Practices developed by the EU-funded Share-PSI project. It was under the aegis of that project that the role of the portal was discussed at great length at a workshop I ran back in November last year. That showed that a portal must be a lot more than a catalog: it should be the focus of a community.
Last year's workshop took place a week after the launch of the European Data Portal, itself a relaunch in response to experience gained through running earlier versions. One of the aims of that particuilar portal is that it should act as a gateway to datasets available throughout Europe. That implies commonly agreed discovery metadata standards for which W3C Recommends the Data Catalog Vocabulary, DCAT. However, it's not enough. What profile of DCAT should you use? The EU's DCAT-AP is a good place to start but how do you validate against that? Enter SHACL for example.
Those last points highlight the need for further work in this area which is one of the motivations for the Smart Descriptions & Smarter Vocabularies (SDSVoc) workshop later this year that we're running in collaboration with the VRE4EIC project. We also want to talk in more general terms about vocabulary development & management at W3C.
Like any successful activity, if data is to be discoverable, usable, useful and used, it needs to be shared among people who have a stake in the whole process. We need to think in terms of an ecosystem, not a shop window.
My response to this is that it isn't deniable that there needs to be a more broaden setting for the language for there to be better data quality, and data set usage. With the history to improve this database, I agree that the attention of this experiment should be set on vocabulary, and W3C. With the previous examples give , it seemed to be a need for non-structural metadate. I think they are planning the right way to go about it.
Hopefully, future open data standards and descriptions will expand community input to data and metadata. This encourages sharing maintenance responsibilities and data consumption. Please add this Data on the Web Best Practices WG's primary document under Section 8.13.
There is a stark overemphasis on metadata vs. the data itself. Actually the headline should read "Just how should we share meta data on the Web"
What is needed is a decentralized infrastructure to host data and a search protocol to identify those datasets. For data, the dat protocol comes to mind http://dat-data.com/ or the noms project https://github.com/attic-labs/noms. Even the venerable Torrent network might be usable. This doesn't solve the problem of discoverability though, which is tackled by eg. YaCy (http://www.yacy.net/en/), or IPFS (https://ipfs.io/docs/getting-started/)
Hi Johann, I can't agree with your first line. If all we share is metadata describing datasets then all we're doing is using the Web as a glorified USB stick, i.e. you might just as well send me the data on a stick through the post. To build the Web of data, individual data points need to be addressable on the Web. That's why structural metadata is just as important as discovery metadata (like the CSV on the Web work) that allows you to interpret a table as a URI set if that's what you want to do).
Thanks for the pointers, we'll look at those.
Cheers
"Just how should we share data on the web?"
Crude answer: follow the Frictionless Data approach http://frictionlessdata.io/
Specifically: use (Tabular) Data Packages and its extensions, use simple formats and focus on making your data good quality and easy to consume in standard tools (e.g. Excel, Postgres etc).
At a more general level you I also like the Frictionless Data principles (I should as I wrote them ;-) ...)
### 1 Focused
When discussing how we should do something have a sharp focus on one part of the data chain, one specific feature – e.g. packaging – and/or a few specific types of data (e.g. tabular).
*Comment: data is often too broad a category for answers other than at the most general level ("provide metdata", have good quality data). You need to get specific to get useful answers.*
### 2 Web Oriented
Build build for the web using formats that are web "native" such as JSON, work naturally with HTTP such as plain text CSVs (which stream).
*Comment: you could obviously mention RDF as it is definitely built for the web. But I think it struggles with the simplicity and existing tooling principles below.*
### 3 Distributed
Be distributed rather than centralized: design for a distributed ecosystem with no centralized, single point of failure or dependence.
### 4 Open
Anyone should be able to freely and openly use and reuse what is built. The community should be open to everyone.
### 5 Existing Tooling
Integrate with existing tools both by building integrations and designing for direct use – for example CSV is great because everyone has a tool that can access CSV.
### 6 Simplicity
Keep it simple and make things easy to learn and use. Add the minimum and do the least required. Concretely this means using simple formats, keeping metadata lightweight, and data specs simple.