Open standards for data have made it easier for people and organizations to publish, access, share and use higher quality data that would otherwise be difficult to discover or consume. There are many categories of standards that have been leveraged in open data, and here we look at a few that have been critical for findability, usability and interoperability of the data.
A common vocabulary helps data publishers and data consumers use a common terminology - e.g. is the data about an organzation or a person, is the data. A standard vocabulary focuses on a handful of areas and uses clear definitions of the things and properties it describes. Vocabularies also help standardize relationships (e.g. “manufactured by”), standard codes (e.g. postal codes), and units of measurement (e.g. currency, distance).
Beyond schema.org, there are many other vocabularies, some of which have become standard in specific domains; e.g. DCAT to describe datasets and catalogs, Friend of a Frient (FOAF) to describe social relationships, vcard for contact information, PROV to describe provenance of objects (including datasets), BIBO for bibliography and citations, and industry specific ontologies.
The next layer of standards is around the data itself - how data is represented - e.g. do dates follow a format of mmddyyyy or ddmmyyyy or something else; do booleans follow a standard patern of “true” or “false” values or or other values allowed (0/1, yes/no); do addresses get represented with the postal code in a separate property. These type of standards are critical to reduce duplication downstream by data consumers and to allow for quicker insights.
Dataset formats are important for interoperability with various data consumption platforms. It is recommended that data publishers use open formats since open formats promote wide compatibility. Proprietary formats include those that can be used only on specific platforms, whose format specification may not be publicly available, or needing specific software / software extensions to parse the dataset. In addition to being an open format, it should also be machine-readable and preferably also human readable. Open formats that are machine readable include CSV, XML and JSON. Some formats can be open but not very machine friendly - e.g. PDF and HTML, which are meant to be human-readable formats. Beyond dataset formats, ideally the metadata and licenses also are machine readable. A popular machine readable metadata formats is JSON; leveraging the DCAT and other standard vocabularies, a publisher can include rich metadata.
Clear licensing terms enable the end user to evaluate what purposes the data can be used for, and whether there are restrictions for republishing the data. Machine readable licenses go one step further, allowing machines to enforce licensing terms rather than humans doing it manually. One way to describe licenses this way is using the Open Digital Rights Language vocabulary. Beyond license information, including provenance of the data (data sources, how it was collected, which fields are derived) allows users to build up a lineage of the data and enable them to use the most authoritative data sources. Provenance can be included using an appropriate vocabulary (e.g. PROV).
A number of these best practices are already being followed by publishers of open data. Some of these are on display at data.gov, where federal, state and local agencies publish data from their respective organizations. Many data resources also include a way to give feedback to the publisher or ask questions about the data.
Commenti