Contents |
Brief Literature Review: Linked Open Data
This wiki page includes notes from a brief literature review looking at Linked Open Data in Development.
It formed the basis of the Technical Issues Paper.
Understanding linked and open data
For the purpose of understanding linked and open data, we can understand data as information that has been recorded or encoded as discrete facts, generally according to some uniform standard. Organisations and projects generate vast quantities of data in their day-to-day work, from figures entered into spreadsheets or meta-data descriptions of video recordings, through to administrative data such as travel records of staff or operational data on the volume of enquiries relating to a particular project.
This data is often held internally only for use by that organisation, and the way it is recorded makes use of internal labels and standards. For example, the column headings in a spreadsheet might be labels that only really make sense to someone from the same organisation, or the author field in the meta-data for a video recording might allow for free-text to be entered leading to the same video-maker being known by many different labels (e.g. Michael Powell; Mike Powell; Mike Powel; and so-on). Linked data provides a set of conventions for recording data based upon using URIs (Uniform Resource Indicators) as the identifiers for things (and relationships) within a dataset.
- Talk of linked data
- Talk of open data
Berners-Lee's Linked Data design principles.
The five stars of open, linked data.
- Definitions;
- Distinctions;
- The current open/linked data eco-system;
Semantic Web
Aldo de Moor distinguishes between the Syntactic, Semantic and Pragmatic web [1].
Drivers for linked open data initiatives
- Open Science
The Panton Principles for open science data do not at present (FAQ, September 2010) apply to social science data, suggesting different principles and norms may be appropriate for opening publishing of social-statistics and qualitative research data.
Open Research
A range of open research projects are available.
- PSI movements;
- Open government movements;
- Semantic web computerisation movements.
Creating and using linked open data: encoding and interpreting
Halb et. al (2008) note that many linked data efforts focus on a 'machine-first' approach to publishing data, offering only machine-readable data and using template-based interlinking algorithms to make large numbers of links between datasets. However, they argue that this approach leads to links with limited "semantic strength", and they prefer an approach of publishing linked data in both human-readable and machine-readable formats, with a Wiki-based approach to allowing human users of dataset to suggest links between the data they are viewing and other datasets.
Halb et. al (2008; ยง6) note that modelling can involve interpretation of the dataset only possible after reading the full documentation. As they explain "the raw data from Eurostat is sometimes ambiguous and can only be resolved by analysing the corresponding docu- ment. For example the statement time\2007 can stand for the value over a period of time (e.g. entire year) or at the end of the reporting period (e.g. 31 Dec)."
Often choices about data representation are affected by what makes for efficient parsing of the resulting triples.
Linked open data eco-systems
The Linked Open Data (LOD) Cloud diagram (now generated from CKAN based on user-submitted details) attempts to show visually the linked open data sources currently available, along with their interlinkages. The Wikipedia extract 'dbpedia' plays a pivotal role in many interlinking efforts. It has been suggested (Auer et. al. 2007; Bizer et. al. 2009) that dbpedia is a key 'nucleus' for a web of open data - providing a 'bottom up' alternative to top-down schema-setting efforts to build a semantic web. As of March 2010 dbpedia includes extracts in 11 languages with varying numbers of 'abstracts' (short/long descriptions of things) available in each language: English (3,144,000), German (503,000), French (545,000), Polish (430,000), Dutch (392,000), Italian (381,000), Spanish (362,000), Japanese (275,000), Portuguese (367,000), Swedish (213,000), Chinese (179,000). Extracts appear, however, to be predominantly based on the English language version of wikipedia. Wordnet also plays an important role in the extraction of information for Wikipedia and the Yago knowledge base.
Data Enrichment Services (E.g. TSO have built a data enrichment service for government http://gov.tso.gov.uk (appears unavailable from outside gov)) which can take press releases and add URIs, annotation etc.)
Open data in development
- Mapping existing initiatives and organisations
- IATI; OKF; AidData; Agropedia etc.
- SDMX used for large statistics
- DDI used for microdata.
- Need to explore data cube ontology.
- Data has dimensions (slices?)
- Data has things which were being measured
- Things which were being measured have attributes (e.g. methods used for measuring)
- Data; Attribute; Measures.
- SCOVO compatible also.
- Need to explore data cube ontology.
Linked data create 'global variables' in a truly global form.
Critical Questions
- What data is available & what does the eco-system look like?
- How is data encoded?
Technical Notes
Distinctions
The following need to be distinguished...
OWL Schema
RDF Schema
(Draft) An ontology contains knowledge, whereas a schema describes how knowledge should be recorded and represented.