Open data stickers, by Jonathan Gray

Open data stickers, by Jonathan Gray

The OKFN AU listserv is a mailing list for those who want to get involved in building communities around open knowledge and open data in Australia as a part of the Open Knowledge Foundation’s global network. It’s a fantastic place to learn more about open data in Australia, connect with others working in the area, and ask questions.

Recently I put a question to the list regarding the notion of ‘high value datatsets’, and whether there were useful criteria or methodologies for their identification that they could point me to. Noting, as I felt it was important to do, the difficulty in trying to establish a normative measure of something that is inherently a subjective matter (of value to whom? in what terms?). Despite the vagueness of my question the list responded brilliantly. Here are some of the key suggestions that came back:

  • feedback on the quality of published datasets should be routinely gathered and fed back to creators and decision makers
  • downloads and page hits should be monitored for popularity of published datasets as well as of standard website pages and FOI requests – all of which helps build a picture of public interest trends (although be wary of automated workflows that artificially increase page hit counts)
  • the open data index offers a valuable a bird’s eye view of where we are at in Australia and globally and where there are gaps and more progress needs to be made (note also however that there was some debate as to how accurate and useful it really is beyond being a good marketing tool)
  • criteria for ‘value’ can vary depending on the discipline, but having the dataset (more) publicly accessible is a key measure for determining how worth supporting its maintenance in $ terms; and
  • the adoption of approaches like the adoption of standards and tools for a frictionless data ecosystem (see OKFN Frictionless data) will in turn increase the value of data in the community.

Some neat matrices / sets of criteria that I found in my research, and again with the help of the excellent OKFNAU community, include:

>>From the City of Philadelphia’s Open Data Census

  • Publication Quality – The team found that whether a dataset was “published” is more complicated than “true or false,” and thus recorded information about what formats were available, how up-to-date they were, how well documented they were, etc., and used that information to inform a publication quality score.
  • Other Cities – To get a sense of what high demand datasets were being released elsewhere and help inform departments of existing precedents, the team researched the data portals of four other major U.S. cities – Baltimore, Boston, Chicago, and New York City. Popular datasets not yet published by the City of Philadelphia were recorded as “unpublished” datasets.
  • Demand / Impact – The team used information derived from an analysis of over 2,800 Right to Know requests, voting on the Open Data Pipeline Trello board, and nominations on to estimate demand for each dataset using a scale of 1-5 (5 being greatest)
  • Cost / Complexity – Information about the level of technical effort required to prepare each dataset for publishing was used to produce an estimate of the cost/complexity on a scale of 1-5 (5 being greatest)

>>From Steve Bennett (

“three criteria when pondering priorities for government data release:

1. Uniqueness: to what extent are there no other sources of this information? A council’s collection of street information is valuable but there’s a lot of overlap with OpenStreetMap, for instance. But no one else could have the garbage collection zone boundaries.
2. Maintenance. Datasets age pretty quickly, and a dataset that’s more than a year out of date seems to go downhill in value pretty fast.
3. Reusability: was the data being collected with a general purpose in mind, or are there limitations due to the original purpose for which it was collected (eg, lack of comprehensiveness, idiosyncratic groupings, jurisdictional filtering…)”

>>From the European Commission’s Report on High Value Datasets from EU Institutions, 2014:

“a dataset may be considered of high – value when one or more of the following criteria are met

It contributes to transparency:
These  datasets  are  published because  they increase  the  transparency and openness   of   the   government   towards   its   citizens. For   instance   the publication  of  parliaments’ data, such  as election  results, or the  way governmental  budgets  are  spent,  or staff  cost  of  public  administrations all contribute  to  the  transparency  of  the  way  public  administrations  are working.

Its publication is subject to a legal obligation:
In some cases the publication of data is enforced by law.
The PSI Directive for  instance, regulates  the  publication  of  policy – related  documents  by (semi) public organisations.

It directly or indirectly relates to their public task:
A public administration may publish a dataset because it directly relates to its  public  task.  For  instance  DG  CLIMA  may  publish  statistics  on  CO2-emission as part of its task for raising awareness about climate change.

It realises a cost reduction:
The availability and re-use of a dataset, e.g. contact information, code lists, reference   data   and   controlled vocabularies, eliminates the need for duplication of data and effort, reduces costs and increases interoperability.
Collections  of  data  housed  in  the  base  registers  and  geospatial  data  are prime  examples  of  dataset  which  opening  up  will  lead  to  direct  cost reductions in data management, production and exchange.

The type and size of its target audience:
A dataset may be useful for/relevant to a large audience (size-based value), for instance traffic data.
On the other hand a dataset may bring large value to  a  specific  target  audience  (target/subject-based value),  for  instance  a dataset  containing  data  of  particles  colliding  at  high  speed  in  a  particle accelerator.”

About the author

Cassie Findlay is a Senior Consultant with Recordkeeping Innovation. In past roles, Cassie has worked strategically at the whole of public sector level on digital recordkeeping, training and open data / open government initiatives, and implemented NSW’s first digital archive for born digital government records.  Cassie has a Masters degree in information management from the University of NSW and is a co-founder of the Recordkeeping Roundtable.

2 thoughts on “What even is a high value dataset?

Leave a Reply for (What even is a high value dataset?)

Your email address will not be published. Required fields are marked *