issues

Data Infrastructure

1st Edition 2nd Edition

Key points

  • Data and digital infrastructures continue to demonstrate the value of open approaches across a range of sectors and communities around the world.
  • Efforts to classify and describe different types of data infrastructures and the organisations that steward them are evolving toward more practical efforts to support the creation of sustainable, trustworthy data infrastructures.
  • Both COVID-19 and the rapid rise of AI have highlighted gaps and issues in our existing data infrastructures.

Leigh Dodds

Energy Sparks

Leigh is an open data practitioner with experience working with a variety of sectors and organisations to develop and adopt best practices for publishing and consuming data.

Introduction

The inclusion of data infrastructure as a cross-cutting theme in the original edition of the State of Open Data was due to the growing use of "infrastructure" as a concept across the open data and open source communities as a means to highlight the often hidden work involved in building and maintaining the code, data, and systems that play a growing and vital role in today's societies. As we highlighted in the original chapter, looking at data and information as "infrastructure" is not a new concept. There is a long history of social and economic research that looks at non-physical forms of infrastructure across a variety of domains. Using the concept of "infrastructure" to think about data and the systems, organisations, and standards that guide how it is collected, used, and shared is useful from two perspectives:

  • First, thinking of data as infrastructure like roads and utilities helps to highlight a number of important public policy questions, such as ensuring equity of access, preserving privacy, increasing safety, and creating value for the public good. 
  • Second, from a more practical point of view, we can look at individual data infrastructures – specific configurations of standards, technologies, policies, and datasets – and measure their impact, inspect their governance and ownership, and assess whether they are sustainable over the long term.

The Role and Impact of Data Infrastructures

Over the last five years, there has been a great deal of research and debate around both of these perspectives, as well as continued investment in the development and scaling of data infrastructures across a range of domains. Successful data infrastructures help to demonstrate the benefits of open data by creating and enabling data ecosystems that push for more open approaches to real world problems.

One example is the Open Apparel Registry. Originally launched in March 2019 with the goal of improving human rights and environmental conditions in and around apparel factories and facilities, the registry provided open identifiers and data about those facilities. Having rapidly grown to over 90,000 facilities with contributions from 540 organisations1, the project has now evolved and relaunched as the Open Supply Hub, broadening its reach to cover global supply chains as a whole2.

Other existing open data infrastructures like OpenStreetMap (OSM) continue to play a vital role in a number of efforts, including humanitarian aid3. As a focal point for collaboration that involves the individuals as well as the public, private and third sectors, OpenStreetMap continues to evidence the value of open approaches and was recently recognised as a "Digital Public Good"62. The Overture Maps Foundation is a project founded by a number of commercial organisations4 to encourage the creation of more open geospatial data. Its founders include Amazon, Meta, and Microsoft, all of whom already contribute resources and data to OSM5. Overture Maps describes itself as complementary to OpenStreetmap6, and while there have been concerns around its goals7, it is hard to imagine this type of project existing without OSM and similar projects having previously demonstrated the benefits of open data and open collaboration around data.

Good data infrastructures provide more than just access to data. They can help to unlock value from that data by enabling its use by providing tools, training, and expertise. Active stewardship of data can help to improve data quality and create feedback loops between data users and data publishers that can help to build norms and best practices that support ethical, trustworthy use of data. For example, the African Data Hub provides more than just a data portal. It also delivers training, mentoring, and support for data journalists across Africa8. This package of support is helping to address vulnerability, inequity, and exclusion, including by highlighting femicides9 and the scale of climate debt10.

Coordinated the publishing of data to a common standard can tackle a range of social, economic, and environmental challenges. A recent Open Data Institute report11 highlighted the range of additional services and activities provided by a number of data infrastructures of this type, including providing data validators and other tools such as a help desk service as well as training and guidance for both publishers and users of the data. 

Public policy interventions that require organisations to publish data can be more impactful if attention is given to the data infrastructure that will support policy delivery12. In finance, the concept of "open banking" – increasing access to financial data through the adoption of common open standards for data – has been rapidly adopted across a number of major economies around the world. While the specific regulatory drivers and approaches differ between countries, e.g. around the role of open data13, OpenBanking is helping to increase access to financial services as an important strategic development goal14. But, to do so, financial inclusion must be considered from the start, especially as this type of data infrastructure continues to be adopted in emerging and developing economies15,16. 

The COVID-19 pandemic has obviously highlighted the importance of a stronger health data infrastructure. The scramble to share data to support research and public health decision-making highlighted a range of problems with health data infrastructures and approaches to data sharing17 and spurred the creation of a large number of COVID-19 specific data portals and other infrastructures18. There are a number of analyses that have looked at how to address gaps in health data infrastructure19,20, as well as the broader impacts of data-enabled health technologies21. 

Looking more broadly, there are examples of policy changes that are helping to shape and strengthen national data infrastructures. The EU Open Data Directive legislation requires EU members to publish a range of "high value" open datasets across geospatial, earth observation and environment, meteorological, statistics, companies, and mobility22. The recognition of the OpenReferral standard by the UK government23 illustrates how open standards for data63 – an essential building block for data infrastructure – can be adapted and adopted in other contexts. OpenReferral provides a standard for publishing open data about community services, enabling people to find the support and services they need. Originally developed in the US, it has been adopted by a number of UK local authorities to help meet requirements to publish lists of public services24. As a common standard, it is helping to build a shared ecosystems of tools and services.

Classifying Data Infrastructures

The definition of data infrastructure in the original chapter of the State of Open Data recognised a number of different elements of an "infrastructure" beyond the data itself, including standards, identifiers, policies, and all the organisations and communities that govern infrastructure. While all of these elements are important, and though much of the broader debate, public policy work, and research agenda has focused on improving the governance of data, it is the organisational aspects of data infrastructure that have had most attention in recent years.

A number of organisations have produced taxonomies and classifications of different types of infrastructure as well as organisational models, including data exchanges, data collaboratives, data cooperatives, and marketplaces25,26,27,28. This work has largely been intended to highlight the strengths and weaknesses of different approaches and to help document existing infrastructures - see the Open Data Institute's Data Institutions Register29, the Catalogue of Open Infrastructure Services30, or Mozilla's database of alternative data governance models31.

New models for governing data infrastructures, such as data trusts, have received a great deal of attention recently. Data trusts and similar models of stewardship include a fiduciary duty on behalf of the data steward to act in the interests of the data contributors27. They offer a potential incentive to create better, more equitable and ethical outcomes based on how data is being accessed, used, and shared. But, while a number of pilots have been carried out32,33,34, there are, as yet, few examples of new legal forms being adopted to help develop new data infrastructures, whereas, the benefits of open participatory data governance and the role of citizens within the scope of existing platforms and data infrastructures is being highlighted through the work of organisations like Connected By Data35, the Data & Society Trustworthy Infrastructures group68, and the Ada Lovelace Institute36.

While many data intermediaries and brokers remain hidden37, making it difficult to monitor and understand their activities, some countries are moving to regulate the role of intermediaries within national data infrastructures. For example, in India, the Personal Data Protection Bill includes a fiduciary responsibility27, and the EU Data Governance Act provides a framework intended to formalise the role of data intermediaries with the aim of boosting the sharing of data for altruistic purposes38.

Building Trustworthy and Sustainable Data Infrastructures

One interesting area of activity over the last few years has centred on attempts to provide support and guidance to those seeking to build, scale, or sustain data infrastructures. The sharing of insights and mentoring across those building or leading data infrastructures has been explored through the Data Stewards Network39, the Data Cooperatives working group40, and the peer networking programmes offered by the Open Data Institute41, and more recently, by Data2X and the Aapti Institute42. 

Building trust in data infrastructures involves a variety of factors, including how they are owned, governed, and funded, so that they can deliver value over the long term. In the open science and research domains, many data institutions have been assessing themselves against the Principles for Open Scholarly Infrastructures43 or the "Good Practice Principles for Scholarly Communication Services"44. A recent SPARC survey45 summarised how a variety of open infrastructures assess themselves against these principles and highlighted that developing good governance was a particular challenge. Other frameworks for assessing organisations in order to help them become more trustworthy include the GPAI Trustworthy Data Institutional Framework46 and the Open Data Institute's Trustworthy Data Stewardship Guidebook47.

The question of how data infrastructures are funded and made sustainable has also received some attention. The Open Data Institute has reviewed a range of data infrastructures to understand how they become sustainable48, and the Aapti Institute has outlined revenue models for data stewards49. Invest In Open has also carried out research looking at funding sources50, costs51, and the financial health52 of infrastructures supporting open science and research.

Third-party assurance has also been proposed as a further method of helping to build trust in data53 by assessing individual datasets or the governance and operations of a data infrastructure more broadly. One lightweight approach is the concept of a Digital Public Good. The Digital Public Goods Alliance defines a digital public good as "open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain to the Sustainable Development Goals".54 The Alliance has provided both a standard to assess what infrastructure qualifies as a public good and a registry of certified public goods55. 

The concept of "digital public infrastructure" that refers to services that are "essential to participation in society and markets as a citizen, entrepreneur, and consumer in a digital era"66 is also relevant. There are some overlaps between this idea and conceptions of data infrastructure. For example, digital public infrastructure has been described as needing to be inclusive, foundational, interoperable, and publicly accountable66, all of which are all qualities which can be seen in the frameworks and guidance relating to improving the governance of data infrastructure.

Impacts of Artificial Intelligence

Large Language Models (LLMs) and other approaches to building AI use large volumes of data, most often consisting of text and images scraped from the web. The lack of transparency around the sources of that data, issues around rights for reuse, and the ethical and safety impacts of AI, are all rightfully receiving a lot of recent attention. While more "traditional" machine-learning projects have always drawn on a wide range of datasets, the sudden rise of LLMs has happened despite existing work on building and governing trustworthy data infrastructure. For LLMs, the underlying data infrastructure is the web itself. 

The ease with which text, images, and other information can be scraped at scale makes it easy for organisations to quickly amass the large volumes of data needed for AI to work. That scraping is often supplemented by the use of training datasets that are used to refine models for specific purposes. The methods by which those datasets are created raises its own concerns64. From a data infrastructure perspective, AI has driven a flurry of work to address these concerns, including efforts to audit and evaluate training datasets56, the creation of new approaches to describing datasets and their limitations57,58, and the development of frameworks that allow foundational AI models to be fine-tuned using customised datasets integrated with existing data infrastructures (e.g. via APIs)59.

The concerns around AI and the purposes for which it is used may lead to a reduction in the trust in, or contributions to, existing data infrastructures and datasets. For example, the Common Crawl dataset, a publicly available crawl of the web, has been used for a variety of research purposes, but its recent adoption as a key dataset in the development of LLMs65, means that it is now being routinely blocked by many websites60. As AI continues to get deployed, it will undoubtedly place increasing demands on data infrastructures, both to supply the additional training data required to tailor models for specific purposes or incorporate data on-demand to carry out specific tasks. 

What Comes Next?

We are still at an early stage in understanding the full impacts of AI on our local, national, and global data infrastructures. The interaction between AI and its underlying data infrastructures will continue to be a focus area for public policy, research, and advocacy for fairer and more equitable data ecosystems, but responding to the rapid adoption of AI won't help solve the many existing issues to be addressed in building trust and sustainability around existing data infrastructure. 

The range of interventions offered by Invest in Open – including its funding pilots, research, and strategic support activities – offers an interesting model for how other sectors and communities might develop their own enabling programmes with the goal of building sustainable, trustworthy data infrastructure. In many ways, the initiative echoes the Public Weather Services programme61 run by the World Meteorological Organization. Over many years, that programme has provided advice, supporting national weather agencies to help them deliver stronger national and global weather data infrastructure.

Amid the hype and debate around AI, it will be important not to lose sight of the need to continue to invest in and support the growth of data infrastructures that can tackle real-world problems. A recent report from Access Now67 highlights the continuing need to understand and review how data and digital infrastructures are evolving and ensure that necessary safeguards are in place - the latest in a growing body of research into how to strengthen other types of local, national, and global data infrastructure to help us tackle the many pressing challenges that face societies and communities around the world.

Previous chapter

Continues in 19. Data Literacy

Next chapter