issues

Privacy

1st Edition 2nd Edition

Key points

  • Growing demand for human-derived data from the public sector has created increased privacy risks as data from multiple sources may be combined for a variety of research and innovation purposes. 
  • Group privacy is an emerging and growing concern, and there is a need for governance, oversight, and recourse mechanisms to protect individuals and communities from collective harms.
  • The growing popularity of information gathering tools that use data scraping requires a nuanced approach, favouring transparency over privacy in some contexts and taking steps to protect privacy in others. 

Teresa Scassa

University of Ottawa

Teresa Scassa is the Canada Research Chair in Information Law and Policy at the University of Ottawa. She teaches and researches in the area of information law, including privacy, data governance, and the regulation of artificial intelligence.

Introduction

Open data programs enable the release of government data in reusable formats under open licenses. They also seek to make data findable and interoperable so as to maximize their reuse both alone and in combination with other data. However, tensions arise when datasets include private or sensitive data. 

The first State of Open Data chapter on Privacy released in 2019 discussed how open data programs tend to exclude personal data from being released as open data. It also noted the risk individuals run of being re-identified from their data, even when it has been anonymized. The deliberate release of personal information in certain types of records, such as from government registries and court decisions, was a privacy concern raised in the chapter, as was how digitization shifts and amplifies privacy risks. 

Four years on, these issues still remain current. To take the analysis a step further, this update to the State of Open Data report considers them in the context of the growing demand for access to a broad range of government data. With a burgeoning digital and data economy, there is growing pressure on governments to provide access to public sector data to stimulate innovation.1 This ever-increasing demand for access to public sector data generates at least four clear privacy challenges, each of which is discussed in depth below.

Re-identification Risk 

Widespread practices of combining data from multiple sources increase the risk of re-identification of individuals who have been de-identified in particular datasets.2 Even if open data is not related to specific individuals, it can still be used in combination with other data, whether open or privately held, either to re-identify individuals in de-identified datasets or to profile individuals. This enhanced risk of re-identification from combining data from multiple sources is sometimes referred to as the ‘mosaic effect’.3

For example, open aggregate geo-demographic data can be used to draw inferences about individuals who are matched to specific geographic areas through data from other sources containing names and addresses.4 Such practices are common, for example, for direct or targeted marketing. 

Profiling of individuals has received heightened attention since the Cambridge Analytica scandal heightened public awareness.5 Profiling can have diverse impacts, depending on who is doing the profiling and for what purpose, and can adversely impact both privacy and human rights.6 But it can be problematic to use such concerns as an argument to restrict access to open data that can have a wide range of beneficial uses - open data and privacy are not necessarily opposing concepts.7 Instead, these issues highlight the need for robust data protection laws,8 for data governance frameworks,9 and for ethical approaches to AI and analytics.10 They may also require a greater role for the use of privacy-protective technologies.11  All of these issue areas have seen a rapid growth in interest and activity, because ensuring that data governance frameworks are in place is a task directly adjacent to open data.12 

Increased Demand for Public Sector Administrative Data Access

There is growing pressure on governments to provide de-identified human-derived data as open data.13 Governments collect a large volume of administrative data14, often containing personal data. This data has significant value – for transparency as well as for innovation purposes, including the development and training of AI systems for use in government and in health care.15 

As noted in the previous section, while such data could be made available as open data if anonymized, there are concerns about the risk of re-identification. Such concerns have created pressure for governments to look beyond open data portals to other models of data governance for sharing public sector data.16 These can include ‘data trusts’,17 which are legal structures that provide independent stewardship of data or so-called safe-sharing spaces.18 

There are, however,  drawbacks to these non-traditional data governance arrangements. For example, they tend to be more complex and more resource-intensive than regular data portals. Access to data via alternative governance mechanisms may also be limited to certain actors and may be referred to as ‘semi-open data’.19 In addition, providing public sector data through such data governance infrastructures raises important issues about public engagement, ethics, and the communal benefits from data sharing and use.20 

Another thing to consider is that building and maintaining such infrastructures is likely to be resource-intensive, raising the question of whether they might draw attention and resources away from conventional open data portals. This means that robust de-identification and/or anonymization techniques will be required, and ethical data use frameworks, oversight, and enforcement will be important considerations.21 

Open Government Information and Enhanced Privacy Risks

Data mining and data extraction processes, along with new uses of data, exacerbate privacy challenges in the case of open government information that contains personal information (such as court decisions and registry information). Although this is not open data under the conventional definitions, contemporary technology easily allows personal data to be harvested from these documents or registries. In this sense, they become open sources of government data, and this can raise significant privacy concerns. 

There are several examples of such data-harvesting practices. Some involve the scraping of court decisions from court websites, as well as from websites that provide more global access to court and administrative tribunal decisions, which are usually de-indexed to prevent them from being searchable by name.22 The following example from Canada illustrates the vulnerability of personal information published online by government entities or by courts. It also highlights the global nature of these forms of data extraction with the resulting challenges for legal recourse. In 2017, Canada’s Federal Privacy Commissioner obtained a court order against a company based in Romania that scraped Canadian court websites in order to create a fully web-searchable database of court decisions.23 The company’s activity led to multiple complaints from individuals who found that their names, alongside often detailed personal information, had become easily searchable online. The website operator also charged significant fees to individuals who sought to have their personal data removed.24 The Canadian court ordered the website to remove the decisions, although the enforceability of the judgment depended on the cooperation of Romanian authorities. 

Data scraping thus creates important privacy risks, which, in turn, may create a tension with the desire to maintain transparency with respect to courts and legal proceedings. Similar issues can arise with other forms of open government information online, including registries. For example, the global corporate transparency organization, Opencorporates25, creates and maintains an international database of corporate ownership to facilitate transparency at a global level. Where data is not readily available as open data, the organization scrapes it from online public registries, including one in Quebec. When the Quebec registry changed its terms of use in 2016 to prohibit data scraping, OpenCorporates stopped scraping to update its data. However, it went to court to resist demands by the Quebec corporate registry that it delete previously scraped data.26 A Quebec court declared that OpenCorporates could not be compelled to delete this data as there had been no earlier prohibition on their acquisition through scraping.27 Nevertheless, the underlying assumption in the case seemed to be that if a website’s terms of service prohibit scraping, data cannot legally be scraped.28 

It is worth noting that in the OpenCorporates case, the database contained the names of individuals linked to corporations, which was arguably not personal information as much as it was information about those corporations. This shows how restricting data scraping of public registries or documents can limit transparency and other important goals of open data. Yet, because there are also serious risks to privacy from the scraping of some government information, the solution may lie in appropriate governance of open government information. This may mean favouring transparency over privacy in some contexts, and taking steps to protect privacy, perhaps through a combination of terms of service, redactions and other privacy-protective technologies, in other contexts.

Collective Privacy

Ordinary notions of privacy focus on the rights of the individual and the impact on individuals of intrusions upon their privacy – or in the case of data protection, on the collection or use of personal data without the data subject’s consent and/or in manners that may be harmful to them. Collective privacy is concerned with the collective harms that may be experienced by groups or communities when human-derived data – even if anonymized – is used (or misused). Large datasets are increasingly used to power AI and analytics. Where these datasets contain human-derived data, their uses may have a number of different impacts. Human-derived data may be used to profile groups or communities, resulting in adverse impacts both for those groups and for individuals within them.29 Possible harms include biased results, privacy invasion, surveillance, and a lack of algorithmic transparency.30 For example, human-derived data might be used by city planners to determine which neighbourhoods should receive more (or fewer) resources. They might be used in decisions about the deployment of police, to profile individuals with certain communities as not being credit-worthy, or for price-discrimination practices. Collective and individual harms that flow from these uses of human-derived data are also linked to concerns about algorithmic bias as it is often large datasets that are used to train algorithms.31 

Where governments provide large, de-identified datasets as open data, they may create the potential for group or collective privacy harms. Addressing these harms may not mean limiting access to open data. Instead, there may need to be governance, oversight, and recourse mechanisms in place to ensure that individuals and communities are protected from collective harms.32 Attention to collective privacy concerns are expanding with the rise of AI.33

Conclusion

Over the past four years, the challenges for open data related to privacy have become more complex as greater volumes of public sector data are sought for use in AI and analytics. Not only can open data be used in combination with other available data to re-identify individuals, it can also be used in profiling activities that have the potential to adversely impact both privacy and human rights. The growing demand for data for research and innovation purposes is also putting pressure on governments to make administrative data and other human-derived data available as open data. This suggests a need for new governance frameworks that can address both re-identification risks and the ethical reuse of human-derived data through licensing and other governance arrangements. This may shift the focus of open data somewhat, and attention should be paid to who is entitled (or sufficiently resourced) to access and use these forms of data. The emerging concept of group or collective privacy is also important in understanding how analytics and profiling may produce both privacy and human rights harms even from anonymized data. This expanded understanding of privacy impacts from the use of data is important and should be taken into account in all open data contexts.


  1. 1: * See, e.g.: Directive (EU) 2019/1024 on open data and the re-use of public sector information, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L_.2019.172.01.0056.01.ENG, (accessed on July 11, 2022).
  2. 2: * See, e.g., Ali Farzanehfar, Florimond Houssiau1 and Yves-Alexandrede Montjoye, “The risk of re-identification remains high even in country-scale location datasets,” Patterns 2, no. 3, (March 2021): 100204. https://doi.org/10.1016/j.patter.2021.100204, (accessed on July 11, 2022); Chris Culnane, Benjamin I. P. Rubinstein, and Vanessa Teague, “Stop the Open Data Bus, We Want to Get Off”, August 25, 2019, https://arxiv.org/pdf/1908.05004.pdf, (accessed on July 11, 2022).
  3. 3: * Center for Open Data Enterprise, “Briefing Paper on Open Data and Privacy” (2016), http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf, (accessed on July 11, 2022); Latonya Sweeney, “Simple Demographics Often Identify People Uniquely,”, Carnegie Mellon University, Data
  4. 4: * See, e.g., Luke Burns, Linda See, Alison Heppenstall, and Mark Birkin, “Developing an Individual-level Geodemographic Classification”, Appl. Spatial Analysis 11 (2018): 417-437. https://doi.org/10.1007/s12061-017-9233-7.
  5. 5: * Issy Lapowsky, "How Cambridge Analytica Sparked the Great Privacy Awakening", Wired, March 17, 2019, https://www.wired.com/story/cambridge-analytica-facebook-privacy-awakening/.
  6. 6: * See, e.g.: Council of Europe, “Profiling should not harm human rights nor democratic societies”, November 9, 2021, https://www.coe.int/en/web/data-protection/-/profiling-should-not-harm-human-rights-nor-democratic-societies, (accessed on July 11, 2022); Center for Open Data Enterprise, “Briefing Paper on Open Data and Privacy” (2016), http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf, (accessed on July 11, 2022).
  7. 7: * Anita Gurumurthy and Nandita Chami, 2016. “Data: The New Four-Letter Word for Feminism”, GenderIT.org, https://www.genderit.org/articles/data-new-four-letter-word-feminism, (accessed on July 11, 2022).
  8. 8: * For example, penalties for re-identification are found in Singapore’s Personal Information Protection (Amendment) Act 2020; the UK Data Protection Act 2018; and are part of Bill C-27 to amend Canada’s private sector data protection law (An Act to enact the Consumer Privacy Protection Act, the Personal Information and Data Protection Tribunal Act and the Artificial Intelligence and Data Act and to make consequential and related amendments to other Acts, 44th Parl. 1st Sess., https://www.parl.ca/legisinfo/en/bill/44-1/c-27, (accessed on July 11, 2022). For a view that is critical of a penalty-based approach, see: Mark Phillips, Edward S. Dove & Bartha M. Knoppers, “Criminal Prohibition of Wrongful Re‑identification: Legal Solution or Minefield for Big Data?,” Journal of Bioethical Inquiry 14, (2017): 527–539. https://link.springer.com/article/10.1007/s11673-017-9806-9.
  9. 9: * See, e.g., OECD, “Recommendation of the Council on Enhancing Access to and Sharing of Data”, OECD/Legal/0463, 5/10/2021, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0463, (accessed on July 11, 2022).
  10. 10: * See, e.g., Luciano Floridi and Josh Cowls, “A Unified Framework of Five Principles for AI in Society,” Harvard Data Science Review 1, no. 1. https://doi.org/10.1162/99608f92.8cd550d1; UNESCO, “Recommendation on the Ethics of Artificial Intelligence”, November 23, 2021, https://unesdoc.unesco.org/ark:/48223/pf0000381137, (accessed on July 11, 2022).
  11. 11: * Jae-Seong Lee and Seung-Pyo Jun, “Privacy-preserving data mining for open government data from heterogeneous sources,” Government Information Quarterly 38 (2021): 101544. https://www.sciencedirect.com/science/article/pii/S0740624X20303233, (accessed on July 11, 2022); Big Data UN Global Working Group, (2019) “UN Handbook on Privacy-Preserving Computation Techniques”, https://unstats.un.org/bigdata/task-teams/privacy/UN%20Handbook%20for%20Privacy-Preserving%20Techniques.pdf, (accessed on July 11, 2022).
  12. 12: * Arthur Kakande, “Open Data vs. Data Protection: Where are We Now? Building a Foundation of Open Data in Uganda”, (June 24, 2019) Medium https://medium.com/pollicy/open-data-vs-data-protection-where-are-we-now-794c4fc6b0c9, (accessed on July 11, 2022).
  13. 13: * Center for Open Data Enterprise, “Briefing Paper on Open Data and Privacy,” 2016. http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf, (accessed on July 11, 2022).
  14. 14: * Collaboration in Research and Methodology for Official Statistics (CROS), “Administrative Data”, (May 8, 2019) https://ec.europa.eu/eurostat/cros/content/administrative-data-0_en, (accessed on July 11, 2022).
  15. 15: * Shinji Kobayashi, Thomas B. Kane and Chris Paton, “The Privacy and Security Implications of Open Data in Healthcare,” Yearb Med Inform. 27, no. 1, (August 2018): 41–47. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6115211/, (accessed on July 11, 2022).
  16. 16: * Center for Open Data Enterprise, “Briefing Paper on Open Data and Privacy” (2016), http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf, (accessed on July 11, 2022).
  17. 17: * See: Jack Hardinges, “What is a Data Trust” (July 10 2018), Open Data Institute, https://theodi.org/article/what-is-a-data-trust/ (accessed on September 23, 2022).
  18. 18: * See, e.g.: Lisa Austin and David Lie, Safe Sharing Sites (February 5, 2019). N.Y.U. L. Rev. Forthcoming, Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3329330.
  19. 19: * Center for Open Data Enterprise, “Briefing Paper on Open Data and Privacy” (2016), http://reports.opendataenterprise.org/BriefingPaperonOpenDataandPrivacy.pdf, (accessed on July 11, 2022).
  20. 20: * See, e.g.: Michael Madison, “Tools for Data Governance,” Technology & Regulation 2020, 29. DOI: 10.26116/techreg.2020.004; Teresa Scassa, “Designing Data Governance for Data Sharing: Lessons from Sidewalk Toronto,” Technology & Regulation (2020): 44-56. https://techreg.org/article/view/10994 .
  21. 21: * Shinji Kobayashi, Thomas B. Kane and Chris Paton, “The Privacy and Security Implications of Open Data in Healthcare”, Yearb Med Inform. 27, no. 1 (August 2018): 41–47. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6115211/, (accessed on July 11, 2022).
  22. 22: * For a listing of websites established around the world to enhance access to basic legal materials, including case law, see: Graham Greenleaf, “Legal Information Institutes and the Free Access to Law Movement”, February 2008, https://www.nyulawglobal.org/globalex/Legal_Information_Institutes.html (accessed on July 11, 2022).
  23. 23: * A.T. v. Globe24h.com, 2017 FC 114 (CanLII), [2017] 4 FCR 310, https://canlii.ca/t/gx6bl (accessed on July 11, 2022).
  24. 24: * Ibid.
  25. 25: * Opencorporates, https://opencorporates.com/, (accessed on July 11, 2022).
  26. 26: * Opencorporates Blog, “OpenCorporates takes Quebec company register to court”, (April 17, 2017) https://blog.opencorporates.com/2017/04/07/opencorporates-takes-quebec-company-register-to-court/ (accessed on July 11, 2022).
  27. 27: * Opencorporates Ltd. c. Registraire des entreprises du Québec, 2019 QCCS 3801 (CanLII), https://canlii.ca/t/j2cmf.
  28. 28: * This issue may be treated differently in different jurisdictions and because of its relative novelty, may not be settled law. For example, in the US, the 9th Circuit Court of Appeals has ruled that data scraping does not violate the Computer Fraud and Abuse Act. See: hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019); aff’d, hiQ Labs v. LinkedIn Corp., 31 F.4th 1180 (9th Cir. 2022).
  29. 29: * Paola Mavriki and Maria Karyda, (2020), "Automated data-driven profiling: threats for group privacy", Information and Computer Security 28, no. 2, (2020): 183-197. https://doi.org/10.1108/ICS-04-2019-0048.
  30. 30: * Teresa Scassa and Fernando Perini, “Open Data in the Global South”, in The Future of Open Data, edited by Pamela Robinson and Teresa Scassa, (Ottawa: University of Ottawa Press 2022), 179-199.
  31. 31: * Ibid.
  32. 32: * See, e.g.: Meera Manoj, (2017). “Big Data Governance Frameworks for ‘Data Revolution for Sustainable Development”. Centre for Internet & Society, https://idl-bnc-idrc.dspacedirect.org/handle/10625/56914, (accessed July 11, 2022); LIRNEasia, “Big Data and SDGs: The State of Play in Sri Lanka and India”. 2017. Columbo: LIRNEasia, https://idl-bnc-idrc.dspacedirect.org/bitstream/handle/10625/56907/IDL-56907.pdf?sequence=2&isAllowed=y, (accessed July 11, 2022); United Nations Economic Commission for Africa, Africa Data Revolution Report 2016. (2017) Addis Ababa: ECA Printing and Publishing Unit, https://www.undp.org/africa/publications/africa-data-revolution-report-2016, (accessed July 12, 2022).
  33. 33: * See, e.g.: Martin Tisne, "Collective data rights can stop big tech from obliterating privacy", MIT Technology Review, May 25, 2021, https://www.technologyreview.com/2021/05/25/1025297/collective-data-rights-big-tech-privacy/; Gabriella Razzano, "Understanding the Theory of Collective Rights: Redefining the Privacy Paradox", Research ICT Africa, February 2020,  https://researchictafrica.net/wp/wp-content/uploads/2021/02/Data-Trusts-Concept-Note.pdf.
Previous chapter

Continues in Open Data Stakeholders

Next section