Data Security

Security Risks of Open Data and Public Datasets

February 23, 2024

What is Open Data and Public Datasets?

Open data refers to data that is freely available for anyone to access, use and share. Public datasets are a form of open data that is released by governments, organizations or individuals for public use.

Public datasets can contain a wide variety of data including:

Government data – statistics, budgets, maps etc.
Scientific data – research findings, genome sequences, weather data etc.
Financial data – stock prices, company filings etc.
Social data – census information, transportation data, crime statistics etc.

The open data movement aims to make non-sensitive data freely available to the public. The benefits include:

Promoting transparency and accountability
Enabling innovation through use and analysis of data
Improving efficiency of services and decision making
Economic gains from commercial use of data

However, there are also potential security risks associated with open data and public datasets.

What are the Main Security Risks?

There are several potential security risks that need to be considered with open data and public datasets:

Unintended Re-identification of Anonymized Data

Anonymized data refers to data that has been stripped of personally identifiable information. However, researchers have shown that combining anonymized data with other available data can lead to re-identification of individuals.

For example, in 2000, Latanya Sweeney showed that 87% of the US population could be uniquely identified by their 5-digit ZIP code, gender and date of birth. As more data becomes openly available, it becomes easier to combine datasets and re-identify individuals.

Privacy Violations from Granular Data

Public datasets, especially those containing granular location data, can reveal sensitive personal information when analyzed. Researchers at MIT found that they could identify individual people from ‘anonymized’ mobility datasets and determine their place of residence, daily habits, religious affiliations and more.

Location data can also be used to identify people visiting sensitive locations like health clinics or political gatherings. Granular data increases privacy risks even when no obvious personal identifiers are present.

Security of Critical Infrastructure Information

Public datasets sometimes include detailed maps, schematics and operational details about critical infrastructure like power grids, telecom networks, water systems etc.

While important for disaster planning, aggregation of such data also informs potential attackers and increases infrastructure vulnerability. Sensitive infrastructure data needs careful examination before release as open data.

Misuse for Criminal Activities

Like any technology, open data can also potentially be misused by malicious actors. For example, detailed maps can help criminals plan burglaries, information on ship locations can aid smugglers, and mining datasets can assist identity theft.

Law enforcement personnel data if made public could endanger officers and their families. While rare, potential criminal misuse needs to be considered when opening up sensitive datasets.

Unintentional Leaks of Classified Data

Declassified government datasets sometimes accidentally include data that is still security classified. In one incident, a UK government agency published dataset files on their open data portal that contained sensitive military information, forcing an emergency shutdown.

Manual examination to scrub classified information from large datasets before release is difficult and prone to such errors. Automated checking by AI tools is getting better but still fallible.

Best Practices for Publishing Open Data Securely

Organizations releasing open data should adopt these practices to minimize security risks:

Thoroughly Scrub Personal Information

Remove all direct and indirect personal identifiers like names, addresses, timestamps, ID numbers, exact geolocations etc. Use techniques like k-anonymization, Differential Privacy, and aggregation to lower re-identification risks.

Limit Granularity of Spatial-Temporal Data

Reduce resolution of spatial information to larger areas and truncate timestamps to coarser time periods like month or year instead of exact time. This limits what private information can be inferred while retaining overall usefulness.

Develop Clear Data Classification Policies

Classify all data into sensitivity categories and establish clear policies on what can be made open. Periodically reassess classifications as data utility and privacy risks evolve over time.

Use Safe Data Formats like Differential Privacy

Differentially private data formats allow deriving useful aggregated insights while providing mathematical guarantees that presence of any individual cannot be determined. Such formats maximize data utility while minimizing privacy risks.

Perform Risk-Benefit Analysis

Weigh risks like re-identification and misuse against benefits of openness for each dataset. Avoid opening up datasets providing little public benefit if they carry non-trivial risks.

Employ Access Controls for Sensitive Datasets

For high-risk but useful datasets, use access control mechanisms like registration, justification or eligibility requirements. Put legal safeguards limiting liability and misuse.

Conclusion

Open data and public datasets provide many civic and economic benefits but also come with cybersecurity and privacy risks. Organizations should systematically assess and address these risks while extracting maximal value from data. With careful policies and responsible use of emerging privacy-preserving technologies, the positives of open data can outweigh the risks.