Open data initiatives promise neutral access to government information. What gets published, in what format, and how usable it is reflects political choices.
A few years ago I was trying to use a publicly available government dataset for a research project. The dataset was listed on a state agency's open data portal, it was free to download, and it was described as containing records going back several years. When I actually opened the file, I found columns with no labels, fields with values that turned out to be internal codes with no published code book, and records that stopped updating about eighteen months before I downloaded them. The data was "open" by any technical definition. It was essentially unusable without information I could not find.
This experience is not unusual. It is actually close to the norm for a significant portion of government open data, and it points to a problem that the open data movement has been slow to fully acknowledge. Availability is not the same as accessibility. And the decisions that determine what gets published, in what format, with what documentation, and how often it gets updated are not neutral technical choices. They are political ones.
The Obama administration launched Data.gov in 2009, around the same time HITECH was passing and the government was thinking generally about using technology to improve public services and accountability. The European Union has its own set of open data directives, and many national governments have made formal commitments to publishing government data. These commitments are real. The institutional infrastructure for open data, the portals, the standards, the policies, does exist and has expanded significantly over the past decade and a half.
But the selection of what to publish tells a different story. Data that helps the public understand what a government agency does but does not create accountability for the agency is easy to publish. Agency budget summaries, administrative statistics, geographic reference data. These are the kinds of datasets that appear frequently on open data portals because they are genuinely useful to some users and carry no political risk to the publishing agency. Data that would allow a journalist or a researcher to hold an agency accountable for specific decisions is a different matter. Enforcement records, inspection results, complaint outcomes, contracting decisions that favored specific vendors. This kind of data is often incomplete, delayed, in a format that resists systematic analysis, or simply not published.
I do not want to be conspiratorial about this. Most of the time it is not that agencies are deliberately hiding embarrassing data. It is more that publishing data requires resources, that the default is not to publish unless required, and that when resources are allocated, the datasets that carry institutional risk tend to fall lower on the priority list than the ones that do not. The result looks the same from the outside: accountability data is harder to find and use than administrative data.
Format matters more than most people realize. A PDF scan of a report is technically a published document. It is not queryable data. A spreadsheet of records with no consistent schema across years is technically open data. It is not analytically useful without substantial cleaning work. The standard of what counts as "open" matters enormously, and many government open data portals meet a very minimal version of that standard. The five-star open data model, which runs from simple availability through linked open data, is a useful framework for thinking about the spectrum. Most government datasets cluster at the lower stars.
There is also a harder problem, the privacy tension, that does not go away even with political will and resources. The most analytically useful government data is often the most sensitive. Individual-level tax records could reveal patterns of tax avoidance by industry or geography in ways that aggregate statistics cannot. Individual-level health records could support research that saves lives. Individual-level location data collected by government agencies could enable surveillance research that no IRB would sanction. The data that researchers would most want to use is frequently the data that creates the most serious risks if it is misused or incorrectly de-identified.
Privacy-preserving techniques like aggregation, suppression, differential privacy, and synthetic data generation exist and are improving. But there is a real tradeoff here that does not have a clean technical solution. The more you protect individual privacy, the less analytically useful the data becomes. Suppressing cells with small counts protects individuals but creates gaps in exactly the geographic or demographic breakdowns where coverage is already thin. Differential privacy adds noise that degrades precision for subgroup analysis. These are real costs. Pretending that privacy and utility can both be maximized without tradeoff is not honest.
The EU's GDPR creates one kind of regulatory environment for this tradeoff, and US privacy frameworks, which are more fragmented and less comprehensive than GDPR, create a different one. I am deliberately not getting into specific regulatory articles because the details matter a lot and I do not want to be imprecise. The directional point is that organizations operating under GDPR face different constraints on what data can be published and how than US agencies operating under a patchwork of sector-specific privacy rules. Neither environment makes open data easy. They make the difficulty different.
The version of open data I find most useful in practice is usually the kind that combines machine-readable data with good documentation, consistent schemas across updates, and some kind of accountability for keeping it current. These are not glamorous technical requirements. They do not require artificial intelligence or blockchain or any other technology that gets pitched as transforming government data. They require sustained attention, adequate staffing, and genuine organizational commitment to treating data publication as part of the agency's public function rather than as a low-priority administrative task.
What bothers me about how open data is often discussed is that the conversation focuses on the portals and the mandates and the star ratings and not enough on the organizational question of who is actually responsible for making the data good. A dataset with a nice API and terrible documentation is not more useful than a well-documented CSV. A portal with thousands of datasets that are all two years out of date is not more useful to a researcher or a journalist than a smaller set of datasets maintained with genuine care. Open data as a policy goal is meaningless without open data as an organizational commitment, and organizational commitments require resources, accountability, and someone inside the agency who genuinely cares whether the data is actually usable.
About the author
Share
More notes
Related notes