Zero-Trust Data Governance and the AI Provenance Problem

Gartner published a prediction in January 2026 that I have been sitting with for a few months: by 2028, 50 percent of organizations will adopt zero-trust data governance as AI-generated data grows (Gartner, 2026, https://www.gartner.com/en/newsroom/press-releases/2026-01-21-gartner-predicts-by-2028-50-percent-of-organizations-will-adopt-zero-trust-data-governance-as-unverified-ai-generated-data-grows). The prediction is notable not because of the adoption figure but because of what is driving it. AI-generated data is accumulating inside organizations faster than organizations can track where it came from or whether it is accurate. That is a new kind of data quality problem, and the existing governance frameworks were not designed for it.

The scale of AI use matters here. McKinsey's 2025 State of AI report found that 88 percent of organizations are using AI in some form (McKinsey, 2025, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai). Every one of those organizations is generating AI outputs. Reports that were drafted by a language model. Summaries of contracts that an AI produced. Customer profiles that were augmented with AI-inferred attributes. Code that was written or modified by an AI assistant. Data pipelines that were created by an AI and have never been reviewed line by line by a human. These outputs are flowing into enterprise data stores, being used to feed downstream models, being cited in presentations, and informing decisions. And most organizations have no reliable way to distinguish AI-generated content from human-generated content in their data.

The provenance problem is not unique to AI. Data quality has always been an organizational challenge. But the traditional data quality problem involves data that a human entered incorrectly, a system that transferred data imprecisely, or a format mismatch between source and destination. Those errors have recognizable patterns and known remediation paths. AI-generated errors are different. A language model can produce a summary that is internally coherent, grammatically clean, and factually wrong. It can infer an attribute for a customer record that looks reasonable in the context of the rest of the record but was never verified against any external source. The output looks like data. It has the same format as data. It sits in the same database as data. But it is a model's best guess, and the model's confidence level is not stored alongside the record.

Zero-trust data governance applies the zero-trust security principle to data. In zero-trust security, the assumption is that no connection, device, or user is trusted by default, even inside the perimeter. Every access request requires verification. Zero-trust data governance extends this logic: no data asset is assumed to be accurate or authoritative until its lineage has been verified. Every dataset carries documentation of its origin, the transformations it went through, and the conditions under which it was produced. AI-generated content is flagged as such, with the model version, the prompt context, and any human review that was applied before the output was accepted into the data catalog.

The tools that make this possible are data catalogs, data contracts, and lineage tracking systems. Data catalogs document what data exists and where it came from. Data contracts define the expected schema, quality standards, and provenance requirements for data moving between systems. Lineage tracking records the transformation history of a dataset from its origin to its current state. These tools exist. The challenge is that deploying them at enterprise scale, across the heterogeneous data environments that most large organizations operate, is a substantial engineering and governance undertaking. The 50 percent adoption figure Gartner is predicting by 2028 implies that roughly half of organizations will make that undertaking in the next two years, driven by the AI provenance problem forcing the issue.

The regulatory driver reinforces this. The EU AI Act requires documentation of training data for AI systems used in high-risk applications. If an organization uses an AI system to make consequential decisions, it needs to be able to show what data the model was trained on, whether that data was accurate and representative, and what governance processes were applied during training. If the training data itself includes AI-generated content that was not documented, the organization cannot answer those questions. The compliance exposure from AI-generated training data contamination is a forcing function for better data lineage practices, independent of whether the organization cares about data quality for its own sake.

IBM's 2024 Cost of Data Breach report found that organizations using AI and automation in security saved an average of $2.2 million per breach (IBM, 2024, https://newsroom.ibm.com/2024-07-30-ibm-report-escalating-data-breach-disruption-pushes-costs-to-new-highs). That figure is often used to justify AI investment in security. But the savings come from faster detection and response, which depends on the AI system having access to clean, timely, and accurately labeled data about network behavior. If the security data environment is contaminated with AI-generated signals that were never verified, the detection system is working on a degraded data foundation. Zero-trust data governance matters to security AI performance, not just to compliance.

The IS research question underneath this is about trust calibration. Trust in data is not binary. A human analyst working with a dataset has some sense of how reliable it is based on experience with the source, knowledge of the collection process, and pattern recognition built over time. That calibrated trust is tacit and hard to operationalize at scale. Zero-trust data governance is an attempt to make data trust explicit and verifiable rather than tacit and assumed. The challenge is that explicit verification has a cost: the engineering resources to build the lineage infrastructure, the governance processes to maintain it, and the organizational discipline to actually reject or flag data that cannot be verified. Most organizations have not been willing to pay that cost for historical data. The arrival of AI-generated data everywhere may finally make the cost of not paying it visible.

My intuition is that the organizations that move fastest on zero-trust data governance will be the ones that discover an AI system made a significant error traced back to an unverified AI-generated data source. The organizations that move proactively will be the ones that understand the IS literature on data quality and its consequences for decision-making well enough to act before the error happens. Those are two different kinds of organizations, and they will have different data governance trajectories over the next five years.