AI Training Data Made the Knowledge-Based View Visible

I was reading my study notes on the Knowledge-Based View the same week the New York Times lawsuit against OpenAI was making headlines. Nonaka (1994) arguing that knowledge is not just another VRIN resource but is tacit, embedded, and socially constructed. Grant (1996) claiming that the firm exists to integrate specialized knowledge held by individuals. Spender (1996) making knowledge the basis for a theory of the firm rather than just a resource category. And in a completely different browser tab, the Times alleging that OpenAI copied millions of its articles without permission to train systems that directly compete with its content. The two tabs said the same thing in different languages. Knowledge is the central strategic resource. And when someone extracts it for their own use, the entire question of who owns what becomes a legal battle.

The Knowledge-Based View extends the Resource-Based View by making knowledge the central strategic resource rather than one among many. I have written before about how the Resource-Based View treats IT spending as rarely VRIN because competitors can buy the same technology. Barney (1991) gave us the VRIN framework: valuable, rare, inimitable, and non-substitutable. Bharadwaj (2000) showed that IT capability, the organizational ability to deploy IT resources effectively, can meet that bar even when IT spending alone cannot. KBV goes further. It says knowledge is not just another VRIN resource but is qualitatively different because it is tacit, embedded, and socially constructed. Nonaka's (1994) SECI model specifies the mechanism: knowledge moves between tacit and explicit forms through socialization, externalization, combination, and internalization, and the spiral of those conversions drives competitive advantage. I have been sitting with this idea for weeks because I keep seeing it play out in the AI training data debates, and the connection is almost too clean.

The KBV warns against equating itself with just knowledge management. It is a strategic theory of the firm, not an operational approach to organizing documents. The New York Times lawsuit against OpenAI and Microsoft, filed in December 2023, makes that distinction tangible. The complaint alleges that OpenAI copied millions of Times articles to train the large language models behind ChatGPT and Copilot. Read the complaint through a KBV lens and the logic is striking. The Times archive is valuable because it contains high-quality journalism accumulated over more than a century. It is rare because no other organization has that specific institutional knowledge, those reporting networks, those editorial practices. It is hard to imitate because producing a comparable archive would require another century of consistent investment that no single competitor is willing to make. The Times is not arguing that OpenAI stole software. It is arguing that OpenAI extracted the firm's central strategic resource: its accumulated knowledge. The lawsuit is not really about copyright in the narrow sense. It is about a firm whose primary asset is knowledge discovering that a technology emerged to extract and recombine that knowledge at a scale that undermines the business model that produced it.

Getty Images versus Stability AI follows the same pattern. Getty licenses a massive library of professional photographs. That library is the company's primary asset. Stability AI trained Stable Diffusion on millions of those images without a license, building a product that generates synthetic images competing with Getty's own offerings. The KBV lens makes this visible in a way that copyright law alone misses. Getty's photographs are not just pictures. They are proprietary knowledge encoded in millions of carefully curated, professionally produced visual assets. The UK High Court ruled on this case in 2025, rejecting the theory that the model weights themselves are infringing copies but finding Stability liable for reproducing Getty's watermarks in generated outputs. The legal outcome is mixed, but the strategic question is clear. If your core asset is knowledge and someone trains on it without your permission, what is left of your business?

The wave of negotiated licensing deals reveals the same dynamic from the other direction. Stack Overflow signed an agreement with OpenAI in May 2024 to provide access to its database of 59 million programming questions and answers. The deal gives OpenAI validated human expertise for model training and gives Stack Overflow a revenue stream with attribution for its community's knowledge. News Corp, owner of the Wall Street Journal, struck a broader deal with OpenAI around the same time, reportedly valued at over 250 million dollars over five years, covering access to the Journal, Barron's, MarketWatch, and other publications.

These deals are the KBV prediction about knowledge ownership playing out exactly as the theory would expect. If knowledge is the central strategic resource, then firms that own valuable knowledge will seek to control access to it and extract value from its use. That is what every negotiated license represents. The deals and the lawsuits are saying the same thing through different channels. The contested question is not whether knowledge has value. The KBV settled that thirty years ago. The contested question is what happens when a technology makes that knowledge transferrable at essentially zero marginal cost.

The emerging market for training data licensing is the KBV becoming institutional. Organizations that spent decades accumulating specialized knowledge, news organizations, photo libraries, coding forums, academic publishers, are suddenly discovering that their archives have a value they never priced. Industry observers have noted that the data used to train AI models is becoming the most contested strategic asset in modern business. I think this is exactly right, and I think the KBV explains why. The theory never claimed that knowledge would change in value over time. It claimed that knowledge is always the central strategic resource. What changed is the technology to extract and apply it at scale. AI training did not create new value in knowledge. It made the existing value visible.

I keep thinking about what the KBV says about tacit knowledge and the limits of extraction. Nonaka's framework starts with socialization, the transfer of tacit knowledge through shared experience, apprenticeship, and practice. That part cannot be scraped. You cannot train a model on the editorial judgment a reporter develops after twenty years on a beat. But the explicit knowledge, the articles, the photographs, the forum answers, that can be extracted and recombined. The KBV has always treated tacit and explicit knowledge as related but fundamentally different in kind. AI training data is forcing us to confront what that difference actually means in practice. A firm's explicit knowledge can now be copied and embedded in a model that competes with the knowledge producer. The tacit knowledge that produced the explicit output, the judgment, the relationships, the craft, remains uncopied. That gap between what can be extracted and what cannot is where the real strategic question lives now. I am not sure anyone has a clean answer yet, but I know the KBV gives us the right language to ask it.