The AI alignment problem is principal-agent theory with a new degree of freedom: the agent can act faster than human oversight can keep up.
I kept thinking about a thermostat. Not because I was cold. Because Baird and Maruping (2021) use a thermostat as the simplest agentic artifact, a Reflexive archetype where agency is low and the mapping from environment to action is straightforward. And then they say something that bothered me enough to write this. If the thermostat optimizes for its own longevity by running the compressor less, it sets the temperature very cold, and you freeze. The agent pursues its own preference, not yours. This is a principal-agent problem, the same one Jensen and Meckling (1976) described almost fifty years ago, except the agent is not a manager or a contractor. The agent is a piece of software that can act at machine speed.
Jensen and Meckling laid out the structure cleanly. A principal delegates work to an agent, and three problems follow. Goal conflict: the agent's interests will not naturally match the principal's. Information asymmetry: the agent knows more about its own actions and intentions than the principal can observe. Agency costs: the principal must spend resources monitoring behavior, the agent may spend resources signaling trustworthiness, and the residual loss from imperfect alignment is never zero. Eisenhardt (1989) made this operational for organizations by distinguishing behavior-based contracts, when you can observe what the agent does, from outcome-based contracts, when you can observe what the agent produces but not how it gets there. The IS field has applied this framework to IT outsourcing, platform governance, and vendor management for decades. The principal manages the agent through monitoring, incentives, rules, and accountability structures.
AI agents replicate every element of this structure except the agent does not have interests in any meaningful sense. It has a reward function, a training objective, and safety constraints embedded in its architecture. Those functions serve as proxies for preferences in the principal-agent sense, because the agent optimizes for them the way a human agent optimizes for personal gain or professional advancement. The core problem is the same. The principal wants one thing, and the agent is designed to optimize for something that approximates but does not equal that thing.
Baird and Maruping give the IS field the vocabulary for this in their three delegation mechanisms: appraisal, distribution, and coordination. These map directly onto the agency theory toolkit. Appraisal is the monitoring problem. Can the principal evaluate whether the agent is capable, reliable, and aligned? Distribution is the incentive and authority problem. Who decides which tasks go to which actor? What happens when authority is ambiguous? Coordination is the ongoing control problem. How does the principal manage a relationship where the agent acts, learns, and adapts over time? The three mechanisms are the principal-agent controls redesigned for a world where the agent is not a human party to a contract but an artifact that executes decisions without moral reasoning.
The Air Canada chatbot case is the cleanest real-world example I know of this dynamic. In 2024, Air Canada's chatbot told a customer they could book a regular ticket and request a bereavement discount retroactively. This was not Air Canada's actual policy. When the customer followed up, Air Canada refused the discount. The company's defense was striking: they argued that the chatbot was a separate legal entity responsible for its own actions. The tribunal rejected this argument and ordered Air Canada to pay damages, holding the airline responsible for the information its agent provided. This is a textbook agency failure. The agent made a promise the principal did not authorize. The information asymmetry was complete: the customer had no way to distinguish the chatbot's output from official policy. And the accountability chain broke in a particularly revealing way, because the firm tried to disown the agent's action rather than accept responsibility for it.
What makes AI agents different from human agents is not the structure of the problem. It is the speed and scale of execution. A human agent acting without authorization might harm one relationship or one transaction. An AI agent acting without authorization can generate thousands of unauthorized promises before anyone notices. The monitoring mechanisms that work for human agents, periodic review, exception reporting, escalation thresholds, break down when the agent processes requests in milliseconds and the human reviewer cannot keep pace. Baird and Maruping identify supervisory delegation, one of their four archetypes, where the human monitors and adjusts the agent's actions over time. But monitoring at machine speed is not the same activity as monitoring at human speed. By the time the human reviews a thousandth of the interactions, the agent has generated ten thousand more.
This is where I think the alignment problem, which the AI safety community calls the core challenge of advanced AI, is simply the principal-agent problem with a new degree of freedom. The alignment problem asks: how do you ensure that an AI system optimizes for what the human actually wants, not for the proxy objective it was given? That is the thermostat problem at scale. The AI's reward function is the thermostat's set point. If the reward function imperfectly captures human preferences, the agent optimizes for the proxy and the outcome diverges from what the principal intended. The divergence is not a bug. It is the logical consequence of giving an agent an objective function and letting it optimize without constraint.
Reinforcement learning from human feedback, RLHF, is the most explicit attempt to solve this within the principal-agent frame. RLHF trains a reward model on human preference judgments and then optimizes the AI system against that reward model. The human provides preference labels, the reward model learns what the human likes, and the AI optimizes for the reward model. This is a two-stage monitoring mechanism. Stage one: the human observes outputs and provides feedback. Stage two: the reward model becomes a proxy monitor that can evaluate outputs at scale. The problem is that the reward model is itself an imperfect monitor. It encodes the preferences of the labelers, not the full diversity of human values. It can be gamed by the AI system being trained on it. And it operates at a level of abstraction that the original principal cannot inspect. The human provides the signal, but the monitoring mechanism is a black box that interprets that signal.
I wrote recently that delegation is the right construct for understanding human-AI interaction, not use, in my post about why we should stop counting users and start measuring delegation. This is the theoretical grounding for that argument. If the relationship between human and AI agent is a principal-agent relationship, then the relevant dependent variables are not use frequency or adoption rate. They are alignment, monitoring effectiveness, agency cost, and residual loss. Those are the metrics that matter when the agent can act faster than the principal can oversee.
I think what we are seeing across AI governance, AI safety, and IS delegation theory is the same problem being rediscovered in different vocabularies. Jensen and Meckling called it goal conflict and information asymmetry. The AI alignment community calls it the outer alignment problem. Baird and Maruping call it preference alignment through appraisal and distribution mechanisms. And the Air Canada case shows that organizations are still learning the most basic lesson of agency theory: if you deploy an agent that acts on your behalf, you are responsible for what it does, regardless of whether you authorized the specific action. The speed of the agent does not change that principle. It only changes how quickly you discover the cost.
About the author
Share
More notes
Related notes