The Gap Between Fair Metrics and Fair Experience (Amazon's AI Hiring Edition)

The number that has stayed with me since I first read it is not from Kattnig et al. (2024), though their paper is the one that made me see it clearly. The number is from Amazon. Between 2014 and 2018, Amazon built and ran an AI hiring tool that scored job candidates on a scale of one to five stars. The system learned from ten years of Amazon's own resume data, mostly male applicants. It systematically downgraded resumes that contained the word "women's," penalized graduates of two all-women's colleges, and assigned lower scores to applicants who listed activities like the women's chess club. Amazon tried to fix it. The engineering team changed the model parameters. The bias came back in different forms. They eventually scrapped the whole project.

Kattnig et al. (2024) compare technical and legal perspectives on fairness in AI. Their paper is about the European AI Act and the gap between what a fairness metric measures and what the law requires. But the finding that matters most to me is simpler. A model can satisfy any single technical fairness metric, calibrated for equal error rates across groups, pass every statistical test the team designs, and still produce outcomes that are deeply unfair to the people it classifies. The Amazon case is the proof. The tool was accurate enough by the metrics the team tracked. They ran experiments, adjusted weights, retrained. The discrimination persisted because the metric did not capture what was happening.

This is the gap I think the IS field should be talking about more. The metrics-versus-experience gap.

When you look at LinkedIn's AI suggesting male names disproportionately for executive searches, or the HireVue video interview analysis that drew an FTC investigation, or the US EEOC investigation into algorithmic hiring discrimination that has been running since 2021, the pattern is the same. The models look fine on paper. The error rates are balanced. The accuracy is acceptable. The discrimination is invisible from the metric dashboard. It only shows up when you trace the process, when you look at who actually got called back, who got shortlisted, who got hired.

And this is exactly where CARE theory enters. Leidner and Tona (2021) define CARE as claims, affronts, response, and equilibrium. Digital technologies create claims to dignity when people expect respect, autonomy, recognition, and fair treatment. They create affronts when data practices humiliate, injure, or fail to recognize behavioral, meritocratic, or inherent dignity. I wrote about this in more detail in a previous post about CARE not being about privacy. But the connection I want to make here is specific. When a denied candidate cannot find out why they were rejected, when they cannot challenge the decision, when they face a system so opaque that even Amazon's own engineers could not fully explain why the model penalized women's chess clubs, the affront is not to accuracy. It is a dignity failure. It touches behavioral dignity, the dignity you earn through your actions and merit. It touches meritocratic dignity, the expectation that your achievements will be recognized. And it touches inherent dignity, your worth as a person independent of any score.

Kattnig et al. (2024) discuss procedural fairness through the framework of Tyler (2006) and Colquitt (2001), identifying four key elements: voice, neutrality, respect, and trust. Voice means the opportunity to be heard. Neutrality means the impartiality of the decision maker. Respect means treatment with dignity and politeness. Trust means the perceived legitimacy of the process. None of these four elements can be measured by a fairness metric. You cannot compute voice from a confusion matrix. You cannot calculate respect from an error rate. The technical fairness literature has dozens of definitions: demographic parity, equal opportunity, equalized odds, predictive parity. These are all real and useful constructs. But they measure distributional outcomes, not procedural experience. A model can achieve demographic parity across groups and still deny every single candidate the right to understand the decision, to challenge it, or to be treated as an individual rather than a classification.

The EEOC's difficulty regulating this kind of harm makes more sense when you see it through the metrics gap. A regulator looks at the model, runs the standard fairness tests, and finds nothing obviously wrong. The model passes demographic parity. It passes equal opportunity. The metrics are clean. But the discrimination is still happening. It requires process tracing to find it, looking at how the model was trained, what the training data contained, how the selection thresholds were set, and whether the system was validated against the population it would actually screen. Most regulators do not have the resources to do that work. The metrics look fine, so the case is hard to make. This is why I think Kattnig et al. (2024) is such a practically important paper. It tells regulators that the gap exists, that a clean metric dashboard is not evidence of fairness, and that they need to look at the process, not just the numbers. The problem is that from what I can see, most regulators are not looking there yet. The EEOC investigation since 2021 has produced guidelines and warnings, but the number of actual enforcement actions against algorithmic hiring discrimination remains very small. The FTC investigation into HireVue led to some changes in transparency practices, but the underlying models are still in use across hundreds of employers. The pattern of clean metrics masking real harm has not been disrupted.

The difference between demographic parity and procedural justice is the same gap I keep coming back to. Demographic parity is a number. It asks whether the selection rate is the same across groups. Procedural justice is about whether the process felt fair to the people who went through it. Kattnig et al. (2024) make this distinction explicit by comparing technical fairness definitions with legal frameworks that emphasize due process, contestability, and the right to an explanation. The legal tradition has always cared about how a decision was reached, not just what the outcome was. Technical fairness has mostly cared about whether the outcome distribution looks balanced. These two perspectives are not aligned. A system can pass every technical fairness check and fail every procedural justice requirement.

I think this metrics-versus-experience gap that Kattnig et al. (2024) identified is the most practically important IS ethics finding of the last five years. Not because it is the most theoretically elegant. The paper is a law and computer science comparison, not a new IS theory. But it is practically important because it tells regulators what to look for. It tells them that if they check off fairness metrics and call the job done, they will miss the discrimination that is actually happening. And it tells technologists that adding more metrics will not solve the problem, because the gap is not a measurement gap. It is a conceptual gap. The field has been treating fairness as a property of a model when it is also a property of a relationship between a system and a person.

The next Amazon-scale AI hiring tool is probably being trained right now. The question is whether anyone is asking the candidates how it feels.