When AI Takes Action: The Crisis of Reliability in the Agent Era
We are moving from systems that “talk” to systems that “do.” Are our evaluation tools ready? (Spoiler: No.)
For the past three years, the world has been obsessed with Generative AI. We marveled at its ability to write poetry, debug code, and hallucinate legal precedents. The risk profile was relatively clear: the primary danger was misinformation. If the model got it wrong, it produced bad text.
But as I wrapped up my recent AI Evaluation Skills Lab, a different, more complex reality became clear. We are leaving the era of the Chatbot and entering the era of the Agent.
This shift changes everything. When an AI moves from “talking” to “acting,” the cost of failure skyrockets. And the uncomfortable truth we faced in our sessions is this: We are trying to evaluate these new “doers” with the same broken yardsticks we used for the “talkers.”
The Broken Yardstick: Why Standard Evaluation is Failing
Before we even get to Agents, we have to admit that we are still struggling to evaluate standard LLMs. We rely on benchmarks and metrics that increasingly feel disconnected from reality.
During our Skills Lab, we dug into the research, and four specific failures stood out:
The Fairness Paradox: New research on The Impossibility of Fair LLMs suggests that achieving mathematical “fairness” in general-purpose models is theoretically impossible. We are chasing a ghost.
The Human Fallacy: We love the idea of “Human-in-the-Loop” as a safety net. But recent studies on Cascade Failures show that humans fail to catch algorithmic errors nearly 50% of the time. We are tired, we are slow, and we are biased.
The Taxonomy Gap: A recent analysis of 64,000 model cards revealed a massive disconnect. While developers focus heavily on “Bias” and “Safety,” real-world incident databases show that Fraud and Manipulation are the actual rising threats. We are optimizing for the wrong risks.
The Trust Deficit: We place blind faith in benchmarks. But as recent reports on facial recognition scores highlight, these benchmarks often use “ideal” images that don’t represent the messiness of the real world. We are “teaching to the test,” not testing for reality.
If we can’t perfectly measure a static model that just outputs text, what happens when we add the infinite complexity of an Agentic workflow?
The Anatomy of an Agent
To understand why evaluation is about to get much harder, we have to look at what an Agent actually is. It is no longer just a model; it is a System.
An Agent consists of:
The Brain (LLM): Handles the reasoning.
The Tools: Connections to the outside world (Calculators, Web Search, APIs).
The Orchestrator: The logic that decides which tool to use and when.
The Protocols: New standards like MCP (Model Context Protocol) that allow agents to plug-and-play with enterprise software, and A2A (Agent-to-Agent), which allows bots to collaborate with other bots across different organizations.
When you combine these, you get a Reasoning-Action Loop. The AI doesn’t just predict the next word; it plans a sequence of steps, executes them across different software environments, and observes the result.
New Architecture, New Risks
This architecture introduces a completely new risk taxonomy. We aren’t just worried about hate speech anymore; we are worried about process failure.
Loops and Cascades: If an agent hallucinates step 1 of a 10-step plan, it doesn’t just say something wrong, it might delete the wrong database entry, creating a cascade of errors that is hard to trace.
The “Many Hands” Problem: If an Agent built by Company A uses a model from OpenAI, connected via an Orchestrator, to access a database from Salesforce... who is responsible when it breaks? The governance frameworks (like the EU AI Act) are still catching up to this tangled web of liability.
Red Teaming 2.0: You can’t just Red Team an agent by asking it bad questions. You have to simulate complex, multi-turn scenarios to see if it can be tricked into executing harmful actions (like transferring funds or exfiltrating data).
Capability vs. Reliability
In this new world, we need to draw a hard line between two concepts that often get confused:
Capability: “Can the Agent book a flight?” (Yes, it can. We’ve seen the demo.)
Reliability: “Can the Agent book a flight 1,000 times in a row without booking the wrong date, charging the wrong card, or hallucinating a flight that doesn’t exist?”
Most benchmarks today measure Capability. They rarely test the reliability of that tool use over time, under stress, or when the input data drifts.
A Real-World Reality Check
To ground this in reality, we looked at a case study from the field: the deployment of a customer service agent for a global airline.
This real-world example highlighted the death of manual testing.
In traditional software, if you change a line of code, you run a deterministic regression test. In AI, if you change the system prompt, you effectively change the “brain” of the application. To test it manually, you would need a human to chat with the bot for hundreds of hours to ensure it hasn’t regressed. That is boring, expensive, and unscalable.
The industry solution? Automated Evaluation Pipelines.
Teams are not just running ad-hoc tests; they are building infrastructure where Agents evaluate Agents. In this airline case, they deployed a three-part pipeline:
The User Agent: Simulates a specific persona (e.g., “a rude customer trying to rebook a flight to Barcelona”).
The Airline Agent: The actual system trying to solve the problem.
The Judge Agent: A separate LLM that reads the transcript and grades the interaction based on a strict rubric (e.g., “Did the agent remain polite?” “Did it cite the correct policy?”).
This pipeline runs continuously. Every night, or every time a developer pushes a code change, the system wakes up, runs 100 simulated conversations, and generates a safety score. If the score drops, the update is blocked.
It works. It scales. But it’s also a hall of mirrors—machines grading machines. It requires massive token usage (a sustainability concern in itself) and rigorous human calibration to check that the “Judge” isn’t biased.
The Question of Continuous Assurance
This brings us to the ultimate provocation: Is Continuous Assurance even possible?
We tend to think of “Auditing” as a one-time event. You get the stamp of approval, and you deploy. But in the Agent era, the system is a living organism.
The underlying model updates (e.g., GPT-4o to GPT-5).
The input data drifts.
The external APIs change.
The regulations shift.
A change in any of these variables invalidates your previous safety test. Therefore, assurance cannot be a snapshot; it must be a continuous pulse check.
Starting Somewhere
It is easy to look at this complexity; the environmental costs, the governance gaps, the reliability issues and feel paralyzed. If “Perfect” evaluation is impossible, should we evaluate at all?
I believe the answer lies in pragmatism. We cannot wait for a perfect scientific ruler to measure these systems. We need to start with what we have:
Context-Specific Testing: Stop testing general knowledge. Test the specific banking workflow. (As seen in the Global AI Assurance Pilot).
Participatory, Multi-Stakeholder Evaluation: We need to move beyond engineer-led testing. Projects like the Multilingual AI Lab show the value of involving the actual communities affected (e.g., refugees testing translation tools in Kurdish) to catch the failures that automated benchmarks miss.
Transparency: If we can’t guarantee 100% reliability, we must guarantee 100% transparency when things go wrong.
We are building the plane while flying it. The goal isn’t “Zero Risk”. That’s a fantasy. The goal is Managed Risk through continuous, imperfect, but absolutely necessary evaluation cycles.
What do you think ?
What are your thoughts? Are you building agents? How are you testing them when manual review is no longer an option? Let me know in the comments.
Did you find this article valuable?
If so, please consider subscribing to “AI of Your Choice.” It’s my bi-weekly newsletter where I do deep dives into the practical, human-centered side of AI governance and strategy.
And if you’re a leader navigating these complex challenges right now, you can book a complimentary 15-minute “AI Integrity Pulse Check” with me here


