Show your working

88% of AI agent pilots never reach production. Most produce the same artefact in week one, a very good demo. The agent writes the code, hits the tests, pushes the commit, and the team cheers. Then someone from risk or compliance asks what exactly it did, step by step, and the answer is "it worked, look at the diff." In a regulated environment, the diff is not the evidence. The process is.

The answer without the working

AI coding agents operate through rapid sequences of tool calls, shell commands, file writes, web fetches and spawned subagents. In most deployments, those actions execute and disappear. You get a working PR and a session log that nobody reads until something goes wrong. The model received the task, the model completed the task, and everything in between is essentially undocumented magic!

For teams trying to prove AI value, this is structurally uncomfortable. You cannot attribute outcomes to a process you cannot observe. You cannot defend an audit against a tool whose actions you cannot trace, and when the board asks whether the AI initiative is delivering returns, "the output looked good" is not the answer they are looking for.

(It is also, quietly, how incidents happen. Something was deleted. Nobody is entirely sure what.)

The gate before the action

The fix is not retrospective logging, it is gating. AI coding environments like Claude Code expose a hook system that fires before every tool call, before a file is written, before a shell command runs, before a subagent spawns. Wire that to an approval broker and every potentially dangerous action the agent wants to take becomes a logged, decidable request, one that fails if the control layer goes down. Interestingly, claude ask for its toys, I mean tools back.

This is not a tax on speed. Safe, vetted operations clear automatically, while writes, shell commands, and anything dodgy goes to the gate, and that gate is your audit trail. The shift is from "our AI is building fast" to "our AI is building correctly and we can prove every step." That is the difference between a demo and a deliverable, and it is the difference between an AI initiative and an AI capability.

The Scail AI Risk Value Index covers exactly this ground, mapping where AI delivery is observable and controlled, and where it is still flying blind.

What boards need to see now

Most businesses have AI agents running inside their delivery pipelines already. Very few can clearly say what those agents did last week, what they were approved to do, and whether any of it should give the board pause.

The Scail AI Risk & Value Scorecard gives leaders a structured view across all eight dimensions of AI capability, from governance and strategy through to execution and value realisation. It is not a one-off assessment. It is a continuous picture of what is controlled, what is drifting, and what needs a decision before it becomes an incident.

AI is no longer just a technology concern. It is a delivery concern, a governance concern, a commercial concern, and a board concern.

The winners will not be the businesses with the fastest agents. They will be the businesses that can show their working.

Read more about our AI Risk & Value Scorecard.

Previous
Previous

Your AI offer sounds impressive. It just doesn’t sound commercially credible.

Next
Next

The tokenmaxxing trap