← Back to Newsletters

Caversham House Newsletter | January 2026

AI in 2026: solving the 6% problem

by Chris Hornby

This month's newsletter outlines a simple but critical shift: real AI value doesn't come from deploying tools but rather from redesigning workflows and testing them properly. To help you apply this immediately, we've built a practical 3-Step AI Value Test tool that you can access at the bottom of this newsletter.

Nearly 90% organisations now use AI in some way but only 6% capture meaningful returns from it. That gap is the defining strategic question of 2026.

It is not a technology problem. The technology crossed a decisive threshold in the final months of 2025: AI systems now reason at PhD level, write production code autonomously, and solve mathematical problems that stumped researchers for decades. These capabilities arrived too late in the year to show up in most enterprise results. The question for 2026 is not whether AI can deliver value, but whether organisations can design workflows that extract it.

This newsletter examines what changed, why most deployments still fail, and what separates the 6% from the rest.


The capability leap was real, and it was recent

Two things happened in 2025. Large language models got dramatically better at reasoning. And the orchestration systems that put those models to work got dramatically better at using that reasoning effectively. Both accelerated sharply in the second half of the year.

The reasoning gains are measurable. On GPQA Diamond, a benchmark of graduate-level science questions where PhD experts score around 65%, frontier models went from 39% in late 2023 to over 90% by mid-2025. In December, AI systems began solving open Erdős problems: mathematical conjectures with no known solutions, some unsolved for thirty years. Fifteen moved from "open" to "solved" since Christmas, eleven crediting AI. This was not retrieval or pattern-matching. The AI had to reason through genuinely novel territory and verify its own proofs.

The orchestration gains are equally significant. Claude Code reached $1 billion in annual revenue within months of launch. Cursor, Copilot, and other AI coding tools use the same underlying models. The difference is not the AI. It is the architecture around the AI: Claude Code gives the model direct access to the codebase, lets it run tests, observe failures, and iterate until the work is done. Same intelligence, radically different access and workflow. One engineer reported shipping 22 pull requests in a single day, each written entirely by AI.

The lesson for every industry: "which AI model?" is the wrong question. The right question is "what can the AI access, and how does it iterate?" The model is the engine. The orchestration is the vehicle.

These improvements compound: better reasoning gives orchestration more capability to deploy; better orchestration gives reasoning models richer context to work with. Software development is the first domain where this combination crossed the threshold from "AI assists a human" to "AI produces, human reviews." Given the timing of these breakthroughs, 2026 is when other domains should begin to follow.


Why most deployments fail

Given these capabilities, you would expect widespread impact. The data says otherwise.

McKinsey's 2025 survey found 88% of organisations use AI, but only 6% qualify as "high performers" capturing meaningful returns. BCG's analysis of AI investments paints a consistent picture: most organisations spend their budgets in the wrong places.

The pattern of failure is remarkably consistent. Organisations bolt AI onto existing workflows, giving people chatbot access and expecting transformation. The technology works. The deployment model does not.

Klarna is the cautionary tale. In February 2024, they announced their AI assistant handled 2.3 million customer conversations monthly, replacing the equivalent of 700 full-time agents. It was the most cited enterprise AI success story of the year. By May 2025, the CEO admitted they "went too far" and began rehiring humans. The AI performed exactly as specified. The problem was the workflow design: Klarna had optimised for cost reduction without mapping which interactions genuinely needed human judgment.

Gartner now predicts over 40% of agentic AI projects will be cancelled by 2027. This is not technological failure. It is organisational failure: specifically, a failure to understand that agentic AI chains are only as reliable as their weakest step.


The chaining problem: why "fully autonomous" usually fails

This is the concept most organisations get wrong when deploying agentic AI.

Agentic AI works by chaining steps together: the AI gathers data, analyses it, drafts output, checks compliance, and delivers a result. Each step has a reliability rate. The critical insight is that these rates multiply across the chain, they do not average.

Consider a five-step workflow where AI handles each step with varying reliability: 95%, 95%, 80%, 60%, 50%. Individually, those first three steps look strong. But multiply across the chain and the end-to-end reliability is 0.95 × 0.95 × 0.80 × 0.60 × 0.50 = 22%. One in five outputs would be correct without human intervention. That is not a viable agent.

Now redesign the same workflow as a hybrid: let AI handle steps one through three autonomously (0.95 × 0.95 × 0.80 = 72% reliability), then insert a human checkpoint for the judgment-heavy steps four and five. The AI does what it does well. The human catches what it does not. End-to-end reliability jumps from 22% to something approaching the human's own accuracy on those final steps, built on a foundation of AI-generated work that is correct nearly three-quarters of the time.

This is not a compromise. It is the correct engineering of a human-AI system. The same principle that made Claude Code effective applies here: success comes from understanding which parts of the workflow AI handles reliably and which parts need human oversight.


What the 6% actually do

McKinsey's high performers share three characteristics, all of which relate to this workflow design challenge.

They map their value chains before deploying AI. Fifty-five percent of high performers restructure how work gets done around AI capabilities, versus roughly 20% of everyone else. The difference is between handing someone a chatbot and systematically analysing each step in a workflow to determine where AI is reliable, where it needs human checkpoints, and where it should not be involved at all.

Their leaders use AI themselves. High performers are three times more likely to have senior executives actively working with AI. Leaders who use the tools understand firsthand where AI excels and where it falters, making them dramatically better at designing appropriate oversight.

They invest in capability, not just access. BCG quantifies the investment pattern of successful AI adopters: 10% on algorithms and models, 20% on technology and data infrastructure, 70% on people and processes. That 70% is predominantly training and change management. Most organisations invert this ratio, spending heavily on technology and hoping adoption follows.


How to design workflows that actually work

"Redesign workflows around AI" sounds abstract. Here is the practical method.

Step one: decompose the workflow. Take a real process, such as financial reporting or contract review, and break it into its individual steps. Data gathering, analysis, drafting, compliance checking, final review. Each step is a link in the chain.

Step two: test each step individually. Before automating anything, paste real inputs into an AI tool and test each step with actual data. Give it real transaction records and ask it to identify anomalies. Give it a real contract and ask it to flag non-standard clauses against your playbook. If the AI cannot handle a step reliably in a simple conversation, it will not handle it reliably in an automated agent. This is the cheapest, fastest way to map what AI can and cannot do for your specific processes.

Step three: design the human-AI split. Based on testing, categorise each step. Some will be fully automatable with high confidence: data extraction, standard formatting, pattern-matching against known rules. Others will need human review: anything involving judgment, novel situations, regulatory interpretation, or stakeholder communication. Design the workflow so AI handles production on the steps it does well, with human checkpoints at the steps where it does not.

Finance example. AI with ERP access pulls data and generates variance analysis (high reliability, automate). AI identifies anomalies and drafts commentary (moderate reliability, automate with flagging). A human analyst reviews the anomalies, validates the narrative against business context, and approves (judgment steps, human checkpoint). The analyst's role shifts from assembly to review, but the workflow is designed around where AI is genuinely reliable, not where it would be convenient.

Legal example. AI connected to the contract repository extracts key terms and flags non-standard clauses against the firm's playbook (high reliability, automate). AI assesses risk exposure and recommends action (lower reliability, requires judgment). An associate reviews flagged exceptions, evaluates risk in context, and escalates (human checkpoint). Time saved: 50% or more. But the saving comes from designing the split correctly, not from handing the entire chain to AI.


The 2026 outlook

The capabilities that arrived in late 2025 are substantial, and they will only strengthen. But the gap between the 6% and the rest will not close through better models. It will close through better workflow design.

For organisations ready to act, the sequence matters. First, map your value chains and identify where AI moves the needle: not broadly, but step by step. Second, build team capability to evaluate AI output, design appropriate oversight, and distinguish opportunity from vendor hype. Third, deploy with hybrid architectures that match AI's strengths to the right steps and preserve human judgment where it matters.

The decision facing leadership is not whether to adopt AI. The market has made that decision. The question is whether to invest in understanding your own workflows deeply enough to deploy AI where it actually works, or to join the 94% still waiting for the technology alone to deliver results.

3-Step AI Value Test tool

This tool encapsulates the core ideas from this edition and provides a structured way to:

You can use it with your leadership team in under an hour.

Open the 3-Step AI Value Test →

Caversham House works with organisations at each stage of this sequence: strategy engagements that map your value chains and identify where AI delivers measurable impact, AI microlearning programmes that build the team capability separating the 6% from the rest, and technical teardowns showing how agentic workflows deploy in your context. If you are navigating these decisions, we would welcome a conversation: www.cavershamhouse.com

← Back to Newsletters