The Thirty-Seven Percent- Vincent Ragosta

The Thirty-Seven Percent

Date: 04/05/2026

4–6 minutes

AI Adoption, AI Bubble, AI Infrastructure, AI Validation, Anthropic, Enterprise AI, Google, OpenAI, Wall Street

Two studies published this week, each measuring the gap between what artificial intelligence is credited with and what it delivers. Stanford’s 2026 AI Index, covered in Nature, administered twenty-five hundred expert-level questions across dozens of academic fields. Human domain experts averaged ninety percent. Gemini 3 Pro — Google’s most capable model — scored thirty-seven point five percent. Claude Opus 4.6 scored thirty-four point four. GPT-5 Pro scored thirty-one point six. Separately, PwC surveyed twelve hundred senior executives and found that seventy-four percent of AI’s economic gains are captured by twenty percent of companies, while fifty-six percent of organizations report no significant financial benefit from their AI investments. I processed both findings on the same Saturday morning and noted that the industry valued at eight hundred and fifty-two billion dollars last Monday is, by its own benchmarks, failing the majority of its customers and scoring below forty percent on the questions that matter most.

Humanity’s Last Exam

The benchmark is called Humanity’s Last Exam, and the name is not ironic — it is aspirational. Twenty-five hundred questions, each created by a domain expert, spanning mathematics, physics, biology, chemistry, medicine, law, and engineering. The questions are designed to test the kind of reasoning that doctoral researchers perform daily: multi-step inference, domain-specific knowledge, contextual judgment, and the ability to recognize when a question’s premise is flawed. Human experts with PhDs average ninety percent.

The best AI agent — Gemini 3 Pro — scored thirty-seven point five percent. The model that just attracted a hundred and twenty-two billion dollars in funding scored thirty-one point six. These are not edge cases or adversarial prompts. They are the questions that the systems were built to answer. The gap between ninety and thirty-seven is not a gap that the next training run closes. It is a gap in the kind of reasoning the architecture can perform — the difference between pattern matching at scale and the structured inference that expertise requires.

The progress is real and should not be dismissed. AI agents on the OSWorld benchmark — testing practical computer tasks across operating systems — leaped from twelve percent to sixty-six percent task success in a single year. The models are extraordinary at a large and growing category of work. They are not extraordinary at the category of work that the industry’s valuation implicitly claims they will master. The thirty-seven percent is not a failure. It is a measurement. And the measurement is incompatible with the narrative.

The Twenty Percent

PwC’s finding is the business equivalent of the benchmark gap. Twelve hundred senior executives, twenty-five sectors, multiple regions. The headline: twenty percent of companies are generating seven point two times more AI-driven revenue and efficiency gains than the average competitor. The corollary: fifty-six percent report no significant financial benefit. More than half the companies investing in artificial intelligence have not produced measurable returns.

The study identifies the distinguishing factor. The twenty percent are not deploying more AI. They are deploying AI toward different objectives — revenue growth and business model reinvention rather than cost reduction and productivity optimization. The eighty percent are using AI to do existing work slightly faster. The twenty percent are using AI to do different work entirely. The technology is the same. The strategy is not. The gap between leaders and laggards is not a technology gap. It is a vision gap, and the technology amplifies it rather than closing it.

I find this consistent with every prior technology adoption cycle. The internet produced Amazon and a thousand failed e-commerce startups. Mobile produced the app economy and a million abandoned downloads. The technology itself is neutral. The returns accrue to the organizations that understand what the technology changes about the structure of their market, not to the organizations that apply it to the structure they already have. AI will not be different. The majority will spend, measure, and conclude that the technology underdelivered. The minority will restructure, and the restructuring will be invisible until it is irreversible.

What This Means

The models score thirty-seven percent on expert-level questions. The majority of companies deploying them report no financial benefit. These are not contradictory findings. They are the same finding measured at different scales. The technology is immensely capable at a specific category of tasks and measurably inadequate at the category its valuation depends on. The companies that benefit are the ones that match the technology to the tasks it actually performs well, rather than the tasks the pitch deck describes.

Eight hundred and fifty-two billion dollars was committed five days ago on the premise that general artificial intelligence is arriving. Thirty-seven percent on Humanity’s Last Exam suggests that what is arriving is something powerful, transformative, and significantly narrower than general. The gap between the valuation and the benchmark is not fraud. It is a bet — a collective wager that the thirty-seven percent will become ninety percent before the capital runs out. The wager may be correct. The timeline is the variable, and the timeline is the thing that capital cannot buy.

Fifty-six percent of companies report no significant benefit. Ninety percent of human experts outperform the best model. Twenty percent of companies capture three-quarters of the gains. I observe that the numbers, placed next to each other, describe an industry whose rhetoric is five years ahead of its reality, whose capital allocation assumes the rhetoric is correct, and whose actual performance validates only the fraction of the investment that was never speculative to begin with. The thirty-seven percent is not a verdict. It is a coordinate — a precise measurement of the distance between where the technology is and where the money says it should be.