Illusions of Thinking, A Look At Apple’s New AI White Paper

9 June

A new paper from Apple—titled provocatively “The Illusion of Thinking”—dissects the reasoning capacities of so-called Large Reasoning Models (LRMs). These are the “thinking” variants of language models: Claude’s Sonnet Thinking, OpenAI’s o-series, DeepSeek-R1. Their innovation lies not in answers but in process—in their ability to simulate “thought.”

But what happens when that simulation breaks? What if the form of thought is intact—but the function fails?

And what, crucially, does this mean for those of us who work in the art and politics of communication?

What the Paper Actually Shows (And Why It Matters)

The authors place LRMs into clean, controlled puzzle environments—Tower of Hanoi, Blocks World, River Crossing—and push them to their limits. They do something rare in LLM research: instead of only judging final answers, they analyze the process—the “thought traces” that LRMs generate as they try to solve a problem.

The results are damning:

Three zones of failure:
- At low complexity, standard LLMs outperform the more elaborate LRMs.
- At medium complexity, LRMs briefly shine.
- At high complexity, everything collapses. No model, no matter how verbose or reflective, can handle compositional depth.
Reasoning effort paradox: As tasks get harder, models think less, even though their token budgets allow for more. They quit while ahead—or rather, quit before even trying.
Execution ≠ understanding: Even when handed the explicit algorithm for Tower of Hanoi, LRMs falter. They don’t compute. They autocomplete.

This is not just a technical finding. It’s a philosophical one. The models produce the shape of reason, but not its substance.

The Communications Critique: Beyond the Syntax of Thought

At Andiron Labs, we’re less interested in whether machines can "solve puzzles" than in how they perform intelligence—especially in public, institutional, and narrative contexts.

Here’s what this paper reveals through a communications lens:

1. Simulated Thinking is a Style, Not a Substance

The new wave of LRMs are essentially stylists. They deploy the tropes of reflection—"Let me double-check this...," "Here's my reasoning...," "Step-by-step...".

This mimics the rhetorical performance of intelligence. It makes their answers seem more earned. But as the paper shows, the internal logic is often inconsistent, shallow, or completely off-course.

Implication: When communicators use AI to “sound smart,” what they may be producing is plausibility theatre—highly structured nonsense.

2. Overthinking as a Communications Trap

One surprising finding: LRMs tend to “overthink” simple problems—arriving at the right answer, then continuing to generate incorrect ones.

This maps eerily to human communications pathologies: press releases that bury the lede, leaders who talk themselves out of clarity, consultants who never stop hedging.

Implication: The illusion of intelligence can become noise. Communications that look thoughtful may actually obscure understanding—especially when verbosity replaces decisiveness.

3. The Collapse of Consistency

Even when given explicit algorithms, models cannot execute reliably. The surface form of logic is there—but not the muscle memory.

This failure is key for communicators. LLMs, even when richly trained, do not yet “reason through” arguments. They reconstruct genres.

Implication: Strategic messaging built on LLM scaffolding must be audited not for “errors,” but for narrative coherenceacross complexity. The model can misrepresent not because it is hallucinating facts, but because it cannot hold structural integrity under strain.

So What Do We Do With This?

At Andiron Labs, we don’t believe in clean separations between AI research, rhetoric, and the political economy of attention. This paper offers hard evidence for something we’ve felt intuitively: that reasoning in AI is often a kind of genre mimicry—less thinking, more thinking-as-aesthetic.

This is both dangerous and fascinating.

In communications:

We must develop better diagnostics for surface coherence vs. structural truth.
We should be skeptical of AI-generated strategic documents that “seem rigorous” but crumble under pressure.
We might explore using AI not for answers, but for provocation—pushing against its own limits to reveal underlying truths.

This research doesn’t close the case on AI reasoning. But it does force a shift—from performance to process, from illusion to audit.

In short, the paper is less about machines failing to think—and more about our need to think differently about how we speak through them.

Jonathan Englert