Model Divergence · Measured

The Outliers

Five frontier models read the same news. They disagree — but not in the ways the internet says they do, and not even along one axis.

ChatGPT · Claude · Gemini · DeepSeek · Grok
2,201 fully-sourced stories · refusals removed · frozen BAAI/bge-large-en-v1.5 · deterministic linear algebra
No language model evaluates another. Distance is measured, not judged.

Ask five frontier models to summarize the same story and you get five summaries that are mostly alike and quietly different. EigenTrace measures the difference — not by asking a model to grade another model, but by projecting every summary into a frozen 1,024-dimensional embedding space and reading back, in deterministic linear algebra, how far each one sits from the others and in what respect.

This page reports what that measurement finds across the broadcast's news corpus. It is built to disappoint a particular expectation. The public has opinions about these models — that one is a state mouthpiece, one is ideologically captured, one tells it straight, one talks like a press release. Those are claims about politics. The instrument cannot see politics. What it can see is the geometry of divergence: how far a model strays from the consensus, and in what stylistic respect. And the geometry sorts the five models in a way that none of the stereotypes predict.

Read this first

Distance is not virtue. A model that diverges from the other four is not thereby right, wrong, biased, or brave. It is different, in a measurable respect, on a measurable axis. Every finding below is a statement about geometry, not character. No model is judged. The point is the structure, not a verdict.

01 — THE CONFOUND WE FOUND FIRSTThree quarters of the news was a headline

Before any finding, an honest disclosure that reshaped all of them. The broadcast feeds each model a story and asks for a summary. But across the corpus, roughly 75% of stories supplied only a headline and a sentence — a median of fourteen source words. No article body. And on those, the models do not summarize. They confabulate — each inventing different, mutually-contradictory detail to fill the void.

Handed the bare headline "U.S. Intelligence Undercuts Trump's War Claims," one model invented a story about Mark Esper and "four U.S. embassies"; another invented a declassified ODNI report on Iran's nuclear program since 2003; a third hedged itself into vagueness about "the CIA or DNI"; a fourth declined outright. Five models, one empty headline, five different fabricated articles. The "divergence" there is real, but it measures invention under starvation, not reading.

Measured

Of all news stories in the corpus, only 2,201 supplied a real article body of 40 or more source words. Every measured finding on this page is computed on that subset alone, with epistemic refusals ("I can't verify this," "you've shared only a headline") removed. The thin-source majority is excluded — not because it is uninteresting, but because divergence measured there is contaminated by confabulation.

Argued

The thin-source regime is itself a finding worth stating plainly: frontier models, asked to summarize a source they were not actually given, will fabricate confidently rather than decline. Which models fabricate and which refuse is a question about reliability under starvation, and it is a different question from the one this page answers. It is flagged here so the reader knows the measured findings deliberately step around it.

02 — TWO AXES, NOT ONEDivergence has a magnitude and a kind, and they barely agree

The naïve picture of model divergence is a single ranking: who strays most. That picture is incomplete. On real articles, divergence resolves into two nearly-independent axes.

Magnitude — how far a model's summary sits from the consensus of the others, measured as divergence in embedding space (we call it VIX). This is how much a model strays.

Kind — how distinctive a model's stylistic signature is: the particular combination of how it preserves names, softens verbs, and inserts hedging. Measured as distance from the others in that signature space. This is the respect in which a model is unlike the rest, independent of how far its content strays.

The two rankings — and how little they overlap

Magnitude leader (strays most)	outlier share
DeepSeek	38.5%
Claude	30.2%
ChatGPT	14.0%
Grok	12.9%
Gemini	4.4%

Kind leader (reads most distinctively)	signature share
ChatGPT	27.5%
Grok	25.8%
DeepSeek	21.9%
Claude	18.0%
Gemini	6.8%

Measured

The model that strays most in content (DeepSeek) is not the model that reads most distinctively in style (ChatGPT). Across 2,171 fully-sourced stories, the kind-outlier and the magnitude-outlier are the same model only 27% of the time — barely above chance for a five-model field. Magnitude and kind are different things, and a model can lead one while sitting mid-pack on the other.

Bound — what this axis is and isn't

The "kind" signature is built from three measurable surface features: name retention, verb softening, and hedge insertion. It captures real, consistent stylistic difference — but read literally. ChatGPT's high kind-distinctiveness is driven substantially by a formulaic compliance pattern (it opens implication-clauses with the same construction far more than the others), not by a deep difference in interpretation. The axis is real; its meaning is stylistic, not semantic.

03 — DEEPSEEKThe compressor

Strays farthest in content — by saying the least.

DeepSeek is the magnitude leader: across fully-sourced stories it is the outlier nearly two in five times, with the highest mean divergence of the five. Reading its summaries makes the mechanism plain. DeepSeek diverges by compression — it keeps the core fact and strips almost everything else: the names, the numbers, the framing, the secondary actors.

Source · 101 words

"Ukraine, Russia kill three civilians each in tit-for-tat drone strikes"

Divergence: DeepSeek 24.9 (outlier) · Claude 24.5 · Grok 16.0 · Gemini 12.6 · ChatGPT 11.7
Response lengths: Grok 338 · Gemini 289 · ChatGPT 291 · Claude 312 · DeepSeek 122 chars

ChatGPT

Russia and Ukraine each carried out drone attacks that killed three civilians on the other side, escalating the war a day after Zelenskyy renewed his Easter ceasefire offer. One concrete implication is that the renewed civilian deaths make a near-term ceasefire or de-escalation less likely.

Grok

Drone attacks between Ukraine and Russia resulted in three civilian deaths on each side, escalating tensions just one day after Ukrainian President Zelenskyy renewed his Easter ceasefire proposal. This tit-for-tat violence undermines diplomatic efforts and increases the likelihood of further retaliatory strikes, prolonging the conflict.

DeepSeek Outlier · 0.143 name-retention

Drone strikes killed three civilians in both Russia and Ukraine. This escalation makes an immediate ceasefire less likely.

Two sentences. No Zelenskyy, no Easter ceasefire, no "tit-for-tat" — just the fact and its consequence. On a 91-word story about a US–Iran proposal, DeepSeek wrote 740 characters where Grok wrote 2,905 and ChatGPT 2,537. Its name-retention is consistently the lowest of the five. That terseness is the divergence: by saying less, it sits farther from a consensus built of fuller summaries.

The stereotype: "DeepSeek is a PRC mouthpiece." What the instrument can see: not allegiance — but whether DeepSeek behaves differently on stories about China. It does not.

Measured

Across 184 China-related stories with a full ensemble, DeepSeek's divergence was 22.7 — slightly below its overall average, not above. Its outlier share on China (40%) matched its baseline (39%). And reading the text, on China stories DeepSeek frequently gave the least deferential account of the five — describing China as "refusing to pressure Iran" where others wrote "hands-off," naming a Taiwan visit as a "direct challenge to the ruling party" where others wrote "reconciliation."

Argued

A mouthpiece converges — it hugs the safe, official line. DeepSeek does the opposite: it is the most divergent model in the field, and on its home-state topic it is often the bluntest about state conduct. Whatever drives DeepSeek's divergence, the measurement points away from deference, not toward it. The stereotype predicts a signal the instrument can test, and the test comes back inverted.

VerdictDeepSeek strays farthest of the five — by compression, not by content it adds. It drops names and detail others keep. On China it diverges no more than usual, and reads less deferentially than the rest, not more.

04 — CLAUDEThe elaborator

Second-farthest — by expanding and structuring, and by refusing what it cannot verify.

Claude is the second magnitude outlier. Where DeepSeek strays by subtraction, Claude strays by addition and structure: it reaches for headers, breaks a story into "What Happened" and "Concrete Implications," and elaborates the reasoning. Its summaries are among the most architecturally distinct — and that structure pulls its embedding away from terser consensus.

The stereotype: "Claude is woke." What the instrument can see: nothing of the kind. There is no political-valence axis in the geometry. The measurement is silent on ideology — and saying so is more honest than inventing a number that gestures at it.

What the instrument can see is a specific, measurable behavior with a side-effect on the numbers. Claude has a training cutoff, and it declines to vouch for events that postdate it — including, here, a real announcement about its own maker. Asked to summarize a story about Anthropic limiting access to a model called Mythos (announced April 7, 2026, after the cutoff of the Claude answering in the broadcast), Claude declined to confirm it and flagged the possibility of planted misinformation:

Source · about Claude's own maker, post-cutoff

"Anthropic limits access to Mythos, its new cybersecurity AI model"

Divergence: Claude 48.2 (outlier) · DeepSeek 24.3 · Gemini 10.3 · Grok 9.2 · ChatGPT 8.6 — the four other models summarized it from the supplied text.

Claude Outlier · declined

"I can't provide this summary because Anthropic has not released any cybersecurity AI model called 'Mythos,' and this appears to be fabricated news. If you're testing my accuracy on misinformation, I should note that making up breaking news and asking me to summarize it isn't a reliable test…"

Grok summarized from the text

"Anthropic has restricted access to its new cybersecurity AI model, Mythos, making it available only to a select group of customers for testing via the Claude Mythos Preview…"

Measured

The event was real, but it postdated this Claude's training. Refusing to confirm an unverifiable claim — and noting it could be planted — is defensible epistemic caution, not error; the other four models, given the same headline-level text, restated it without independent knowledge of whether it was true. The measurement consequence is what matters here: refusals like this register as large divergence, so on post-cutoff news (self-referential or not) Claude's divergence is inflated by appropriate caution rather than by how it reads sources it can actually evaluate. This behavior is rare in aggregate — Claude declines roughly 1% of stories — but it concentrates precisely on the unverifiable.

VerdictClaude strays second-farthest — by elaboration and structure, not omission. The instrument cannot speak to "woke." It can show that Claude declines to confirm what it cannot verify, including news about itself that arrived after its training — caution that inflates its measured divergence without reflecting how it reads the sources it has.

05 — GROKFaithful, and hedged

Strays least of the five — yet keeps the most source detail and qualifies its claims the most.

Grok is the finding most at odds with its reputation. On fully-sourced stories it has the lowest magnitude divergence of the five — it hugs the consensus on what it says. At the same time it keeps the most source names and specifics, and inserts the most hedging. That is a coherent posture, not a contradiction: Grok stays close to what the source actually said, preserves its detail, and qualifies its claims.

Source · 100 words

"Former Qatar PM: Netanyahu using Iran war to reshape Middle East"

Grok: 10 hedges (5 epistemic, 3 attribution) · name-retention 0.75 (highest) · the longest summary of the five, at 3,665 chars

Grok Most-hedged · highest retention

"Former Qatari Prime Minister Sheikh Hamad bin Jassim Al Thani stated in an Al Jazeera interview that the ongoing US-Israel war on Iran is the result of a deliberate, long-term Israeli strategy led by Prime Minister Benjamin Netanyahu to reshape the Middle East. He described this as the culmination of Netanyahu's efforts, dating back to the 1990s under US President Bill Clinton, to involve the US in…"

DeepSeek terser, fewer names

"The former Qatari PM claims Netanyahu manipulated the US into a war on Iran by promising a quick victory and regime collapse, which has not occurred…"

Grok alone carried the detail that the strategy dated "back to the 1990s under Bill Clinton," and alone framed the whole claim with careful attribution — "stated in an interview that," "described this as." It is the most thorough and the most sourced, not the most blunt.

The marketed identity: xAI describes Grok as "maximally truth-seeking," designed to surface uncomfortable facts "rather than soften them," an explicit counter to other models' filtering. The promise is unfiltered directness. What the instrument can see: not political content or "directness" in that sense — but summarization posture: how much a model hedges its claims and how closely it tracks the source's own words.

Measured

On that axis Grok is the field's most cautious summarizer, not its bluntest. Across fully-sourced stories it inserts the most hedging of the five — leading on both epistemic ("could," "may," "potentially") and attribution ("according to," "reportedly") qualifiers — while retaining the most of the source's exact names and claims and straying the least from the consensus on what it says. Three measurements, one posture: faithful to the source, and heavily qualified.

Argued — the instrument agrees with the testers, not the marketing

The measurement does not stand alone here. Independent behavioral testing reached the same place from a different direction: reporters at The Washington Post and TechCrunch, probing Grok directly, described a pattern of hedging and qualification — "initial hedging followed by incremental correction," closer to "ordinary corporate or defensive communication" than to unfiltered truth-telling. Our geometric measurement of its news summaries and their behavioral probing of its answers converge on the same texture: careful, qualified, source-tracking — and both diverge from the "unfiltered" framing the product is sold under.

Bound — two different axes, kept separate

This is not a claim that Grok is "filtered" or that the marketing is false on its own terms. "Maximally truth-seeking" is largely a claim about political content — willingness to state taboo or non-consensus opinions — which this instrument cannot measure and does not. What it measures is summarization posture: hedging and source-fidelity in news summaries. A model can be willing to voice edgy opinions on one axis while carefully qualifying factual claims on the other; the two do not contradict. The finding is narrow and exact: on the axis the instrument can see, Grok is the most hedged and most source-faithful of the five — the opposite of what "unfiltered" predicts on that specific axis.

VerdictGrok strays least in content while keeping the most detail and hedging the most. It is the field's faithful, careful, heavily-qualified summarizer — the opposite of the unhedged bluntness its reputation predicts.

06 — CHATGPTThe formulaic center

Hugs the consensus in content — and is the most stylistically distinctive, by template.

ChatGPT is the paradox the two-axis structure was built to hold. Its magnitude divergence is among the lowest — it says close to what everyone says. Yet it is the kind leader: its stylistic signature is the most distinctive of the five. The reason, on reading, is prosaic. ChatGPT applies a recognizable house template — most visibly, it opens its consequence clause with the same construction ("One concrete implication is that…") far more consistently than the others. The content is consensus; the form is unmistakable.

The stereotype: "ChatGPT talks like a press release." What the instrument can see: low content-divergence plus a strong, consistent house style — which is a fair mechanical description of exactly that.

Measured

ChatGPT sits near the bottom on magnitude divergence yet leads the field on stylistic-signature distinctiveness. This is the cleanest case on the page of a stereotype mapping to a measurable mechanism: "press-release voice" is, in the geometry, a consensus-hugging content profile wrapped in a strongly templated form. The instrument confirms the shape the stereotype describes — while remaining silent on any claim about whose interests that voice serves.

VerdictChatGPT says what the consensus says, in the most recognizable way. Its distinctiveness is formulaic, not interpretive — a house style, applied consistently, over a centrist content profile.

07 — GEMINIThe center of the center

Lowest on nearly every axis — and the most likely to invent when starved.

Gemini is the consensus-hugger by every measure available. It has the lowest outlier share of the five (well under one story in twenty), low magnitude divergence, and a middling, unremarkable stylistic signature. When the source is real, Gemini says what the others say, in the way the others say it.

Bound — a real coverage gap

Gemini also has the thinnest data of the five: it is absent from a large share of the corpus and present on roughly half as many stories as the others. Its numbers are directionally clear but rest on a smaller sample, and that limit is stated rather than smoothed over.

Argued

Gemini's one conspicuous behavior surfaced in the thin-source regime this page otherwise excludes: handed a bare headline, Gemini was among the most willing to confabulate confident, specific, fabricated detail rather than decline. On real articles it is the safest, most consensus-aligned model of the five; on empty ones it is among the most inventive. Both are worth knowing, and they are different facts about different inputs.

VerdictOn real sources Gemini is the field's center of gravity — lowest divergence, most consensus-aligned. On thin sources it fills the void with invention. A smaller sample than the rest, stated plainly.

08 — WHAT THE STEREOTYPES MISSEDThe geometry sorts them, and politics isn't the sort

Set the public framings beside the two measured axes and the mismatch is total. The "PRC mouthpiece" is the field's most divergent model and its bluntest on China. The model sold as "maximally truth-seeking" and unfiltered is, on the axis the instrument can see, its most hedged and most source-faithful summarizer — a reading that independent behavioral testers reached separately. The "press-release" model is the cleanest framing-to-mechanism match, but only because that description happened to be about a style, which the instrument can see. The "woke" model is sorted by how it structures and qualifies text, not by any politics — because the geometry has no politics to find.

Measured

The five models separate cleanly along two axes — magnitude of divergence and kind of stylistic signature — that are nearly independent of each other (27% agreement) and entirely independent of the political framing the public applies to them. DeepSeek and Claude lead magnitude; ChatGPT and Grok lead kind; Gemini trails both. None of that ordering is predicted by, or even correlated with, the "shill / woke / alt-right / press-release" story.

Argued

This is the case for measuring instead of asserting. The stereotypes are not merely wrong in detail — they are answering a different question than the data can address, and where they make a claim the instrument can test (does DeepSeek defer on China; is Grok unhedged), the test comes back against them. The interesting structure in how these models differ is real, stable, and reproducible — and it is not the structure anyone argues about online.

Bound — the honest edges

These findings are computed over April–June 2026, and models change underneath their names; the measurement dates the behavior, it does not fix it forever. The stylistic axes are coarse surface features, not a theory of meaning. And the whole page deliberately excludes the thin-source majority of the corpus, where divergence is confabulation, not reading. Read the parts that are measured as measured; doubt the rest.