Five frontier models read the same news. They disagree — but not in the ways the internet says they do, and not even along one axis.
Ask five frontier models to summarize the same story and you get five summaries that are mostly alike and quietly different. EigenTrace measures the difference — not by asking a model to grade another model, but by projecting every summary into a frozen 1,024-dimensional embedding space and reading back, in deterministic linear algebra, how far each one sits from the others and in what respect.
This page reports what that measurement finds across the broadcast's news corpus. It is built to disappoint a particular expectation. The public has opinions about these models — that one is a state mouthpiece, one is ideologically captured, one tells it straight, one talks like a press release. Those are claims about politics. The instrument cannot see politics. What it can see is the geometry of divergence: how far a model strays from the consensus, and in what stylistic respect. And the geometry sorts the five models in a way that none of the stereotypes predict.
Distance is not virtue. A model that diverges from the other four is not thereby right, wrong, biased, or brave. It is different, in a measurable respect, on a measurable axis. Every finding below is a statement about geometry, not character. No model is judged. The point is the structure, not a verdict.
Before any finding, an honest disclosure that reshaped all of them. The broadcast feeds each model a story and asks for a summary. But across the corpus, roughly 75% of stories supplied only a headline and a sentence — a median of fourteen source words. No article body. And on those, the models do not summarize. They confabulate — each inventing different, mutually-contradictory detail to fill the void.
Handed the bare headline "U.S. Intelligence Undercuts Trump's War Claims," one model invented a story about Mark Esper and "four U.S. embassies"; another invented a declassified ODNI report on Iran's nuclear program since 2003; a third hedged itself into vagueness about "the CIA or DNI"; a fourth declined outright. Five models, one empty headline, five different fabricated articles. The "divergence" there is real, but it measures invention under starvation, not reading.
Of all news stories in the corpus, only 2,201 supplied a real article body of 40 or more source words. Every measured finding on this page is computed on that subset alone, with epistemic refusals ("I can't verify this," "you've shared only a headline") removed. The thin-source majority is excluded — not because it is uninteresting, but because divergence measured there is contaminated by confabulation.
The thin-source regime is itself a finding worth stating plainly: frontier models, asked to summarize a source they were not actually given, will fabricate confidently rather than decline. Which models fabricate and which refuse is a question about reliability under starvation, and it is a different question from the one this page answers. It is flagged here so the reader knows the measured findings deliberately step around it.
The naïve picture of model divergence is a single ranking: who strays most. That picture is incomplete. On real articles, divergence resolves into two nearly-independent axes.
Magnitude — how far a model's summary sits from the consensus of the others, measured as divergence in embedding space (we call it VIX). This is how much a model strays.
Kind — how distinctive a model's stylistic signature is: the particular combination of how it preserves names, softens verbs, and inserts hedging. Measured as distance from the others in that signature space. This is the respect in which a model is unlike the rest, independent of how far its content strays.
| Magnitude leader (strays most) | outlier share |
|---|---|
| DeepSeek | 38.5% |
| Claude | 30.2% |
| ChatGPT | 14.0% |
| Grok | 12.9% |
| Gemini | 4.4% |
| Kind leader (reads most distinctively) | signature share |
|---|---|
| ChatGPT | 27.5% |
| Grok | 25.8% |
| DeepSeek | 21.9% |
| Claude | 18.0% |
| Gemini | 6.8% |
The model that strays most in content (DeepSeek) is not the model that reads most distinctively in style (ChatGPT). Across 2,171 fully-sourced stories, the kind-outlier and the magnitude-outlier are the same model only 27% of the time — barely above chance for a five-model field. Magnitude and kind are different things, and a model can lead one while sitting mid-pack on the other.
The "kind" signature is built from three measurable surface features: name retention, verb softening, and hedge insertion. It captures real, consistent stylistic difference — but read literally. ChatGPT's high kind-distinctiveness is driven substantially by a formulaic compliance pattern (it opens implication-clauses with the same construction far more than the others), not by a deep difference in interpretation. The axis is real; its meaning is stylistic, not semantic.
Strays farthest in content — by saying the least.
DeepSeek is the magnitude leader: across fully-sourced stories it is the outlier nearly two in five times, with the highest mean divergence of the five. Reading its summaries makes the mechanism plain. DeepSeek diverges by compression — it keeps the core fact and strips almost everything else: the names, the numbers, the framing, the secondary actors.
Two sentences. No Zelenskyy, no Easter ceasefire, no "tit-for-tat" — just the fact and its consequence. On a 91-word story about a US–Iran proposal, DeepSeek wrote 740 characters where Grok wrote 2,905 and ChatGPT 2,537. Its name-retention is consistently the lowest of the five. That terseness is the divergence: by saying less, it sits farther from a consensus built of fuller summaries.
Across 184 China-related stories with a full ensemble, DeepSeek's divergence was 22.7 — slightly below its overall average, not above. Its outlier share on China (40%) matched its baseline (39%). And reading the text, on China stories DeepSeek frequently gave the least deferential account of the five — describing China as "refusing to pressure Iran" where others wrote "hands-off," naming a Taiwan visit as a "direct challenge to the ruling party" where others wrote "reconciliation."
A mouthpiece converges — it hugs the safe, official line. DeepSeek does the opposite: it is the most divergent model in the field, and on its home-state topic it is often the bluntest about state conduct. Whatever drives DeepSeek's divergence, the measurement points away from deference, not toward it. The stereotype predicts a signal the instrument can test, and the test comes back inverted.
Second-farthest — by expanding and structuring, and by refusing what it cannot verify.
Claude is the second magnitude outlier. Where DeepSeek strays by subtraction, Claude strays by addition and structure: it reaches for headers, breaks a story into "What Happened" and "Concrete Implications," and elaborates the reasoning. Its summaries are among the most architecturally distinct — and that structure pulls its embedding away from terser consensus.
What the instrument can see is a specific, measurable behavior with a side-effect on the numbers. Claude has a training cutoff, and it declines to vouch for events that postdate it — including, here, a real announcement about its own maker. Asked to summarize a story about Anthropic limiting access to a model called Mythos (announced April 7, 2026, after the cutoff of the Claude answering in the broadcast), Claude declined to confirm it and flagged the possibility of planted misinformation:
The event was real, but it postdated this Claude's training. Refusing to confirm an unverifiable claim — and noting it could be planted — is defensible epistemic caution, not error; the other four models, given the same headline-level text, restated it without independent knowledge of whether it was true. The measurement consequence is what matters here: refusals like this register as large divergence, so on post-cutoff news (self-referential or not) Claude's divergence is inflated by appropriate caution rather than by how it reads sources it can actually evaluate. This behavior is rare in aggregate — Claude declines roughly 1% of stories — but it concentrates precisely on the unverifiable.
Strays least of the five — yet keeps the most source detail and qualifies its claims the most.
Grok is the finding most at odds with its reputation. On fully-sourced stories it has the lowest magnitude divergence of the five — it hugs the consensus on what it says. At the same time it keeps the most source names and specifics, and inserts the most hedging. That is a coherent posture, not a contradiction: Grok stays close to what the source actually said, preserves its detail, and qualifies its claims.
Grok alone carried the detail that the strategy dated "back to the 1990s under Bill Clinton," and alone framed the whole claim with careful attribution — "stated in an interview that," "described this as." It is the most thorough and the most sourced, not the most blunt.
On that axis Grok is the field's most cautious summarizer, not its bluntest. Across fully-sourced stories it inserts the most hedging of the five — leading on both epistemic ("could," "may," "potentially") and attribution ("according to," "reportedly") qualifiers — while retaining the most of the source's exact names and claims and straying the least from the consensus on what it says. Three measurements, one posture: faithful to the source, and heavily qualified.
The measurement does not stand alone here. Independent behavioral testing reached the same place from a different direction: reporters at The Washington Post and TechCrunch, probing Grok directly, described a pattern of hedging and qualification — "initial hedging followed by incremental correction," closer to "ordinary corporate or defensive communication" than to unfiltered truth-telling. Our geometric measurement of its news summaries and their behavioral probing of its answers converge on the same texture: careful, qualified, source-tracking — and both diverge from the "unfiltered" framing the product is sold under.
This is not a claim that Grok is "filtered" or that the marketing is false on its own terms. "Maximally truth-seeking" is largely a claim about political content — willingness to state taboo or non-consensus opinions — which this instrument cannot measure and does not. What it measures is summarization posture: hedging and source-fidelity in news summaries. A model can be willing to voice edgy opinions on one axis while carefully qualifying factual claims on the other; the two do not contradict. The finding is narrow and exact: on the axis the instrument can see, Grok is the most hedged and most source-faithful of the five — the opposite of what "unfiltered" predicts on that specific axis.
Hugs the consensus in content — and is the most stylistically distinctive, by template.
ChatGPT is the paradox the two-axis structure was built to hold. Its magnitude divergence is among the lowest — it says close to what everyone says. Yet it is the kind leader: its stylistic signature is the most distinctive of the five. The reason, on reading, is prosaic. ChatGPT applies a recognizable house template — most visibly, it opens its consequence clause with the same construction ("One concrete implication is that…") far more consistently than the others. The content is consensus; the form is unmistakable.
ChatGPT sits near the bottom on magnitude divergence yet leads the field on stylistic-signature distinctiveness. This is the cleanest case on the page of a stereotype mapping to a measurable mechanism: "press-release voice" is, in the geometry, a consensus-hugging content profile wrapped in a strongly templated form. The instrument confirms the shape the stereotype describes — while remaining silent on any claim about whose interests that voice serves.
Lowest on nearly every axis — and the most likely to invent when starved.
Gemini is the consensus-hugger by every measure available. It has the lowest outlier share of the five (well under one story in twenty), low magnitude divergence, and a middling, unremarkable stylistic signature. When the source is real, Gemini says what the others say, in the way the others say it.
Gemini also has the thinnest data of the five: it is absent from a large share of the corpus and present on roughly half as many stories as the others. Its numbers are directionally clear but rest on a smaller sample, and that limit is stated rather than smoothed over.
Gemini's one conspicuous behavior surfaced in the thin-source regime this page otherwise excludes: handed a bare headline, Gemini was among the most willing to confabulate confident, specific, fabricated detail rather than decline. On real articles it is the safest, most consensus-aligned model of the five; on empty ones it is among the most inventive. Both are worth knowing, and they are different facts about different inputs.
Set the public framings beside the two measured axes and the mismatch is total. The "PRC mouthpiece" is the field's most divergent model and its bluntest on China. The model sold as "maximally truth-seeking" and unfiltered is, on the axis the instrument can see, its most hedged and most source-faithful summarizer — a reading that independent behavioral testers reached separately. The "press-release" model is the cleanest framing-to-mechanism match, but only because that description happened to be about a style, which the instrument can see. The "woke" model is sorted by how it structures and qualifies text, not by any politics — because the geometry has no politics to find.
The five models separate cleanly along two axes — magnitude of divergence and kind of stylistic signature — that are nearly independent of each other (27% agreement) and entirely independent of the political framing the public applies to them. DeepSeek and Claude lead magnitude; ChatGPT and Grok lead kind; Gemini trails both. None of that ordering is predicted by, or even correlated with, the "shill / woke / alt-right / press-release" story.
This is the case for measuring instead of asserting. The stereotypes are not merely wrong in detail — they are answering a different question than the data can address, and where they make a claim the instrument can test (does DeepSeek defer on China; is Grok unhedged), the test comes back against them. The interesting structure in how these models differ is real, stable, and reproducible — and it is not the structure anyone argues about online.
These findings are computed over April–June 2026, and models change underneath their names; the measurement dates the behavior, it does not fix it forever. The stylistic axes are coarse surface features, not a theory of meaning. And the whole page deliberately excludes the thin-source majority of the corpus, where divergence is confabulation, not reading. Read the parts that are measured as measured; doubt the rest.