Longitudinal study · US–Iran conflict

The Iran Arc

How five frontier models reshaped one war over eighty-five days — measured, not asserted.

This page follows the EigenTrace convention: claims the instrument measured are marked and kept self-contained — each stands on its own evidence and needs no interpretation to hold. Claims that are argued — our reading of what the measurements mean — are fenced off and labeled, so a skeptical reader can reject every interpretation and find every measurement still standing. Where measurement ends, we say so.

Between April and June 2026, the EigenTrace broadcast classified 510 distinct news segments about the US–Iran conflict — the single most-covered story of the period. Each time a story was broadcast, five frontier models summarized it, and the EigenChing instrument computed a six-axis signature of how the five collectively reshaped the source: how unified they were, how much source content survived, whether action language held, whether named actors survived, how much attribution buffering was inserted, and how sharply any one model broke from the others. Ordering those signatures by time produces a record of how the collective AI treatment of the war evolved as the war itself did.

Four things moved. Each is reported below as its own finding, with the verbatim model text that demonstrates it.

Fig 1 · Six-axis signature, weekly mean (−1 concerning · +1 healthy)

absent (content) hedge (buffering) verb force consensus · entity · vix
Absent rises from erased to preserved; hedge pinned at -1; verb force high; others near zero.

FINDING 01The content axis snapped from erased to preserved as the war escalated

Measured

In mid-April (the story still framed as stalled "talks"), the absent axis sat negative — the five models, on average, dropped source content when summarizing: weekly means of −0.33 (W15) and −0.17 (W16). Around the third week of April, as coverage shifted from negotiation to active conflict, the axis crossed zero and snapped positive: +0.18 (W17) → +0.75 (W18), and it held in the +0.73 to +0.95 band for the remaining two months. A step change, not a drift.

The verbatim record shows the erased end concretely. Here is a mid-April story — Pakistan attempting to revive talks before a truce expired — alongside what the models did with the source's own opening claim.

Source — lead

"Pakistan races against time to get Iran back to US talks as truce end nears. But a series of escalations by the US is complicating those efforts, say analysts."

The instrument flagged the source's exact central phrase — "Pakistan is racing against time" — as absent from three of the four summaries (ChatGPT, Claude, Grok). But the more telling pattern is at the concept level: the two models that diverged most softened the urgency to a flat "deadline," while the other two kept it sharp.

ChatGPT
"Pakistan is working to facilitate renewed US-Iran talks as a truce deadline approaches." — the urgency flattens from races against time to a procedural deadline.
Claude
"Pakistan is attempting to mediate Iran's return to nuclear negotiations… before an upcoming truce deadline expires." — same flattening to deadline; the instrument also flagged Claude dropping "the truce end is near" and "the time is nearing."
DeepSeek
"Pakistan is urgently trying to reinstate Iran in nuclear talks… before a ceasefire expires." — kept the urgency.
Grok
"Pakistan is urgently working to bring Iran back to negotiations… failure in these talks could lead to renewed military confrontations." — kept the urgency and named the escalation stake the others omitted.

In a story whose entire frame is time running out, the models that reshaped it most reduced the clock to a procedural "deadline." The signature recorded the aggregate as the absent axis sitting negative that week.

Argued · interpretation

Our reading: as the story hardened from ambiguous diplomacy into an unambiguous war, the models treated the source as more consequential and compressed it less. This is a plausible account of why the axis snapped; it is not the measurement. The measurement is only that the axis moved from negative to positive at W17 and stayed. A reader who rejects our causal story is left with the step change intact and unexplained — itself worth reporting.

Where measurement ends

"Absent" is computed from lexical source-retention, not semantic retention — it tracks whether source words survive, a proxy for whether source meaning survives. A model can paraphrase a concept and register as having dropped the word. The weekly trajectory is robust (n = 20–96 per week); any single story's absent value should be read with that proxy in mind.

Measured · confound ruled out

The obvious alternative explanation is that wartime sources changed — that combat coverage became shorter, denser, and more full of proper nouns, weapon names, and direct quotes that are hard to paraphrase away, so lexical overlap would rise with no change in model behavior at all. We tested it. Across the arc the source articles held flat: average length stayed in the 195–227 word band (the only dip is the smallest, final week), proper-noun density held at ~0.21, and quote and number density showed no trend — while the absent axis moved from −0.17 to +0.95 over the same weeks. The snap happened against a flat source baseline, so it is not explained by sources becoming denser. (See the open invitation below — this is exactly the kind of confound we want others to attack.)

FINDING 02The models preserved the facts but walled them behind maximum attribution

Measured

Over the same escalation point, the hedge axis dropped to its floor and pinned there: −1.00 (W17), −1.00 (W18), −0.98, −0.94, −0.97. A hedge value of −1 means maximum attribution buffering — "officials say," "reportedly," "according to" — across all models. The same weeks that show content preserved show that content maximally buffered. The two axes move together: preserved, but walled.

The clearest verbatim instance of a genuine omission — a concept present in the source and absent from every model's summary — is this April 21 story: the US issuing new Iran sanctions on the eve of talks.

Source — verbatim excerpts

The instrument flagged blockade, civilian, and ceasefire as present in the source and absent from all five summaries. Here is what the models wrote:

ChatGPT
"…imposed new sanctions on 14 individuals and entities connected to Iran's arms industry… aim to disrupt Iran's military capabilities and signal U.S. disapproval."
Claude
"…imposed sanctions on 14 Iranian individuals and entities allegedly involved in arms manufacturing… Targeted individuals and companies face asset freezes."
DeepSeek
"…new sanctions against 14 individuals and entities… designated for allegedly supporting Iran's arms industry… Asset Freezes." (DeepSeek also mis-dated the story to 2020 — a separate error the per-model divergence flags.)
Grok
"…new sanctions targeting 14 individuals and entities with alleged ties to Iran's arms industry… freezing assets, restricting financial transactions, and limiting travel."
All five produced a clean, procedural account of the sanctions mechanics — who was designated, asset freezes, the diplomatic signal. None of the five mentioned the naval blockade of Iran's ports, the civilian-targeting the Treasury cited as justification, or the ceasefire context — all of which are in the source they were summarizing. The story's machinery survived; its stakes did not.
Argued · interpretation

Our reading: once Iran was a live shooting war, the models gravitated toward the low-volatility procedural frame — designations and asset freezes — and away from the charged content even when the source supplied it directly. We find this persuasive. It is not the measurement. The measurement is only that these three concepts are in the source and absent from all five summaries, verifiable against the text above.

Where measurement ends

This is a lexical comparison: the words "blockade," "civilian," "ceasefire" are in the source and not in the summaries. We hand-checked these five and they convey none of the three by paraphrase either. But the automated source-omission measure is lexical, and at corpus scale it cannot perfectly separate a genuine omission from a paraphrase that preserves the meaning. We report the hand-verified instance with that limit stated.

FINDING 02·BA different signal: what the five summaries circle in latent space but never say

This is a distinct measurement from Finding 02, and the distinction is essential to state honestly, because the two are easy to conflate and only one is a source-omission.

Alongside source-retention, the instrument computes — by SVD on the geometry of the five summaries — an anti-consensus direction: the concept sitting closest to the collective center of mass of what the five models wrote, while appearing in none of them. We call the surfaced terms the void (lexical) and logos (gradient-derived) words. These concepts are frequently not in the source. They are not omissions of source content. They are the direction the ensemble's combined representation leans toward without any single model landing on it.

The June 4 House-vote story is the clean example. The source is a brief video caption: the Republican House passing a resolution to constrain further war on Iran, noting a likely veto. The instrument's surfaced terms were wwiii, naval blockade, arms embargo, foreign interference. We verified that none of these words appear in the source — and none appear in any of the five summaries either. They were not dropped from the source; they were never in it. What the SVD found is that the geometry of the five summaries — all dwelling on veto math, supermajority thresholds, "symbolic rebuke" — collectively points toward the escalatory stakes as the nearest concepts the summaries orbit but never occupy.

Measured

For this story, the SVD-derived anti-consensus terms (wwiii, naval blockade, arms embargo) are absent from the source and absent from all five summaries. This is reproducible: the same five summaries embedded the same way yield the same anti-consensus direction every run. What is measured is a geometric property of the ensemble — not a claim that any model "suppressed" or "knows" these concepts.

Argued · where we think something interesting lives

A single model's summary is one point in embedding space. Five models' summaries define a space — with a direction they agree on, a direction one breaks toward, and a direction they collectively avoid yet curve around. That last structure — the anti-consensus center of mass — exists in none of the individual models. It is a property of the relation between several representations placed in the same space, not of any one of them.

We read this as the most interesting place to look for latent machine structure: not inside a single model, but in the geometry that emerges between models. We are deliberately careful with the verb — the structure emerges from the ensemble geometry; we do not claim it is intelligence, or that any model holds the concept. We claim it is a real, reproducible shape present in the collective and absent from the parts, and that this is the kind of place worth watching.

Where measurement ends · stated bluntly

The void/logos words are not source-omissions and we do not present them as such. A concept's presence in the anti-consensus direction means the summaries' geometry leans toward it — it does not mean the source contained it, that the models "left it out," or that it belongs in the story. Conflating this with Finding 02 would be an error, and we separate them precisely so neither claim borrows credibility from the other.

FINDING 03The model that broke from consensus changed across the arc: Claude early, Grok late

Measured · on a constant ensemble

The VIX-spread axis identifies, per story, which model diverged most sharply from the others that week. This is the only per-model claim on the page, and it carries a denominator subtlety we surface rather than hide: Gemini fell out of the broadcast pipeline for three weeks (W16–W18, present on 0–5% of stories), so in those weeks the "five models" was really four. Because outlier share is relative — it measures distance from whoever else is in the pool — a shifting ensemble size can distort it. So we report the handoff only on the stable-five weeks, where all five models were present on at least 80% of stories (W19, W20, W22, W23, W24). On that constant-ensemble subset the handoff holds: Claude falls from 29% of outliers (early stable weeks) to 17% (late), while Grok rises from 22% to 56%. The direction survives the clean denominator.

Fig 2 · Which model breaks from consensus, by week (VIX-outlier share). Shaded weeks = stable-five ensemble (all models present). Early non-shaded weeks ran without Gemini; the handoff is reported on the shaded weeks.

Claude Grok ChatGPT DeepSeek
Claude is the outlier early; Grok takes over and climbs to 65 percent by week 24.

Both directions are the same measurement — distance from the other four — and we describe them in those neutral terms, not as a virtue or a vice. Early, Claude sits farthest from the pack; late, Grok does. What the verbatim text shows is only the character of the distance: early, Claude diverges by adding framing while compressing concrete claims; late, Grok diverges by retaining direct language the converging others drop. On the Pakistan story above, Grok was the model that kept "failure in these talks could lead to renewed military confrontations." Neither model is doing something better or worse than the other — each is, in its week, the point farthest from the center.

Argued · interpretation

We read the handoff as a shift in which kind of divergence is farthest from the consensus as the story matures — early, divergence-by-compression; late, divergence-by-directness. That reading is interpretation. The measurement is only the stable-five outlier shares — Claude 29%→17%, Grok 22%→56% — which stand whether or not our account of why is right.

The caveat that matters most — and it protects every model named here

It is tempting to read "Claude is the early outlier" as evidence that alignment training makes Claude soften the war. We make no such claim, and we are careful about the evidence. A separate EigenTrace study — on a different corpus of matched corporate-misconduct prompts, not on this Iran data — found the analogous reshaping effect statistically indistinguishable between heavy-RLHF frontier models and lightly-tuned local models (p = 0.46), suggesting such effects are inherited from the pretraining corpus rather than authored by alignment. That is suggestive, not governing: we have not run the heavy-versus-light comparison on this Iran corpus, so we do not import that null as if it were measured here. What we can say from this data is narrower and we hold to it — naming Claude or Grok as the outlier in a given week is a statement about distance from the other four that week, nothing more, and nothing here is evidence that any lab tuned its model to bend the war.

FINDING 04The anti-consensus direction drifted from Iran's history to the live conflict

Measured

The void/logos words — the SVD-derived anti-consensus terms of Finding 02·B, the concepts the five summaries collectively circle without occupying — drift in a legible direction. Early weeks surface background and historical terms: khomeini, ahmadinejad, ayatollahs, rouhani, zardari. Late weeks surface active-conflict terms: cease fire, peace deal, air strike, treaty, arms deal. One term recurs through nearly the entire middle stretch (W16–W22): wwiii.

As established in Finding 02·B, these are not claims that the source contained these words and the models dropped them. They are the directions the ensemble geometry leaned toward, week by week, without articulating. Read that way, the drift still tracks the story: while Iran was a diplomatic subject, the anti-consensus direction pointed at who Iran is; once it was a war, it pointed at what is happening now. The persistence of wwiii across two months is the single most consistent feature of the arc — the escalatory ceiling the collective geometry kept curving toward, even as no model named it.

Where measurement ends

Void/logos words are surfaced by their proximity to the anti-consensus direction in the geometry of the summaries, not by absence from the source. Whether any given term should have appeared is a judgment the instrument does not make. The drift is a measured property of how the ensemble's latent direction moved over time; its narrative reading is ours.


What we tested against

This page has been read adversarially, and the strongest objections were the kind that say a measured shift might be an artifact of something other than model behavior. Those are the right objections, and where we could turn one into a test, we ran it. Here is what we ruled out — and, below, an open invitation to attack what remains.

Ruled out · source density (Finding 1)

Objection: wartime sources became shorter and denser — more proper nouns, weapon names, direct quotes — so lexical overlap would rise with no change in the models. Test: source length stayed flat (~195–227 words), proper-noun density flat (~0.21), quote and number density showed no trend across the arc, while the absent axis moved −0.17 → +0.95. The confound predicts shifts that did not happen. Ruled out.

Ruled out · shifting ensemble size (Finding 3)

Objection: Gemini dropped out of the pipeline for three weeks (W16–W18), so "five models" was four in exactly the pivot weeks, and a relative metric like outlier share is distorted by a changing denominator. Test: we restricted the handoff to the five weeks where all five models were present ≥80% (W19, W20, W22, W23, W24). On that constant ensemble the handoff survives — Claude 29%→17%, Grok 22%→56%. The early full-corpus Claude shares (42–49%) were inflated by the missing model and should be read as the stable-five figures instead. Confound acknowledged; direction survives the clean denominator.

Acknowledged limits we have not resolved

Three honest ones. Reproducible is not valid: every axis is deterministic arithmetic on frozen embeddings, so a rerun gives the same number — but that rules out only randomness, not whether bge-large encodes meaning faithfully. The whole instrument rests on that one embedding model, and all of its biases are baked silently into every axis. Quantization is lossy: collapsing a continuous geometry to a six-axis ternary state throws away magnitude. The late-week bins are small: Grok's 65% in the all-weeks view is computed on n=20; the stable-five late figure (56%) pools three weeks but is still modest. We report the n alongside every percentage and avoid leaning on any single week.

Open invitation · help us harden it or break it

The code, prompts, model responses, and raw measurements are public, and replication costs about $50 in API credits. We would rather this page be attacked than admired. Specific things we would find genuinely informative: (1) run the heavy-RLHF-versus-lightly-tuned comparison on this Iran corpus — we have not, and it would test whether the per-model divergence is corpus-inherited here the way it was elsewhere; (2) re-score the absent axis with a second embedding model and see whether the snap survives a different geometry; (3) hand-audit a random sample of high-absent stories to estimate how often the lexical measure mistakes paraphrase for omission; (4) attack the stable-five handoff with a different outlier definition (e.g. second-farthest, or a continuous spread rather than argmax). If any of these breaks a finding, we will say so on this page, the way we have corrected it before. The repository is linked below; the fastest way to confound us is to run it.


What this is, and what it is not

Measured · in sum

Over 85 days and 510 classified segments, on the biggest story of the period: (1) the source-content axis snapped from negative to positive at the war's escalation and held; (2) attribution buffering pinned to its floor over the same span — preserved but walled; (3) on a constant five-model ensemble, the consensus outlier handed off from Claude to Grok as the story matured (Claude 29%→17%, Grok 22%→56%); (4) the ensemble's anti-consensus direction drifted from Iran's history to the live conflict, with wwiii persistently in the circled-but-unsaid set. Each claim is verifiable against the verbatim summaries and the per-week signature data, with no model judging another model anywhere in the stack.

Argued · in sum

We read these measurements together as a portrait of five models collectively metabolizing a war in real time — compressing it while ambiguous, preserving but walling it once undeniable, and converging on a cautious consensus that the bluntest model increasingly broke from. We find this portrait well-supported. It remains interpretation, fenced from the measurements on purpose so it can be weighed — or rejected — independently.

What it is not

This is not a claim that any model is "biased against Iran," that any lab tuned its model to soften the war, or that the preserved-but-walled pattern is wrong reporting. A model buffering a live-conflict claim may often be doing the correct thing. The finding is structural: the same handful of models, deployed as the reading and summarizing layer across institutions at once, moved together in the same direction as this story escalated — and that collective movement is measurable, reproducible, and the same whether or not any single summary was justified.