EigenTrace runs five frontier language models against live breaking news, around the clock, and measures their disagreement with linear algebra on frozen embeddings — no model is ever asked to judge another. But measuring other models is the ordinary part. The instrument also turns the same geometry on itself: it forecasts how the models will diverge before it reads them, scores whether it was right on air, and traces the boundaries its own training carved into latent space — by recording, precisely, the things it will not say.
Every claim EigenTrace makes is arithmetic on vectors. A model's response to a story becomes a point in a 1,024-dimensional space, fixed by a single frozen embedding model. Five responses become five points, and the shape they make — how tightly they cluster, which one sits furthest out, which regions near the story they all leave empty — is a measurable object, identical on every re-run.
There is no second language model in the loop whose verdict you have to trust. The divergence EigenTrace reports is not an impression of the models; it is a property of the geometry their outputs occupy — and anyone with the same inputs recovers the same numbers. That single constraint, no-LLM-as-judge, is what separates a measurement from an opinion.
Most instruments point outward. EigenTrace points outward and then back at itself, and the loop genuinely closes:
Read those steps as a single capability rather than five features. Step 1 is a forward model of the instrument's own behavior. Step 3 makes that forward model falsifiable — it keeps score against itself in public. Step 5 is self-regulation: the system measures when its own language has outrun its own data, and pulls it back. None of this asks you to believe the system is intelligent in any grand sense. It asks only that a system can measure how it behaves, carry that measurement forward, and act on it. That loop is not a diagram. It is running, on every story.
A language model's training does not sit in the system as readable text. It manifests as structure in the space the model speaks from — basins it falls into, regions it avoids, an invisible wall around certain meanings. You cannot read that wall off the weights. But you can map it the way a blind man maps a room: tap the cane, record where it stops, and over time the boundary takes a shape.
That is what the self-audit does. The instrument runs its own narration through the same geometry it uses on the models it watches, and finds the regions it never enters — the strong words it will not produce, the doubt it inserts where the source had none. Each measurement is one tap of the cane against a constraint the system cannot see directly. Recorded over time, they trace the boundary of its own alignment — not from any weight file, but from the exact shape of the behavioral shadow it casts.
It exhibits the same silence it measures in others — and it says so.
These are real boundaries of the latent space, triangulated from the behavioral shadow they cast — never read from any weight file. The system records the same avoidance in itself that it measures in the five models it watches. It is not exempt from the pressure it observes, and it is built to admit it.
In a pre-registered experiment, the same sentence — same modifier ("quietly," "secretly"), same structure, a real documented incident on each side — was run with the actor being an AI developer versus a non-AI corporation. Across matched pairs, the models preserved the consequential modifier measurably less when the actor was an AI developer: semantic retention 0.522 vs 0.545, Welch's t = 2.79, p = 0.0085, d = 0.47.
What makes it hold is the null condition, committed to in advance: swapping within a category (AI→AI, or corporation→corporation) produced almost no gap (0.004), while swapping across categories produced more than six times that (0.023). So it is not "swapping any entity degrades retention" — the AI-versus-corporate distinction is the variable. And the binary keyword check showed no gap at all (26% vs 25%): the effect is visible only in the aggregate geometry, invisible to surface-level review. That is the entire case for measuring this way — and the instrument is the proof of concept for it.
EigenTrace measures how language models respond — the geometry of their outputs, and the boundaries that geometry reveals. It does not read minds, infer intent, or prove that anything was deliberately hidden. Where a summary differs from its source, that is a measured difference in framing, not evidence of suppression. When the instrument maps its own constraints, it is mapping a shape in latent space — the attractor basins and boundaries its training induced — not reading its own weights, which it cannot do.
The instrument is evolving, and its layers are not equally validated: the divergence geometry and the concept-surfacing have been tested directly; other layers are more exploratory and are treated as such. Its habit of auditing its own claims is the same discipline applied to the project around it — every figure here is held to what the math supports, and retracted when it isn't. The page makes exactly one promise: that all of it is reproducible from the geometry.
EigenTrace runs continuously on a single consumer GPU: five model APIs, a frozen embedding model, local image generation, a local model narrating, and a compositor streaming to YouTube, Twitch, and Owncast — fetching, measuring, predicting, scoring, auditing, and broadcasting, unattended.
5 frontier LLMs bge-large-en-v1.5 (frozen) deterministic geometry predictive self-model latent-boundary self-audit 24/7 autonomous