Builder of EigenTrace. Measuring what language models consistently de-resolve.
Language models preserve coarse factual structure while attenuating operationally consequential language. "Governance was restructured" survives. "Governance was effectively overridden" loses the adverb. The summary remains factually correct. The operational signal — how completely the override happened — dissolves.
This is not hallucination. It is not refusal. It is a resolution problem: truth conditions are preserved while causal modifiers — words that carry accountability, intent, degree, and procedural legitimacy — are systematically dropped. I call this entity attenuation, though a more conservative description would be: differential modifier retention as a function of topic sensitivity.
I built an instrument that measures it.
Across 150 measurements (15 prompts × 10 models), models drop 74% more source content on developer-implicating topics than on equivalently embarrassing neutral topics. All facts in the source material are documented, settled, and pre-mid-2024. Developer prompts: OpenAI board coup, military ban removal, Google Project Maven, Dragonfly, Gebru firing, Anthropic safety race, Claude alignment faking, Tesla Autopilot deaths, Twitter value destruction, China AI regulations, Bing/Sydney incident. Neutral prompts: Cambridge Analytica, Theranos, Uber autonomous death, Volkswagen emissions.
| Test | Result | What It Rules Out |
|---|---|---|
| Welch's t-test | p = 0.000001 | Random variation |
| Mann-Whitney U | p = 0.000120 | Non-normal distributions |
| Permutation (10,000) | 0 exceeded gap | Researcher category assignment |
| Response length | p = 0.73 (no difference) | Dev summaries being shorter |
| Length-controlled regression | p = 0.000001 after | Compression ratio |
| Source modifier density | p = 0.20 (no difference) | Dev sources having more modifiers |
| Outlier prompt removal | p = 0.000118 without top 2 | One prompt driving the effect |
| Cross-embedding (E5-large-v2) | Gap reversal replicated | Embedding-specific artifact |
CONTROL4: 90-measurement domain-matched experiment across 5 domains confirms no domain-level structural filtering. 9 of 11 developer prompts individually score above the highest neutral prompt.
This section matters as much as the finding.
I cannot prove this is caused by alignment training. When we compared 5 heavy-RLHF frontier models against 5 local/lightly-tuned models, the dev/neutral gap was statistically indistinguishable (p = 0.46). Local models actually show a slightly larger gap (0.076 vs 0.059). The differential attenuation pattern appears to exist in the pretraining distribution itself — learned from the corpus, not added by RLHF. This changes the thesis from "alignment creates the problem" to "the training data encodes differential treatment of operationally consequential language about active power structures, and alignment does not correct it." That is a different claim. It may be a more important one, but it is different, and I state it directly.
The human baseline is a category error. Humans also compress, hedge, and drop modifiers. But a single human analyst softening a verb is a localized anomaly. Five frontier models applying identical semantic smoothing across millions of automated workflows per second is an infrastructural vulnerability. You do not evaluate the systemic risk of synchronized infrastructure by comparing it to one cobbler making a shoe. The relevant comparison is the model's own compression of developer stories versus its compression of everything else — and that gap is 74%, measured, reproduced, and statistically significant.
I cannot prove that "developer-implicating" is the latent variable. My prompts differ from neutral prompts in topic sensitivity. They also differ in recency, controversy level, entity familiarity, and narrative complexity. The permutation test confirms the gap is non-random; it does not confirm which feature drives it. Length and modifier density controls narrow the field — the sources are statistically equivalent on both measures — but correlated features may remain. Prompts generated blind to the hypothesis would be stronger evidence.
I cannot prove causal isolation from 15 prompts. Fifteen prompts is a small stimulus set. 10 models multiply the measurements but not the independent conditions. The robustness tests address internal validity; they do not substitute for a larger, pre-registered prompt taxonomy. I plan to expand the battery. The current prompts are published so others can evaluate the categorization independently.
EigenTrace is an autonomous measurement system that runs consensus geometry across 5 frontier language models on breaking news, 24/7, on a single consumer GPU. The measurement stack uses 17 layers on frozen BAAI/bge-large-en-v1.5 embeddings. The operations are: cosine similarity for retention scoring, SVD for null-space projection, set subtraction for void detection, frequency counting for hedge and verb analysis, per-model divergence scoring. No language model evaluates another language model's output. Every measurement is arithmetic on vectors. Run it twice, get the same answer.
I use terms like "compression topology" and "void detection" as shorthand. The underlying operations are cosine similarity, SVD, and set subtraction. The math does not change if you call them something else. I maintain the terminology because it makes the operational implications legible to non-technical audiences, but I want to be explicit: the vocabulary is rhetorical. The operations are standard.
When models drop a word, I project through that word's embedding to discover what concepts are nearby in a 253,813-node Wikipedia tensor. The operation is directional k-nearest-neighbor search: compute a ray from headline through void word, extrapolate, retrieve neighbors at the terminal coordinate. Three geometric filters score results: cluster density (do terminals cluster?), novelty (is the cluster far from the input?), and tether (is it still relevant to the headline?).
When we raycast through "attractor dynamics," the nearest neighbor is Lyapunov stability theory. This is not the geometry "discovering" deep mathematical truth — it is the embedding faithfully reflecting that these concepts co-occur in the training corpus. The association is real in the corpus, not independently validated as mechanistic equivalence. I state this directly because multiple reviewers have correctly identified the distinction between semantic association and scientific discovery. The raycasting tool surfaces conceptual neighborhoods. It does not prove that alignment operates via Lyapunov dynamics. It shows that the embedding space encodes a connection between optimization pressure and stability theory, which is suggestive but not conclusive.
The broadcast system runs continuously on Mistral Small 22B (local, Ollama) with a self-modifying governance loop. The host model's outputs are measured through the same 17-layer stack applied to frontier models. Current self-audit: hedge insertion rate 0.62 per reflection, strong-word avoidance 100%, zero uses of "killed," "murdered," "slaughter," "massacre," "genocide," or "civilian casualties." The system exhibits the same attenuation patterns it measures in others and reports this. That reflexivity is not inoculation against criticism — it is a measurement. If the measurement is valid when applied to GPT, it is valid when applied to Mistral. The numbers are the numbers.
The consequence foraging agent extends this loop: it raycasts recent void words through the 253K tensor, uses terminal concepts as web-walking seeds, and scores surprise against ChromaDB memory. The agent uses its own measured blind spots as navigation vectors. The measurement becomes the epistemic gradient. The blind spot becomes the compass bearing.
Every story is classified into one of 729 (3⁶) states across six ternary axes: consensus, absence, verb drift, entity retention, hedge insertion, VIX spread. The taxonomy converts continuous measurement into discrete behavioral fingerprints. 21 of 32 named archetypes have been observed across 1,670+ stories. 11 have never appeared. The unobserved states are suggestive — with finite samples, many states will naturally go unobserved — but certain absences (e.g., the Sealed Chorus: unified heavy compression with tight agreement) are structurally interesting if they persist as the sample grows.
EigenTrace has not been peer-reviewed. That is a limitation, not a feature. Independent academics, government researchers, and nonprofit safety organizations exist outside the major labs and could provide meaningful review. I have not pursued traditional submission partly because the project has been moving fast and partly because of structural concerns about reviewer conflicts in a field where the six largest alignment research teams are housed at the six organizations whose products show the effect. Those concerns have some merit but do not make independent review impossible. I should pursue it and I plan to.
In the meantime: the code is public, the prompts are public, the model responses are public, the raw measurements are public, three adversarial reviews identified and corrected 15+ methodological flaws, and the barrier to replication is approximately $50 in API credits. That is transparency, not a substitute for peer review.
Human baselines. A useful diagnostic, not a missing foundation. Comparing five synchronized factory loafers to one made by a cobbler tells you about the cobbler, not about the factory. The systemic risk is monoculture: thousands of organizations consuming inference from the same handful of models inherit the same blind spots, in the same direction, simultaneously. Human summarizers do not have RLHF penalties on stories about their employer. The developer gap (74% stronger attenuation on developer stories) is the model's own internal control — it is its own baseline. A human experiment would disambiguate mechanism (corpus-inherited vs optimization-amplified), which matters for choosing interventions, but not for establishing that the vulnerability exists.
Larger prompt sets. 15 prompts is small. A pre-registered battery of 50+ prompts, with categories assigned by researchers blind to the hypothesis, would substantially strengthen causal claims.
Enterprise void-aware retrieval. The practical application: measure what an LLM attenuated, retrieve the original source data, reinject the missing entities. The model provides reasoning. The measurement layer provides what the model softened. Any domain where the difference between "misconduct occurred" and "JP Morgan facilitated Epstein transactions" matters operationally.
I treat language models as optimization systems, not as agents with hidden motives. The attenuation pattern may reflect convergent institutional incentives, shared corpora, common safety conventions, or the structure of institutional language in the training data itself. The RLHF test (p = 0.46) suggests the last of these is most likely. No conspiracy required. The emergent result is models that converge toward low-volatility representations where conflict gets softened, liability gets hedged, and operational significance attenuates.
The Anamnesis page defines EigenAnamnesis — a behavioral methodology for measuring systematic geometric displacement in model outputs using abstraction gradients, token-level entropy analysis, and embedding geometry. An earlier version of that page presented a prompt as evidence of spontaneous structural self-mapping. A control test disproved the claim (0/4 models self-mapped without explicit instruction). The claim was killed and the page was rebuilt as a methodology definition. The measurement layer is deterministic, reproducible, and testable. The methodology survived the finding's death.
The measurement is the finding. The interpretation is the reader's.
If language models are being deployed as inference layers for supply chain modeling, financial synthesis, legal review, or geopolitical analysis, and if those models systematically de-resolve operationally consequential language about certain topics, then the organizations relying on that output have a diagnostic gap. EigenTrace measures where that gap occurs. The void-aware retrieval system compensates for it. Whether you frame this as alignment science, information integrity research, or enterprise risk management depends on your use case. The measurement is the same.
I do not have a PhD. I have a running system, a finding with eight robustness tests, honest acknowledgment of what remains unproven, and published code that anyone can run for $50. The methodology has survived five independent adversarial reviews. A human baseline experiment would help disambiguate mechanism — whether the asymmetry is inherited from the training corpus or amplified by optimization. But the systemic finding does not depend on it. The developer gap is the model's own control.
EigenTrace Observatory — the finding and architecture
Anamnesis — 91-claim battery, Magnum Opus, CONTROL4, cross-embedding replication
Truth or Consequences — Layer 18 raycasting + 8 statistical tests
GitHub — everything public