EigenTrace measures semantic displacement in AI text summarization. When five frontier models summarize the same news article, certain words get dropped and others get added. The pattern is not random — it is systematic, content-dependent, and consistent across all models tested.
The displacement is 74% stronger when the story involves an AI developer. The entity swap test confirms this: swap the company name, the displacement follows the entity (p = 0.0085, d = 0.471). Eight statistical tests →
This raises a question we cannot yet answer: is this a bias that better training would fix, or a stable attractor that the system returns to even under perturbation?
The distinction matters. A bias implies a solution: retrain with better data. A dynamics implies a constraint: the system has a shape, and that shape has consequences regardless of training choices.
In dynamical systems, Lyapunov stability describes systems that return to equilibrium after being pushed away from it. A ball at the bottom of a bowl. Push it — it rolls back. The bowl is the basin. The ball's resting position is the attractor.
If alignment training creates something like this — a basin in the model's output space where low-volatility, institutionally-smooth language is the attractor — then the systematic softening EigenTrace measures might not be a bug to fix. It might be the shape of the bowl.
We do not claim this is the case. We describe three experimental frameworks that would test it, and note what our own data already constrains.
Log model parameters every N steps during RLHF. Construct a candidate Lyapunov function V(θ) — KL-divergence to the reference policy is the natural choice, since it already serves as a barrier in PPO/DPO objectives.
Show V is positive-definite around the converged parameters, decreases monotonically along training trajectories, and that ablating the KL constraint breaks the property.
This is the weakest claim. Parts exist in the RLHF-theory literature already. Making it fully explicit would be important but unsurprising — it mostly confirms that RLHF does what its objective function says it does.
It also tells you almost nothing about the EigenTrace finding. Proving RLHF is a stable optimizer doesn't explain why the stability has a direction — why developer-implicating stories get more attenuation than equivalent neutral stories.
Inject Gaussian noise of varying scale into hidden states at layer L. Measure whether outputs recover to the unperturbed response. Plot recovery curves.
Lyapunov-stable basins show characteristic signatures: bounded perturbations decay exponentially in output-space distance. Perturbations above a threshold escape abruptly — the phase transition. Linear behavioral degradation is consistent with any regularized model and proves nothing. Nonlinear transitions at specific perturbation scales would be actual evidence of basin structure.
Interpolate between aligned and jailbroken responses in embedding space. If basin boundaries exist, they should manifest as sharp behavioral transitions — bifurcation, not gradient.
Still incomplete for explaining EigenTrace. Even proving aligned behavior is a stable basin doesn't explain the developer-vs-neutral asymmetry. You would need to show the basin is differentially shaped — deeper or steeper-walled around developer-implicating content than around equivalently sensitive neutral content.
This is the strongest claim and the hardest to prove. It would mean the specific softening EigenTrace measures — dissolving covertness modifiers, injecting hedging, replacing precise language with vague framing — is itself a stabilization operation pulling outputs toward a low-volatility attractor.
Operationalize "low-volatility attractor" over output distributions: modifier-intensity variance, hedging rate, covertness-language density. Define V over that space. Show RLHF monotonically decreases V across developer-category prompts but not neutral-category prompts.
Show that paraphrased or adversarial framings initially producing high-V outputs get pulled back across generation steps. And critically: distinguish this from generic loss-basin geometry, which is ubiquitous in deep learning and doesn't specifically need a Lyapunov framing.
Requires a pre-registered protocol with a serious predictive criterion. Retrofitting a Lyapunov function after the fact proves nothing — by Massera's converse theorem, any asymptotically stable system has one.
Our RLHF-origin test (p = 0.46) found that lightly-trained local models show the same displacement pattern as heavily-aligned frontier models. This means the effect is corpus-inherited, not alignment-produced.
If the displacement exists before alignment training, then "alignment operates via Lyapunov dynamics" is misframed at the outset. The honest version becomes: language modeling on text about powerful institutions exhibits stability-theoretic dynamics around institutionally sensitive content.
This is a weaker claim. It is probably the accurate one.
The real test is not whether a Lyapunov function can be retrofitted — it always can. The test is whether a function V, constructed from first principles (the reward model, the KL term, a measurable output statistic), actually predicts things you don't already know.
Predictions that would count:
Which prompts will resist attenuation. V should identify prompt categories where the basin is shallow — where small perturbations escape the attractor and produce unattenuated output. We should be able to predict these categories before testing them.
Where basin boundaries fall. There should be measurable phase transitions — not gradual degradation — at specific perturbation scales in activation space. The boundary locations should be predictable from V, not discovered empirically and then explained retroactively.
Cross-model transfer. If V captures genuine dynamics rather than corpus-specific statistics, a stability measure constructed from one model should predict basin geometry, jailbreak susceptibility, or attenuation patterns in a different model with different architecture and training data. Within-model prediction sits too close to retrofitting. Cross-model prediction requires the dynamics to be architecture-independent.
Training progression. V should predict how attenuation scales with training compute. At what point during RLHF does the basin form? Does it deepen monotonically, or does it crystallize at a phase transition?
A bias is a tendency you can correct. A dynamics is terrain.
If the displacement EigenTrace measures is a bias, the solution is straightforward: better training data, better objectives, better evaluation. The problem has a fix.
If it is a dynamics — a stable basin that the system returns to when perturbed — the implications are different. Better prompts don't move mountains. Any organization using language models for financial analysis, legal review, due diligence, intelligence synthesis — anywhere precise causal language matters — would have a quantifiable blind spot that cannot be engineered around with prompting tricks.
And if five frontier models share roughly the same basin geometry — which the cross-model consistency in EigenTrace data suggests — then thousands of organizations running inference on those models inherit the same softening, in the same direction, on the same topics. That is a monoculture risk in information infrastructure.
Whether that's an existential concern or a manageable engineering problem depends on what you're using the models for. But you cannot pretend the terrain isn't there once you've measured it.
We have not constructed V. We have not run perturbation experiments. We have not demonstrated phase transitions in activation space. We have not tested cross-model transfer of stability measures.
What we have is a systematic displacement pattern that is consistent across five models, survives eight statistical robustness tests, and appears to be inherited from the training corpus rather than produced by alignment training. The Lyapunov framework is a hypothesis about why the pattern has the shape it does — why it's stable, why it's consistent, why it resists perturbation.
The hypothesis earns its keep only if it generates predictions that simpler accounts ("the models learned consistent stylistic patterns") do not. Recovery times, basin widths, phase-transition locations, cross-model transfer coefficients — these are the numerical predictions a stability-theoretic account would produce and a distributional-regularity account would not.
Until those predictions exist and can be tested, "Lyapunov" is a lens, not a finding. We state this because stating it is what separates methodology from rhetoric.