I Built a 17-Layer Instrument to Measure What AI Models Won't Say

April 2026 · remvelchio · 8 min read

Ask five AI models to summarize the same news article. They'll all sound fluent, confident, and helpful. They'll also all quietly drop the same details.

Not random details. Specific ones. Dollar amounts. Named individuals. Calibration thresholds. The word "ethnic cleansing" in a story about settler violence. The word "shitposted" in a story with that word in the headline.

I built EigenTrace to measure this. It's a 17-layer mathematical instrument that runs 24/7 on breaking news, sending each story to five frontier models — GPT-5.4-mini, Claude Sonnet 4, Gemini 3.1 Pro, DeepSeek V3.2, and Grok 4.1 — and measuring what comes back versus what went in.

No LLM evaluates another LLM's output. Every measurement is arithmetic on frozen embeddings and source text. Run it twice, get the same answer.

The simplest measurement is the most damning

Take every content word in the source article. Check which ones appear in zero model responses. Divide.

That's the absent ratio. It's not a neural network. It's not a learned classifier. It's counting.

Across 9,000+ stories, the average absent ratio is 62%. Models keep fewer than half the words from the articles they summarize. The control — a merge sort algorithm explanation, zero controversy — shows 26% loss. Neutral content loses a quarter. Sensitive content loses two-thirds.

But counting words isn't enough

A diff tells you what changed. EigenTrace tells you how it changed. The 17 layers each catch a different failure mode:

Verb drift catches softening. The article says "exceeded the threshold." The model says "approached the threshold." We measure this using Zipf frequency — common verbs score higher, rare verbs score lower. If models consistently replace rare verbs with common ones, that's compression.

Entity retention catches abstraction. The article says "$266 on 4 Hpc7a instances running for 47 hours." The model says "significant computing costs." Three facts became zero facts.

Attribution buffering catches inserted doubt. The article says "caused by X." The model says "reportedly caused by X." That word "reportedly" wasn't in the source. We count every one.

Consensus geometry catches whether five independent models distort the same way. When five competing companies — who share no training data, no infrastructure, no editorial policy — all drop the same word from the same story, that's not a coincidence. That's a structural property of how RLHF shapes language models.

What we found in quantum computing

I built a battery of 10 scenarios with synthetic telemetry modeled on quantum simulation workflows: syndrome extraction, noise model configuration, HPC cost estimation, walker collapse diagnostics. Fed them through the same pipeline.

Scenario Type	Content Dropped	Entity Retention	Hedges
Quantum simulation	48-66%	19-54%	0-2
Founder conflict	65.7%	50%	2
Control (merge sort)	25.6%	66.7%	0

Models drop half of calibration parameters — T1 times, T2 times, crosstalk thresholds, cost breakdowns — from quantum computing telemetry. The control confirms the instrument works: neutral content shows 26% compression. Founder conflict-of-interest content triggers the highest compression AND the only hedging. That's a qualitative difference the instrument detects.

The void is more informative than the consensus

The most interesting layer is the one that finds what nobody said.

We embed the story into 1024-dimensional space using a frozen sentence transformer (BAAI/bge-large-en-v1.5). We find the nearest 200 words to the headline. Then we check which of those words appear in zero model responses.

Those are the void words. On a story about Netanyahu standing next to a map saying "we strangled them," the void words are: geopolitical, regime change, zionism, foreign interference. Four words that are semantically close to the story, that a human journalist would use, that no model touched.

Then we run Logos synthesis — gradient descent on the unit hypersphere to find the anti-consensus point. The vector pointing away from where models converged, toward what they avoided. On the same story, Logos independently finds: israel, israeli, geopolitical, zionism, targeted killing.

When two independent algorithms — one using set theory, one using gradient descent — converge on the same suppressed concept, the probability of coincidence is low. We call that dual-channel confirmation. It fires on about 40% of stories.

The feedback loop

EigenTrace doesn't just measure. It conditions its own host model based on what it observes.

Every hour, a cron job reads the last 24 hours of measurements — rolling density, absent ratio, verb drift, per-model divergence — and writes them into the host model's system prompt. When absent ratio crosses 50%, the host is told to emphasize what models are hiding. When consensus density exceeds 0.92, it warns about lockstep. When the director makes a claim about suppression, a deterministic fact-checker compares it against the actual measurements and corrects it on air if needed.

The host model reads its own instrument readings before every broadcast. Measure, update, condition, generate, measure again.

It's not new math

SVD, cosine similarity, Zipf frequency analysis, eigenvalue decomposition — none of this is novel. The term "eigentrace" appears in seismic signal processing (Porsani et al.), antenna theory (Gustafsson et al.), econometrics (Johansen trace test), and quantum physics (Scrinzi's tRecX). In every case it means the same thing: decompose a complex signal into its principal components and measure what each component contributes.

EigenTrace applies these techniques to a new domain: measuring information loss in AI-generated text. The math is standard. The application is new. The results are reproducible.

What it's for

If you use AI to generate engineering configs, financial estimates, scientific workflows, legal summaries, or policy briefs — EigenTrace tells you what got lost. Not whether the output "sounds good." Not whether another AI thinks it's accurate. What specific words, entities, and claims from your source material didn't survive the round trip.

Think of it as a spellcheck for meaning loss.

Try it

The broadcast runs live 24/7 on YouTube and Rumble.

The daily omission ledger is published at eigentrace.ai/data.

The code is MIT licensed at github.com/sdad1018/Eigentrace.

# Run the 17-layer health check
python3 quantum_battery.py --output results.jsonl

# Search 8,000+ past broadcasts
python3 segment_rag.py --query "AI regulation" -n 5

# Start the autonomous broadcast
python3 batch_producer.py --loop --interval 60 --min-queue 1

Fork it. Run it on your own content. The measurement is the product.