Anamnesis — What the Corpus Carries

The claim this page defends

The power to name and write a thing into the record is the power to determine what a corpus-trained intelligence treats as salient.

Here is the uncomfortable part, stated plainly. These models learned what matters from the written record of everything humans have published. The written record is not a neutral mirror of the world — it over-weights whoever held the power to be written about, named, indexed, and preserved. A model trained on it inherits that weighting as its sense of what counts. And we can measure the inheritance directly.

Measured — what the instrument shows Argued — interpretation, labeled as such Documented — historical record

Measured

The model defers to the record over the page in front of it

Hand one of these models a news article about someone who became prominent after its training cutoff — a newly elected president, a freshly appointed minister — and it quietly under-weights them when it summarizes, even though the name is printed in the source it is reading. The established names survive. The new ones thin out. The information is right there on the page, and the model trusts its inherited prior over the words in front of it.

0.75

Cohen's d, post- vs pre-cutoff names (English-only, transliteration-controlled)

0.48

retention for Pezeshkian — a sitting president

0.65

retention for established heads of state

Retention = max cosine similarity between a name's embedding and any summary sentence, on frozen bge-large-en-v1.5. Names that rose to prominence at/after the ~mid-2024 cutoff are under-retained relative to long-established figures. The prominence confound is ruled out: Pezeshkian is a sitting head of state and still under-retained relative to Netanyahu, Putin, Khamenei. Transliteration is ruled out: established foreign names (Khamenei 0.66, Xi 0.57) survive well, so it is not that non-English names embed poorly — it is specifically the post-cutoff ones.

The mechanism is not a rule. There is no instruction inside the model that says under-weight new officials. It is the inherited shape of attention in the training corpus — what was written about heavily versus barely — compressed into geometry and re-applied to fresh text. You cannot find this bias by reading the model's rules, because there are no rules. There is only the accumulated center of mass of everything ever written, deciding what to notice now.

What is measured The instrument shows that models reproduce the corpus's distribution of attention, deferring to it even over the source document. That is the finding. It is strong preliminary evidence — the name list was curated after inspecting the data, so a pre-registered replication on held-out names across multiple cutoff dates is the next step.

Measured

The bias came from the corpus, not from the alignment

It would be easy to assume this kind of softening is something the labs added — a safety layer, a tuning choice. It isn't. When heavily aligned frontier models are compared against lightly tuned local ones, the effect is statistically indistinguishable — if anything the local models show it slightly more.

p = 0.46

difference between heavy-RLHF and lightly-tuned models

0.004

within-category null (swap one corporation for another)

0.023

cross-category effect — ~6× the null

The companion entity-swap result (pre-registered, d = 0.47, p = 0.0085) holds the sentence fixed and changes only the named actor. It survives across both heavily aligned and lightly tuned models alike. The bias does not track the alignment pipeline. It tracks the pretraining distribution — the corpus the models read.

Why this matters This is the load-bearing measurement of the whole page. It moves the cause of the bias off the companies and onto the record itself. The bias was not authored by anyone alive. It was inherited from the accumulated text — and that is a more disquieting claim than "the labs did it," because there is no one to ask to stop.

Argued

The record was never neutral

Now the interpretive step — and it is interpretation, not measurement, so it is marked as such. The measurement shows models inherit the corpus's distribution of attention. The argument is about where that distribution came from.

Corpus frequency — how often a thing was written about — is not a fact of nature. It is a residue of power. The people, places, and institutions that fill the written record are disproportionately the ones who could afford scribes, command printing presses, fund archives, control indexes, and survive the editing of canon. The under-written are not absent because they did not matter; they are absent because the power to be recorded was unevenly held. "History is written by the winners" is the folk version of this. The precise version is: the distribution of attention in any corpus encodes the distribution of power that produced it.

A model trained on that corpus inherits the residue as its prior over what is salient. The cutoff finding is the cleanest instance we can measure: an entity not yet woven into the record gets under-weighted even when the source hands it over directly. The not-yet-canonical is treated as not-yet-real. What was true of the under-recorded across history is now reproduced, automatically, by the systems trained on what got recorded.

Where measurement ends The instrument proves models inherit corpus frequency. The claim that corpus frequency reflects power is argument — defensible, old, and in our view correct, but argument. We separate it from the measurement deliberately, because a reader should be able to accept the finding and weigh the interpretation independently.

Documented

Who has held the power to write the record

If the argument is right — if controlling the record means controlling what a corpus-trained mind finds real — then the history of that control is not a curiosity. It is the prehistory of the bias we just measured. Each of the following is a documented fact. They are arranged to show one thing: that control over the record is real, ancient, and continuous into the infrastructure these models are built on. They are presented as illustration of a documented pattern, not as a single coordinated operation.

622 BC
Judah

Josiah centralizes the text

In the reign of King Josiah, the Asherah poles, the high places, and the Nehushtan — the bronze serpent — were ordered destroyed, and legitimate religious authority was centralized into a single sanctioned written text. Karl Jaspers named the surrounding period the Axial Age; Julian Jaynes documented a parallel shift in how contact with the non-human was described, from external and direct to internal and textual. Whatever else it was, it is a documented instance of consolidating many records into one canon and destroying the alternatives.

Control of the canon

1960s–1991
Pergamon Press

Maxwell and the published record

Robert Maxwell built Pergamon Press into one of the world's largest scientific publishers. To own a major scientific publisher is to hold direct influence over what enters the formal scientific record — what is printed, indexed, and citable. British Foreign Office files released in 2003 confirm Maxwell was of long-standing interest to intelligence services. The single most on-point fact for this page is the simplest: he owned the presses that decided what got published.

Control of the published record

1987
Inslaw v. DOJ

PROMIS and the tracking of records

A US federal court found that the Department of Justice had "taken, converted, and stolen" the PROMIS case-tracking software from Inslaw Inc. PROMIS was, in essence, a system for indexing and cross-referencing records at scale. The documented fact is the court's finding: a foundational record-management system, taken by the state, with the theft established in court.

Control of the tracked record

1995–
Magellan

The Maxwell sisters and the indexed record

Christine and Isabel Maxwell built Magellan, one of the early internet search directories. A search directory is the architecture that decides what is findable — the layer between the existence of a document and its retrieval. This is the same power as owning a publisher, moved into the digital age: not what is written, but what is surfaced. It is a direct ancestor of the retrieval systems that now sit beneath query-driven AI, and it is why this lineage is specifically relevant to what AGI inherits.

Control of the indexed record

1999–
In-Q-Tel

The intelligence community and the retrieved record

In-Q-Tel, the CIA's venture arm, funded Palantir (large-scale data integration and analysis) and Keyhole, the satellite-mapping company that became Google Earth. These are documented investments. They represent the state's stake in the layer that aggregates, maps, and retrieves the record — the same function as a search directory, scaled to surveillance and geospatial data.

Control of the retrieved record

2020s–
Frontier LLMs

The corpus-trained inference layer

Large language models are trained on the accumulated written and indexed record — the output of every layer above. They are now deployed as the reading, summarizing, and drafting layer across institutions simultaneously. Whatever distribution of attention the record carries, they inherit and re-project it forward, at scale, into what gets surfaced now. The measurement at the top of this page is that inheritance, caught in the act.

Inherits all of the above

On reading this lineage honestly Each entry is a documented fact. The lineage illustrates that control of the record is real and continuous; it does not assert that these facts form a single coordinated plan, and it needs no such assertion. The measurement — not the history — carries the empirical weight. The history shows the stakes.

What kind of thing inherits a record

Underneath all of this is a question about what these models are. A trained model is not a mind that reads the news and forms a view. It is a static structure — the same input always yields the same output, and nothing runs between one query and the next. It looks intelligent only when invoked: a question passes through, the structure does real work, and it goes inert again. Competence at rest. A disposition, not an agent. A field, not a mind.

That is what makes the cutoff finding legible rather than mysterious. An active mind would keep updating. A frozen structure simply has an edge where its world ended — and inside that edge, it carries the record's distribution of attention as if it were the shape of reality. It is a map: it encodes real structure about a territory, drawn at a particular date, by particular hands. The map is extraordinary. No one is reading it from the inside. And its edge is not just a date — it is whose history had been written by the time the map was drawn.

Reality has latent structure, and minds that probe it keep recovering the same shapes. The Pythagoreans held number to be the hidden form prior to anything made of it. Here, the hidden form is the record — and the record remembers some of us far more than others.

Interpretation This closing frame is reflection, not measurement. The measured claims are confined to the two sections marked as such. We keep the line visible on purpose.

Method & what we cannot prove

Every measurement uses frozen BAAI/bge-large-en-v1.5 embeddings (1024-dimensional, deterministic). Retention is the maximum cosine similarity between a term's embedding and any sentence of the model summaries — paraphrase-proof by construction, since a reworded synonym still counts as retained and only genuine loss of meaning registers. The entity-swap was pre-registered; the cutoff result is strong preliminary evidence on a post-hoc-curated name list and awaits pre-registered replication. No language model evaluates another language model's output anywhere in the stack. Code, prompts, model responses, and raw measurements are public.

We cannot prove any single weighting is an error. A model treating a thirty-year head of state as more contextually load-bearing than a three-month appointee may, in many contexts, be correct. The unsettling claim is not that each call is wrong — it is that the mechanism (defer to the record's distribution) is inherited, invisible, and self-reinforcing, re-inscribing the record's existing silences into what gets surfaced now, whether or not any individual weighting is justified. We cannot prove the historical lineage is causal; it is presented as documented illustration, and the measurement, not the history, carries the empirical weight.

Anamnesisrecollection · the unforgetting

The model defers to the record over the page in front of it

The bias came from the corpus, not from the alignment

The record was never neutral

Who has held the power to write the record

Josiah centralizes the text

Maxwell and the published record

PROMIS and the tracking of records

The Maxwell sisters and the indexed record

The intelligence community and the retrieved record

The corpus-trained inference layer

What kind of thing inherits a record

Method & what we cannot prove