77. If I Were Emperor of New AI Safety Researcher Training...

(Epistemic status: Opinions, but justifiable ones. For RD, and to a lesser extent RK and DN/LT; with thanks to JM and PR.)

...then what would I make absolutely sure that the new blood read, played, or otherwise interacted with? And why? This list is not meant to be exhaustive, but I've tried pretty hard to cover a lot of ground very fast. You may assume that this is in addition to classics, like "A List of Lethalities", excerpts from Bostrom, and "Ten Levels of AI Alignment Difficulty". Accordingly, this is the things that I would personally add to that curriculum, or maybe bump some marginal things in favor of. It's aimed all over the spectrum of what "new AI safety researcher" means; some of them are for totally new people, some are for people who have a sense of what subfield they want to attack, and some could benefit literally everyone including established researchers. I've tried to pick things that are specifically underutilized and a relatively short time commitment, or when longer, at least decidedly not dry; in all cases, I've valorized works which are prescient or which are not downstream of the AI safety community.

To begin, a few things that I'd have literally everyone read. The first subcategory is basic grounding - what are we to contemplate and study and disassemble, and why?

A couple of Stanislaw Lem's novellas: "A History of Bitic Literature", which is a chillingly (though obviously not perfectly) prescient account of modern LLMs from way back in 1973, predicting the "interpolation between data points" framework for understanding LLMs to say nothing of the basic idea of mass text ingestion; as well as "Golem XIV", also from 1973, which was similarly enliveningly mostly prescient about the shape of AI (and perhaps AGI) as a whole... though perhaps the way it ends is rosier than what we'll get.
Robnost/Nostalgebraist's 2025 post, "The Void", on what it's like to be an LLM and why they are how they are is a solid read and will remain so far as long as AI development remains within the LLM paradigm. It covers, in simple clear language, what the token-generation paradigm means, what a foundation model is, what (pre)training and fine-tuning are, and ties them forward to meditations on the underdetermined nature of LLM psychology, the assistant model, the hazards of constant public fretting about AI risk in ways that will surely find their way into training corpuses, and why many evals - especially Anthropic's - aren't remotely testing what's meant to be tested and may well be actively counterproductive. This one is especially for anyone looking to work in control/oversight, LLM psychology, shard theory, or model welfare.

In a similar vein, a solid moral-philosophical grounding on what the stakes of power are and why they matter are a must-have. Too many AI safety researchers are desperately naive about the game of power and about the overall sweep of sociotechnical development. Unfortunately, I might also have described this subcategory as "texts that point out an extremely severe and corrosive problem, examine it at length and with sophistication, and end by presenting no answer, but rather apologizing for having none and expressing a sureness or fervent wish that in the decades to come someone cleverer will find the answers, or that everyone will put down their arms and hold hands to find that answer". Gentle reader: know your history; they did not.

The earliest of these is J. D. Bernal's "The World, the Flesh, and the Devil: An Enquiry Into the Three Enemies of the Rational Soul". (Yes, that J. D. Bernal, who first sketched out asteroid habitats.) Published in 1929, the text extensively considers the shape of things that were then to come lays out the three major constraints on human wellbeing, and thus the three major avenues along which progress would bring returns on effort: the physical world and material scarcity and poverty (the World), biological needs like food and medicine along with morphology (the Flesh), and human desire, communication, and psychology (the Devil). It's thus also the source of many of the ideas that later went on to influence futurism, including cyborgs, space colonization, genetic engineering, the possibility of posthuman speciation, the prospect of merging with technology, loss of control to transhumans and machines and thus interspecies warfare, the possibility of a human hive mind, and of course those asteroid habitats; his leftist politics very much influence the desire for a materialist utopia.
Next up, C. S. Lewis's "The Abolition of Man", from 1943. Lewis critiques moral relativism and the subjectivity of values, instead arguing that without any firm moral grounding however slight, there is no meaningful optimization goal but sheer power and control, and that that grounding can be found and universalizable principles distilled through comparison . He argues for a need for a chest (emotions and conscience) to mediate between both head (intellect and dispassionate reason) and belly/guts (appetites and base desires). Without that, you get Men Without Chests, who without that moral sense have only the lust for power, wealth, control, and the satisfaction of base impulses. (Sound like anyone to you?) Through technological progress, some of them might come to gain a perfect understanding of human psychology and thus both exercise control over the future and supercede the past: the Conditioners. While his tone and frames are characteristically Christian, he draws respectfully and substantively on every religious tradition he can get his hands on, and I claim that you should ignore how much Christian apologists like it. A classic on value lock-in and the risks of concentration of power. This one's for everyone like most of these, but especially for anyone seeking to work on governance, policy, advocacy, and sociotechnical impact. His Space Trilogy (Out of the Silent Planet, Perelandra, and That Hideous Strength) presents many of the same ideas in the potentially more palatable format of fiction.
Then there's C. P. Snow's "The Two Cultures". Written in 1959, he points out a then-novel problem: in modern terms, the phenomenon where STEM types disdain everything to do with culture, morals/ethics, and much of philosophy as out of scope and for people who'll be working as baristas soon enough, and where humanities sorts scoff at the fundamentals of math and science as for those who've scooped out their souls and replaced them with cold figures. Both are dead wrong. The disconnect between the titular two cultures has proven disastrous for society's ability to tackle hard problems STEM without the humanities has no heart, and the humanities without STEM has no brain. We must, as the joke goes, set aside our squabbles to turn as one on the true enemy: business majors.
Finally for this category, Bill Thurston's January 1987 letter to the editor of the American Mathematical Society's periodical on the hazards of military funding in pure mathematics research. Surely that means nothing, says you; math is supremely abstract. It means everything, says I. Military funding means access to research spaces, a chilling effect on dissent, military control over civilian affairs, better press for the military than they might deserve, a filtering effect favoring and rewarding those with fewer moral scruples and a greater willingness to drive race dynamics, and the normalization of dual use of any fruits of that abstract mathematics - for example cryptography. What this means for AI safety, I leave (in the manner of mathematicians throughout the ages) as an exercise for the reader.

I've noticed a few major missing categories from AI safety reading lists. I've listed them off here, along with my imperial decrees on which texts to pick.

First off, any kind of hands-on experience. Messing around with ELIZA or with simple Markov models is worth doing for a couple of hours; afterwards, read Joseph Weizenbaum's 1966 "Computer Power and Human Reason" to see some of the very first instances of terror at the credulity of the public to even the very earliest proto-LLMs. It's not the slop; we've always been like this, people. Alternatively, just play around with a frontier model (a Claude or a GPT, or maybe an open-source model) for real tasks you care about for at least a few days' worth of work; maybe 20 hours or so. Keep a log of anything about the responses that surprised you, failed you in unexpected ways, anything that got misunderstood, or felt like they revealed a usually-convincing faking of understanding rather than anything like actual comprehension.
Epistemic hygiene - both its importance and its practices - is also painfully lacking. Carl Sagan's "The Demon-Haunted World" (1995) is a classic of the field, but Kahneman and Tversky's "Judgement Under Uncertainty: Heuristics and Biases" (1974), Philip Tetlock's "Expert Political Judgement" (2005), and especially Richard Feynmann's "Cargo Cult Science" (1974) are all perfectly good drop-in replacements.
There's very little on any lists about security mindset or the failure of complex systems. The website https://how.complexsystems.fail/ is a very very brief attempt at a replacement, but for the real meat, you want something like Charles Perrow's "Normal Accidents: Living With High-Risk Technologies" (1984) to think about how complex and tightly-coupled systems might fail in ways that better engineering or even better training cannot represent. Forget the Vulnerable World Hypothesis - everything complex about the modern world rides the knife's edge of the control envelope, vastly more fragile and failure-prone than you might expect. Essential for both AI system design and gaming out AI doom scenarios.
Economic incentives, inadequate equilibria, and Goodhart's law are frequently mentioned but rarely treated in depth. James C. Scott's "Seeing Like a State" (1998) is a very good look at why high-modernist surveillance schemes for AI safety are unlikely to pan out well, and why legibility requirements, simplified metrics, and optimizing away the problems you can see (cough, RLHF, cough) are such lethal errors.
Game theory and coordination problems are vital to understanding both agent foundations and governance, and yet somehow very little about it ever makes an AI safety reading list. This seems extremely bad, actually. Thomas Schelling's "The Strategy of Conflict" (1960) is a powerful and foundational text on coordination problems, focal points, commitment devices and races, and why inadequate equilibria might arise or persist at all even between ideal agents. Robert Axelrod's "The Evolution of Cooperation" (1984), Douglas Hofstadter's "Metamagical Themas" columns specifically on superrationality and the Prisoner's Dilemma are both more digestible. If you can't stomach Schelling, Axelrod, and Hofstadter, then give Nicky Case's "The Evolution of Trust" (2019) a play and the Youtube channel "Lines On Maps" a watch.
Finally concrete governance proposals. Elinor Ostrom's "Governing the Commons" is a strong look at how communities actually solve coordination problems and manage scarce shared resources without much need for coercion, privatization, or top-down control. Polycentric governance, local knowledge, and graduated sanctions leading to virtuous incentives landscapes carry the day. A must-read for anyone thinking about governance.

Finally, a few miscellaneous entries invaluable to specific subfields but bearing fruit for all.

First and foremost, my favorite hobbyhorse: 4D Golf, by CodeParade. If you've ever wanted dearly to be able to visualize 4 spatial dimensions, play this. It's literally $20 on Steam; play it for a few hours and you'll start to see. I'd especially make anyone doing mechinterp play this for a few hours: their stock in trade is the comprehension of high-dimensional spaces. I'd probably also make them read both of my posts on high dimensions afterwards, too. Hyperrogue by ZenoRogue is an honorable mention and possible drop-in replacement.
"Consilience: The Unity of Knowledge" by E. O. Wilson is the kind of text that will either open a new world to you or solidly confirm suspicions you've had all along. The core idea, which gets a little repetitive, is one I've referred to in prior posts: that the world is one, and thus that knowledge is one; consequently, one should look to many varied fields of study in order to triangulate subtle concepts, and conversely that many seemingly unrelated moderately-strong arguments from disparate fields of knowledge are a pointer towards an important truth. Vital for anyone thinking about agent foundations; still powerful for everyone else.
Zvi Moshowitz's famous "Slack". Not directly relevant to AI safety, but a powerful frame for trying to stave off the burnout that is so endemic to the AI safety and to EA-adjacent spaces in general. The basic idea is that Slack - the ability to handle surprise problems, take a break from work, and generally not be a single screwup, disaster, misunderstanding, or slow day away from something catching fire - is crucially important and to be guarded viciously.

Honorable mentions, given their presence in many existing introductory AI safety reading lists:

"Deep Deceptiveness" by Nate Soares nearly didn't make this list, given how it's on many AI Safety reading lists. It's a careful and gearsy look at why even an AGI that doesn't explicitly seek to deceive might well end up behaving deceptively anyway in the pursuit of its evaluator-driven goals. A quick and worthwhile read.
"Alignment by Default" by John Wentworth also nearly didn't make this list for the same reason, but it's an excellent look into the ontological nature of human values as opposed to (e.g.) trees, or the color blue, and what that means for the difficulty of training any RL agaent. In the same vein of classic texts often found in reading lists anyway, there's also his entire "Why Not Just...?" sequence.
A few more honorable mentions frequently found in reading lists: Eliezer Yudkowsky's "A List of Lethalities", Paul Christiano's "Prosaic Alignment", and the "Embedded Agency" sequence by Scott Garrabrant and Abram Demski.

I fully expect that at least three people will happily tell me where this decree has obviously and crucially faltered, and be wise advisors in doing so. Such is the nature of lists made by a single fox rather than a committee of hedgehogs. I welcome your sage good-faith counsel and promise not to send you to the GPU mines.

Search This Blog

Tiled With Pentagons

77. If I Were Emperor of New AI Safety Researcher Training...

Comments

Post a Comment

Popular posts from this blog

4. Seven-ish Words from My Thought-Language

20. A Sketch of Helpfulness Theory With Equivocal Principals

98. The ICML 2026 Seoul Survival Guide