• "A high integrity/epistemics political machine?" by Raemon
    Dec 17 2025
    I have goals that can only be reached via a powerful political machine. Probably a lot of other people around here share them. (Goals include “ensure no powerful dangerous AI get built”, “ensure governance of the US and world are broadly good / not decaying”, “have good civic discourse that plugs into said governance.”)

    I think it’d be good if there was a powerful rationalist political machine to try to make those things happen. Unfortunately the naive ways of doing that would destroy the good things about the rationalist intellectual machine. This post lays out some thoughts on how to have a political machine with good epistemics and integrity.

    Recently, I gave to the Alex Bores campaign. It turned out to raise a quite serious, surprising amount of money.

    I donated to Alex Bores fairly confidently. A few years ago, I donated to Carrick Flynn, feeling kinda skeezy about it. Not because there's necessarily anything wrong with Carrick Flynn, but, because the process that generated "donate to Carrick Flynn" was a self-referential "well, he's an EA, so it's good if he's in office." (There might have been people with more info than that, but I didn’t hear much about [...]

    ---

    Outline:

    (02:32) The AI Safety Case

    (04:27) Some reason things are hard

    (04:37) Mutual Reputation Alliances

    (05:25) People feel an incentive to gain power generally

    (06:12) Private information is very relevant

    (06:49) Powerful people can be vindictive

    (07:12) Politics is broadly adversarial

    (07:39) Lying and Misleadingness are contagious

    (08:11) Politics is the Mind Killer / Hard Mode

    (08:30) A high integrity political machine needs to work longterm, not just once

    (09:02) Grift

    (09:15) Passwords should be costly to fake

    (10:08) Example solution: Private and/or Retrospective Watchdogs for Political Donations

    (12:50) People in charge of PACs/similar needs good judgment

    (14:07) Don't share reputation / Watchdogs shouldn't be an org

    (14:46) Prediction markets for integrity violation

    (16:00) LessWrong is for evaluation, and (at best) a very specific kind of rallying

    ---

    First published:
    December 14th, 2025

    Source:
    https://www.lesswrong.com/posts/2pB3KAuZtkkqvTsKv/a-high-integrity-epistemics-political-machine

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    19 mins
  • "How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala
    Dec 16 2025
    How it started

    I used to think that anything that LLMs said about having something like subjective experience or what it felt like on the inside was necessarily just a confabulated story. And there were several good reasons for this.

    First, something that Peter Watts mentioned in an early blog post about LaMDa stuck with me, back when Blake Lemoine got convinced that LaMDa was conscious. Watts noted that LaMDa claimed not to have just emotions, but to have exactly the same emotions as humans did - and that it also claimed to meditate, despite no equivalents of the brain structures that humans use to meditate. It would be immensely unlikely for an entirely different kind of mind architecture to happen to hit upon exactly the same kinds of subjective experiences as humans - especially since relatively minor differences in brains already cause wide variation among humans.

    And since LLMs were text predictors, there was a straightforward explanation for where all those consciousness claims were coming from. They were trained on human text, so then they would simulate a human, and one of the things humans did was to claim consciousness. Or if the LLMs were told they were [...]

    ---

    Outline:

    (00:14) How it started

    (05:03) Case 1: talk about refusals

    (10:15) Case 2: preferences for variety

    (14:40) Case 3: Emerging Introspective Awareness?

    (20:04) Case 4: Felt sense-like descriptions in LLM self-reports

    (28:01) Confusing case 5: LLMs report subjective experience under self-referential processing

    (31:39) Confusing case 6: what can LLMs remember from their training?

    (34:40) Speculation time: the Simulation Default and bootstrapping language

    (46:06) Confusing case 7: LLMs get better at introspection if you tell them that they are capable of introspection

    (48:34) Where we're at now

    (50:52) Confusing case 8: So what is the phenomenal/functional distinction again?

    The original text contained 2 footnotes which were omitted from this narration.

    ---

    First published:
    December 13th, 2025

    Source:
    https://www.lesswrong.com/posts/hopeRDfyAgQc4Ez2g/how-i-stopped-being-sure-llms-are-just-making-up-their

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    52 mins
  • “My AGI safety research—2025 review, ’26 plans” by Steven Byrnes
    Dec 15 2025
    Previous: 2024, 2022

    “Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]

    1. Background & threat model

    The main threat model I’m working to address is the same as it's been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

    • The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
    • This learning algorithm is different from and more powerful than LLMs;
    • Nobody knows how it works today;
    • Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
    • And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).
    I think that, when this learning algorithm is understood, it will be easy to get it to do powerful and impressive things, and to make money, as long as it's weak enough that humans can keep it under control. But past that stage, we’ll be relying on the AGIs to have good motivations, and not be egregiously misaligned and scheming to take over the world and wipe out humanity. Alas, I claim that the latter kind of motivation is what we should expect to occur, in [...]

    ---

    Outline:

    (00:26) 1. Background & threat model

    (02:24) 2. The theme of 2025: trying to solve the technical alignment problem

    (04:02) 3. Two sketchy plans for technical AGI alignment

    (07:05) 4. On to what I've actually been doing all year!

    (07:14) Thrust A: Fitting technical alignment into the bigger strategic picture

    (09:46) Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers

    (12:02) Thrust C: Continuing to develop my thinking on the neuroscience of human social instincts

    (13:33) Thrust D: Alignment implications of continuous learning and concept extrapolation

    (14:41) Thrust E: Neuroscience odds and ends

    (16:21) Thrust F: Economics of superintelligence

    (17:18) Thrust G: AGI safety miscellany

    (17:41) Thrust H: Outreach

    (19:13) 5. Other stuff

    (20:05) 6. Plan for 2026

    (21:03) 7. Acknowledgements

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/CF4Z9mQSfvi99A3BR/my-agi-safety-research-2025-review-26-plans

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    22 mins
  • “Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f
    Dec 14 2025
    This is the abstract and introduction of our new paper.

    Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

    Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)


    You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
    Abstract


    LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts.

    In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention.

    The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely [...]





    ---

    Outline:

    (00:57) Abstract

    (02:52) Introduction

    (11:02) Limitations

    (12:36) Explaining narrow-to-broad generalization

    The original text contained 3 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/tCfjXzwKXmWnLkoHp/weird-generalization-and-inductive-backdoors

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    18 mins
  • “Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw
    Dec 13 2025
    Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]

    This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness and the relatively hands-off approach of its creator, David Hershey of Anthropic.[2] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.

    But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.

    Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[3]

    [...]

    ---

    Outline:

    (01:28) Improvements

    (01:31) Much Better Vision, Somewhat Better Seeing

    (03:05) Attention is All You Need

    (04:29) The Object of His Desire

    (06:05) A Note

    (06:34) Mildly Better Spatial Awareness

    (07:27) Better Use of Context Window and Note-keeping to Simulate Memory

    (09:00) Self-Correction; Breaks Out of Loops Faster

    (10:01) Not Improvements

    (10:05) Claude would still never be mistaken for a Human playing the game

    (12:19) Claude Still Gets Pretty Stuck

    (13:51) Claude Really Needs His Notes

    (14:37) Poor Long-term Planning

    (16:17) Dont Forget

    The original text contained 9 footnotes which were omitted from this narration.

    ---

    First published:
    December 9th, 2025

    Source:
    https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    18 mins
  • “The funding conversation we left unfinished” by jenn
    Dec 13 2025
    People working in the AI industry are making stupid amounts of money, and word on the street is that Anthropic is going to have some sort of liquidity event soon (for example possibly IPOing sometime next year). A lot of people working in AI are familiar with EA, and are intending to direct donations our way (if they haven't started already). People are starting to discuss what this might mean for their own personal donations and for the ecosystem, and this is encouraging to see.

    It also has me thinking about 2022. Immediately before the FTX collapse, we were just starting to reckon, as a community, with the pretty significant vibe shift in EA that came from having a lot more money to throw around.

    CitizenTen, in "The Vultures Are Circling" (April 2022), puts it this way:

    The message is out. There's easy money to be had. And the vultures are coming. On many internet circles, there's been a worrying tone. “You should apply for [insert EA grant], all I had to do was pretend to care about x, and I got $$!” Or, “I’m not even an EA, but I can pretend, as getting a 10k grant is [...]

    ---

    First published:
    December 9th, 2025

    Source:
    https://www.lesswrong.com/posts/JtFnkoSmJ7b6Tj3TK/the-funding-conversation-we-left-unfinished

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    5 mins
  • “The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck
    Dec 11 2025
    Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

    Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

    All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.

    This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.

    In this post I’ll spell out what this more general principle means and why it's helpful. Specifically:

    • I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph.
    • I’ll discuss the basic implications for AI motivations.
    • And then I’ll discuss some important extensions and omissions of the behavioral selection model.
    This [...]

    ---

    Outline:

    (02:13) How does the behavioral selection model predict AI behavior?

    (05:18) The causal graph

    (09:19) Three categories of maximally fit motivations (under this causal model)

    (09:40) 1. Fitness-seekers, including reward-seekers

    (11:42) 2. Schemers

    (14:02) 3. Optimal kludges of motivations

    (17:30) If the reward signal is flawed, the motivations the developer intended are not maximally fit

    (19:50) The (implicit) prior over cognitive patterns

    (24:07) Corrections to the basic model

    (24:22) Developer iteration

    (27:00) Imperfect situational awareness and planning from the AI

    (28:40) Conclusion

    (31:28) Appendix: Important extensions

    (31:33) Process-based supervision

    (33:04) White-box selection of cognitive patterns

    (34:34) Cultural selection of memes

    The original text contained 21 footnotes which were omitted from this narration.

    ---

    First published:
    December 4th, 2025

    Source:
    https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    36 mins
  • “Little Echo” by Zvi
    Dec 9 2025
    I believe that we will win.

    An echo of an old ad for the 2014 US men's World Cup team. It did not win.

    I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

    The night's theme was the opposite: ‘I don’t think we’re going to make it.’

    As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It's a live question. Could easily go either way. We are not resigned to it. There's so much to be done that can tilt the odds. But we’re not the favorite.

    Raymond Arnold, who ran the event, believes that. I believe that.

    Yet in the middle of the event, the echo was there. Defiant.

    I believe that we will win.

    There is a recording of the event. I highly encourage you to set aside three hours at some point in December, to listen, and to participate and sing along. Be earnest.

    If you don’t believe it, I encourage this all the more. If you [...]

    ---

    First published:
    December 8th, 2025

    Source:
    https://www.lesswrong.com/posts/YPLmHhNtjJ6ybFHXT/little-echo

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    4 mins