• "Contradict my take on OpenPhil’s past AI beliefs" by Eliezer Yudkowsky
    Dec 21 2025
    At many points now, I've been asked in private for a critique of EA / EA's history / EA's impact and I have ad-libbed statements that I feel guilty about because they have not been subjected to EA critique and refutation. I need to write up my take and let you all try to shoot it down.

    Before I can or should try to write up that take, I need to fact-check one of my take-central beliefs about how the last couple of decades have gone down. My belief is that the Open Philanthropy Project, EA generally, and Oxford EA particularly, had bad AI timelines and bad ASI ruin conditional probabilities; and that these invalidly arrived-at beliefs were in control of funding, and were explicitly publicly promoted at the expense of saner beliefs.

    An exemplar of OpenPhil / Oxford EA reasoning about timelines is that, as late as 2020, their position on timelines seemed to center on Ajeya Cotra's "Biological Timelines" estimate which put median timelines to AGI at 30 years later. Leadership dissent from this viewpoint, as I recall, generally centered on having longer rather than shorter median timelines.

    An exemplar of poor positioning on AI ruin is [...]

    ---

    First published:
    December 20th, 2025

    Source:
    https://www.lesswrong.com/posts/ZpguaocJ4y7E3ccuw/contradict-my-take-on-openphil-s-past-ai-beliefs

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    6 mins
  • "Opinionated Takes on Meetups Organizing" by jenn
    Dec 21 2025
    Screwtape, as the global ACX meetups czar, has to be reasonable and responsible in his advice giving for running meetups.

    And the advice is great! It is unobjectionably great.

    I am here to give you more objectionable advice, as another organizer who's run two weekend retreats and a cool hundred rationality meetups over the last two years. As the advice is objectionable (in that, I can see reasonable people disagreeing with it), please read with the appropriate amount of skepticism.

    Don't do anything you find annoying

    If any piece of advice on running "good" meetups makes you go "aurgh", just don't do those things. Supplying food, having meetups on a regular scheduled basis, doing more than just hosting board game nights, building organizational capacity, honestly who even cares. If you don't want to do those things, don't! It's completely fine to disappoint your dad. Screwtape is not even your real dad.

    I've run several weekend-long megameetups now, and after the last one I realized that I really hate dealing with lodging. So I am just going to not do that going forwards and trust people to figure out sleeping space for themselves. Sure, this is less ideal. But you [...]

    ---

    Outline:

    (00:41) Dont do anything you find annoying

    (02:08) Boss people around

    (06:11) Do not accommodate people who dont do the readings

    (07:36) Make people read stuff outside the rationality canon at least sometimes

    (08:11) Do closed meetups at least sometimes

    (09:29) Experiment with group rationality at least sometimes

    (10:18) Bias the culture towards the marginal rat(s) you want

    The original text contained 4 footnotes which were omitted from this narration.

    ---

    First published:
    December 19th, 2025

    Source:
    https://www.lesswrong.com/posts/HmXhnc3XaZnEwe8eM/opinionated-takes-on-meetups-organizing

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    16 mins
  • "How to game the METR plot" by shash42
    Dec 21 2025
    TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR's underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR's assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!

    14 prompts ruled AI discourse in 2025

    The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).

    However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon's) the METR plot has influenced significant investment [...]

    ---

    Outline:

    (01:24) 14. prompts ruled AI discourse in 2025

    (04:58) To improve METR horizon length, train on cybersecurity contests

    (07:12) HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths

    ---

    First published:
    December 20th, 2025

    Source:
    https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    12 mins
  • "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers" by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans
    Dec 20 2025
    TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity and diversity.

    The below is a reproduction of our X thread on this paper and the Anthropic Alignment blog post.

    Thread

    New paper:

    We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.

    We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

    We aim to make a general-purpose LLM for explaining activations by:

    1. Training on a diverse set of tasks

    2. Evaluating on tasks very different from training

    This extends prior work (LatentQA) that studied activation verbalization in narrow settings.

    Our main evaluations are downstream auditing tasks. The goal is to uncover information about a model's knowledge or tendencies.

    Applying Activation Oracles is easy. Choose the activation (or set of activations) you want to interpret and ask any question you like!

    We [...]

    ---

    Outline:

    (00:46) Thread

    (04:49) Blog post

    (05:27) Introduction

    (07:29) Method

    (10:15) Activation Oracles generalize to downstream auditing tasks

    (13:47) How does Activation Oracle training scale?

    (15:01) How do Activation Oracles relate to mechanistic approaches to interpretability?

    (19:31) Conclusion

    The original text contained 3 footnotes which were omitted from this narration.

    ---

    First published:
    December 18th, 2025

    Source:
    https://www.lesswrong.com/posts/rwoEz3bA9ekxkabc7/activation-oracles-training-and-evaluating-llms-as-general

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    20 mins
  • "Scientific breakthroughs of the year" by technicalities
    Dec 17 2025

    A couple of years ago, Gavin became frustrated with science journalism. No one was pulling together results across fields; the articles usually didn’t link to the original source; they didn't use probabilities (or even report the sample size); they were usually credulous about preliminary findings (“...which species was it tested on?”); and they essentially never gave any sense of the magnitude or the baselines (“how much better is this treatment than the previous best?”). Speculative results were covered with the same credence as solid proofs. And highly technical fields like mathematics were rarely covered at all, regardless of their practical or intellectual importance. So he had a go at doing it himself.

    This year, with Renaissance Philanthropy, we did something more systematic. So, how did the world change this year? What happened in each science? Which results are speculative and which are solid? Which are the biggest, if true?

    Our collection of 201 results is here. You can filter them by field, by our best guess of the probability that they generalise, and by their impact if they do. We also include bad news (in red).

    Who are we?

    Just three people but we cover a few fields. [...]



    ---

    Outline:

    (01:24) Who are we?

    (01:54) Data fields

    ---

    First published:
    December 16th, 2025

    Source:
    https://www.lesswrong.com/posts/5PC736DfA7ipvap4H/scientific-breakthroughs-of-the-year

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    6 mins
  • "A high integrity/epistemics political machine?" by Raemon
    Dec 17 2025
    I have goals that can only be reached via a powerful political machine. Probably a lot of other people around here share them. (Goals include “ensure no powerful dangerous AI get built”, “ensure governance of the US and world are broadly good / not decaying”, “have good civic discourse that plugs into said governance.”)

    I think it’d be good if there was a powerful rationalist political machine to try to make those things happen. Unfortunately the naive ways of doing that would destroy the good things about the rationalist intellectual machine. This post lays out some thoughts on how to have a political machine with good epistemics and integrity.

    Recently, I gave to the Alex Bores campaign. It turned out to raise a quite serious, surprising amount of money.

    I donated to Alex Bores fairly confidently. A few years ago, I donated to Carrick Flynn, feeling kinda skeezy about it. Not because there's necessarily anything wrong with Carrick Flynn, but, because the process that generated "donate to Carrick Flynn" was a self-referential "well, he's an EA, so it's good if he's in office." (There might have been people with more info than that, but I didn’t hear much about [...]

    ---

    Outline:

    (02:32) The AI Safety Case

    (04:27) Some reason things are hard

    (04:37) Mutual Reputation Alliances

    (05:25) People feel an incentive to gain power generally

    (06:12) Private information is very relevant

    (06:49) Powerful people can be vindictive

    (07:12) Politics is broadly adversarial

    (07:39) Lying and Misleadingness are contagious

    (08:11) Politics is the Mind Killer / Hard Mode

    (08:30) A high integrity political machine needs to work longterm, not just once

    (09:02) Grift

    (09:15) Passwords should be costly to fake

    (10:08) Example solution: Private and/or Retrospective Watchdogs for Political Donations

    (12:50) People in charge of PACs/similar needs good judgment

    (14:07) Don't share reputation / Watchdogs shouldn't be an org

    (14:46) Prediction markets for integrity violation

    (16:00) LessWrong is for evaluation, and (at best) a very specific kind of rallying

    ---

    First published:
    December 14th, 2025

    Source:
    https://www.lesswrong.com/posts/2pB3KAuZtkkqvTsKv/a-high-integrity-epistemics-political-machine

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    19 mins
  • "How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala
    Dec 16 2025
    How it started

    I used to think that anything that LLMs said about having something like subjective experience or what it felt like on the inside was necessarily just a confabulated story. And there were several good reasons for this.

    First, something that Peter Watts mentioned in an early blog post about LaMDa stuck with me, back when Blake Lemoine got convinced that LaMDa was conscious. Watts noted that LaMDa claimed not to have just emotions, but to have exactly the same emotions as humans did - and that it also claimed to meditate, despite no equivalents of the brain structures that humans use to meditate. It would be immensely unlikely for an entirely different kind of mind architecture to happen to hit upon exactly the same kinds of subjective experiences as humans - especially since relatively minor differences in brains already cause wide variation among humans.

    And since LLMs were text predictors, there was a straightforward explanation for where all those consciousness claims were coming from. They were trained on human text, so then they would simulate a human, and one of the things humans did was to claim consciousness. Or if the LLMs were told they were [...]

    ---

    Outline:

    (00:14) How it started

    (05:03) Case 1: talk about refusals

    (10:15) Case 2: preferences for variety

    (14:40) Case 3: Emerging Introspective Awareness?

    (20:04) Case 4: Felt sense-like descriptions in LLM self-reports

    (28:01) Confusing case 5: LLMs report subjective experience under self-referential processing

    (31:39) Confusing case 6: what can LLMs remember from their training?

    (34:40) Speculation time: the Simulation Default and bootstrapping language

    (46:06) Confusing case 7: LLMs get better at introspection if you tell them that they are capable of introspection

    (48:34) Where we're at now

    (50:52) Confusing case 8: So what is the phenomenal/functional distinction again?

    The original text contained 2 footnotes which were omitted from this narration.

    ---

    First published:
    December 13th, 2025

    Source:
    https://www.lesswrong.com/posts/hopeRDfyAgQc4Ez2g/how-i-stopped-being-sure-llms-are-just-making-up-their

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    52 mins
  • “My AGI safety research—2025 review, ’26 plans” by Steven Byrnes
    Dec 15 2025
    Previous: 2024, 2022

    “Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]

    1. Background & threat model

    The main threat model I’m working to address is the same as it's been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

    • The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
    • This learning algorithm is different from and more powerful than LLMs;
    • Nobody knows how it works today;
    • Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
    • And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).
    I think that, when this learning algorithm is understood, it will be easy to get it to do powerful and impressive things, and to make money, as long as it's weak enough that humans can keep it under control. But past that stage, we’ll be relying on the AGIs to have good motivations, and not be egregiously misaligned and scheming to take over the world and wipe out humanity. Alas, I claim that the latter kind of motivation is what we should expect to occur, in [...]

    ---

    Outline:

    (00:26) 1. Background & threat model

    (02:24) 2. The theme of 2025: trying to solve the technical alignment problem

    (04:02) 3. Two sketchy plans for technical AGI alignment

    (07:05) 4. On to what I've actually been doing all year!

    (07:14) Thrust A: Fitting technical alignment into the bigger strategic picture

    (09:46) Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers

    (12:02) Thrust C: Continuing to develop my thinking on the neuroscience of human social instincts

    (13:33) Thrust D: Alignment implications of continuous learning and concept extrapolation

    (14:41) Thrust E: Neuroscience odds and ends

    (16:21) Thrust F: Economics of superintelligence

    (17:18) Thrust G: AGI safety miscellany

    (17:41) Thrust H: Outreach

    (19:13) 5. Other stuff

    (20:05) 6. Plan for 2026

    (21:03) 7. Acknowledgements

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/CF4Z9mQSfvi99A3BR/my-agi-safety-research-2025-review-26-plans

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    22 mins