• “Little Echo” by Zvi
    Dec 9 2025
    I believe that we will win.

    An echo of an old ad for the 2014 US men's World Cup team. It did not win.

    I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

    The night's theme was the opposite: ‘I don’t think we’re going to make it.’

    As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It's a live question. Could easily go either way. We are not resigned to it. There's so much to be done that can tilt the odds. But we’re not the favorite.

    Raymond Arnold, who ran the event, believes that. I believe that.

    Yet in the middle of the event, the echo was there. Defiant.

    I believe that we will win.

    There is a recording of the event. I highly encourage you to set aside three hours at some point in December, to listen, and to participate and sing along. Be earnest.

    If you don’t believe it, I encourage this all the more. If you [...]

    ---

    First published:
    December 8th, 2025

    Source:
    https://www.lesswrong.com/posts/YPLmHhNtjJ6ybFHXT/little-echo

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    4 mins
  • “A Pragmatic Vision for Interpretability” by Neel Nanda
    Dec 8 2025
    Executive Summary

    • The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
      • Trying to directly solve problems on the critical path to AGI going well[[1]]
      • Carefully choosing problems according to our comparative advantage
      • Measuring progress with empirical feedback on proxy tasks
    • We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
      • Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
      • Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
      • See our companion piece for more on which research areas and theories of change we think are promising
    • Why pivot now? We think that times have changed.
      • Models are far more capable, bringing new questions within empirical reach
      • We have been [...]
    ---

    Outline:

    (00:10) Executive Summary

    (03:00) Introduction

    (03:44) Motivating Example: Steering Against Evaluation Awareness

    (06:21) Our Core Process

    (08:20) Which Beliefs Are Load-Bearing?

    (10:25) Is This Really Mech Interp?

    (11:27) Our Comparative Advantage

    (14:57) Why Pivot?

    (15:20) Whats Changed In AI?

    (16:08) Reflections On The Fields Progress

    (18:18) Task Focused: The Importance Of Proxy Tasks

    (18:52) Case Study: Sparse Autoencoders

    (21:35) Ensure They Are Good Proxies

    (23:11) Proxy Tasks Can Be About Understanding

    (24:49) Types Of Projects: What Drives Research Decisions

    (25:18) Focused Projects

    (28:31) Exploratory Projects

    (28:35) Curiosity Is A Double-Edged Sword

    (30:56) Starting In A Robustly Useful Setting

    (34:45) Time-Boxing

    (36:27) Worked Examples

    (39:15) Blending The Two: Tentative Proxy Tasks

    (41:23) What's Your Contribution?

    (43:08) Jack Lindsey's Approach

    (45:44) Method Minimalism

    (46:12) Case Study: Shutdown Resistance

    (48:28) Try The Easy Methods First

    (50:02) When Should We Develop New Methods?

    (51:36) Call To Action

    (53:04) Acknowledgments

    (54:02) Appendix: Common Objections

    (54:08) Aren't You Optimizing For Quick Wins Over Breakthroughs?

    (56:34) What If AGI Is Fundamentally Different?

    (57:30) I Care About Scientific Beauty and Making AGI Go Well

    (58:09) Is This Just Applied Interpretability?

    (58:44) Are You Saying This Because You Need To Prove Yourself Useful To Google?

    (59:10) Does This Really Apply To People Outside AGI Companies?

    (59:40) Aren't You Just Giving Up?

    (01:00:04) Is Ambitious Reverse-engineering Actually Overcrowded?

    (01:00:48) Appendix: Defining Mechanistic Interpretability

    (01:01:44) Moving Toward Mechanistic OR Interpretability

    The original text contained 47 footnotes which were omitted from this narration.

    ---

    First published:
    December 1st, 2025

    Source:
    https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-inter
    Show More Show Less
    1 hr and 4 mins
  • “AI in 2025: gestalt” by technicalities
    Dec 8 2025
    This is the editorial for this year's "Shallow Review of AI Safety". (It got long enough to stand alone.)

    Epistemic status: subjective impressions plus one new graph plus 300 links.

    Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

    tl;dr

    • Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
    • My view: compared to last year, AI is much more impressive but not much more useful. They improved on many things they were explicitly optimised for (coding, vision, OCR, benchmarks), and did not hugely improve on everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far.
    • Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It's probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should [...]
    ---

    Outline:

    (00:36) tl;dr

    (03:51) Capabilities in 2025

    (04:02) Arguments against 2025 capabilities growth being above-trend

    (08:48) Arguments for 2025 capabilities growth being above-trend

    (16:19) Evals crawling towards ecological validity

    (19:28) Safety in 2025

    (22:39) The looming end of evals

    (24:35) Prosaic misalignment

    (26:56) What is the plan?

    (29:30) Things which might fundamentally change the nature of LLMs

    (31:03) Emergent misalignment and model personas

    (32:32) Monitorability

    (34:15) New people

    (34:49) Overall

    (35:17) Discourse in 2025

    The original text contained 9 footnotes which were omitted from this narration.

    ---

    First published:
    December 7th, 2025

    Source:
    https://www.lesswrong.com/posts/Q9ewXs8pQSAX5vL7H/ai-in-2025-gestalt

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    42 mins
  • “Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky
    Dec 7 2025
    "How are you coping with the end of the world?" journalists sometimes ask me, and the true answer is something they have no hope of understanding and I have no hope of explaining in 30 seconds, so I usually answer something like, "By having a great distaste for drama, and remembering that it's not about me." The journalists don't understand that either, but at least I haven't wasted much time along the way.

    Actual LessWrong readers sometimes ask me how I deal emotionally with the end of the world.

    I don't actually think my answer is going to help. But Raymond Arnold thinks I should say it. So I will say it.

    I don't actually think my answer is going to help. Wisely did Ozy write, "Other People Might Just Not Have Your Problems." Also I don't have a bunch of other people's problems, and other people can't make internal function calls that I've practiced to the point of hardly noticing them. I don't expect that my methods of sanity will be reproducible by nearly anyone. I feel pessimistic that they will help to hear about. Raymond Arnold asked me to speak them anyways, so I will.

    Stay genre-savvy [...]

    ---

    Outline:

    (01:15) Stay genre-savvy / be an intelligent character.

    (03:41) Dont make the end of the world be about you.

    (07:33) Just decide to be sane, and write your internal scripts that way.

    ---

    First published:
    December 6th, 2025

    Source:
    https://www.lesswrong.com/posts/isSBwfgRY6zD6mycc/eliezer-s-unteachable-methods-of-sanity

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    16 mins
  • “An Ambitious Vision for Interpretability” by leogao
    Dec 6 2025
    The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI's death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.

    The value of understanding

    Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons.

    First, mechanistic understanding can make it much easier to figure out what's actually going on, especially when it's hard to distinguish hypotheses using external behavior (e.g if the model is scheming).

    We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were [...]

    ---

    Outline:

    (00:38) The value of understanding

    (02:32) AMI has good feedback loops

    (04:48) The past and future of AMI

    The original text contained 1 footnote which was omitted from this narration.

    ---

    First published:
    December 5th, 2025

    Source:
    https://www.lesswrong.com/posts/Hy6PX43HGgmfiTaKu/an-ambitious-vision-for-interpretability

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    9 mins
  • “6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes
    Dec 4 2025
    Tl;dr

    AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”

    As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying.

    So in this post, I offer my own explanation of why “agent foundations” toy models fail to describe humans, centering around a particular non-“behaviorist” [...]

    ---

    Outline:

    (00:13) Tl;dr

    (03:35) 0. Background

    (03:39) 0.1. Human social instincts and Approval Reward

    (07:23) 0.2. Hang on, will future powerful AGI / ASI by default lack Approval Reward altogether?

    (10:29) 0.3. Where do self-reflective (meta)preferences come from?

    (12:38) 1. The human intuition that it's normal and good for one's goals & values to change over the years

    (14:51) 2. The human intuition that ego-syntonic desires come from a fundamentally different place than urges

    (17:53) 3. The human intuition that helpfulness, deference, and corrigibility are natural

    (19:03) 4. The human intuition that unorthodox consequentialist planning is rare and sus

    (23:53) 5. The human intuition that societal norms and institutions are mostly stably self-enforcing

    (24:01) 5.1. Detour into Security-Mindset Institution Design

    (26:22) 5.2. The load-bearing ingredient in human society is not Security-Mindset Institution Design, but rather good-enough institutions plus almost-universal human innate Approval Reward

    (29:26) 5.3. Upshot

    (30:49) 6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default

    (31:13) 7. Conclusion

    The original text contained 12 footnotes which were omitted from this narration.

    ---

    First published:
    December 3rd, 2025

    Source:
    https://www.lesswrong.com/posts/d4HNRdw6z7Xqbnu5E/6-reasons-why-alignment-is-hard-discourse-seems-alien-to

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    33 mins
  • “Three things that surprised me about technical grantmaking at Coefficient Giving (fka Open Phil)” by null
    Dec 3 2025
    Open Philanthropy's Coefficient Giving's Technical AI Safety team is hiring grantmakers. I thought this would be a good moment to share some positive updates about the role that I’ve made since I joined the team a year ago.

    tl;dr: I think this role is more impactful and more enjoyable than I anticipated when I started, and I think more people should consider applying.

    It's not about the “marginal” grants

    Some people think that being a grantmaker at Coefficient means sorting through a big pile of grant proposals and deciding which ones to say yes and no to. As a result, they think that the only impact at stake is how good our decisions are about marginal grants, since all the excellent grants are no-brainers.

    But grantmakers don’t just evaluate proposals; we elicit them. I spend the majority of my time trying to figure out how to get better proposals into our pipeline: writing RFPs that describe the research projects we want to fund, or pitching promising researchers on AI safety research agendas, or steering applicants to better-targeted or more ambitious proposals.

    Maybe more importantly, cG's technical AI safety grantmaking strategy is currently underdeveloped, and even junior grantmakers can help [...]

    ---

    Outline:

    (00:34) It's not about the marginal grants

    (03:03) There is no counterfactual grantmaker

    (05:15) Grantmaking is more fun/motivating than I anticipated

    (08:35) Please apply!

    ---

    First published:
    November 26th, 2025

    Source:
    https://www.lesswrong.com/posts/gLt7KJkhiEDwoPkae/three-things-that-surprised-me-about-technical-grantmaking

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    10 mins
  • “MIRI’s 2025 Fundraiser” by alexvermeer
    Dec 2 2025
    MIRI is running its first fundraiser in six years, targeting $6M. The first $1.6M raised will be matched 1:1 via an SFF grant. Fundraiser ends at midnight on Dec 31, 2025. Support our efforts to improve the conversation about superintelligence and help the world chart a viable path forward.

    MIRI is a nonprofit with a goal of helping humanity make smart and sober decisions on the topic of smarter-than-human AI.

    Our main focus from 2000 to ~2022 was on technical research to try to make it possible to build such AIs without catastrophic outcomes. More recently, we’ve pivoted to raising an alarm about how the race to superintelligent AI has put humanity on course for disaster.

    In 2025, those efforts focused around Nate Soares and Eliezer Yudkowsky's book (now a New York Times bestseller) If Anyone Builds It, Everyone Dies, with many public appearances by the authors; many conversations with policymakers; the release of an expansive online supplement to the book; and various technical governance publications, including a recent report with a draft of an international agreement of the kind that could actually address the danger of superintelligence.

    Millions have now viewed interviews and appearances with Eliezer and/or Nate [...]

    ---

    Outline:

    (02:18) The Big Picture

    (03:39) Activities

    (03:42) Communications

    (07:55) Governance

    (12:31) Fundraising

    The original text contained 4 footnotes which were omitted from this narration.

    ---

    First published:
    December 1st, 2025

    Source:
    https://www.lesswrong.com/posts/z4jtxKw8xSHRqQbqw/miri-s-2025-fundraiser

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    16 mins