• “Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala
    Apr 17 2025
    Introduction Writing this post puts me in a weird epistemic position. I simultaneously believe that:

    • The reasoning failures that I'll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won't get us AGI
    • As soon as major AI labs read about the specific reasoning failures described here, they might fix them
    • But future versions of GPT, Claude etc. succeeding at the tasks I've described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.
    That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]

    ---

    Outline:

    (00:13) Introduction

    (02:13) Reasoning failures

    (02:17) Sliding puzzle problem

    (07:17) Simple coaching instructions

    (09:22) Repeatedly failing at tic-tac-toe

    (10:48) Repeatedly offering an incorrect fix

    (13:48) Various people's simple tests

    (15:06) Various failures at logic and consistency while writing fiction

    (15:21) Inability to write young characters when first prompted

    (17:12) Paranormal posers

    (19:12) Global details replacing local ones

    (20:19) Stereotyped behaviors replacing character-specific ones

    (21:21) Top secret marine databases

    (23:32) Wandering items

    (23:53) Sycophancy

    (24:49) What's going on here?

    (32:18) How about scaling? Or reasoning models?

    ---

    First published:
    April 15th, 2025

    Source:
    https://www.lesswrong.com/posts/sgpCuokhMb8JmkoSn/untitled-draft-7shu

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    36 mins
  • “Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen
    Apr 16 2025
    Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.

    Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant [...]

    ---

    Outline:

    (01:28) The Evaluation

    (02:29) Visual Errors

    (04:03) Physical Reasoning Errors

    (06:09) Why do LLM's struggle with physical tasks?

    (07:37) Improving on physical tasks may be difficult

    (10:14) Potential Implications of Uneven Automation

    (11:48) Conclusion

    (12:24) Appendix

    (12:44) Visual Errors

    (14:36) Physical Reasoning Errors

    ---

    First published:
    April 14th, 2025

    Source:
    https://www.lesswrong.com/posts/r3NeiHAEWyToers4F/frontier-ai-models-still-fail-at-basic-physical-tasks-a

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    21 mins
  • “Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah
    Apr 12 2025
    Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

    Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

    * = equal contribution

    The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.

    TL;DR

    • To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts
    • [...]
    ---

    Outline:

    (01:08) TL;DR

    (02:38) Introduction

    (02:41) Motivation

    (06:09) Our Task

    (08:35) Conclusions and Strategic Updates

    (13:59) Comparing different ways to train Chat SAEs

    (18:30) Using SAEs for OOD Probing

    (20:21) Technical Setup

    (20:24) Datasets

    (24:16) Probing

    (26:48) Results

    (30:36) Related Work and Discussion

    (34:01) Is it surprising that SAEs didn't work?

    (39:54) Dataset debugging with SAEs

    (42:02) Autointerp and high frequency latents

    (44:16) Removing High Frequency Latents from JumpReLU SAEs

    (45:04) Method

    (45:07) Motivation

    (47:29) Modifying the sparsity penalty

    (48:48) How we evaluated interpretability

    (50:36) Results

    (51:18) Reconstruction loss at fixed sparsity

    (52:10) Frequency histograms

    (52:52) Latent interpretability

    (54:23) Conclusions

    (56:43) Appendix

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    March 26th, 2025

    Source:
    https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    58 mins
  • [Linkpost] “Playing in the Creek” by Hastings
    Apr 11 2025
    This is a link post. When I was a really small kid, one of my favorite activities was to try and dam up the creek in my backyard. I would carefully move rocks into high walls, pile up leaves, or try patching the holes with sand. The goal was just to see how high I could get the lake, knowing that if I plugged every hole, eventually the water would always rise and defeat my efforts. Beaver behaviour.

    One day, I had the realization that there was a simpler approach. I could just go get a big 5 foot long shovel, and instead of intricately locking together rocks and leaves and sticks, I could collapse the sides of the riverbank down and really build a proper big dam. I went to ask my dad for the shovel to try this out, and he told me, very heavily paraphrasing, 'Congratulations. You've [...]

    ---

    First published:
    April 10th, 2025

    Source:
    https://www.lesswrong.com/posts/rLucLvwKoLdHSBTAn/playing-in-the-creek

    Linkpost URL:
    https://hgreer.com/PlayingInTheCreek

    ---

    Narrated by TYPE III AUDIO.

    Show More Show Less
    4 mins
  • “Thoughts on AI 2027” by Max Harms
    Apr 10 2025
    This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.

    Okay, I'm annoyed at people covering AI 2027 burying the lede, so I'm going to try not to do that. The authors predict a strong chance that all humans will be (effectively) dead in 6 years, and this agrees with my best guess about the future. (My modal timeline has loss of control of Earth mostly happening in 2028, rather than late 2027, but nitpicking at that scale hardly matters.) Their timeline to transformative AI also seems pretty close to the perspective of frontier lab CEO's (at least Dario Amodei, and probably Sam Altman) and the aggregate market opinion of both Metaculus and Manifold!

    If you look on those market platforms you get graphs like this:

    Both [...]

    ---

    Outline:

    (02:23) Mode ≠ Median

    (04:50) Theres a Decent Chance of Having Decades

    (06:44) More Thoughts

    (08:55) Mid 2025

    (09:01) Late 2025

    (10:42) Early 2026

    (11:18) Mid 2026

    (12:58) Late 2026

    (13:04) January 2027

    (13:26) February 2027

    (14:53) March 2027

    (16:32) April 2027

    (16:50) May 2027

    (18:41) June 2027

    (19:03) July 2027

    (20:27) August 2027

    (22:45) September 2027

    (24:37) October 2027

    (26:14) November 2027 (Race)

    (29:08) December 2027 (Race)

    (30:53) 2028 and Beyond (Race)

    (34:42) Thoughts on Slowdown

    (38:27) Final Thoughts

    ---

    First published:
    April 9th, 2025

    Source:
    https://www.lesswrong.com/posts/Yzcb5mQ7iq4DFfXHx/thoughts-on-ai-2027

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    40 mins
  • “Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov
    Apr 9 2025
    Short AI takeoff timelines seem to leave no time for some lines of alignment research to become impactful. But any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. So even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. This doesn't crucially depend on giving significant probability to long AI takeoff timelines, or on expected value in such scenarios driving the priorities.

    Potential for AI to take up the torch makes it reasonable to still prioritize things that have no hope at all of becoming practical for decades (with human effort). How well AIs can be directed to advance a line of research [...]

    ---

    First published:
    April 9th, 2025

    Source:
    https://www.lesswrong.com/posts/3NdpbA6M5AM2gHvTW/short-timelines-don-t-devalue-long-horizon-research

    ---

    Narrated by TYPE III AUDIO.

    Show More Show Less
    2 mins
  • “Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger
    Apr 9 2025
    In this post, we present a replication and extension of an alignment faking model organism:

    • Replication: We replicate the alignment faking (AF) paper and release our code.
    • Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 from the original classifier.
    • Evaluating More Models: We find Llama family models, other open source models, and GPT-4o do not AF in the prompted-only setting when evaluating using our new classifier (other than a single instance with Llama 3 405B).
    • Extending SFT Experiments: We run supervised fine-tuning (SFT) experiments on Llama (and GPT4o) and find that AF rate increases with scale. We release the fine-tuned models on Huggingface and scripts.
    • Alignment faking on 70B: We find that Llama 70B alignment fakes when both using the system prompt in the [...]
    ---

    Outline:

    (02:43) Method

    (02:46) Overview of the Alignment Faking Setup

    (04:22) Our Setup

    (06:02) Results

    (06:05) Improving Alignment Faking Classification

    (10:56) Replication of Prompted Experiments

    (14:02) Prompted Experiments on More Models

    (16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o

    (23:13) Next Steps

    (25:02) Appendix

    (25:05) Appendix A: Classifying alignment faking

    (25:17) Criteria in more depth

    (27:40) False positives example 1 from the old classifier

    (30:11) False positives example 2 from the old classifier

    (32:06) False negative example 1 from the old classifier

    (35:00) False negative example 2 from the old classifier

    (36:56) Appendix B: Classifier ROC on other models

    (37:24) Appendix C: User prompt suffix ablation

    (40:24) Appendix D: Longer training of baseline docs

    ---

    First published:
    April 8th, 2025

    Source:
    https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    41 mins
  • “METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman
    Apr 7 2025
    Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

    The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.

    Full paper | Github repo



    We think that forecasting the capabilities of future AI systems is important for understanding and preparing for the impact of [...]

    ---

    Outline:

    (08:58) Conclusion

    (09:59) Want to contribute?

    ---

    First published:
    March 19th, 2025

    Source:
    https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    11 mins