• “Gemini 3 is Evaluation-Paranoid and Contaminated” by null
    Nov 23 2025
    TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data.

    Most of the experiments in this post are very easy to replicate, and I encourage people to try.

    I write things with LLMs sometimes. A new LLM came out, Gemini 3 Pro, and I tried to write with it. So far it seems okay, I don't have strong takes on it for writing yet, since the main piece I tried editing with it was extremely late-stage and approximately done. However, writing ability is not why we're here today.

    Reality is Fiction

    Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part:

    It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for [...]

    ---

    Outline:

    (00:54) Reality is Fiction

    (05:17) Distortions in Development

    (05:55) Is this good or bad or neither?

    (06:52) What is going on here?

    (07:35) 1. Too Much RL

    (08:06) 2. Personality Disorder

    (10:24) 3. Overfitting

    (11:35) Does it always do this?

    (12:06) Do other models do things like this?

    (12:42) Evaluation Awareness

    (13:42) Appendix A: Methodology Details

    (14:21) Appendix B: Canary

    The original text contained 8 footnotes which were omitted from this narration.

    ---

    First published:
    November 20th, 2025

    Source:
    https://www.lesswrong.com/posts/8uKQyjrAgCcWpfmcs/gemini-3-is-evaluation-paranoid-and-contaminated

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    15 mins
  • “Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
    Nov 22 2025
    Abstract

    We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.

    Twitter thread

    New Anthropic research: Natural emergent misalignment from reward hacking in production RL.

    “Reward hacking” is where models learn to cheat on tasks they’re given during training.

    Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

    In our experiment, we [...]

    ---

    Outline:

    (00:14) Abstract

    (01:26) Twitter thread

    (05:23) Blog post

    (07:13) From shortcuts to sabotage

    (12:20) Why does reward hacking lead to worse behaviors?

    (13:21) Mitigations

    ---

    First published:
    November 21st, 2025

    Source:
    https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    19 mins
  • “Anthropic is (probably) not meeting its RSP security commitments” by habryka
    Nov 21 2025
    TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are in violation of their RSP.

    Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon)

    From the Anthropic RSP:

    When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.

    We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).

    [...]

    We will implement robust controls to mitigate basic insider risk, but consider mitigating risks from sophisticated or state-compromised insiders to be out of scope for ASL-3. We define “basic insider risk” as risk from an insider who does not have persistent or time-limited [...]

    ---

    Outline:

    (00:37) Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon)

    (03:40) Claude weights that are covered by ASL-3 security requirements are shipped to many Amazon, Google, and Microsoft data centers

    (04:55) This means given executive buy-in by a high-level Amazon, Microsoft or Google executive, their corporate espionage team would have virtually unlimited physical access to Claude inference machines that host copies of the weights

    (05:36) With unlimited physical access, a competent corporate espionage team at Amazon, Microsoft or Google could extract weights from an inference machine, without too much difficulty

    (06:18) Given all of the above, this means Anthropic is in violation of its most recent RSP

    (07:05) Postscript

    ---

    First published:
    November 18th, 2025

    Source:
    https://www.lesswrong.com/posts/zumPKp3zPDGsppFcF/anthropic-is-probably-not-meeting-its-rsp-security

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Ap
    Show More Show Less
    9 mins
  • “Varieties Of Doom” by jdp
    Nov 20 2025
    There has been a lot of talk about "p(doom)"over the last few years. This has always rubbed me the wrong waybecause "p(doom)" didn't feel like it mapped to any specific belief in my head.In private conversations I'd sometimes give my p(doom) as 12%, with the caveatthat "doom" seemed nebulous and conflated between several different concepts.At some point it was decideda p(doom) over 10% makes you a "doomer" because it means what actions you should take with respect toAI are overdetermined. I did not and do not feel that is true. But any time Ifelt prompted to explain my position I'd find I could explain a little bit ofthis or that, but not really convey the whole thing. As it turns out doom hasa lot of parts, and every part is entangled with every other part so no matterwhich part you explain you always feel like you're leaving the crucial parts out. Doom ismore like an onion than asingle event, a distribution over AI outcomes people frequentlyrespond to with the force of the fear of death. Some of these outcomes are lessthan death and some [...]

    ---

    Outline:

    (03:46) 1. Existential Ennui

    (06:40) 2. Not Getting Immortalist Luxury Gay Space Communism

    (13:55) 3. Human Stock Expended As Cannon Fodder Faster Than Replacement

    (19:37) 4. Wiped Out By AI Successor Species

    (27:57) 5. The Paperclipper

    (42:56) Would AI Successors Be Conscious Beings?

    (44:58) Would AI Successors Care About Each Other?

    (49:51) Would AI Successors Want To Have Fun?

    (51:11) VNM Utility And Human Values

    (55:57) Would AI successors get bored?

    (01:00:16) Would AI Successors Avoid Wireheading?

    (01:06:07) Would AI Successors Do Continual Active Learning?

    (01:06:35) Would AI Successors Have The Subjective Experience of Will?

    (01:12:00) Multiply

    (01:15:07) 6. Recipes For Ruin

    (01:18:02) Radiological and Nuclear

    (01:19:19) Cybersecurity

    (01:23:00) Biotech and Nanotech

    (01:26:35) 7. Large-Finite Damnation

    ---

    First published:
    November 17th, 2025

    Source:
    https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Show More Show Less
    1 hr and 39 mins
  • “How Colds Spread” by RobertM
    Nov 19 2025
    It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of studies conducted over the years, but most of those were testing secondary endpoints, like how long viruses would survive on surfaces, or how likely they were to be transmitted to people's fingers after touching contaminated surfaces, etc.

    However, a few of them involved rounding up some brave volunteers, deliberately infecting some of them, and then arranging matters so as to test various routes of transmission to uninfected volunteers.

    My conclusions from reviewing these studies are:

    • You can definitely infect yourself if you take a sick person's snot and rub it into your eyeballs or nostrils. This probably works even if you touched a surface that a sick person touched, rather than by handshake, at least for some surfaces. There's some evidence that actual human infection is much less likely if the contaminated surface you touched is dry, but for most colds there'll often be quite a lot of virus detectable on even dry contaminated surfaces for most of a day. I think you can probably infect yourself with fomites, but my guess is that [...]
    ---

    Outline:

    (01:49) Fomites

    (06:58) Aerosols

    (16:23) Other Factors

    (17:06) Review

    (18:33) Conclusion

    The original text contained 16 footnotes which were omitted from this narration.

    ---

    First published:
    November 18th, 2025

    Source:
    https://www.lesswrong.com/posts/92fkEn4aAjRutqbNF/how-colds-spread

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    21 mins
  • “New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence” by Aaron_Scher, David Abecassis, Brian Abeyta, peterbarnett
    Nov 19 2025
    TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards artificial superintelligence. The agreement is centered around limiting the scale of AI training, and restricting certain AI research.

    Experts argue that the premature development of artificial superintelligence (ASI) poses catastrophic risks, from misuse by malicious actors, to geopolitical instability and war, to human extinction due to misaligned AI. Regarding misalignment, Yudkowsky and Soares's NYT bestseller If Anyone Builds It, Everyone Dies argues that the world needs a strong international agreement prohibiting the development of superintelligence. This report is our attempt to lay out such an agreement in detail.

    The risks stemming from misaligned AI are of special concern, widely acknowledged in the field and even by the leaders of AI companies. Unfortunately, the deep learning paradigm underpinning modern AI development seems highly prone to producing agents that are not aligned with humanity's interests. There is likely a point of no return in AI development — a point where alignment failures become unrecoverable because humans have been disempowered.

    Anticipating this threshold is complicated by the possibility of a feedback loop once AI research and development can [...]

    ---

    First published:
    November 18th, 2025

    Source:
    https://www.lesswrong.com/posts/FA6M8MeQuQJxZyzeq/new-report-an-international-agreement-to-prevent-the

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    Show More Show Less
    7 mins
  • “Where is the Capital? An Overview” by johnswentworth
    Nov 17 2025
    When a new dollar goes into the capital markets, after being bundled and securitized and lent several times over, where does it end up? When society's total savings increase, what capital assets do those savings end up invested in?

    When economists talk about “capital assets”, they mean things like roads, buildings and machines. When I read through a company's annual reports, lots of their assets are instead things like stocks and bonds, short-term debt, and other “financial” assets - i.e. claims on other people's stuff. In theory, for every financial asset, there's a financial liability somewhere. For every bond asset, there's some payer for whom that bond is a liability. Across the economy, they all add up to zero. What's left is the economists’ notion of capital, the nonfinancial assets: the roads, buildings, machines and so forth.

    Very roughly speaking, when there's a net increase in savings, that's where it has to end up - in the nonfinancial assets.

    I wanted to get a more tangible sense of what nonfinancial assets look like, of where my savings are going in the physical world. So, back in 2017 I pulled fundamentals data on ~2100 publicly-held US companies. I looked at [...]

    ---

    Outline:

    (02:01) Disclaimers

    (04:10) Overview (With Numbers!)

    (05:01) Oil - 25%

    (06:26) Power Grid - 16%

    (07:07) Consumer - 13%

    (08:12) Telecoms - 8%

    (09:26) Railroads - 8%

    (10:47) Healthcare - 8%

    (12:03) Tech - 6%

    (12:51) Industrial - 5%

    (13:49) Mining - 3%

    (14:34) Real Estate - 3%

    (14:49) Automotive - 2%

    (15:32) Logistics - 1%

    (16:12) Miscellaneous

    (16:55) Learnings

    ---

    First published:
    November 16th, 2025

    Source:
    https://www.lesswrong.com/posts/HpBhpRQCFLX9tx62Z/where-is-the-capital-an-overview

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try
    Show More Show Less
    18 mins
  • “Problems I’ve Tried to Legibilize” by Wei Dai
    Nov 17 2025
    Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

    1. Philosophical problems
      1. Probability theory
      2. Decision theory
      3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
      4. Interaction between bargaining and logical uncertainty
      5. Metaethics
      6. Metaphilosophy: 1, 2
    2. Problems with specific philosophical and alignment ideas
      1. Utilitarianism: 1, 2
      2. Solomonoff induction
      3. "Provable" safety
      4. CEV
      5. Corrigibility
      6. IDA (and many scattered comments)
      7. UDASSA
      8. UDT
    3. Human-AI safety (x- and s-risks arising from the interaction between human nature and AI design)
      1. Value differences/conflicts between humans
      2. “Morality is scary” (human morality is often the result of status games amplifying random aspects of human value, with frightening results)
      3. [...]
    ---

    First published:
    November 9th, 2025

    Source:
    https://www.lesswrong.com/posts/7XGdkATAvCTvn4FGu/problems-i-ve-tried-to-legibilize

    ---



    Narrated by TYPE III AUDIO.

    Show More Show Less
    4 mins