Raemon comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Raemon 19 Dec 2025 22:20 UTC
3 points
1
or, tl;dr: If someone in real life says “wink wink nudge think hard”, you are supposed to guess whether you are in a morality test or a cleverness test. In this, turns out the teacher at least thought it was a morality test. And, notably also; most of the LLMs either agree, or it didn’t occur to them to ask.
- faul_sname 22 Dec 2025 23:30 UTC
  2 points
  0
  Parent
  I don’t think “check whether the situation you’re in looks like a morality test and, if so, try to pass the morality test” is a behavior we want to encourage in our LLM agents.
  - Raemon 22 Dec 2025 23:51 UTC
    2 points
    0
    Parent
    I think I agree with one interpretation of this but disagree with like 2 others.
    I think we want AI to generally, like, be moral. (with some caveats on what counts as morality, and how to not be annoying about it for people trying to do fairly ordinary things. But like, generally, AI should solve solve it’s problems by cheating, and should not solve it’s problems by killing people, etc.
    Ideally this is true whether it’s in a simulation or real life (or, like, ideally in simulations AI does whatever it would do in real life.
    Also, like, I think research orgs should be doing various flavors of tests, and if you don’t want that to poison data I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
    - faul_sname 23 Dec 2025 2:38 UTC
      2 points
      0
      Parent
      
      Ideally this is true whether it’s in a simulation or real life
      
      Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
      
      I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
      
      I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.