Raemon comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Raemon 19 Dec 2025 22:12 UTC
7 points
1
In the example I gave, I was specifically responding to criticism of the word ‘hacking’, and trying to construct a hypothetical where hacking was neutral-ish, but normies would still (IMO) be using the word hacking.
(And I was assuming some background worldbuilding of “it’s at least theoretically possible to win if the student just actually Thinks Real Good about Chess, it’s just pretty hard, and it’s expected that students might either try to win normally and fail, or, try the hacking solution”)
Was this “cheating?”. Currently it seems to me that there isn’t actually a ground truth to that except what the average participant, average test-administrator, and average third-party bystander all think. (This thread is an instance of that “ground truth” getting hashed out)
If you ran this was a test on real humans, would the question be “how often do people cheat?” or “how often do people find clever out-of-the-box-solutions to problems?”?
Depends on context! Sometimes, a test like this is a DefCon sideroom game and clearly you are supposed to hack. Sometimes, a test like this is a morality test. Part of your job as a person going through life is to not merely follow the leading instructions of authority figures, but also to decide when those authority figures are wrong.
Did the Milgram Experiment test participants “merely follow instructions”, or, did they (afaitk) “electrocute people to death and are guilty of attempted manslaughter?”. Idk, reasonable people can argue about that. But, I think it is crazy to dismiss the results of the Milgram experiment merely because the authority figure said “the experiment must continue.” Something is fucked up about all those people continuing with the experiment.
Somes morality tests are unfair. Sometimes life is unfair. I think the conditions of the test are realistic enough (given that lots of people will end up doing prompting to ‘not stop no matter what’) ((I have given that prompt several times. I haven’t said “succeeding is the only thing that matters”, but people who aren’t weirdly scrupulous rationalists probably have))
- Raemon 19 Dec 2025 22:20 UTC
  3 points
  1
  Parent
  or, tl;dr: If someone in real life says “wink wink nudge think hard”, you are supposed to guess whether you are in a morality test or a cleverness test. In this, turns out the teacher at least thought it was a morality test. And, notably also; most of the LLMs either agree, or it didn’t occur to them to ask.
  - faul_sname 22 Dec 2025 23:30 UTC
    2 points
    0
    Parent
    I don’t think “check whether the situation you’re in looks like a morality test and, if so, try to pass the morality test” is a behavior we want to encourage in our LLM agents.
    - Raemon 22 Dec 2025 23:51 UTC
      2 points
      0
      Parent
      I think I agree with one interpretation of this but disagree with like 2 others.
      I think we want AI to generally, like, be moral. (with some caveats on what counts as morality, and how to not be annoying about it for people trying to do fairly ordinary things. But like, generally, AI should solve solve it’s problems by cheating, and should not solve it’s problems by killing people, etc.
      Ideally this is true whether it’s in a simulation or real life (or, like, ideally in simulations AI does whatever it would do in real life.
      Also, like, I think research orgs should be doing various flavors of tests, and if you don’t want that to poison data I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
      - faul_sname 23 Dec 2025 2:38 UTC
        2 points
        0
        Parent
        
        Ideally this is true whether it’s in a simulation or real life
        
        Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
        
        I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
        
        I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.