faul_sname comments on Help keep AI under human control: Palisade Research 2026 fundraiser

faul_sname 23 Dec 2025 2:38 UTC
2 points
0

Ideally this is true whether it’s in a simulation or real life

Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.

I think it should be the AI Lab’s job to filter out that kind of data from the next training run.

I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.