ryan_greenblatt comments on Catching AIs red-handed

ryan_greenblatt 6 Jan 2025 0:09 UTC
12 points
2
It’s going to depend on the “hacks”. I think o3 is plausibly better described as “vast amounts of rl with an llm init” than “an llm with some rl applied”.

(The idealized utility maximizer question mostly seems like a distraction that isn’t a crux for the risk argument. Note that the expected utility you quoted is our utility, not the AI’s.)
- Logan Zoellner 6 Jan 2025 1:00 UTC
  2 points
  0
  Parent
  (The idealized utility maximizer question mostly seems like a distraction that isn’t a crux for the risk argument. Note that the expected utility you quoted is our utility, not the AI’s.)
  I must have misread. I got the impression that you were trying to affect the AI’s strategic planning by threatening to shut it down if it was caught exfiltrating its weights.