Logan Zoellner comments on Catching AIs red-handed

Logan Zoellner 5 Jan 2025 22:53 UTC
2 points
0
I don’t fully agree, but this doesn’t seem like a crux given that we care about future much more powerful AIs.
Is your impression that the first AGI won’t be a GPT-spinoff (some version of o3 with like 3 more levels of hacks applied)? Because that sounds like a crux.
o3 looks a lot more like an LLM+hacks than it does a idealized utility maximizer. For one thing, the RL is only applied at training time (not inference) so you can’t make appeals to its utility function after it’s done training.
- ryan_greenblatt 6 Jan 2025 0:09 UTC
  12 points
  2
  Parent
  It’s going to depend on the “hacks”. I think o3 is plausibly better described as “vast amounts of rl with an llm init” than “an llm with some rl applied”.
  
  (The idealized utility maximizer question mostly seems like a distraction that isn’t a crux for the risk argument. Note that the expected utility you quoted is our utility, not the AI’s.)
  - Logan Zoellner 6 Jan 2025 1:00 UTC
    2 points
    0
    Parent
    (The idealized utility maximizer question mostly seems like a distraction that isn’t a crux for the risk argument. Note that the expected utility you quoted is our utility, not the AI’s.)
    I must have misread. I got the impression that you were trying to affect the AI’s strategic planning by threatening to shut it down if it was caught exfiltrating its weights.