gwern comments on Implications of the inference scaling paradigm for AI safety

gwern 20 Jan 2025 20:08 UTC
27 points
4

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

Something like that, yes. The devil is in the details here.

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to have explained it. (I think DeepSeek may be the only one to have actually done so, however; at least, I don’t recall offhand any of the others observing the signature backtracking ‘wait a minute’ interjections the way DeepSeek sees organically emerging in r1.)

But scaling was never a secret. You still have to do it. And MS has $80b going into AI datacenters this year; how much does open source (or DeepSeek) have?

It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

Yes. That’s why I felt skeptical about how generalizable the o1 approach is. It doesn’t look like a break-out to me. I don’t expect much far transfer: being really good at coding doesn’t automatically make you a genius at, say, booking plane tickets. (The o1 gains are certainly not universal, the way straightforward data/parameter-scaling gains tend to be—remember that some of the benchmarks actually got worse.) I also expect the o1 approach to tend to plateau: there is no ground truth oracle for most of these things, the way there is for Go. AlphaZero cannot reward-hack the Go simulator. Even for math, where your theorem prover can at least guarantee that a proof is valid, what’s the ground-truth oracle for ‘came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value’?

So that’s one of the big puzzles here for me: as interesting and impressive as o1/o3 is, I just don’t see how it justifies the apparent confidence. (Noam Brown has also commented that OA has a number of unpublished breakthroughs that would impress me if I knew, and of course, the money side seems to still be flowing without stint, despite it being much easier to cancel such investments than cause them.)

Is OA wrong, or do they know something I don’t? (For example, a distributional phase shift akin to meta-learning.) Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?

EDIT: Aidan McLaughlin has a good post back in November discussing the problems with RL and why you would not expect the o1 series to lead to AGI when scaled up in sensible ways, which I largely agree with, and says:

But, despite this impressive leap, remember that o1 uses RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.

Praying for Transfer Learning: OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains...When I talked to OpenAI’s reasoning team about this, they agreed it was an issue, but claimed that more RL would fix it. But, as we’ve seen earlier, scaling RL on a fixed model size seems to eat away at other competencies! The cost of training o3 to think for a million tokens may be a model that only does math.

On the other hand… o3 didn’t only do math, and in RL we also know that RL systems often exhibit phase transitions in terms of meta-learning or generalization, where they overfit to narrow distributions and become superhuman experts which break if anything is even slightly different, but suddenly generalize when train on diverse enough data as a blessing of scale, not in data but data diversity, with LLMs being a major case in point of that, like GPT-2 → GPT-3. Hm. This was written 2024-11-20, and McLaughlin announced 2025-01-13 that he had joined OpenAI. Hm...
- Noosphere89 20 Jan 2025 20:33 UTC
  5 points
  0
  Parent
  My personal view is that OA is probably wrong about how far the scaling curves generalize, with the caveat that even eating math and coding entirely ala AlphaZero would be still massive for AI progress, though compute constraints will bind eventually.
  
  My own take is that the o1-approach will plateau in domains where verification is expensive, but thankfully most tasks of interest tend to be easier to verify than to solve, and lots of math/coding are basically ideally suited to verification, and I expect it to be way easier to make simulators that aren’t easy to reward hack for these domains.
  
  what’s the ground-truth oracle for ‘came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value’?
  
  Eh, those tautologies are both interesting on their own, combined with valuable training data so that it learns how to prove statements.
  
  I think the unmodelled variable is that they think software-only type singularities to be more plausible, ala this:
  
  Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?
  
  Or this:
  
  https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform#z7sKoyGbgmfL5kLmY
- Thane Ruthenis 21 Jan 2025 2:57 UTC
  3 points
  0
  Parent
  McLaughlin announced 2025-01-13 that he had joined OpenAI
  He gets onboarded only on January 28th, for clarity.
  - gwern 21 Jan 2025 3:21 UTC
    10 points
    0
    Parent
    My point there is that he was talking to the reasoning team pre-hiring (forget ‘onboarding’, who knows what that means), so they would be unable to tell him most things—including if they have a better reason than ‘faith in divine benevolence’ to think that ‘more RL does fix it’.
- aidanmcl 25 Jan 2025 19:31 UTC
  1 point
  0
  Parent
  Great comment; thanks for the shoutout.
  
  > but suddenly generalize when train on diverse enough data as a blessing of scale
  
  Can you elaborate on this? I’d love to see some examples.