Vladimir_Nesov comments on LLMs Look Increasingly Like General Reasoners

Vladimir_Nesov 9 Nov 2024 12:20 UTC
6 points
2
Most of my additional credence is on something like ‘the full o1 turns out to already be close to the grand prize mark’

Keep in mind that o1 is still probably a derivative of GPT-4o’s or GPT-4T’s base model, which was probably trained on at most 1e26 FLOPs^[1]. While the new 100K H100s cluster can train 4e26+ FLOPs models, and the next OpenAI model at this scale will probably be ready early next year. The move from o1-preview to full o1 is not obviously as significant as what happens when you also upgrade the base model. If some Orion rumors are correct, it might additionally get better than just from scale using o1-generated synthetic data in pretraining.
1. ↩︎
  WSD and Power learning rate schedules might enable effective continued pretraining, and it’s possible to continue training on repeated data, so fixed-compute base model scale is not obviously the correct assumption. That is, even though GPT-4o was released in May 2024, that doesn’t necessarily mean that its base model didn’t get stronger since then, that stronger performance is entirely a result of additional post-training. And 1e26 FLOPs is about 3 months on 30K H100s, which could be counted as a $140 million training run at $2 per H100-hour (not contradicting the claim that $100 million training runs were still the scale of models deployed by Jun 2024).
- eggsyntax 9 Nov 2024 15:10 UTC
  3 points
  0
  Parent
  For sure! At the same time, a) we’ve continued to see new ways of eliciting greater capability from the models we have, and b) o1 could (AFAIK) involve enough additional training compute to no longer be best thought of as ‘the same model’ (one possibility, although I haven’t spent much time looking into what we know about o1: they may have started with a snapshot of the 4o base model, put it through additional pre-training, then done an arbitrary amount of RL on CoT). So I’m hesitant to think that ‘based on 4o’ sets a very strong limit on o1′s capabilities.
  - Vladimir_Nesov 9 Nov 2024 17:03 UTC
    4 points
    0
    Parent
    Performance after post-training degrades if behavior gets too far from that of the base/SFT model (see Figure 1). Solving this issue would be an entirely different advancement from what o1-like post-training appears to do. So I expect that the model remains approximately as smart as the base model and the corresponding chatbot, it’s just better at packaging its intelligence into relevant long reasoning traces.
    - eggsyntax 9 Nov 2024 18:58 UTC
      1 point
      0
      Parent
      Interesting, I didn’t know that. But it seems like that assumes that o1′s special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it’s RLHF-ish? Or at least some clearly offline approach.