Jesse Hoogland comments on Jesse Hoogland’s Shortform

Jesse Hoogland 10 Feb 2025 16:56 UTC
LW: 40 AF: 16
17
AF
I think this is important because the safety community still isn’t thinking very much about search & RL, even after all the recent progress with reasoning models. We’ve updated very far away from AlphaZero as a reference class, and I think we will regret this.
On the other hand, the ideas I’m talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they’re headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time scaling/reasoning being the future (e.g., at 20:32). I think R1 has driven the message home for everyone else.
- Noosphere89 10 Feb 2025 19:52 UTC
  20 points
  16
  Parent
  To be fair here, AlphaZero was a case where it not only had an essentially unhackable reward model, but also could generate very large amounts of data, which while not totally unique to Go or gaming, is a property that is generally hard to come by in a lot of domains, so progress will probably be slower than AlphaZero.
  
  Also, a lot of the domains are areas where latencies are either very low or you can tolerate long latency, which is not the case in the physical world very often.
  - cubefox 11 Feb 2025 14:31 UTC
    5 points
    1
    Parent
    We already have seen a lot of progress in this regard with the new reasoning models, see this neglected post for details.