I think this is important because the safety community still isn’t thinking very much about search & RL, even after all the recent progress with reasoning models. We’ve updated very far away from AlphaZero as a reference class, and I think we will regret this.
On the other hand, the ideas I’m talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they’re headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time scaling/reasoning being the future (e.g., at 20:32). I think R1 has driven the message home for everyone else.
To be fair here, AlphaZero was a case where it not only had an essentially unhackable reward model, but also could generate very large amounts of data, which while not totally unique to Go or gaming, is a property that is generally hard to come by in a lot of domains, so progress will probably be slower than AlphaZero.
Also, a lot of the domains are areas where latencies are either very low or you can tolerate long latency, which is not the case in the physical world very often.
I think this is important because the safety community still isn’t thinking very much about search & RL, even after all the recent progress with reasoning models. We’ve updated very far away from AlphaZero as a reference class, and I think we will regret this.
On the other hand, the ideas I’m talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they’re headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time scaling/reasoning being the future (e.g., at 20:32). I think R1 has driven the message home for everyone else.
To be fair here, AlphaZero was a case where it not only had an essentially unhackable reward model, but also could generate very large amounts of data, which while not totally unique to Go or gaming, is a property that is generally hard to come by in a lot of domains, so progress will probably be slower than AlphaZero.
Also, a lot of the domains are areas where latencies are either very low or you can tolerate long latency, which is not the case in the physical world very often.
We already have seen a lot of progress in this regard with the new reasoning models, see this neglected post for details.