All the examples of “RL” doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.
I guess when you say “powerful world models”, you’re suggesting that model-based RL (e.g. MuZero) is not RL but rather “RL”-in-scare-quotes. Was that your intention?
I’ve always thought of model-based RL is a central subcategory within RL, as opposed to an edge-case.
Personally, I consider model-based RL to be not RL at all. I claim that either one needs to consider model-based RL to be not RL at all, or one needs to accept such a broad definition of RL that the term is basically-useless (which I think is what porby is saying in response to this comment, i.e. “the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way”).
“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.
What’s more, even calling it a problem statement is misleading, because it’s (almost) the most general problem statement possible for any arbitrary task. If you try to formalize a concept like “doing a task well,“ or even “being an entity that acts freely and wants things,” in the most generic terms with no constraints whatsoever, you end up writing down “reinforcement learning.”
and so does Russell & Norvig 3rd edition
Reinforcement learning might be considered to encompass all of AI.
…That’s not my perspective though.
For my part, there’s a stereotypical core of things-I-call-RL which entails:
(A) There’s a notion of the AI’s outputs being better or worse
(B) But we don’t have ground truth (even after-the-fact) about what any particular output should ideally have been
(C) Therefore the system needs to do some kind of explore-exploit.
By this (loose) definition, both model-based and model-free RL are central examples of “reinforcement learning”, whereas LLM self-supervised base models are not reinforcement learning (cf. (B), (C)), nor are ConvNet classifiers trained on ImageNet (ditto), nor are clustering algorithms (cf. (A), (C)), nor is A* or any other exhaustive search within a simple deterministic domain (cf. (C)), nor are VAEs (cf. (C)), etc.
(A,B,C) is generally the situation you face if you want an AI to win at videogames or board games, control bodies while adapting to unpredictable injuries or terrain, write down math proofs, design chips, found companies, and so on.
This (loose) definition of RL connects to AGI safety because (B-C) makes it harder to predict the outputs of an RL system. E.g. we can plausibly guess that an LLM base model, given internet-text-like prompts, will continue in an internet-text-typical way. Granted, given OOD prompts, it’s harder to say things a priori about the output. But that’s nothing compared to e.g. AlphaZero or AlphaStar, where we’re almost completely in the dark about what the trained model will do in any nontrivial game-state whatsoever. (…Then extrapolate the latter to human-level AGIs acting in the real world!)
(That’s not an argument that “we’re doomed if AGI is based on RL”, but I do think that a very RL-centric AGI would need tailored approaches to thinking about safety and alignment that wouldn’t apply to LLMs; and I likewise think that likewise a massive increase in the scope and centrality of LLM-related RL (beyond the RLHF status quo) would raise new (and concerning) alignment issues, different from the ones we’re used to with LLMs today.)
Calling MuZero RL makes sense. The scare quotes are not meant to imply that it’s not “real” RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.
For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn’t going to wander down a path to general superintelligence with strong preferences about paperclips.
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?
I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.
It does still apply, though what ‘it’ is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can’t reach extreme capability in an open-ended environment.
The precondition I included is important:
in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function
In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn’t make them “fake RL,” I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.
For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.
It’s a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer’s search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori.
I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori.
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining.
Examples include:
MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”.
“A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
I don’t disagree. For clarity, I would make these claims, and I do not think they are in tension:
Something being called “RL” alone is not the relevant question for risk. It’s how much space the optimizer has to roam.
MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less ‘bound’ in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
Not all applications of tractable RL-like algorithms are safe or wise.
There do exist safe applications of RL-like algorithms.
I guess when you say “powerful world models”, you’re suggesting that model-based RL (e.g. MuZero) is not RL but rather “RL”-in-scare-quotes. Was that your intention?
I’ve always thought of model-based RL is a central subcategory within RL, as opposed to an edge-case.
Personally, I consider model-based RL to be not RL at all. I claim that either one needs to consider model-based RL to be not RL at all, or one needs to accept such a broad definition of RL that the term is basically-useless (which I think is what porby is saying in response to this comment, i.e. “the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way”).
@nostalgebraist bites that bullet here:
and so does Russell & Norvig 3rd edition
…That’s not my perspective though.
For my part, there’s a stereotypical core of things-I-call-RL which entails:
(A) There’s a notion of the AI’s outputs being better or worse
(B) But we don’t have ground truth (even after-the-fact) about what any particular output should ideally have been
(C) Therefore the system needs to do some kind of explore-exploit.
By this (loose) definition, both model-based and model-free RL are central examples of “reinforcement learning”, whereas LLM self-supervised base models are not reinforcement learning (cf. (B), (C)), nor are ConvNet classifiers trained on ImageNet (ditto), nor are clustering algorithms (cf. (A), (C)), nor is A* or any other exhaustive search within a simple deterministic domain (cf. (C)), nor are VAEs (cf. (C)), etc.
(A,B,C) is generally the situation you face if you want an AI to win at videogames or board games, control bodies while adapting to unpredictable injuries or terrain, write down math proofs, design chips, found companies, and so on.
This (loose) definition of RL connects to AGI safety because (B-C) makes it harder to predict the outputs of an RL system. E.g. we can plausibly guess that an LLM base model, given internet-text-like prompts, will continue in an internet-text-typical way. Granted, given OOD prompts, it’s harder to say things a priori about the output. But that’s nothing compared to e.g. AlphaZero or AlphaStar, where we’re almost completely in the dark about what the trained model will do in any nontrivial game-state whatsoever. (…Then extrapolate the latter to human-level AGIs acting in the real world!)
(That’s not an argument that “we’re doomed if AGI is based on RL”, but I do think that a very RL-centric AGI would need tailored approaches to thinking about safety and alignment that wouldn’t apply to LLMs; and I likewise think that likewise a massive increase in the scope and centrality of LLM-related RL (beyond the RLHF status quo) would raise new (and concerning) alignment issues, different from the ones we’re used to with LLMs today.)
Calling MuZero RL makes sense. The scare quotes are not meant to imply that it’s not “real” RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.
For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn’t going to wander down a path to general superintelligence with strong preferences about paperclips.
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right?
I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.
It does still apply, though what ‘it’ is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can’t reach extreme capability in an open-ended environment.
The precondition I included is important:
In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn’t make them “fake RL,” I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.
For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.
It’s a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer’s search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori.
I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori.
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining.
Examples include:
MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”.
“A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?
I don’t disagree. For clarity, I would make these claims, and I do not think they are in tension:
Something being called “RL” alone is not the relevant question for risk. It’s how much space the optimizer has to roam.
MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less ‘bound’ in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
Not all applications of tractable RL-like algorithms are safe or wise.
There do exist safe applications of RL-like algorithms.