People seem to believe that the bot trained to “win” in a narrow domain will extend to a bot that “tries to win” in the real world
I think the concern is that an AGI will not be trained on a narrow domain. The Problem isn’t arguing that Stockfish is an ASI or will become one, it’s arguing that an ASI will be just as relentless in its domain (the real world) as Stockfish is in its (valid chess moves).
All AI is trained in narrow domains, to some extent. There is no way to make a training environment as complex as the real world. I could have make the same post about LLMs, except there the supposed goal is a lot less clear. Do you have a better example of a “goal oriented” AI in a complex domain?
You might reasonably argue that making aligned narrow AI is easy, but greedy capitalists will build unaligned AI instead. I think it would be off topic to debate here how likely that is. But I don’t think this is the prevailing thought, and I don’t think it produces the p(doom)=0.9 that some people hold.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
I work at MIRI, worked on The Problem, and have never heard anyone express this belief.[1] Brendan is correct about the intention of that passage.
There is no way to make a training environment as complex as the real world.
It’s unclear that this is needed; e.g., the AI2027 story where you train coders that help you train scientists that help you build ASI.
Still, virtual environments for RL are a huge market right now; people are, indeed, currently trying a more modest version of this thing you claim is impossible. Of course, these aren’t literally ‘as complex as the real world’, but it’s not clear the fidelity you’d need to reach particular capability thresholds. Iirc this is the importance of work on, e.g., multi-level world models and Markov blankets: better understanding what fidelity you need in what portions of your conception of the world in order to meet a given end.
If someone were to chime in and say they believe this, my guess is that they’d get there by abusing the category ‘chess bot’; e.g., ChatGPT is kind of a chess bot in that it’s a bot that can play chess, even though it’s the product of a very different training regime than one would ever sensibly use to create a chess bot on purpose.
To be clear, you believe that making aligned narrow AI is easy, regardless of how intelligent it is? Even something more useful than a chess bot, like a theorem prover? And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks? I’m not saying they won’t, but is that really enough to justify high p(doom)? When I was reading “The Problem”, I was sure that goal-oriented AI was seen as inevitable for some reason deeper than “Goal-oriented behavior is economically useful”.
I’d still like to argue that “goal-oriented” is not a simple concept, and it’s not trivial to produce a goal-oriented agent even if you try, and that not all useful agents are goal-oriented. But if the answer is that people will try very hard to kill themselves, I wouldn’t know how to reply to that.
I’m recognizing a lot of terms you’re using but there seems to be a supposition of my model that’s very different from my actual model, to such an extent that I actually can’t decode it. My best guess is that the productive thing is to zoom out and clarify my Actual Position in more detail, instead of arguing single points (which will then lead you to make other assumptions that don’t quite square with my actual model, which is the big failure mode I’m trying to avoid here). To the extent that your aim is to better understand my model (which is very nearby, but not synonymous with, the models of other MIRI staff), this looks like the best path forward to me. Hopefully along the way we locate some cruxes (and I’d like it if you also helped out in guessing the root of our disagreement/misunderstanding).
At a high level, I don’t think it’s particularly useful to talk about ‘alignment’ with respect to smaller and more specialized systems, since it introduces the potential for conflating between qualitatively distinct cases.
For any system with a very small number of outputs (e.g. [chess piece] to [board position] small, and plausibly several OOMs larger than that in absolute number), it is trivially easy to verify the safety of the range of available outputs, since you can simply generate them all in advance, check whether they’re liable to kill all humans, and move on. A key reason that I think alignment is hard is that the range of outputs of properly general systems is so large that all possible outputs in all possible deployment settings can’t possibly be hand-verified. There are reasons to think you can verify groups or categories of outputs bundled together, but so far verifying the safety of the space of all possible outputs of powerful systems is not doable (my impression is that RAT and LAT were gestures in this direction, and that some of the ELK stuff is inspired by related concerns, but I’m no authority!).
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question. Let me know if that seems right.
For the theorem-prover case, the details of the implementation really matter quite a lot, and so the question is underspecified from my perspective. (My guess is the vast majority of theorem-provers are basically safe, even under large optimization pressure, but I haven’t looked into it and invite someone who’s thought more about that case to chime in; there are definitely approaches I could imagine for creating a theorem prover that might risk dangerous generalization, I’m just not sure how central those approaches are in practice.)
And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks?
This is not my position, and is not the position of anyone at MIRI who I’ve spoken with on the topic.
When I was reading “The Problem”, I was sure that goal-oriented AI was seen as inevitable for some reason deeper than “Goal-oriented behavior is economically useful”.
This is a correct reading and I don’t understand what in my initial reply gave you some other impression. My best guess is that you’ve conflated between generality and goal-pursuing. Chess systems are safe because they’re not very general (i.e., they have a very small action space), not because they aren’t pursuing goals.
I’d still like to argue that “goal-oriented” is not a simple concept,
Agreed.
and it’s not trivial to produce a goal-oriented agent even if you try
In conversations I’ve seen about this, we usually talk about how ‘coherent’ an agent is, as a way of describing how robustly it pursues its objective, whatever that objective may be. If what you mean is something like “contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis,” I agree.
not all useful agents are goal-oriented.
I think I disagree here, but I don’t know how to frame that disagreement without more detail. Feel free to offer it if you feel there’s more to talk about, and I’ll do my best to continue engaging. I want to acknowledge that I don’t really understand what you mean by goal-oriented, and how it might differ from my conception, which I’m hesitant to elaborate on for now, in the spirit of avoiding further confusion.
-
My best guess is that you thought the chess example was attempting to illustrate more things than it actually is. The chess example, as I recall, is a response to two common objections:
“AIs will never be better than humans at cognitive tasks.” Chess is a cognitive task where the AIs ~always beat the humans.
“AIs will never be able to pursue goals robustly over long time horizons.” Chess has many turns, and AIs seem to coherently pursue a single end (winning) over the course of even very long games (in clock time).
These are just existence proofs; if the AI can perform with superhuman competence at a game with n variables (like chess), then it seems plausible that AIs could eventually, in principle, perform with superhuman competence at a game with 2n, 100n, or 1e80 variables.
For any system with a very small number of outputs <...> it is trivially easy to verify the safety of the range of available outputs
On second thought, I don’t agree that the number of outputs is the right criteria. It’s the “narrowness” of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of “preference for chess” or “desire to win”, neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say “I’m good at chess”, but this wouldn’t be a direct consequence of the chess skill.
I don’t really understand what you mean by goal-oriented
This is the key point. “The Problem” uses the word “goal” some 80 times, but does not define it, does not acknowledge that it’s a complex concept, or consider if some AI might not have it. I wish I could just use your concept of a goal, we shouldn’t need to discuss this, it should have been precisely defined in “The Problem” or some other introductory text.
Personally by “goal” or “goal-oriented” I mean that the utility function of the AI has a simple description. E.g. In the narrow domain of chess moves, the actions a chess bot chooses are very well explained by “trying to win”. On the other hand, in the real world there are many actions that would help winning, which the chess bot ignores, and not for a lack of intelligence, therefore “trying to win” is no longer a good predictor, and I therefore claim that this bot is not actually “goal-oriented”, or at least the goal is very different from plain “winning”. Maybe you would call this property “robustly goal-oriented”?
A second definition could be that any system which moves towards any goal, more than away from it, is “oriented” to that goal. This is reasonable because that’s the economically useful part. And with this definition, the statement “ASI is very likely to exhibit goal-oriented behavior” is trivial. But that’s a very low bar, and extrapolating something about the long term behavior of the system from this definition seems like a mistake.
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question.
I’m happy with the explanation, my issue is that I don’t feel like I’ve seen this explicitly acknowledged, neither in “The Problem” nor in “A List of Lethalities” (maybe it falls under “We can’t just build a very weak system”, but I don’t agree that it has to be weak) nor in Paul’s posts I’ve read.
not all useful agents are goal-oriented.
The theorem proved I mentioned is one such useful agent. I understand you would call the prover “goal oriented”, but it doesn’t necessarily reach that level under my definition. And at lest we agree that provers can be safe. The usefulness is, among other things, that we could use the prover to work out alignment for more general agents. I don’t want to go too far on the tangent of whether this would actually work, but surely it is a coherent sequence of events, right?
The chess example, as I recall, is a response to two common objections
I don’t hold these objections, and I don’t think anyone reasonable does, especially with the “never” in them. At best I could argue that humans aren’t actually great at pursuing goals robustly, and therefore the AI might also not be.
contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis
It’s not just “hard” to make improvements, but also unnecessary and even suicidal. X-risk arguments seem to assume that goals are robust, but do not convincingly explain why they must be.
I don’t hold these objections, and I don’t think anyone reasonable does, especially with the “never” in them. At best I could argue that humans aren’t actually great at pursuing goals robustly, and therefore the AI might also not be.
The Problem is intended for a general audience (e.g., not LW users). I assure you people make precisely these objections, very often.
Is this supposed to answer my entire comment, in the sense that the general audience doesn’t need precise definitions? That may work for some people but can be off-putting to others. And surely it’s more important to convince AI researchers. Much of the general public already hates AI.
I think the concern is that an AGI will not be trained on a narrow domain. The Problem isn’t arguing that Stockfish is an ASI or will become one, it’s arguing that an ASI will be just as relentless in its domain (the real world) as Stockfish is in its (valid chess moves).
All AI is trained in narrow domains, to some extent. There is no way to make a training environment as complex as the real world. I could have make the same post about LLMs, except there the supposed goal is a lot less clear. Do you have a better example of a “goal oriented” AI in a complex domain?
You might reasonably argue that making aligned narrow AI is easy, but greedy capitalists will build unaligned AI instead. I think it would be off topic to debate here how likely that is. But I don’t think this is the prevailing thought, and I don’t think it produces the p(doom)=0.9 that some people hold.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
I work at MIRI, worked on The Problem, and have never heard anyone express this belief.[1] Brendan is correct about the intention of that passage.
It’s unclear that this is needed; e.g., the AI2027 story where you train coders that help you train scientists that help you build ASI.
Still, virtual environments for RL are a huge market right now; people are, indeed, currently trying a more modest version of this thing you claim is impossible. Of course, these aren’t literally ‘as complex as the real world’, but it’s not clear the fidelity you’d need to reach particular capability thresholds. Iirc this is the importance of work on, e.g., multi-level world models and Markov blankets: better understanding what fidelity you need in what portions of your conception of the world in order to meet a given end.
If someone were to chime in and say they believe this, my guess is that they’d get there by abusing the category ‘chess bot’; e.g., ChatGPT is kind of a chess bot in that it’s a bot that can play chess, even though it’s the product of a very different training regime than one would ever sensibly use to create a chess bot on purpose.
To be clear, you believe that making aligned narrow AI is easy, regardless of how intelligent it is? Even something more useful than a chess bot, like a theorem prover? And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks? I’m not saying they won’t, but is that really enough to justify high p(doom)? When I was reading “The Problem”, I was sure that goal-oriented AI was seen as inevitable for some reason deeper than “Goal-oriented behavior is economically useful”.
I’d still like to argue that “goal-oriented” is not a simple concept, and it’s not trivial to produce a goal-oriented agent even if you try, and that not all useful agents are goal-oriented. But if the answer is that people will try very hard to kill themselves, I wouldn’t know how to reply to that.
I’m recognizing a lot of terms you’re using but there seems to be a supposition of my model that’s very different from my actual model, to such an extent that I actually can’t decode it. My best guess is that the productive thing is to zoom out and clarify my Actual Position in more detail, instead of arguing single points (which will then lead you to make other assumptions that don’t quite square with my actual model, which is the big failure mode I’m trying to avoid here). To the extent that your aim is to better understand my model (which is very nearby, but not synonymous with, the models of other MIRI staff), this looks like the best path forward to me. Hopefully along the way we locate some cruxes (and I’d like it if you also helped out in guessing the root of our disagreement/misunderstanding).
At a high level, I don’t think it’s particularly useful to talk about ‘alignment’ with respect to smaller and more specialized systems, since it introduces the potential for conflating between qualitatively distinct cases.
For any system with a very small number of outputs (e.g. [chess piece] to [board position] small, and plausibly several OOMs larger than that in absolute number), it is trivially easy to verify the safety of the range of available outputs, since you can simply generate them all in advance, check whether they’re liable to kill all humans, and move on. A key reason that I think alignment is hard is that the range of outputs of properly general systems is so large that all possible outputs in all possible deployment settings can’t possibly be hand-verified. There are reasons to think you can verify groups or categories of outputs bundled together, but so far verifying the safety of the space of all possible outputs of powerful systems is not doable (my impression is that RAT and LAT were gestures in this direction, and that some of the ELK stuff is inspired by related concerns, but I’m no authority!).
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question. Let me know if that seems right.
For the theorem-prover case, the details of the implementation really matter quite a lot, and so the question is underspecified from my perspective. (My guess is the vast majority of theorem-provers are basically safe, even under large optimization pressure, but I haven’t looked into it and invite someone who’s thought more about that case to chime in; there are definitely approaches I could imagine for creating a theorem prover that might risk dangerous generalization, I’m just not sure how central those approaches are in practice.)
This is not my position, and is not the position of anyone at MIRI who I’ve spoken with on the topic.
This is a correct reading and I don’t understand what in my initial reply gave you some other impression. My best guess is that you’ve conflated between generality and goal-pursuing. Chess systems are safe because they’re not very general (i.e., they have a very small action space), not because they aren’t pursuing goals.
Agreed.
In conversations I’ve seen about this, we usually talk about how ‘coherent’ an agent is, as a way of describing how robustly it pursues its objective, whatever that objective may be. If what you mean is something like “contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis,” I agree.
I think I disagree here, but I don’t know how to frame that disagreement without more detail. Feel free to offer it if you feel there’s more to talk about, and I’ll do my best to continue engaging. I want to acknowledge that I don’t really understand what you mean by goal-oriented, and how it might differ from my conception, which I’m hesitant to elaborate on for now, in the spirit of avoiding further confusion.
-
My best guess is that you thought the chess example was attempting to illustrate more things than it actually is. The chess example, as I recall, is a response to two common objections:
“AIs will never be better than humans at cognitive tasks.” Chess is a cognitive task where the AIs ~always beat the humans.
“AIs will never be able to pursue goals robustly over long time horizons.” Chess has many turns, and AIs seem to coherently pursue a single end (winning) over the course of even very long games (in clock time).
These are just existence proofs; if the AI can perform with superhuman competence at a game with n variables (like chess), then it seems plausible that AIs could eventually, in principle, perform with superhuman competence at a game with 2n, 100n, or 1e80 variables.
On second thought, I don’t agree that the number of outputs is the right criteria. It’s the “narrowness” of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of “preference for chess” or “desire to win”, neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say “I’m good at chess”, but this wouldn’t be a direct consequence of the chess skill.
This is the key point. “The Problem” uses the word “goal” some 80 times, but does not define it, does not acknowledge that it’s a complex concept, or consider if some AI might not have it. I wish I could just use your concept of a goal, we shouldn’t need to discuss this, it should have been precisely defined in “The Problem” or some other introductory text.
Personally by “goal” or “goal-oriented” I mean that the utility function of the AI has a simple description. E.g. In the narrow domain of chess moves, the actions a chess bot chooses are very well explained by “trying to win”. On the other hand, in the real world there are many actions that would help winning, which the chess bot ignores, and not for a lack of intelligence, therefore “trying to win” is no longer a good predictor, and I therefore claim that this bot is not actually “goal-oriented”, or at least the goal is very different from plain “winning”. Maybe you would call this property “robustly goal-oriented”?
A second definition could be that any system which moves towards any goal, more than away from it, is “oriented” to that goal. This is reasonable because that’s the economically useful part. And with this definition, the statement “ASI is very likely to exhibit goal-oriented behavior” is trivial. But that’s a very low bar, and extrapolating something about the long term behavior of the system from this definition seems like a mistake.
I’m happy with the explanation, my issue is that I don’t feel like I’ve seen this explicitly acknowledged, neither in “The Problem” nor in “A List of Lethalities” (maybe it falls under “We can’t just build a very weak system”, but I don’t agree that it has to be weak) nor in Paul’s posts I’ve read.
The theorem proved I mentioned is one such useful agent. I understand you would call the prover “goal oriented”, but it doesn’t necessarily reach that level under my definition. And at lest we agree that provers can be safe. The usefulness is, among other things, that we could use the prover to work out alignment for more general agents. I don’t want to go too far on the tangent of whether this would actually work, but surely it is a coherent sequence of events, right?
I don’t hold these objections, and I don’t think anyone reasonable does, especially with the “never” in them. At best I could argue that humans aren’t actually great at pursuing goals robustly, and therefore the AI might also not be.
It’s not just “hard” to make improvements, but also unnecessary and even suicidal. X-risk arguments seem to assume that goals are robust, but do not convincingly explain why they must be.
The Problem is intended for a general audience (e.g., not LW users). I assure you people make precisely these objections, very often.
Is this supposed to answer my entire comment, in the sense that the general audience doesn’t need precise definitions? That may work for some people but can be off-putting to others. And surely it’s more important to convince AI researchers. Much of the general public already hates AI.