Looking back on this post after a year, I haven’t changed my mind about the content of the post, but I agree with Seth Herd when he said this post was “important but not well executed”.
In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I’m not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.
If you’re not sure what misinterpretation I’m referring to, I’ll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:
In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don’t match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.
The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as “make as many paperclips as you can.” In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.
However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to “understand” English better than people expected—it’s not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.
Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user’s actual intentions. In other words, the AI’s actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.
Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.
Looking back on this post after a year, I haven’t changed my mind about the content of the post, but I agree with Seth Herd when he said this post was “important but not well executed”.
In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I’m not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.
If you’re not sure what misinterpretation I’m referring to, I’ll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:
In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don’t match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.
The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as “make as many paperclips as you can.” In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.
However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to “understand” English better than people expected—it’s not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.
Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user’s actual intentions. In other words, the AI’s actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.
Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.