I agree that LLMs are less intrinsically agent-y and may be less prone to instrumental convergence than other AI approaches. (I’m less sure whether this can’t be overcome with relatively simple code that calls LLMs as a part of its operation, a la some better version of AutoGPT). I think the central argument of the post, and the linked discussions (which I see as something like “As long as we try our best at each LLM iteration to use what we have for alignment research, we can get better at aligning the next iteration”), is plausible, iff we slow down capabilities research long enough to do so. “We” meaning everyone. So far, we have not done anything close to that, despite some noises in that direction. Nor do we have anything like a reliable plan for achieving such a capabilities slowdown, and no single company or country is in a position to create and enforce one. Possibly no proper subset of countries or companies is in a position to do so. I’d love to see a good way to get to that as a stable equilibrium where we make sure we’re only increasing capabilities and trying new algorithms after we’ve done enough to be sufficiently sure they’re safe. What would that look like?
Hm, I’d don’t agree that at least some slowdown is probably necessary, mostly because I don’t think that the capabilities of a system actually correlate all that well with how instrumentally convergent a system is, but my key point here is that yes, the iterative approach does in fact work, and most importantly, instrumental convergence either likely doesn’t exist, is way more bounded than what LWers imagine, and newly, instrumental convergence probably doesn’t even support the claims that AI is existentially risky, even if true, from a new post, so a lot of inferences that LW made are maybe false, given the implicit use of the assumption was pretty large, even conditional on my argument being false.
Now to get at a point that seems less appreciated, but the argument is more or less not correlated to capabilities, in that AI can get essentially arbitrarily capable of things and still have much more bounded instrumental convergence.
So no, we mostly don’t have to slow down AI, because the capabilities of an AI are not negatively corrrelated with safety, and in more probable scenarios like AI misuse, you might actually want to race to the finish line, and an argument is given by Simeon here (called simeon_c).
I agree that LLMs are less intrinsically agent-y and may be less prone to instrumental convergence than other AI approaches. (I’m less sure whether this can’t be overcome with relatively simple code that calls LLMs as a part of its operation, a la some better version of AutoGPT). I think the central argument of the post, and the linked discussions (which I see as something like “As long as we try our best at each LLM iteration to use what we have for alignment research, we can get better at aligning the next iteration”), is plausible, iff we slow down capabilities research long enough to do so. “We” meaning everyone. So far, we have not done anything close to that, despite some noises in that direction. Nor do we have anything like a reliable plan for achieving such a capabilities slowdown, and no single company or country is in a position to create and enforce one. Possibly no proper subset of countries or companies is in a position to do so. I’d love to see a good way to get to that as a stable equilibrium where we make sure we’re only increasing capabilities and trying new algorithms after we’ve done enough to be sufficiently sure they’re safe. What would that look like?
Hm, I’d don’t agree that at least some slowdown is probably necessary, mostly because I don’t think that the capabilities of a system actually correlate all that well with how instrumentally convergent a system is, but my key point here is that yes, the iterative approach does in fact work, and most importantly, instrumental convergence either likely doesn’t exist, is way more bounded than what LWers imagine, and newly, instrumental convergence probably doesn’t even support the claims that AI is existentially risky, even if true, from a new post, so a lot of inferences that LW made are maybe false, given the implicit use of the assumption was pretty large, even conditional on my argument being false.
The post is below:
https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft
Now to get at a point that seems less appreciated, but the argument is more or less not correlated to capabilities, in that AI can get essentially arbitrarily capable of things and still have much more bounded instrumental convergence.
So no, we mostly don’t have to slow down AI, because the capabilities of an AI are not negatively corrrelated with safety, and in more probable scenarios like AI misuse, you might actually want to race to the finish line, and an argument is given by Simeon here (called simeon_c).
https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns