A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
Sam Marks argued at more length below:
https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform#JLHjuDHL3t69dAybT