We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.
I do agree with this, but I think that there are certain more specific failure modes that are especially important—they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I’m thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don’t have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development worksand figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)
I prefer to avoid terms such as “pretending” or “faking”, and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, “faking” would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don’t think it is a necessary condition.
I do agree with this, but I think that there are certain more specific failure modes that are especially important—they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I’m thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don’t have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development works and figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)
I prefer to avoid terms such as “pretending” or “faking”, and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, “faking” would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don’t think it is a necessary condition.