> frontier labs are only pretending to try to solve alignment
>>This is probably the main driver of our disagreement.
I agree with your diagnosis! I think Sam Altman is a sociopathic liar, so the fact that he signed the statement on AI risk doesn’t convince me that he cares about alignment. I feel reasonably confident about that belief. Zvi’s series on Moral Mazes apply here: I don’t claim that you literally can’t mention existential risk at OpenAI, but if you show signs of being earnestly concerned enough about it to interefere with corporate goals, then I believe you’ll be sidelined.
I’m much less confident about whether or not successful alignment looks like normal deep learning work; I know more about corporate behavior than I do about technical AI safety. It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I’m not enough of a technical expert to be sure that that’s wrong.
Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it), if you show me a random person who believes this and I don’t know anything about their incorruptiability or the clarity of their thinking ahead of time, then my prior is that most of the people in the random distribution that this person was drawn from probably arrived at the belief mostly out of convenience and temptation, rather than mostly by becoming technically convinced of the merits of a position that seems a priori unlikely to me. However, I can’t be sure—perhaps it’s more likely than I think that normal deep learning can solve alignment.
By “it will look like normal deep learning work” I don’t mean it will be exactly the same as mainstream capabilities work—e.g. RLHF was both “normal deep learning work” and also notably different from all other RL at the time. Same goes for constitutional AI.
What seems promising to me is paying close attention to how we’re training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.
It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I’m not enough of a technical expert to be sure that that’s wrong.
What are we comparing deep learning to here? Black box − 100% granted.
But for the other problems—power-seeking and emergent goals—I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we’ll have more issues, but LLMs are a very good base to start from—they’re highly capable, understand natural language, and aren’t power seeking.
In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won’t work, and 2), the theoretical advantage—that if you can precisely specify what your alignment target is, it will optimize for it—is in fact a terrible disadvantage, since you won’t be able to precisely specify your alignment target.
Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it)
This is what I mean about selective cynicism! I’ve heard the exact same argument about theoretical alignment work—“mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job”—and I don’t find it convincing in either case.
> frontier labs are only pretending to try to solve alignment
>>This is probably the main driver of our disagreement.
I agree with your diagnosis! I think Sam Altman is a sociopathic liar, so the fact that he signed the statement on AI risk doesn’t convince me that he cares about alignment. I feel reasonably confident about that belief. Zvi’s series on Moral Mazes apply here: I don’t claim that you literally can’t mention existential risk at OpenAI, but if you show signs of being earnestly concerned enough about it to interefere with corporate goals, then I believe you’ll be sidelined.
I’m much less confident about whether or not successful alignment looks like normal deep learning work; I know more about corporate behavior than I do about technical AI safety. It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I’m not enough of a technical expert to be sure that that’s wrong.
Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it), if you show me a random person who believes this and I don’t know anything about their incorruptiability or the clarity of their thinking ahead of time, then my prior is that most of the people in the random distribution that this person was drawn from probably arrived at the belief mostly out of convenience and temptation, rather than mostly by becoming technically convinced of the merits of a position that seems a priori unlikely to me. However, I can’t be sure—perhaps it’s more likely than I think that normal deep learning can solve alignment.
By “it will look like normal deep learning work” I don’t mean it will be exactly the same as mainstream capabilities work—e.g. RLHF was both “normal deep learning work” and also notably different from all other RL at the time. Same goes for constitutional AI.
What seems promising to me is paying close attention to how we’re training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.
What are we comparing deep learning to here? Black box − 100% granted.
But for the other problems—power-seeking and emergent goals—I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we’ll have more issues, but LLMs are a very good base to start from—they’re highly capable, understand natural language, and aren’t power seeking.
In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won’t work, and 2), the theoretical advantage—that if you can precisely specify what your alignment target is, it will optimize for it—is in fact a terrible disadvantage, since you won’t be able to precisely specify your alignment target.
This is what I mean about selective cynicism! I’ve heard the exact same argument about theoretical alignment work—“mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job”—and I don’t find it convincing in either case.