By “it will look like normal deep learning work” I don’t mean it will be exactly the same as mainstream capabilities work—e.g. RLHF was both “normal deep learning work” and also notably different from all other RL at the time. Same goes for constitutional AI.
What seems promising to me is paying close attention to how we’re training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.
It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I’m not enough of a technical expert to be sure that that’s wrong.
What are we comparing deep learning to here? Black box − 100% granted.
But for the other problems—power-seeking and emergent goals—I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we’ll have more issues, but LLMs are a very good base to start from—they’re highly capable, understand natural language, and aren’t power seeking.
In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won’t work, and 2), the theoretical advantage—that if you can precisely specify what your alignment target is, it will optimize for it—is in fact a terrible disadvantage, since you won’t be able to precisely specify your alignment target.
Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it)
This is what I mean about selective cynicism! I’ve heard the exact same argument about theoretical alignment work—“mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job”—and I don’t find it convincing in either case.
By “it will look like normal deep learning work” I don’t mean it will be exactly the same as mainstream capabilities work—e.g. RLHF was both “normal deep learning work” and also notably different from all other RL at the time. Same goes for constitutional AI.
What seems promising to me is paying close attention to how we’re training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.
What are we comparing deep learning to here? Black box − 100% granted.
But for the other problems—power-seeking and emergent goals—I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we’ll have more issues, but LLMs are a very good base to start from—they’re highly capable, understand natural language, and aren’t power seeking.
In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won’t work, and 2), the theoretical advantage—that if you can precisely specify what your alignment target is, it will optimize for it—is in fact a terrible disadvantage, since you won’t be able to precisely specify your alignment target.
This is what I mean about selective cynicism! I’ve heard the exact same argument about theoretical alignment work—“mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job”—and I don’t find it convincing in either case.