I think you’re correct that the paradigm has changed, Matthew, and that the problems that stood out to MIRI before as possibilities no longer quite fit the situation.
I still think the broader concern MIRI exhibited is correct: namely, that that an AI could appear to be aligned but not actually be aligned, and that this may not come to light until it is behaving outside of the context of training/in which the command was written. Because of the greater capabilities of an AI, the problem may have to do with differences in superficially similar goals that wouldn’t matter at the human capabilities level.
I’m not sure if the fact that LLMs solve the cauldron-filling problem means that we should consider the whole broader class of problems easier to solve than we thought. Maybe it does. But given the massive stakes of the issue I think we ought to consider not knowing if LLMs will always behave as intended OOD a live problem.
Whether MIRI was confused about the main issues of alignment in the past, and whether LLMs should have been a point of update for them is one of the points of contention here.
(I think the answer is no, see all the comments about this above)
ML models in the current paradigm do not seem to behave coherently OOD but I’d bet for nearly any metric of “overall capability” and alignment that the capability metric decays faster vs alignment as we go further OOD.
See https://arxiv.org/abs/2310.00873 for an example of the kinds of things you’d expect to see when taking a neural network OOD. It’s not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that’s obvious to people who’ve done real work with ml but very counter to classic LessWrong intuitions and isn’t learnable by implementing mingpt.
I think you’re correct that the paradigm has changed, Matthew, and that the problems that stood out to MIRI before as possibilities no longer quite fit the situation.
I still think the broader concern MIRI exhibited is correct: namely, that that an AI could appear to be aligned but not actually be aligned, and that this may not come to light until it is behaving outside of the context of training/in which the command was written. Because of the greater capabilities of an AI, the problem may have to do with differences in superficially similar goals that wouldn’t matter at the human capabilities level.
I’m not sure if the fact that LLMs solve the cauldron-filling problem means that we should consider the whole broader class of problems easier to solve than we thought. Maybe it does. But given the massive stakes of the issue I think we ought to consider not knowing if LLMs will always behave as intended OOD a live problem.
Whether MIRI was confused about the main issues of alignment in the past, and whether LLMs should have been a point of update for them is one of the points of contention here.
(I think the answer is no, see all the comments about this above)
ML models in the current paradigm do not seem to behave coherently OOD but I’d bet for nearly any metric of “overall capability” and alignment that the capability metric decays faster vs alignment as we go further OOD.
See https://arxiv.org/abs/2310.00873 for an example of the kinds of things you’d expect to see when taking a neural network OOD. It’s not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that’s obvious to people who’ve done real work with ml but very counter to classic LessWrong intuitions and isn’t learnable by implementing mingpt.
<snark> Your models of intelligent systems collapse to entropy on OOD intelligence levels. </snark>