It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”.
I think this particular difference might’ve been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking “Jeremy believes that there’s lots of events that can cause a model to act in unaligned ways, hm, presumably I’d evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events”. And then reading this thread I was like “oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I’d need isn’t a long list of (independent) events that could cause misaligned behavior, what I’d need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that’d be catastrophic if revealed/understood”.
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could’ve been deep-down misaligned from the start, so it can just as well be read as either.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.
Appreciated! Probably won’t prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.
Thanks! Appreciate the clarification & pointer.
I think this particular difference might’ve been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking “Jeremy believes that there’s lots of events that can cause a model to act in unaligned ways, hm, presumably I’d evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events”. And then reading this thread I was like “oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I’d need isn’t a long list of (independent) events that could cause misaligned behavior, what I’d need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that’d be catastrophic if revealed/understood”.
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could’ve been deep-down misaligned from the start, so it can just as well be read as either.
Appreciated! Probably won’t prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.