The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.
There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.
I feel like I’m missing something about these paragraphs.
It seems like corrigibility helps you the most when you’re starting to enter the distribution shift (so it’s not just speculating about future problems). At that point the AI can notice things that are not what you intended and proactively alert you (so it’s not just implementing the weak form of corrigibility).
Is the claim that there’s a dilemma between two options?
Either 1) the AI is only speculating about future problems and so has limited ability to detect those problems or 2) the AI is already in the sway of those problems and so will not be motivated to help fix them?
Why is there no middle ground, where the AI is encountering problems and proactively flagging them?
Good point. There can be a middle ground, but most of the examples that come to mind are more binary.
E.g. If you notice that you don’t endorse a habit, this generally happens immediately. There’s not a long period of being uncertain whether you endorse the habit, and still following the habit. If you’re uncertain, and the habit-situation is coming up, this forces you to think it through.
On the other hand, with this one:
If the overseer is only invoked when you think the overseer knows more than you.
Seems like the AI’s understanding of how much the overseer knows could change gradually, and it might feel compelled to alert the overseer of this change. Maybe depends a bit on the internal mechanisms for this corrigibility-property, but mostly this one looks like gradual change could allow proactive flagging.
One background assumption that might be relevant: without various biases that lead to people being attached to beliefs, if a belief update can be made just by thinking about it, then the belief change will happen fast once attention is allocated to it. It’s rare for logical belief updates to require lots of compute, once your attention is on the right things.
E.g. If you notice that you don’t endorse a habit, this generally happens immediately. There’s not a long period of being uncertain whether you endorse the habit, and still following the habit. If you’re uncertain, and the habit-situation is coming up, this forces you to think it through.
But there often is a long period between when you stop endorsing the habit and when you’ve finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)
Also, I’ll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you’re going to change your mind. The whole edifice doesn’t usually collapse in a single moment. It’s a process. (This interview covers a good example.)
Notably, it seems unhealthy if every time a person gets an inkling that maybe christianity is false, they dutifully go to their pastor and get freshly brainwashed to patch those specific objections. It’s unhealthy because Christianity is false and adult humans should grow out of it in time.
It’s unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn’t.
It’s rare for logical belief updates to require lots of compute, once your attention is on the right things.
Yeah, but putting your attention on the right things often does take a lot of compute.
But there often is a long period between when you stop endorsing the habit and when you’ve finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)
Yeah often, but I think that’s stretching the analogy too far. A scenario with a dangerous AI doing AI research has more self-improvement options than a human, and the stakes are higher (for the AI) than most human habit-breaking is to humans. This distribution shift is important when an AI has significantly greater-than-human self-improvement options. If it doesn’t, then the AI noticing non-endorsed habits may still happen, and that’d be good if it happened in a way that allowed humans to notice what’s wrong and try to fix it.
Also, I’ll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you’re going to change your mind. The whole edifice doesn’t usually collapse in a single moment. It’s a process. (This interview covers a good example.)
Yeah, another good point, I agree. I haven’t watched that interview but I saw another video Rhett made. On the other hand, each stage can sometimies look like “slow buildup without acknowledgement of any update → crisis & fast update”. But I agree that religious deconversion is a decent analogy, and humans are playing the role of the pastor trying to catch and redirect the process, and that sometimes can work.
It’s unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn’t.
I don’t think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better. So with the habits, noticing that a habit is working against you can be as simple as updating a belief (about the consequences of that habit).
Yeah, but putting your attention on the right things often does take a lot of compute.
True but you don’t usually update or know that you’re going to update during this part.
Is the claim that there’s a dilemma between two options?
I think I don’t want to claim that there’s a strict dilemma, more that the paths between 1 and 2 are many and varied and hard to catch, even from the inside. Often because it’s fast, but sometimes just because it’s messy and there are lots of pressures and mechanisms involved.
I don’t think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better.
So then is the whole argument premised on high confidence that there’s no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?
If so: What’s the argument for that? (I didn’t notice one in the OP.)
That’s closer to being a conclusion rather than a premise. This section of this post or this is the main argument for that. It’s just an underspecification argument, you could see it as a generalization of Carlsmith’s counting argument.
It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”. I think the arguments linked support my view. Possibly I’ve not made some background reasoning or assumptions explicit.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.
It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”.
I think this particular difference might’ve been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking “Jeremy believes that there’s lots of events that can cause a model to act in unaligned ways, hm, presumably I’d evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events”. And then reading this thread I was like “oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I’d need isn’t a long list of (independent) events that could cause misaligned behavior, what I’d need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that’d be catastrophic if revealed/understood”.
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could’ve been deep-down misaligned from the start, so it can just as well be read as either.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.
Appreciated! Probably won’t prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.
I feel like I’m missing something about these paragraphs.
It seems like corrigibility helps you the most when you’re starting to enter the distribution shift (so it’s not just speculating about future problems). At that point the AI can notice things that are not what you intended and proactively alert you (so it’s not just implementing the weak form of corrigibility).
Is the claim that there’s a dilemma between two options?
Either 1) the AI is only speculating about future problems and so has limited ability to detect those problems or 2) the AI is already in the sway of those problems and so will not be motivated to help fix them?
Why is there no middle ground, where the AI is encountering problems and proactively flagging them?
Good point. There can be a middle ground, but most of the examples that come to mind are more binary.
E.g. If you notice that you don’t endorse a habit, this generally happens immediately. There’s not a long period of being uncertain whether you endorse the habit, and still following the habit. If you’re uncertain, and the habit-situation is coming up, this forces you to think it through.
On the other hand, with this one:
Seems like the AI’s understanding of how much the overseer knows could change gradually, and it might feel compelled to alert the overseer of this change. Maybe depends a bit on the internal mechanisms for this corrigibility-property, but mostly this one looks like gradual change could allow proactive flagging.
One background assumption that might be relevant: without various biases that lead to people being attached to beliefs, if a belief update can be made just by thinking about it, then the belief change will happen fast once attention is allocated to it. It’s rare for logical belief updates to require lots of compute, once your attention is on the right things.
But there often is a long period between when you stop endorsing the habit and when you’ve finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)
Also, I’ll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you’re going to change your mind. The whole edifice doesn’t usually collapse in a single moment. It’s a process. (This interview covers a good example.)
Notably, it seems unhealthy if every time a person gets an inkling that maybe christianity is false, they dutifully go to their pastor and get freshly brainwashed to patch those specific objections. It’s unhealthy because Christianity is false and adult humans should grow out of it in time.
It’s unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn’t.
Yeah, but putting your attention on the right things often does take a lot of compute.
Yeah often, but I think that’s stretching the analogy too far. A scenario with a dangerous AI doing AI research has more self-improvement options than a human, and the stakes are higher (for the AI) than most human habit-breaking is to humans. This distribution shift is important when an AI has significantly greater-than-human self-improvement options. If it doesn’t, then the AI noticing non-endorsed habits may still happen, and that’d be good if it happened in a way that allowed humans to notice what’s wrong and try to fix it.
Yeah, another good point, I agree. I haven’t watched that interview but I saw another video Rhett made. On the other hand, each stage can sometimies look like “slow buildup without acknowledgement of any update → crisis & fast update”. But I agree that religious deconversion is a decent analogy, and humans are playing the role of the pastor trying to catch and redirect the process, and that sometimes can work.
I don’t think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better. So with the habits, noticing that a habit is working against you can be as simple as updating a belief (about the consequences of that habit).
True but you don’t usually update or know that you’re going to update during this part.
I think I don’t want to claim that there’s a strict dilemma, more that the paths between 1 and 2 are many and varied and hard to catch, even from the inside. Often because it’s fast, but sometimes just because it’s messy and there are lots of pressures and mechanisms involved.
So then is the whole argument premised on high confidence that there’s no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?
If so: What’s the argument for that? (I didn’t notice one in the OP.)
That’s closer to being a conclusion rather than a premise. This section of this post or this is the main argument for that. It’s just an underspecification argument, you could see it as a generalization of Carlsmith’s counting argument.
It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”. I think the arguments linked support my view. Possibly I’ve not made some background reasoning or assumptions explicit.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.
Thanks! Appreciate the clarification & pointer.
I think this particular difference might’ve been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking “Jeremy believes that there’s lots of events that can cause a model to act in unaligned ways, hm, presumably I’d evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events”. And then reading this thread I was like “oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I’d need isn’t a long list of (independent) events that could cause misaligned behavior, what I’d need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that’d be catastrophic if revealed/understood”.
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could’ve been deep-down misaligned from the start, so it can just as well be read as either.
Appreciated! Probably won’t prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.