But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we’re “in the attractor basin” and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you’re trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there’s a big chance of kicking us out of “the basin of attraction.”
Idk this doesn’t really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don’t need big changes.
That said, I’m quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
stable basin: The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
unstable basin: The slightly sycophantic patterns in the reasoning processes of the AI cause the AI to get more reward, pushing the AI further towards sycophancy and incorrigibility.
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there’s of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn’t seem implausible to me that we could try in a good way and hit it.
I’m writing a post about this at the moment. I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
When you make a bigger change you just have to be really careful that you land in the basin again.
How can you see whether you’re in the basin? What actions help you land in the basin?
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
I just mean that I don’t plan for corrigibility to scale that far anyway (see my other comment), and maybe we don’t need a paradigm shift to get to the level we want, so it’s mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn’t all that real so multiple small updates might lead us out of the basin. It just didn’t seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
What actions help you land in the basin?
Clarifying the problem first: Let’s say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
We don’t want there to be a “will I get reward” node in the world model. In the beginning the agent shouldn’t know it is an RL agent or how it is trained.
Also make sure it doesn’t know about thought monitoring etc. in the beginning.
The operators should be careful to not give visible signs that are strongly correlated to giving reward, like smiling or writing “good job” or “great” or “thanks” or whatever. Else the agent may learn to aim for those proxies instead.
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn’t ask for.
And then you also want to use some thought monitoring. If the agent doesn’t reason in CoT, we might still be able to train some translators on the neuralese. We can:
Train the world model (aka thought generator) to think more in terms of concepts like the principal, short-term preferences, actions/instructions of the principal, power/influence.
Give rewards directly based on the plans the AI is considering (rather than just from observing behavior).
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
Paul seems to guess that there may be a crisp difference in corrigible behaviors vs incorrigible ones. One could interpret that as a hope that there’s sth like a local optimum in model space around corrigibility, although I guess that’s not fully what Paul thinks here. Paul also mentions the ELK continuity proposal there, which I think might’ve developed into Mechanistic Anomaly Detection. I guess the hope there is that to get to incorrigible behavior there will be a major shift in how the AI reasons. E.g. if before the decisions were made using the corrigible circut, and now it’s coming from the reward-seeking circut. So perhaps Paul thinks that there’s a basin for the corrigibility circut, but that the creation of other circuts is also still incentivized and that a shift to those needs to be avoided through disincentivizing anomalies?
Idk seems like a good thing to try, but seems quite plausible we would then get a continuous mechanistic shift towards incorrigibility. I guess that comes down to thinking there isn’t a crisp difference between corrigible thinking and incorrigible thinking. I don’t really understand Paul’s intuition pump for why to expect the difference to be crisp, but not sure, it could be crisp. (Btw, Paul admits it could turn out not to be crisp). But even then the whole MAL hope seems sorta fancy and not really the sort of thing I would like to place my hopes on.
The other basin-like property comes from agents that already sorta want to empower the principal, to want to empower the principal even more, because this helps empower the principal. So if you get sorta-CAST into an agent it might want to become better-CAST.
Idk this doesn’t really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don’t need big changes.
That said, I’m quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
stable basin: The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
unstable basin: The slightly sycophantic patterns in the reasoning processes of the AI cause the AI to get more reward, pushing the AI further towards sycophancy and incorrigibility.
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there’s of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn’t seem implausible to me that we could try in a good way and hit it.
I’m writing a post about this at the moment. I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
How can you see whether you’re in the basin? What actions help you land in the basin?
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.
I just mean that I don’t plan for corrigibility to scale that far anyway (see my other comment), and maybe we don’t need a paradigm shift to get to the level we want, so it’s mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn’t all that real so multiple small updates might lead us out of the basin. It just didn’t seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
Clarifying the problem first: Let’s say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
We don’t want there to be a “will I get reward” node in the world model. In the beginning the agent shouldn’t know it is an RL agent or how it is trained.
Also make sure it doesn’t know about thought monitoring etc. in the beginning.
The operators should be careful to not give visible signs that are strongly correlated to giving reward, like smiling or writing “good job” or “great” or “thanks” or whatever. Else the agent may learn to aim for those proxies instead.
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn’t ask for.
And then you also want to use some thought monitoring. If the agent doesn’t reason in CoT, we might still be able to train some translators on the neuralese. We can:
Train the world model (aka thought generator) to think more in terms of concepts like the principal, short-term preferences, actions/instructions of the principal, power/influence.
Give rewards directly based on the plans the AI is considering (rather than just from observing behavior).
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
Paul seems to guess that there may be a crisp difference in corrigible behaviors vs incorrigible ones. One could interpret that as a hope that there’s sth like a local optimum in model space around corrigibility, although I guess that’s not fully what Paul thinks here. Paul also mentions the ELK continuity proposal there, which I think might’ve developed into Mechanistic Anomaly Detection. I guess the hope there is that to get to incorrigible behavior there will be a major shift in how the AI reasons. E.g. if before the decisions were made using the corrigible circut, and now it’s coming from the reward-seeking circut. So perhaps Paul thinks that there’s a basin for the corrigibility circut, but that the creation of other circuts is also still incentivized and that a shift to those needs to be avoided through disincentivizing anomalies?
Idk seems like a good thing to try, but seems quite plausible we would then get a continuous mechanistic shift towards incorrigibility. I guess that comes down to thinking there isn’t a crisp difference between corrigible thinking and incorrigible thinking. I don’t really understand Paul’s intuition pump for why to expect the difference to be crisp, but not sure, it could be crisp. (Btw, Paul admits it could turn out not to be crisp). But even then the whole MAL hope seems sorta fancy and not really the sort of thing I would like to place my hopes on.
The other basin-like property comes from agents that already sorta want to empower the principal, to want to empower the principal even more, because this helps empower the principal. So if you get sorta-CAST into an agent it might want to become better-CAST.