I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we’re “in the attractor basin” and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you’re trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there’s a big chance of kicking us out of “the basin of attraction.”
Idk this doesn’t really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don’t need big changes.
That said, I’m quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
stable basin: The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
unstable basin: The slightly sycophantic patterns in the reasoning processes of the AI cause the AI to get more reward, pushing the AI further towards sycophancy and incorrigibility.
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there’s of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn’t seem implausible to me that we could try in a good way and hit it.
Nice post! And being scared of minus signs seems like a nice lesson.
Absent a greater degree of theoretical understanding, I now expect the feedback loop of noticing and addressing flaws to vanish quickly, far in advance of getting an agent that has fully internalized corrigibility such that it’s robust to distributional and ontological shifts.
My motivation for corrigibility isn’t that it scales all that far, but that we can more safely and effectively elicit useful work out of corrigible AIs than out of sycophants/reward-on-the-episode-seekers (let alone schemers).
E.g. current approaches to corrigibility still rely on short-term preferences, but when the AI gets smarter and its ontology drifts so it sees itself as agent embedded in multiple places in greater reality, short-term preferences become much less natural. This probably-corrigibility-breaking shift already happens around Eliezer level if you’re trying to use the AI to do alignment research. Doing alignment research makes it more likely that such breaks occur earlier, also because the AI would need to reason about stuff like “what if an AI reflects on itself in this dangerous value-breaking way” which is sorta close to the AI reflecting itself in that way. Not that it’s necessarily impossible to use corrigible AI to help with alighment research, but we might be able to get a chunk further in capability if we make the AI not think about alignment stuff and instead just focus on e.g. biotech research for human intelligence augmentation, and that generally seems like a better plan to me.
I’m pretty unsure, but I currently think that if we tried not too badly (by which I mean much better than any of the leading labs seem on track to try, but not requiring fancy new techniques), we may have sth like a 10-75%[1] chance of getting a +5.5SD corrigible AI. And if a leading lab is sane enough to try a well-worked-out proposal here and it works, it might be quite useful to have +5.5SD agents inside of the labs that want to empower the overseers and at least can tell them that all the current approaches suck and we need to aim for international cooperation to get a lot more time (and then maybe human augmentation). (Rather than having sycophantic AIs that just tell the overseers what they want to hear.)
So I’m still excited about corrigibility even though I don’t expect it to scale.
Restructuring it this way makes it more attractive for the AI to optimize things according to typical/simple values if the human’s action doesn’t sharply identify their revealed preferences. This seems bad.
The way I would interpret “values” in your proposal is like “sorta-short-term goals a principle might want to get fulfilled”. I think it’s probably fine if we just learn a prior over what sort of sorta-short-term goals a human may have, and then use that prior instead of Q. (Or not?) If so, this notion of power seems fine to me.
(If you have time, I also would be still interested in your rough take on my original question.)
- ^
(wide range because I haven’t thought much about it yet)
- ^
Behavioral science of generalization. The first is just: studying AI behavior in depth, and using this to strengthen our understanding of how AIs will generalize to domains that our scalable oversight techniques struggle to evaluate directly.
Work in the vicinity of “weak to strong” generalization is a paradigm example here. Thus, for example: if you can evaluate physics problems of difficulty level 1 and 2, but not difficulty level 3, then you can train an AI on level 1 problems, and see if it generalizes well to level 2 problems, as a way of getting evidence about whether it would generalize well to level 3 problems as well.
(This doesn’t work on schemers, or on other AIs systematically and successfully manipulating your evidence about how they’ll generalize, but see discussion of anti-scheming measures below.)
I don’t think this just fails with schemers. A key problem is that it’s hard to distinguish whether you’re measuring “this alignment approach is good” or “this alignment approach looks good to humans”. If it’s the latter, it looks great on level 1 and 2 but then the approach for 3 doesn’t actually work. I unfortunately expect that if we train AIs to evaluate what is good alignment research, they will more likely learn the latter. (This problem seems related to ELK.)
they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy.
I don’t think Nate or Eliezer were expecting seeing bad cases this early, and I don’t think seeing bad cases updated them much further towards pessimism—they were already pretty pessimistic before. I don’t think they update in a non-Bayesian way as you seem to think, it’s just that AIs being nice in new circumstances isn’t much evidence for alignment being easy given their models.
I think thinking in terms of behavior generalization is a bad frame for thinking about what really smart AIs will do. You rather need to think in terms of optimization / goal-directed reasoning. E.g. if you imagine a reward maximizer, it’s 0 surprising that it works well while it cannot escape control measures, but when it is really smart so it can, it’s not surprising that it will.
Thanks for writing up your views in detail!
On corrigibility:
Corrigibility was originally intended to mean that a system that has that property does not run into nearest unblocked strategy problems, unlike the kind of adversarial dynamic that exists between deontological and consequentialist preferences. In your version, the consequentialist planning to fulfill a hard task given by the operators is at odds with the deontological constraints.
I also think it is harder to get robust deontological preferences into an AI than one would expect given human intuitions. The way the human reward system is wired in a way that we robustly get positive rewards for pro-social self-reflective thoughts. Perhaps we can have another AI monitor the thoughts of our main AI and likewise reward pro-social (self-reflective?) thoughts (although I think LLM-like AIs would likely be self-reflective in a rather different way than humans). However, I think for humans our main preferences come from such approval-directed self-reflective desires, whereas I expect the way people will train AI by default will cause the main smart optimization to aim for object-level outcomes, which are then more at odds with norm-following. (See this post, especially sections 2.3 and 3.) (And also even for humans it’s not quite clear whether deontological/norm-following preferences are learned deeply enough.)
So basically, I don’t expect the main optimization of the AI to end up robustly steering in a way to fulfill deontological preferences, and it’s rather like trying to enforce deontological preferences by having a different AI monitor the AI’s thoughts and constraining it to not think virtue-specification-violating thoughts. So when the AI gets sufficiently smart you get (1) nearest-unblocked-strategy problems, like non-AI-interpretable disobedient thoughts; and/or (2) collusion if you didn’t find some other way to make your AIs properly corrigible.
Deontological preferences aren’t very natural targets for a steering process to steer toward. It somehow sometimes works out for humans because their preferences derive more from their self-image, rather than from environmental goals. But if you try to train deontological preferences into an AI with current methods, it won’t end up deeply internalizing those, but rather learns a belief that it should not think disobedient thoughts, or an outer-shell non-consequentialist constraint.
(I guess given that you acknowledge nearest-unblocked strategy problems, you might sorta agree with this, though still plausible to me that you overestimate how deep and generalizing trained-in deontological constraints would be.)
Myopic instruction-following is already going into a roughly good direction in terms of what goal to aim for, but I think if you give the AI a task, the steering process towards that task would likely not have a lot of nice corrigibility properties by default. E.g. it seems likely that in steering toward such a task, it would see the possibility of the operator telling it to stop as obstacle that would prevent the goal of task-completion. (I mean it’s a bit ambiguous how exactly to interpret instruction-following here, but I think that’s what you get by default if you naively try to train for it as current labs would.)
It would be much nicer if the powerful steering machinery isn’t steering in a way that it would naturally disempower us (absent thought and control constraints) in the first place. I think aiming for CAST would be much better: Basically, we want to point the powerful steering machinery towards a goal like “empower the principal” which then implies instruction-following and keeping the principal in control[1]. It also has the huge advantage that steering towards roughly-CAST may be enough for the AI to want to empower the principal more, so it may try to change itself into sth like more-correct-CAST (aka Paul Christiano’s “basin of corrigibility”). (But obviously difficulties of getting the intended target instead of sth like reward-seeking AI still apply.)
- ^
Although not totally sure whether it works robustly, but in any case seems much much much better to aim for than sth like Anthropic’s HHH.
- ^
I heard you mention in the doom debates podcast that you’re working on an audiobook but it “may take a while”. Could you give a quantitative guess for how long?
Do you count avoiding reward-on-the-episode-seekers as part of step 2 or step 3?
Thanks!
The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn’t prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
(Btw, even if it doesn’t work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)
Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):[4]
Plan A: 7%
Plan B: 13%
Plan C: 20%
Plan D: 45%
Plan E: 75%
I think it makes more sense to state overall risk instead of takeover risk, because that’s what we care about. Could you give very rough guesses on what fraction of achievable utility we would get in expectation conditional on each Plan? (“achievable utility” is the utility we would get if the future goes optimally, like CEV aligned superintelligence.) Or just roughly break down how good you expect the non-takeover worlds to be?
I’m a lot more pessimistic than you, and I’m currently trying to get a better understand the world model of people like you. Some questions I have: (Feel free to link me to other resources.)
Do you agree that we need to model AIs as goal-directed reasoners and make sure they are steering towards the right outcomes? Or does some hope come from worlds where AIs just end up nice through behavioral generalization or so?
There seems to be a lot of work on avoiding schemers, but on my model, if we don’t get a schemer (which is quite plausible) we very likely get a sycophant, which I think still kills us at very high levels of intelligence. Unless we make a relatively good attempt at training corrigibility (which IMO sorta at least requires Plan C), getting a saint seems very unlikely to me. Do you hope to get a saint, or are you fine with a sycophant that you can use to get even better coordination between leading actors to then transition to a better plan[1] or so? Or what are your thoughts?
(If you know of any success stories that are relatively concrete about how the AI was trained and which seem somewhat plausible to you, I would be interested in reading them.)
- ^
Although I think that wouldn’t really explain the 25% non-takeover chance on Plan E?
Thanks for giving a legible account of your probability distribution that dominate top human experts.
- 1:1 very good: top human expert at opaque goal-directed reasoning with the relevant context with a bunch of time to think _(My modal view)_
Could you clarify what you mean by “a bunch of time to think” here? 30 minutes? 1 week? roughly indefinite?
(Btw: Top human experts is quite vague and problematic because the human capability distribution is often quite heavy-tailed. And I am most interested in the question of whether AIs will be scheming when they start to be able to AI safety research as well as the best humans, which I think may be a chunk after they surpass humans at capabilities research, since safety research seems less verifiable and people will try less hard to automate safety. On my model that does make success a decent amount harder than if the bar is “ordinary top human experts”.)
Thanks.
I think your reply for the being present point makes sense. (Although I still have some general worries and some extra worries about how it might be difficult to train a competitive AI with only short-term terminal preferences or so).
Here’s a confusion I still have about your proposal: Why isn’t the AI incentivized to manipulate the action the principal takes (without manipulating the values)? Like, some values-as-inferred-through-actions are easier to accomplish (yield higher localpower) than others, so the AI has an incentive to try to manipulate the principal to take some actions, like telling Alice to always order Pizza. Or why not?
Aside on the Corrigibility paper: I think it made sense for MIRI to try what they did back then. It wasn’t obvious it wouldn’t easily work out that way. I also think formalism is important (even if you train AIs—so you better know what to aim for). Relevant excerpt form here:
We somewhat wish, in retrospect, that we hadn’t framed the problem as “continuing normal operation versus shutdown.” It helped to make concrete why anyone would care in the first place about an AI that let you press the button, or didn’t rip out the code the button activated. But really, the problem was about an AI that would put one more bit of information into its preferences, based on observation — observe one more yes-or-no answer into a framework for adapting preferences based on observing humans.
The question we investigated was equivalent to the question of how you set up an AI that learns preferences inside a meta-preference framework and doesn’t just: (a) rip out the machinery that tunes its preferences as soon as it can, (b) manipulate the humans (or its own sensory observations!) into telling it preferences that are easy to satisfy, (c) or immediately figure out what its meta-preference function goes to in the limit of what it would predictably observe later and then ignore the frantically waving humans saying that they actually made some mistakes in the learning process and want to change it.
The idea was to understand the shape of an AI that would let you modify its utility function or that would learn preferences through a non-pathological form of learning. If we knew how that AI’s cognition needed to be shaped, and how it played well with the deep structures of decision-making and planning that are spotlit by other mathematics, that would have formed a recipe for what we could at least try to teach an AI to think like.
Crisply understanding a desired end-shape helps, even if you are trying to do anything by gradient descent (heaven help you). It doesn’t mean you can necessarily get that shape out of an optimizer like gradient descent, but you can put up more of a fight trying if you know what consistent, stable shape you’re going for. If you have no idea what the general case of addition looks like, just a handful of facts along the lines of 2 + 7 = 9 and 12 + 4 = 16, it is harder to figure out what the training dataset for general addition looks like, or how to test that it is still generalizing the way you hoped. Without knowing that internal shape, you can’t know what you are trying to obtain inside the AI; you can only say that, on the outside, you hope the consequences of your gradient descent won’t kill you.
(I think I also find the formalism from the corrigibility paper easier to follow than the formalism here btw.)
I think there are good ideas here. Well done.
I don’t quite understand what you mean by the “being present” idea. Do you mean caring only about the current timestep? I think that may not work well because it seems like the AI would be incentivized to self-modify so that in the future it also only cares about what happened at the timestep when it self-modified. (There are actually 2 possibilities here: 1) The AI cares only about the task that was given in the first timestep, even if it’s a long-range goal. 2) The AI doesn’t care about what happens later at all, in which case that may make the AI less capable to long-range plan, and also the AI might still self-modify even though it’s hard to influence the past from the future. But either way it looks to me like it doesn’t work. But maybe I misunderstand sth.)
Also, if you have the time to comment on this, I would be interested in what you think the key problem was that blocked MIRI from solving the shutdown problem earlier, and how you think your approach circumvents or solves that problem. (It still seems plausible to me that this approach actually runs into similar problems but we just didn’t spot them yet, or that there’s an important desideradum this proposal misses. E.g. may there be incentives for the AI to manipulate the action the principle takes (without manipulaing the values), or maybe use action-manipulation as an outcome pump?)
But in contrast to Christiano, I expect that these AIs will very much reflect on their conception of corrigibility and spend a lot of time checking things explicitly.
I think having the AI learn about corrigibility and use its knowledge about corrigibility to predict what reward it will get will strongly increase the chance that the AI will steer towards sth like “get reward” instead being corrigible. I would not let the AI study anything about corrigibility at least until it naturally starts to reflect, and then I’m still not sure.
I have a current set of projects, but, the meta-level one is “look for ways to systematically improve people’s ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can.”
Yeah so I think I read all the posts about that you wrote in the last 2 years.
I think such techniques and the meta skill of deriving more of them are very important for making good alignment progress, but I still think it takes geniuses. (In fact I sorta tried my shot at the alignment problem over the last 3ish years where I often reviewed and looked at what I could do better, and started developing a more systematic sciency approach for studying human minds. I’m now pivoting to working on Plan 1 though, because there’s not enough time left.)
Like currently it takes a some special kind of genius to make useful progress. Maybe we could try to train the smartest young supergeniuses in the techniques we currently have, and maybe they then could make progress much faster than me or Eliezer. Or maybe they still wouldn’t be able to judge what is good progress.
If you actually get supergeniuses you could try to train that would obviously be quite useful to try, even though it likely won’t get done on time, but if they don’t end up running away with their dumb idea for solving alignment without understanding the systems they are dealing with, it would still be great for needing less time after the ban.(Steven Byrnes’ agenda seems to be potentially more scaleable with good methodology, and it has the advantage that progress is relatively less exfohazardry, but would still take very smart people (and relatively long study) to make progress. But it won’t get done on time, so you still need international coordination to not build ASI.)
But the way you seemed to motivate your work in your previous comment sounded more like “make current safety researchers do more productive work so we might actually solve alignment without international coordination”. Seems very difficult to me, I think they are not even tracking many problems that are actually rather obvious or not getting the difficulties that are easy to get. People somehow often have a very hard time understanding relevant concepts here. E.g. even special geniuses like John Wentworth and Steven Byrnes made bad attempts at attacking corrigibility where they misunderstood the problem (1, 2), although that’s somewhat cherry picked and may be fixable. I mean not that such MIRI-like research is likely necessary, but still. Though I’m still curious about how you imagine your project might help here more precisely.
Thanks for writing that list!
...I also see part of my goal as trying to help the “real alignment work” technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab “safety” people can just say “oh, sure it sounds obvious when you put it like that, why didn’t you say that before?”
This seems extremely unrealistic to me. Not sure how you imagine that might work.
Yeah I suppose I could’ve guessed that.
I read your sequence in the past but I didn’t think carefully enough about this to evaluate this.
I’m not trying to reach Type 3 people. I’m trying to not alienate Type 2 people from supporting Plan 1.
I mean, maybe this isn’t “opposed” but there is a direct tradeoff where if you’re not trying to stop the race, you want to race ahead faster because you’re more alignment-pilled.
You could still try to solve alignment and support (e.g. financially) others who are trying to stop the race.
Or rather, how are you supposed to reach Type 3 people?
I assume you mean Type 2 people. Like, they could sign the superintelligence statement.
What exactly do you mean by corrigibility here? Getting an AI to steer towards a notion of human empowerment in a way we can ask it to solve uploading and it does that without leading to bad results? Or getting an AI that has solve-uploading levels of capability but still would let us shut it down without resisting (even if it didn’t complete its task yet). And if the latter, does it need to be in a clean way, or can it be in a messy way like that we just trained really really hard to make the AI not think about the offswitch and it somehow surprisingly ended up working?
I think the way most alignment researchers (probably including Paul Christiano) would approach training for corrigibility is relatively unlikely to work in time, because they think more in terms of behavior generalization rather than steering systems, and I guess they wouldn’t aim well enough at getting coherent corrigible steering patterns to make the systems corrigible at high levels of optimization power.
It’s possible there’s a smarter way that has better chances.
Pretty unsure about both of those though.
I just mean that I don’t plan for corrigibility to scale that far anyway (see my other comment), and maybe we don’t need a paradigm shift to get to the level we want, so it’s mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn’t all that real so multiple small updates might lead us out of the basin. It just didn’t seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
Clarifying the problem first: Let’s say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
We don’t want there to be a “will I get reward” node in the world model. In the beginning the agent shouldn’t know it is an RL agent or how it is trained.
Also make sure it doesn’t know about thought monitoring etc. in the beginning.
The operators should be careful to not give visible signs that are strongly correlated to giving reward, like smiling or writing “good job” or “great” or “thanks” or whatever. Else the agent may learn to aim for those proxies instead.
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn’t ask for.
And then you also want to use some thought monitoring. If the agent doesn’t reason in CoT, we might still be able to train some translators on the neuralese. We can:
Train the world model (aka thought generator) to think more in terms of concepts like the principal, short-term preferences, actions/instructions of the principal, power/influence.
Give rewards directly based on the plans the AI is considering (rather than just from observing behavior).
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
Paul seems to guess that there may be a crisp difference in corrigible behaviors vs incorrigible ones. One could interpret that as a hope that there’s sth like a local optimum in model space around corrigibility, although I guess that’s not fully what Paul thinks here. Paul also mentions the ELK continuity proposal there, which I think might’ve developed into Mechanistic Anomaly Detection. I guess the hope there is that to get to incorrigible behavior there will be a major shift in how the AI reasons. E.g. if before the decisions were made using the corrigible circut, and now it’s coming from the reward-seeking circut. So perhaps Paul thinks that there’s a basin for the corrigibility circut, but that the creation of other circuts is also still incentivized and that a shift to those needs to be avoided through disincentivizing anomalies?
Idk seems like a good thing to try, but seems quite plausible we would then get a continuous mechanistic shift towards incorrigibility. I guess that comes down to thinking there isn’t a crisp difference between corrigible thinking and incorrigible thinking. I don’t really understand Paul’s intuition pump for why to expect the difference to be crisp, but not sure, it could be crisp. (Btw, Paul admits it could turn out not to be crisp). But even then the whole MAL hope seems sorta fancy and not really the sort of thing I would like to place my hopes on.
The other basin-like property comes from agents that already sorta want to empower the principal, to want to empower the principal even more, because this helps empower the principal. So if you get sorta-CAST into an agent it might want to become better-CAST.