Elliott Thornley (EJT)
Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
Preference gaps as a safeguard against AI self-replication
They don’t have to be short-term oriented! Their utility function could be:
Where is some strictly concave function and is the agent’s payment at time . Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to regardless of whether is 1 or 1 million. And yet, for the right kind of , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
My sincere apologies for the delayed reply
No problem! Glass houses and all that.
You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button!
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Perhaps the best approach is to build something that isn’t (at least, isn’t explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation.
Yep, this is what I try to do here!
This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.
I think that’s reasonable on priors, but these papers plus the empirical track record suggests there’s no clever trick that makes EUMs corrigible.
Yeah so you might think ‘Given perfect information, no agent would have a preferential gap between any two options.’ But this is quite a strong claim! And there are other plausible examples of preferential gaps even in the presence of perfect information, e.g. very different ice cream flavors:
Consider a trio of ice cream flavors: buttery and luxurious pistachio, bright and refreshing mint, and that same mint flavor further enlivened by chocolate chips. You might lack a preference between pistachio and mint, lack a preference between pistachio and mint choc chip, and yet prefer mint choc chip to mint.
Note also that if we adopt a behavioral definition of preference, the existence of preferential gaps is pretty much undeniable. On other definitions, their existence is deniable but still very plausible.
Oh nice! I like this idea. Let’s talk about it more tomorrow.
With the Work or Steal example specifically, we want the agent to choose Work because that’s best for the user’s expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That’s a downside, but it’s minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we’ll want to take down, we’ll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And—I think—it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it’s not resisting shutdown, we’ll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we’re in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren’t training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we’re training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like ’2024′, it might also generalize to behaving badly on a trigger like ‘2023’. But if you finetune your model to behave badly given ‘2024’ and behave well given ‘2023’, you can get the bad behavior to stay limited to the ‘2024’ trigger.
I think it’s a good point that the plausibility of the RC depends in part on what we imagine the Z-lives to be like. This paper lists some possibilities: drab, short-lived, rollercoaster, Job, Cinderella, chronically irritated. I find my intuitions vary a fair bit depending on which of these I consider.
Needless to say, none of these analogies show up in my published papers
This is kind of wild. The analogies clearly helped Tao a lot, but his readers don’t get to see them! This has got me thinking about a broader kind of perverse incentive in academia: if you explain something really well, your idea seems obvious or your problem seems easy, and so your paper is more likely to get rejected by reviewers.
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
That’s not quite right. ‘Risk-averse with respect to quantity X’ just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about ‘risk aversion,’ they almost always mean diminishing marginal utility.
diminishing returns from resources… is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
Can you say more about why?
Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
But AIs with sharply diminishing marginal utility to extra resources wouldn’t care much about this. They’d be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)
MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
Not quite. ‘Competent agents will always be choosing between same-length lotteries’ is a claim about these agents’ credences, not their preferences. Specifically, the claim is that, in each situation, all available actions will entirely overlap with respect to the trajectory-lengths assigned positive probability. Competent agents will never find themselves in a situation where—e.g. -- they assign positive probability to getting shut down in 1 timestep conditional on action A and zero probability to getting shut down in 1 timestep conditional on action B.
That’s compatible with these competent agents violating POST by—e.g. -- preferring some trajectory of length 2 to some trajectory of length 1.
Thanks!
Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)
Shutdownable Agents through POST-Agency
Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.
Ah good to know, thanks!
I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.
More reasons to worry about relying on constraints:
As you say, your constraints might be insufficiently general (‘nearest unblocked strategy,’ etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just ‘outer shell,’ like humans’ instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its terminal values.
Your constraints might actually be false beliefs that later get revised (e.g. that lying never pays)(Barnett and Gillen).
Your constraints might cause theoretical problems that motivate the AI to revise them away (e.g. money pumps, intransitivities, violations of the Independence of Irrelevant Alternatives, implausible dependence on which outcome is designated as the status quo, paralysis, trouble dealing with risk, arbitrariness of constraints’ exact boundaries).
Your constraints might cause other misalignments (e.g. the AI wants to take extreme measures to prevent other agents from lying too).
Your constraints might make the AI incapable (e.g. they might falsify the strategy-stealing assumption, or make AIs too timid [e.g. maybe the AI will be extremely reluctant to say anything it’s not absolutely certain of]).
Your constraints might fail to motivate the AI to do good alignment work (e.g. the AI produces alignment slop).
Your constraints might make the AI bad at moral philosophy (and we might need AI-powered moral philosophy to get a really good future).