Alex Turner, Oregon State University PhD student working on AI alignment.
Great observation. Similarly, a hypothesis called “Maximum Causal Entropy” once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don’t perpetually maximize their potential partners—they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
it seems like a response of the form “we have support for IC, not just in random minds, but also for random reward functions” has not responded to the critique and should not be expected to be convincing to that person.
I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of “but how do we know IC even exists?” with “well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don’t (formally) know how hard it is to avoid if you try”.
I think I agree with most of what you’re arguing.
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
I’m very glad you enjoyed it!
I’ve never read the “Towards a new Impact Measure” post, but I assume doing so is redundant now since this sequence is the ‘updated’ version.
I’d say so, yes.
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness.
For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R.
Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.
I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.
Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn’t mean it isn’t a saddle point. You have to check all of the directional derivatives; in two dimensions, there are uncountably infinitely many.
Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:
A valley must have a lowest point A.
For A to be a local minimum, all of its directional derivatives must be 0:
Direction N (north), AND
Direction NE (north-east), AND
Direction NNE, AND
Direction NNNE, AND
This doesn’t work because the directional derivatives aren’t probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you’re randomly drawing a topographic function from R2 to R.
For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.
S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand.
I’m not saying anything about an explicit representation of corrigibility. I’m saying the space of likely updates for an intent-corrigible system might form a “basin” with respect to our intuitive notion of corrigibility.
I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?
I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn’t necessarily mitigate your argument, but it seemed like an important refinement—we aren’t considering corrigibility along all dimensions—just those along which updates are likely to take place.
“value drift” feels unusually natural from my perspective
I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.)
I think there argument might be misleading in that local stability isn’t that rare in practice, because we aren’t drawing local stability independently across all possible directional derivatives around the proposed local minimum.
Gradient updates or self-modification will probably fall into a few (relatively) low-dimensional subspaces (because most possible updates are bad, which is part of why learning is hard). A basin of corrigibility is then just that, for already-intent-corrigible agents, the space of likely gradient updates is going to have local stability wrt corrigibility.
Separately, I think the informal reasoning goes: youcurrent probably wouldn’t take a pill that makes youfuture slightly more willing to murder people. Youcurrent will be particularly wary if youfuture will be presented with even more pill ingestion opportunities (a.k.a. algorithm modifications); youfuture will be even more willing to take more pills, as youfuture will be more okay with the prospect of wanting to murder people. So, even offered large immediate benefit, youcurrent should not take the pill.
I think this argument is sound, for a wide range of goal-directed agents which can properly reason about their embedded agency. So, for your intuitive argument to survive this reductio ad absurdum, what is the disanalogy with corrigibility in this situation?
Perhaps the AI might not reason properly about embedded agency and accidentally jump out of the basin. Or, perhaps the basin is small and the AI won’t land in it—corrigibility won’t be so important that it doesn’t get traded away for other benefits.
I really like this post. Before, I just knew that sometimes I “didn’t feel like studying”, and that was that. Silly, but that’s the nature of a thoughtless mistake. Now, I have a specific concept and taxonomy for these failure modes, and you suggested good ways of combating them. Thanks for writing this!
I mean, we already know about epilepsy. I would be surprised if there were did not exist some way to disable a given person’s brain, just by having them look at you.
If you measure death-badness from behind the veil of ignorance, you’d naively prioritize well-liked, famous people with large families.
Usually I strong-upvote when I feel like a post made something click for me, or that it’s very important and deserves more eyeballs. I weak-upvote well-written posts which taught me something new in a non-boring way.
As an author, my model of this is also impoverished. I’m frequently surprised by posts getting more or less attention than I expected.
we already see that; we’re constantly amazed by it, despite little meaning of created texts
But GPT-3 is only trained to minimize prediction loss, not to maximize response. GPT-N may be able to crowd-please if it’s trained on approval, but I don’t think that’s what’s currently happening.
Would you mind adding linebreaks to the transcript?
Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there’s something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you.
Can you explain why GPT-x would be well-suited to that modality?
This might be the best figure I’ve ever seen in a textbook. Talk about making a point!
I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way—there might be a further fact that says why OT and IC don’t apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.
I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL.
I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I proved that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like “optimal farsighted agents tend to preserve their access to terminal states” (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and “optimal agents generally choose paths through the future that afford strictly more options” (Generalizing the Power-Seeking Theorems, Theorem 2).
The theorems aren’t conclusive evidence:
maybe we don’t get AGI through RL
learned policies are not going to be optimal
the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
perhaps the agents inherit different mesa objectives during training
The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.
But the theorems are evidence that RL leads to catastrophe at optimum, at least. We’re not just talking about “the space of all possible minds and desires” anymore.
In the linked slides, the following point is made in slide 43:
We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicideAny possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.
We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicide
Any possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.
On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here’s why I think it’s weak:
“All states have self-loops, left hidden to reduce clutter. In AI: A Modern Approach (3e), the agent starts at 1 and receives reward for reaching 3. The optimal policy for this reward function avoids 2, and one might suspect that avoiding 2 is instrumentally convergent. However, a skeptic might provide a reward function for which navigating to 2 is optimal, and then argue that “instrumental convergence″ is subjective and that there is no reasonable basis for concluding that 2 is generally avoided.We can do better… for any way of independently and identically distributing reward over states, 1011 of reward functions have farsighted optimal policies which avoid 2. If we complicate the MDP with additional terminal states, this number further approaches 1.If we suppose that the agent will be forced into 2 unless it takes preventative action, then preventative policies are optimal for 1011 of farsighted agents – no matter how complex the preventative action. Taking 2 to represent shutdown, we see that avoiding shutdown is instrumentally convergent in any MDP representing a real-world task and containing a shutdown state. We argue that this is a special case of a more general phenomenon: optimal farsighted agents tend to seek power.”~ Optimal Farsighted Agents Tend to Seek Power