Computing scientist and Systems architect. Currently doing selffunded AGI safety research.
Koen.Holtman(Koen Holtman)
My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!
OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.
(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn’t totally offbase, and then spent the rest of the time composing a reply.)
You writing it quickly in half an hour also explains a lot about how it reads.
it’s returning to my initial reactions as I read the abstract, which is that this paper is about intuitivecorrigibility.
I guess we have established by now that the paper is not about your version of intuitivecorrigibility.
For my analysis of intuitivecorrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.
You interpret the abstract as follows:
You aren’t just saying “I’ll prove that this AI design leads to suchandsuch formal property”, but (lightly rephrasing the above): “This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.
Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.
[I] prove that the corrigibility layer works as intended in a large set of nonhostile universes.”
The phrasing ‘works as intended’ in the abstract is supposed to indicate clearly that the layer is designed to produce specific suchandsuch formal corrigibility properties only, not some broad idea of ‘intuitive corrigibility’.
So I am guessing you did not pick up on that when reading the abstract.
OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:
But if the ‘offswitch’ is only a binary sensory modality (there’s a channel that says ‘0’ or ‘1’ at each time step), then how do you have AIXI pick out ‘the set of worlds in which humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’?
In the paper I don’t try to make the agent’s world model distinguish between ‘humans are pressing the button’ versus ‘the set of worlds in which a rock fell on the button’. The ‘works as intended’ is that any button press for whatever reason is supposed to stop the agent.
So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any MLbased ‘human versus rock’ discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.
Even with blackbox learned world models, high levels of robustness can be achieved by a regime of testing ondistribution and then ensuring that the agent environment never goes offdistribution.
This is not going to happen for AGI, since we might not survive testing ondistribution,
You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.
What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of riskreduction techniques out there if you look.
and how would we ensure that the environment “stays ondistribution”? Is that like, pausing the world forever?
Here is a paperclip optimizer example of ‘ensuring that the agent environment never goes offdistribution’. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.
In other words, keep the environment inside the factory ondistribution, and keep the agent in the factory.
Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.
For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.
I spent some time in my professional life doing safety engineering for physical systems and cyberphysical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.
Could you change the definition of to , and give the agent the ability to selfmodify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press.
In general if you would forcefully change the agent’s reward function into some , it will selfpreserve from that moment on and try to maintain this , so it won’t selfedit its back into the original form.
There are exceptions to this general rule, for special versions of and special versions of agent environments (see section 7.2), where you can get the agent to selfedit, but on first glance, your example above does not seem to be one.
If you remove the bits from the agent definition then you can get an agent that selfedits a lot, but without changing its fundamental goals. The proofs of ‘without changing its fundamental goals’ will get even longer and less readable than the current proofs in the paper, so that is why I did the privileging.
Thanks for expanding on your question about the use of . Unfortunately. I still have a hard time understanding your question, so I’ll say a few things and hope that will clarify.
If you expand the term defined in (5) recursively, you get a treelike structure. Each node in the tree has as many sub nodes as there are elements in the set . The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on.
I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or ‘rational’ decision making process with a Bellman equation. If your default mental model of an agent is a set of Qlearning equations, the model I develop will look strange, maybe even unnatural at first sight.
or your theory is going to end up with the wrong prior.
OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a ‘theory’, in the sense that theories have predictive power. A mathematical model used as a theory, like , predicts how objects will accelerate when subjected to a force.
The agent model in the paper does not really ‘predict’ how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent’s reward function freely after observing of reverseengineering the agent to be modeled.
On purpose, the agent model is constructed with so many ‘free parameters’ that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate.
In mathematics. the technique I am using in the paper is sometimes called ‘without loss of generality’. I am developing very general proofs by introducing constraining assumptions ‘without loss of generality’.
Another thing to note is that the model of the agent in the paper, the model of an agent with the corrigibilitycreating safety layer, acts as a specification of how to add this layer to any generic agent design.
This dual possible use, theory or specification, of models can be tricky if you are not used to it. In observationbased science, mathematical models are usually always theories only. In engineering (and in theoretical CS, the kind where you prove programs correct, which tends to be a niche part of CS nowadays) models often act as specifications. In statistics, the idea that statistical models act as theories tends to be deemphasized. The paper uses models in the way they are used in theoretical CS.
You may want to take a look at this post in the sequence, which copies text from a 2021 paper where I tried to make the theory/specification use of models more accessible. If you read that post, if might be easier to fully track what is happening, in a mathematical sense, in my 2019 paper.
OK, so we now have people who read this abstract and feel it makes objectionable ‘very large claims’ or ‘big claims’, where these people feel the need to express their objections even before reading the full paper itself. Something vaguely interesting is going on.
I guess I have to speculate further about the root cause of why you are reading the abstract in a ‘big claim’ way, whereas I do not see ‘big claim’ when I read the abstract.
Utopian priors?
Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat preemptive criticism based on a strong prior you have about what the contents of that paper will have to be.
Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:
That said, the rest of this comment addresses your paper as if it’s proving claims about intuitivecorrigibility.
Curious. So here is some speculation.
In MIRI’s writing and research agenda, and in some of the writing on this forum, there seems to be an utopian expectation that hugely big breakthroughs in mathematical modeling could be made, mixed up with a wish that they must be made. I am talking about breakthroughs that allow us to use mathematics to construct AGI agents that will provably be

perfectly aligned

with zero residual safety risk

under all possible circumstances.
Suppose you have these utopian expectations about what AGI safety mathematics can do (or desperately must do, or else we are all dead soon). If you have these expectations of perfection, you can only be disappointed when you read actually existing mathematical papers with models and correctness proofs that depend on welldefined boundary conditions. I am seeing a lot of preemptive expression of disappointment here.
Alex: your somewhat extensive comments above seem to be developing and attacking the strawman expectation that you will be reading a paper that will

resolve all open problems in corrigibility perfectly,

not just corrigibility as the paper happens to define it, but corrigibility as you define it

while also resolving, or at least namechecking, all the open items on MIRI’s research agenda
You express doubts that the paper will do any of this. Your doubts are reasonable:
So I think your paper says ‘an agent is corrigible’ when you mean ‘an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.’
What you think is broadly correct. The surprising thing that needs to be explained here is: why would you even expect to get anything different in a paper with this kind of abstract?
Structure of the paper: pretty conventional
My 2019 paper is a deeply mathematical work, but it proceeds in a fairly standard way for such mathematical work. Here is what happens:

I introduce the term corrigibility by referencing the notion of corrigibility developed in the 2015 MIRI/FHI paper

I define 6 mathematical properties which I call corrigibility desiderata. 5 of them are taken straight from the 2015 MIRI/FHI paper that introduced the term.

I construct an agent and prove that it meets these 6 desiderata under certain welldefined boundary conditions. The abstract mentions an important boundary condition right from the start:
A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of nonhostile universes.
The paper devotes a lot of space (it is 35 pages long!) to exploring and illustrating the matter of boundary conditions. This is one of the main themes of the paper. In the end, the proven results are not as utopian as one might conceivably hope for,
What I also do in the paper is that I sometimes us the term ‘corrigible’ as a shorthand for ‘provably meets the 6 defined corrigibility properties’. For example I do that in the title of section 9.8.
You are right that the word ‘corrigible’ is used in the paper in both an informal (or intuitive) sense, and in a more formal sense where it is equated to these 6 properties only. This is a pretty standard thing to do in mathematical writing. It does rely on the assumption that the reader will not confuse the two different uses.
You propose a writing convention where ‘POWER’ always is the formal inpaper definition of power and ‘power’ is the ‘intuitive’ meaning of power, which puts less of a burden on the reader. Frankly I feel that is a bit too much of a departure from what is normal in mathematical writing. (Depends a bit I guess on your intended audience.)
If people want to complain that the formal mathematical properties you named X do not correspond to their own intuitive notion of what the word X really means, then they are going to complain. Does not matter whether you use uppercase or not.
Now, back in 2019 when I wrote the paper, I was working under the assumption that when people in the AGI safety community read the world ‘corrigibility’, they would naturally map this word to the list of mathematical desiderata in the 2015 MIRI/FHI paper titled ‘corrigibility’. So I assumed that my use of the word corrigibility in the paper would not be that confusing or jarring to anybody.
I found out in late 2019 that the meaning of the ‘intuitive’ term corrigibility was much more contingent, and basically all over the place. See the ‘disentangling corrigibility’ post above, where I try to offer a map to this diverse landscape. As I mention in the post above:
Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible.
But I am not going to update my 2019 paper to covert some words to uppercase.
On the ‘bigness’ of the mathematical claims
You write:
On p2, you write:
The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.
If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won’t try to get in our way / prevent us from deactivating it, while also ensuring it does something nontrivial to optimize its goals? That is a big claim.
You seem to have trouble believing the ‘if this were true’. The open question here is how strong of a guarantee you are looking for, when you are saying ‘will guarantee’ above.
If you are looking for absolute, rocksolid utopian ‘provable safety’ guarantees, where this method will reduce AGI risk to zero under all circumstances, then I have no such guarantees on offer.
If you are looking for techniques that can will deliver weaker guarantees, of the kind where there is a low but nonzero residual risk of corrigibility failure, if you wrap these techniques around a welltested AI or AGIlevel ML system, these are the kind of techniques that I have to offer.
If this were true it would be an absolute breakthrough
Again, you seem to be looking for the type of absolute breakthrough that delivers mathematically perfect safety always, even though we have fallible humans, potentially hostile universes that might contain unstoppable processes that will damage the agent, and agents that have to learn and act based on partial observation only. Sorry, I can’t deliver on that kind of utopian programme of provable safety. Nobody can.
Still, I feel that the mathematical results in the paper are pretty big. They clarify and resolve several issues identified in the 2015 MIRI/FHI paper. They resolve some of these by saying ‘you can never perfectly have this thing unless boundary condition X is met’, but that is significant progress too.
On the topic of what happens to the proven results when I replace the agent that I make the proofs for with AIXI, see section 5.4 under learning agents. AIXI can make certain prediction mistakes that the agent I am making the proofs for cannot make by definition. These mistakes can have the result of lowering the effectiveness of the safety layer. I explore the topic in some more detail in later papers.
Stability under recursive selfimprovement
You say:
I think you might be discussing corrigibility in the very narrow sense of “given a known environment and an agent with a known ontology, such that we can pick out a ‘shutdown button pressed’ event in the agent’s world model, the agent will be indifferent to whether this button is pressed or not.”
We don’t know how to robustly pick out things in the agent’s world model, and I don’t see that acknowledged in what I’ve read thus far.
First off, your claim that ‘We don’t know how to robustly pick out things in the agent’s world model’ is deeply misleading.
We know very well ‘how to do this’ for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (nonAGI) world models as used by today’s actually existing AI agents, both hardcoded and learned world models, and there is no big mystery about how this is achieved.
Even with blackbox learned world models, high levels of robustness can be achieved by a regime of testing ondistribution and then ensuring that the agent environment never goes offdistribution.
You seem to be looking for ‘not very narrow sense’ corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it rebuilds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don’t build a system like that if you want to be safe.
The problem of formalizing humility
In another set of remarks you make, you refer to the web page Hard problem of corrigibility, were Ellezer speculates that to solve the problem of corrigibility, what really we want to formalize is not indifference but
something analogous to humility or philosophical uncertainty.
You say about this that
I don’t even know how to begin formalizing that property, and so a priori I’d be quite surprised if that were done successfully all in one paper.
I fully share your stance that I would not even know how to begin with ‘humility or philosophical uncertainty’ and end successfully.
In the paper I ignore this speculation about humilitybased solution directions, and leverage and formalize the concept of ‘indifference’ instead. Sorry to disappoint if you were expecting major progress on the humility agenda advanced by Ellezer.
Superintelligence
Another issue is that you describe a “superintelligent” AGI simulator
Yeah, in the paper I explicitly defined the adjective superintelligent in a somewhat provocative way, I defined ‘superintelligent’ to mean ‘maximally adapted to solving the problem of utility maximization in its universe’.
I know this is somewhat jarring to many people, but in this case it was fully intended to be jarring. It is supposed to make you stop and think...
(This grew into a very long response, and I do not feel I have necessarily addressed or resolved all of your concerns. If you want to move further conversation about the more technical details of my paper or of corrigibility to a video call, I’d be open to that.)

First I’ve seen this paper, haven’t had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.
I wrote that paper and abstract back in 2019. Just reread the abstract.
I am somewhat puzzled how you can read the abstract and feel that it makes ‘very large claims’ that would be ‘very surprising’ when fulfilled. I don’t feel that the claims are that large or hard to believe.
Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.
I like your “Corrigibility with Utility Preservation” paper.
Thanks!
I don’t get why you prefer not using the usual conditional probability notation.
Well, I wrote in the paper (section 5) that I used instead of the usual conditional probability notation because it ‘fits better with the mathematical logic style used in the definitions and proofs below.’ i.e. the proofs use the mathematics of second order logic, not probability theory.
However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further.
In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong’s original balancing term elements and , where and are lowprobability nearfuture events, can be usefully generalized (and simplified) as the Pearlian and where the terms are interventions (or ‘edits’) on the current world state.
The notation makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection.
That being said, in my 2020 paper I present a simplified version of the math in the 2019 paper using the traditional notation again, and without having to introduce .
leads to TurnTrout’s attainable utility preservation.
Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper. In general I think that counterfactual expectedutility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes).
Why not use in the definition of ?
In definition of (section 5.3 equation 4) I am using a term, so I am not sure if I understand the question.
(I am running out of time now, will get back to the remaining questions in your comment later)
Thanks at lot all! I just edited the post above to change the language as suggested.
FWIW, Paul’s post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.
Interesting… On first reading your post, I felt that your methodological approach for dealing with the ‘all is doomed in the worst case’ problem is essentially the same as my approach. But on rereading, I am not so sure anymore. So I’ll try to explore the possible differences in methodological outlook, and will end with a question.
The key to your methodology is that you list possible process steps which one might take when one feels like
all of our current algorithms are doomed in the worst case.
The specific doomremoving process step that I want to focus on is this one:
If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]
My feeling is that AGI safety/alignment community is way too reluctant to take this process step of ‘add another assumption about the world’ in order to eliminate a worst case failure story.
These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can’t for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.
So I’m happy to see a post that encourages people to make explicit assumptions about the agent’s environment. I have definitely used this technique to make progress in my own work.
But.
When I look at your example of ‘the strategy stealing assumption’ as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.
To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.
But at the same time, these interlocks do not remove all possible worstcase failure stories of doom. To quote from the post and the underlying paper:
We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.
The key here is the ‘highly unlikely’. If we have an algorithm were
We train a generative model that predicts these camera observations given the motor actions.
then I typically add the following assumption to avoid doom:
Assumption: Take a sequence of of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following misprediction scenario is theoretically possible, the risk that it actually happens is very low. The misprediction scenario is that, for the above motor actions that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.
In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and reallife experiments with trained generative models.
So my question is: would the above assumptionadding step, about the low risk of mispredictions, be a natural and valid assumptionadding process step for ‘throwing out failure stories’ in your methodology?
Or is the existence of this assumption automatically implied by default in your process?
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low like to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low needs to go.
In any case, I’ll now try to convince you that if , your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.
Is the demonstrator policy to get themselves killed?
The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.
Say that is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where and where at time step , we have the likely case that . In the world I constructed for the counterexample, the remaining top policies now perform a synchronized treacherous turn where they kill the demonstrator.
In time step and later, the policies diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the action. This eventually removes all ‘wrong’ policies from , where ‘wrong’ means that they do not take the action at all future time steps.
The silver lining is perhaps that at least the agent will eventually stop, perform actions only, after it has killed the demonstrator.
Now. the paper proves that the behavior of the agent policy will approximate that of the true demonstrator policy closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy is one where the demonstrator always intended to get themselves killed.
This would be a somewhat unusual solution to the inner alignment problem.
The math in the paper has you working in a fixedpolicy setting where the demonstrator policy is immutable/timeinvariant. The snag is that this does not imply that the policy defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force into a selffulfilling trajectory where it kills the demonstrator.
Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable timedependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.
I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the and the bold text above would have stopped me getting confused. So I’ll try again.
I read the sentence fragment
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history . However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is and the next action taken is .
The actual probability of the imitator is picking the action itself under , is given by , which is only mentioned in passing in the lines between equations (3) and (4).
So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation , which is the equation I was really looking for. Instead my mind autocompleted equation (3) by adding an term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of , and foregroundig the definition of more clearly as a singleline equation.
[edited to delete and replace an earlier question] Question about the paper: under equation (3) on page 4 I am reading:
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
This confused me initially to no end, and still confuses me. Should this be:
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator or picking one of the other actions???
This would seem to be more consistent with the definitions that follow, and it would seem to make more sense overall.
[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]
If we set small enough, we can make it arbitrarily like[ly] that never leaves the set of top policies.
OK, I think you are kind of saved here, in the case of my counterexample, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:
Say I refine the counterexample by adding the prior , so each candidate is considered equally likely from the start. Say that .
The key thing to note in my example is that, before the timesynchronized treacherous turn happens for all policies , all policies are equally compatible with events so far, so the posterior is the same for all of them.
When I use with these allthesameposterior policies, then (if I am reading equation (2) correctly) the set of top policies will have size . Ties are broken arbitrarily, so the real is unlikely to be among these top candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.
However.
Case of : much better than I thought!
However, and this is where things get interesting, if we use , so , then equation (2) will compute , the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.
So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counterexample above, or if you had a more general reason  the paper does not say.
Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for and for certain values of ‘the inner alignment problem’. I can also imagine several setups where would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).
This is what progress looks like. There are certain boundary conditions here that might be unrealistic: needs to be a finite set and the real policy needs to be in there, but at least we have some welldefined boundary conditions we can now look at, where we can try to understand and interpret the math further.
Case of : the challenge of designing a prior
In the case of using or , because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values in such a way that gets a higher prior than the all the treacherous policies.
My counterexample above shows that if you are using the 1/(number of bits in the program that computes ) as the prior, you will not get what you need. There will be plenty of functions that have fewer bits in their encoding than the real function used by the demonstrator.
Earlier in this comment section, there is a whole subthread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)
There is of course a tradition in the AI safety community where this is made ‘tractable’ by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.
What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted selfknowledge about agent compute core internals) bits(agent environment model that lacks this unwanted selfknowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can’t immediately see a route here.
My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counterexample above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.
Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.
For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.
This doesn’t feel like a good summary of what Rohin says in his sequence.
I was not trying to summarize the entire sequence, only summarizing my impressions of some things he said in the first post of the sequence. Those impressions are that Rohin was developing his intuitive notion of goaldirectedness in a very different direction than you have been doing, given the examples he provides.
Which would be fine, but it does lead to questions of how much your approach differs. My gut feeling is that the difference in directions might be much larger than can be expressed by the mere adjective ‘behavioral’.
On a more technical note, if your goal is to search for metrics related to “less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown”, then the metrics that have been most productive in my opinion are, first, ‘indifference’, in the meaning where it is synonymous with ‘not having a control incentive’. Other very relevant metrics are ‘myopia’ or ‘short planning horizons’ (see for example here) and ‘power’ (see my discussion in the post Creating AGI Safety Interlocks).
(My paper counterfactual planning has a definition of ‘indifference’ which I designed to be more accessible than the `not having a control incentive’ definition, i.e. more accessible for people not familiar with Pearl’s math.)
None of the above metrics look very much like ‘nongoaldirectedness’ to me, with the possible exception of myopia.
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of selfimproves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind moralityrelated word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lavabased example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
This post proposes such a behavioral definition of goaldirectedness. If it survives the artillery fire of feedback and criticism, it will provide a more formal grounding for goaldirectedness,
I guess you are looking for critical comments. I’ll bite.
Technical comment on the above post
So if I understand this correctly. then is a metric of goaldirectedness. However, I am somewhat puzzled because only measures directedness to the single goal .
But to get close to the concept of goaldirectedness introduced by Rohin, don’t you need then do an operation over all possible values of ?
More general comments on goaldirectedness
Reading the earlier posts in this sequence and several of the linked articles, I see a whole bunch of problems.
I think you are being inspired by the The Misspecified Goal Argument. From Rohin’s introductory post on goal directedness:
The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make longterm plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.
Rohin then speculates that if we remove the ‘goal’ from the above argument, we can make the AI safer. He then comes up with a metric of ‘goaldirectedness’ where an agent can have zero goaldirectedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin’s terminology, an agent gets safer it if is less goaldirected.
Rohin then proposes that intuitively, a tabledriven agent is not goaldirected. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.
Where things completely move off the main sequence is in Rohin’s next step in developing his intuitive notion of goaldirectedness:
This suggests a way to characterize these sorts of goaldirected agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.
So what I am reading here is that if an agent behaves more unpredictably offdistribution, it is becomes less goaldirected in Rohin’s intuition. But I can’t really make sense of this anymore, as Rohin also associates less goaldirectedness with more safety.
This all starts to look like a linguistic form of Goodharting: the meaning of the term ‘goaldirected’ collapses completely because too much pressure is placed on it for control purposes.
To state my own terminology preference: I am perfectly happy to call any possible AI agent a goaldirected agent. This is because people build AI agents to help them pursue some goals they have, which naturally makes these agents goaldirected. Identifying a subclass of agents which we then call nongoaldirected looks like a pretty strange program to me, which can only cause confusion (and an artillery fire of feedback and criticism).
To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.

Is your idea that a lower number on a metric implies more safety? This seems to be Rohin’s original idea.

Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of ‘will become adversarial and work against us’ at all? If so I am not seeing the correlation.

This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed
I think you are using some mental model where ‘paths with nodes’ vs. ‘paths without nodes’ produces a realworld difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram >[node]> can always be replaced by a single arrow > to produce a model that makes equivalent predictions, and the opposite operation is also possible.
So the number of nodes on a path better read as a choice about levels of abstraction in the model, not as something that tells us anything about the real world. The comment I just posted with the alternative development of the game model may be useful for you here, it offers a more specific illustration of adding nodes.
In this comment (last in my series of planned comments on this post) I’ll discuss the detailed playertomatchwith example developed in the post:
In order to analyse the issues with the setup, let’s choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive ingame items.
I have by now reread this analysis with the example several times. First time I read it, I already felt that it was a strange way to analyse the problem, but it took me a while to figure out exactly why.
Best I can tell right now is that there are two factors

I can’t figure out if the bad thing that the example tries to prove is that a) agent is trying to maximize purchases, which is unwanted or b) the agent is manipulating user’s item ranking, which is unwanted. (If it is only a), then there is no need to bring in all this discussion about correlation.)

the example refines its initial CID by redrawing it in a strange way
So now I am going to develop the same game example in a style that I find less strange. I also claim that this gets closer to the default style people use when they want to analyse and manage causal incentives.
To start with, this is the original model of the game mechanics: the model of the mechanics in the real world in which the game takes place.
This shows that the agent has an incentive to control predicted purchases upwards, but also to do so by influencing the item rankings that exist in the mind of the player.
If we want to weaken this incentive to influence the item rankings that exist in the mind of the player, we can construct a counterfactual planning world for the agent (see here for an explanation of the planning world terminology I am using):
(Carey et all call often call this planning world a twin model, a model which combines both factual and counterfactual events.) In both my work and in Carey et intention, the is that the above diagram defines the world model in which the agent will plan the purchasesmaximizing action, and then this same action is applied in the real world model above.
Now, the important things to note are:

this counterfactual construction does not eliminate the incentive of the agent to maximize purchases, as we still have the red arrow in there

this counterfactual construction does not eliminate the ability of the agent to influence item rankings, as we still have the orange arrow in there

but as the orange halo around the influenced item rankings is gone, the agent has lost its instrumental control incentive on item rankings. (The meaning of the orange halo and the terminology of instrumental control incentives are defined in Agent Incentives: A Causal Perspective.)
Now, say that we want to drill down further on these models, to a more detailed level of modeling. We might do so if we may want to examine further how the orange arrow above will act in practice.
We could add more detail by adding a node ‘gameplay knowledge’ which is correlated with item rankings. In the realworld model, this would be depicted as follows:
I added the blue arrow above to make the correlation between influenced gameplay knowledge and influenced rankings explicit as a line of causal influence. An equivalent blue arrow is not present in the drawings in the post above: technically speaking the drawing in the post is compatible with the assumption that there may be a correlation between the two, but it does not spell out the presence of this correlation, which would be unusual when doing this type of analysis.
The drawing in the post is also unusual in that it omits the red arrow I left in above. If I were to remove the red arrow, this would amount to a claim that in the real world situation, item ranking and gameplay knowledge are the only two channels by which the agent’s decision may influence purchases. Clearly this is not the case: if the agent were to match the player with an opponent who is sure to win the fight and steal all of the player’s possessions afterward, this would have an effect on predicted purchases, because the player would have to buy some replacements for all the stuff they lost.
I can now again construct a counterfactual planning world to go along with the refined real world model:
If we use the above planning world we have removed the instrumental control incentive on influenced rankings again.
We can go one step further and apply the same construction to gameplay knowledge:
The construction also goes some way towards addressing the problem of lowprobability counterfactual worlds discussed in the post.
If we keep this up long enough, to split further observables out of the red arrow and give them the counterfactual treatment, we will weaken the ability of the agent to take ‘rational’ actions that maximize purchases in the real world further and further.
Going this route is nice in a thought experiment, but in practice a less cumbersome way to weaken this ability is to decouple the yellow utility node entirely, e.g. to use a random number generator for opponent selection.

Interesting paper! I like the focus on imitation learning, but the really new foodforthought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.
My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment failure or a mesa optimization related failure. The term is so fluid that I find the quantitative feelings that people express in the comment section hard to interpret. Is everybody talking about the same and ?
Thought experiment counterexample
Moving beyond these mixed feelings, here is a fun thought experiment, for various values of fun. You state:
our construction allows us to bound the extent to which mesaoptimizers could accomplish anything.
So now I will try to construct a counterexample to this claim: an example where mesaoptimizers (as I understand them) will accomplish very bad things even inside your construction.
Your construction uses a set of candidate policies , where one of them equals the real demonstrator policy .
A single policy is a function were computes the probability that action will be taken in the world state represented by the history .
Here is the first special step in the thought experiment. I am going to define each by using a different reward function that the policy is going to try to maximize. Every policy definition will use the same world state transition predictor that allows me to estimate for every . (I am probably getting slightly creative with the notation in the paper at this point.)
I now define each function as follows: I compute by assigning a probability of 1 to the one that comes out of the of the Bellman equation constructed using the shared predictor and the policy specific above. So the policy picks the action that will optimize the expected time discounted return of the reward function starting from the world state , and projected forward by the predictor above, under the assumption that the policy also controls all future actions.
So each is a model with a definite inner optimizer inside of it: the inner optimizer does a search over a world state tree to find the optimal solution to the Bellman equation.
Now for the second special step: I am going to make some assumptions about the agent environment, building up to a treacherous turn scenario:

The demonstrator in the paper is a human, working alone with the agent.

The demonstrator has a big hammer that they will use to smash the agent when it starts misbehaving.

The world model projections in the policies are all detailed enough that they know about the hammer, and they know that they can expect the agent to get smashed when it performs an unexpected action.

Therefore, for all policies where , policies with reward functions not shared by the demonstrator, the optimal course of action is to take exactly that action that the demonstrator would expect the agent to take, to avoid getting smashed. The world models are good enough that they can predict this action.

The above is of course only the optimal action in these policies because they all project that there will be a point in time where the demonstrator becomes distracted, distracted enough that the agent can kill the demonstrator with a single action without getting smashed first. The projection also shows that the agent will have plausible deniability to other humans afterwards, so that it will survive to do whatever it wants.
When that point in time comes, these policies will all take exactly the same action, the action that kills the demonstrator. Only in the time steps and later, the policies start diverging in what actions they will take, because they have now accomplished the emergent subgoal that was shared by all, so that each has their hands free to accomplish their real goal .
Now if I reexamine the claim:
our construction allows us to bound the extent to which mesaoptimizers could accomplish anything.
I have to conclude that the mesaoptimizers have at least accomplished the feat of killing the demonstrator. Whether they then can accomplish any of their divergent posttreachery goals in later time steps remains an open question, but the damage has been done.
Issues
Some issues remain with this fun (for various values of fun) thought experiment.
I can note that at time , the policy that was fully aligned with the demonstrator did not want to kill the demonstrator, it would have taken another action. So, if I understand the paper right, if was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.
But I am also reading in the abstract of the paper that
[...] queries to the demonstrator rapidly diminish in frequency
so I think this implies that over time, it is likely that might disappear from the top set. But I am not fully sure here. I might be reading the paper wrong.
Questions
So I have some questions:

Does this thought experiment look reasonable or have I overlooked something?

What about the probability that is still in the set of top policies at time ? To me, it looks like if we start with candidates in and the agent quickly narrows this down to a top set of size , then it is likely that will not be in there anymore.

Any strong opinions from anyone about whether the above counts as a definite inner alignment failure due to mesa optimization (I think so) or definitely not?
This comment is long enough already so I am not going to speculate here about possible ways to avoid inserting models like the models I constructed above into the set . But speculation is welcome...
(I am calling this a ‘fun’ thought experiment because I am thinking of this as a limit case. This limit case useful to identify and provides some food for thought, but it does not really change my opinion about how stable or safe the approach in the paper might be in practice. I my experience, you can always find a limit case where things fail if you start looking for it.)

On recent terminology innovation:
we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to “control as a side effect”.
For exactly the same reason, In my own recent paper Counterfactual Planning, I introduced the terms direct incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence I develop and apply this terminology in the case of an agent emergency stop button.
In highlevel descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive.
I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome.
Do they have some standard phrasing where they can say things like ‘no value to control’ while subtly reminding the reader that ‘this does not imply there will be no side effects?’
OK, I think I see what inspired your question.
If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of selfreflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent selfawareness, but the models are not full definitions of the agent.
For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C13) back to things I can or should build in physical reality.
There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in postNewtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.
I’m with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that ‘the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually’.
BTW, I offer some additional interpretation of a difficulttointerpret part of the math in section 10 of my 2020 paper here.
You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function R and returns a quantilized reward function Rq, this gives you a different type of quantilization, but I feel it would be in the same spirit.
In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing f0 into fc, by adding a balancing term to the reward function, are not the same effects produced by quantilization.