Can you boil down your attempted solution to a statement of which
coherence constraint you think it is possible to violate, and what
shutdownable behavior you think can result from that, how,
Glad you asked.
in the least fancy actually-precise terms of which your writing
talent permits?
If you want actual full precision, I have to refer you to the math in
my papers. Since 2019 I have been working on and off to make this
math more general and accessible, and to find better ways to explain
this math in actually-precise natural language statements. So here is
my best current attempt.
TL;DR: The coherence constraint/theorem/property that I want to
violate is the property that the AGI is using a world model that
accurately depicts the internals of its own compute core. I want to
make the agent use an inaccurate model of these internals, one that
omits the safety shut-down mechanisms I have built inside of its core.
By doing this, I can reduce the emergent incentive of the AGI agent to
disable its own emergency shut-down mechanisms, and the emergent
incentive to stop people from activating them.
I will now expand on this and add more details, using the the example
of an emergency stop button.
Say that the compute core has an emergency stop button attached to
it. Say that actual software running inside the compute core will,
when receiving a stop signal from the button, cause the agent to stop.
When the signal is received, the software will always select and
perform null actions in every future time step. Let’s say that
Mc is a world model that accurately depicts this situation. I am
not going to build an AGI that uses Mc to plan its actions.
Instead I build an AGI agent that will plan its next actions by using
an incorrect world model Mi. This Mi is different from Mc,
but only in how it depicts the internals of the agent compute core.
In the incorrect/imaginary world depicted by Mi, the compute core
has different software in it, software that will ignore the stop
button signal, and just keep on picking actions that maximize utility.
I further construct my AGI so that, in every time step, it calculates
which next action a would maximize utility in this incorrect,
imaginary world Mi. I then further construct it to take this same
action a in the real world.
An AGI that lives in a world that matches the correct model Mc,
while using the incorrect model Mi to decide what actions to take,
will (sometimes) act incoherently. But this is a very specific and
targeted form of incoherence, far less severe and far less dangerous
than having the agent believe that 2+2=5. Still, this incoherency
does have some unwanted side effects. One example side effect, also
considered in the 2015 MIRI corrigibility paper, is that the AGI will
be indifferent about the fate of its emergency stop button. In my
papers, I explore a much broader range of side effects, and look at
how to prevent/suppress/fix them.
So the first step in the above corrigibility design is that you have
to imagine constructing an AGI that will use a specifically incorrect
world model Mi. The first step is that you have to realize that
you actually want this, that you want to decrease coherence in order
to increase corrigibility.
The next step is to look for options to actually construct an AGI that
uses this incorrect Mi.
One option is to train the AGI in a simulated environment that
faithfully represents Mi. We then extract the learned world model,
and put it in an AI that we unleash in the real world. We disable
further learning in this AI, to prevent Mi from being updated
towards Mc. Though this option depicts exactly how many current
real-life AI systems are constructed and get their corrigibility, this
option is not very satisfactory in a theoretical sense. It may not
be a viable path to build certain otherwise desirable types of
AI/AGI/superintelligence.
The more interesting option is to build an AGI that will
construct Mi based on an online training regime in the real world,
in a real world that exactly matches the correct world model Mc.
Is the second option technically feasible? The answer is generally
yes.
A very general solution, leveraging causal models, is in this
sequence.
To give you some idea of the more detailed math involved: the (ls)
pictured below is the real/correct world model Mc, and (p) is
the imaginary/incorrect world model Mi. In the agent defined using
these pictures, certain indifference properties are present trivially,
by construction.
The solution in the pictures above uses ML to learn a model L that
correctly models only a part of the real world, the part of the real
world outside of the agent compute core. We then slot L into a
causal world model, one that has other parts which mis-represent the
real compute core internals. This causal world model will then be
used as the incorrect Mi. Note that this learned L may be a fully
opaque deep neural net, or whatever black-box thing the next
generation of ML might produce. We do not have to understand or edit
any details inside L for this slot-in operation to work.
(I should mention that the paper
has a lot of details not mentioned in the sequence, or visible in the
pictures above. In particular, section 10.2 may be of interest.)
I want to stress that this causal model option is only one possible
route to creating incorrect world models Mi via machine learning in
the real world. Papers like Safely interruptible
agents
and How RL Agents Behave When Their Actions Are
Modified show that the idea of
removing certain events from the training record can also work:
whether this works as intended depends on having the right built-in
priors, priors which control inductive generalization.
So overall, I have a degree of optimism about AGI corrigibility.
That being said, if you want to map out and estimate probabilities for
our possible routes to doom, then you definitely need to include the
scenario where a future superior-to-everything-else type of ML is
invented, where this superior future type of ML just happens to be
incompatible with any of the corrigibility techniques known at that
time. Based on the above work, I put a fairly low probability on that
scenario.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to
the discussion here.
In your post, you aim to develop a crisp mathematical definition of
(in)coherence, i.e. VNM-incoherence. I like that, looks like a good
way to move forward. Definitely, developing the math further has been
my own approach to de-confusing certain intuitive notions about what
should be possible or not with corrigibility.
However, my first impression is that your
concept
of VNM-incoherence is only weakly related to the meaning that Eliezer
has in mind when he uses the term incoherence. In my view, the four
axioms of
VNM-rationality
have only a very weak descriptive and constraining power when it comes
to defining rational behavior.
I believe that Eliezer’s notion of rationality, and therefore his
notion of coherence above, goes far beyond that implied by the axioms of
VNM-rationality. My feeling is that Eliezer is using the term
‘coherence constraints’ an intuition-pump way where coherence implies, or almost
always implies, that a coherent agent will develop the incentive to
self-preserve.
Looking at your post, I am also having trouble telling exactly how you
are defining VNM-incoherence. You seem to be toying with
several alternative definitions, one where it applies to reward
functions (or preferences over lotteries) which are only allowed to
examine the final state in a 10-step trajectory, another where the
reward function can examine the entire trajectory and maybe the
actions taken to produce that trajectory. I think that your proof
only works in the first case, but fails in the second case. This has
certain (fairly trivial) corollaries about building
corrigibility. I’ll expand on this in a comment I plan to attach to
your post.
I’m interested in hearing about how your approach handles this environment,
I think one way to connect your ABC toy environment to my
approach is to look at sections 3 and 4 of my earlier
paper where I develop a somewhat
similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state
transition diagram is a depiction which makes it hard see how to apply
my approach, because the depiction mashes up the state of the world
outside of the compute core and the state of the world inside the
compute core. If you want to apply counterfactual planning, or if you
want to have a an agent design that can compute the balancing function terms according to Armstrong’s
indifference approach, you need a different depiction of your setup.
You need one which separates out these two state components more explicitely. For example,
make an MDP model where the individual states are instances of the
tuple (physical position of agent in the ABC playing field,policy
function loaded into the compute core).
Not sure how to interpret your statement that you got lost in
symbol-grounding issues. If you can expand on this, I might be able
to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead.
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can’t learn at this point then I find it hard to believe it’s generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?
On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence.
I show there that any RL agent, even a non-AGI, will always have
the latent ability to ‘look at itself’ and create a machine-learned model of its compute core internals.
What is done with this latent ability is up to the designer.
The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to ‘look at itself’.
Once you decide that you don’t want to use this latent ability, certain safety/corrigibility problems
become a lot more tractable.
Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.
Though there is plenty of discussion on this forum which silently assumes
otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.
this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
Terminology note if you want to look into this some more:
ML typically does not frame this goal as ‘instructing the model not to
learn about Q’. ML would frame this as ‘building the model to
approximate the specific relation P(X|Y,Z) between some well-defined
observables, and this relation is definitely not Q’.
Glad you asked.
If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.
TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.
I will now expand on this and add more details, using the the example of an emergency stop button.
Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform null actions in every future time step. Let’s say that Mc is a world model that accurately depicts this situation. I am not going to build an AGI that uses Mc to plan its actions.
Instead I build an AGI agent that will plan its next actions by using an incorrect world model Mi. This Mi is different from Mc, but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by Mi, the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action a would maximize utility in this incorrect, imaginary world Mi. I then further construct it to take this same action a in the real world.
An AGI that lives in a world that matches the correct model Mc, while using the incorrect model Mi to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that 2+2=5. Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.
So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model Mi. The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.
The next step is to look for options to actually construct an AGI that uses this incorrect Mi.
One option is to train the AGI in a simulated environment that faithfully represents Mi. We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent Mi from being updated towards Mc. Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.
The more interesting option is to build an AGI that will construct Mi based on an online training regime in the real world, in a real world that exactly matches the correct world model Mc.
Is the second option technically feasible? The answer is generally yes.
A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model Mc, and (p) is the imaginary/incorrect world model Mi. In the agent defined using these pictures, certain indifference properties are present trivially, by construction.
The solution in the pictures above uses ML to learn a model L that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot L into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect Mi. Note that this learned L may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside L for this slot-in operation to work.
(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)
I want to stress that this causal model option is only one possible route to creating incorrect world models Mi via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.
So overall, I have a degree of optimism about AGI corrigibility.
That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.
Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
Read your post, here are my initial impressions on how it relates to the discussion here.
In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can’t learn at this point then I find it hard to believe it’s generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?
On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to ‘look at itself’ and create a machine-learned model of its compute core internals.
What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to ‘look at itself’.
Once you decide that you don’t want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.
Wikipedia has the following definition of AGI:
Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.
Terminology note if you want to look into this some more: ML typically does not frame this goal as ‘instructing the model not to learn about Q’. ML would frame this as ‘building the model to approximate the specific relation P(X|Y,Z) between some well-defined observables, and this relation is definitely not Q’.