Your method differs in that you use a logical fact, rather than a probability, to underpin your scheme. I would just have put u(h) as a constant if f−1(y)≠0, but the way you do it prevents the agent from going crazy (all utilities are the same, or logic is broken) if it ever figures out that f−1(y)=0, and incentivises it to shut down in that case.
The disadvantage is your requirement (3): difficulty erasing x. If u(h) were truly constant given f−1(y)≠0, then that would not be necessary—it would only matter that there was a small probability that x=0, so we could do a sloppy job of erasing.
On that note, your scheme falls apart if the agent can only change u(h) by a really tiny amount (in that case, spending a lot of resources calculating x makes little difference). To correct for that, if you want to preserve a non-constant utility for f−1(y)≠0, then you need to scale this utility by the amount that the agent expects it can change u(h) in the f−1(y)=0 world.
Overall: I like the idea of using logical facts, I don’t like the idea of assuming that calculation costs are enough to control whether the agent computes the fact or not. Some better use of logical uncertainty might be called for?
The disadvantage is your requirement (3): difficulty erasing x.
I just want to note that if it’s too easy for the agent to reconstruct x, this approach fails gracefully (and we can then simply improve this aspect and try again).
On that note, your scheme falls apart if the agent can only change u(h) by a really tiny amount (in that case, spending a lot of resources calculating x makes little difference).
I agree (the approach fails gracefully in that case too). We can create a “sink” for excess computation power by designing u to reward the agent for doing arbitrary computations (i.e. finding more and more prime numbers). But in a way that the agent always prefers to give up any amount of such computations if it means achieving its goal (and terminating) one time-step earlier.
Regarding your paper (thanks for the pointer!): is there already a known way to use event-dependent rewards and effective disbelief to introduce an off-switch, that the agent will be indifferent about, to an arbitrary POMDP problem?
ETA: Regarding the seamless transition approach, I think I understand how it can solve this. But it requires using the agent’s model of the environment in the definition of the utility function, so it might not work for a black-box agent that maximizes a computable function of reward given via a reward channel. I think the approach I described in this post could still be used in such a case.
ETA2: A note to random readers: Invoking a black-box agent is probably a very bad idea unless there’s a consensus in the AI safety community it should be done!
EDIT: The following is probably incorrect, I need more time to think about it, please ignore it.
I’m confused about the seamless transition approach. Suppose the agent constructs a sub-agent that from time step t+1 (right after the transition) acts to minimize R′ iff it observes that the original agent keeps following the same policy after the transition (instead of switching to a policy that would maximize R′). Thereby, it will increase the extra corrective reward C (and thus the pseudo-reward R′+C) by decreasing V(R′,π_A,h_{t+1}).
In the proof of Theorem 17, the last sentence is:
This is obviously maximized by π_A being the optimal policy for R up until time t+1.
But πA might not be the optimal policy for R up until t+1. For example, if it corresponds to creating the above sub-agent, it isn’t optimal for R up until t+1.
There still might be problems with subagents, though. It could be optimal for the agent to create a subagent to protect it from being interfered with, while it “goes to sleep”.
Yep, that’s better. There’s still the risk of subagents being created—when the agent thinks that f−1(y)≠0, almost certainly, but not completely certainly. Then it might create a u-maximising subagent and then self-terminate.
Interesting. I’ll think of whether this works and can be generalised (it doesn’t make it reflectively stable—creating u-maximising subagents is still allowed, and doesn’t directly hurt the agent—but might improve the situation).
Yep, seamless transition does not work on black-box agents. However, you don’t have to fully understand its utility or the consequences of that utility, to get the design to work. So full transparency and understanding are not needed.
Hey there!
This is similar to the value-learning via indifference idea I presented here: https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference , with the most up to date version being here: https://arxiv.org/abs/1712.06365
Great minds thinking alike and all that.
Your method differs in that you use a logical fact, rather than a probability, to underpin your scheme. I would just have put u(h) as a constant if f−1(y)≠0, but the way you do it prevents the agent from going crazy (all utilities are the same, or logic is broken) if it ever figures out that f−1(y)=0, and incentivises it to shut down in that case.
The disadvantage is your requirement (3): difficulty erasing x. If u(h) were truly constant given f−1(y)≠0, then that would not be necessary—it would only matter that there was a small probability that x=0, so we could do a sloppy job of erasing.
On that note, your scheme falls apart if the agent can only change u(h) by a really tiny amount (in that case, spending a lot of resources calculating x makes little difference). To correct for that, if you want to preserve a non-constant utility for f−1(y)≠0, then you need to scale this utility by the amount that the agent expects it can change u(h) in the f−1(y)=0 world.
Overall: I like the idea of using logical facts, I don’t like the idea of assuming that calculation costs are enough to control whether the agent computes the fact or not. Some better use of logical uncertainty might be called for?
Thank you so much for the feedback!
I just want to note that if it’s too easy for the agent to reconstruct x, this approach fails gracefully (and we can then simply improve this aspect and try again).
I agree (the approach fails gracefully in that case too). We can create a “sink” for excess computation power by designing u to reward the agent for doing arbitrary computations (i.e. finding more and more prime numbers). But in a way that the agent always prefers to give up any amount of such computations if it means achieving its goal (and terminating) one time-step earlier.
Regarding your paper (thanks for the pointer!): is there already a known way to use event-dependent rewards and effective disbelief to introduce an off-switch, that the agent will be indifferent about, to an arbitrary POMDP problem?
ETA: Regarding the seamless transition approach, I think I understand how it can solve this. But it requires using the agent’s model of the environment in the definition of the utility function, so it might not work for a black-box agent that maximizes a computable function of reward given via a reward channel. I think the approach I described in this post could still be used in such a case.
ETA2: A note to random readers: Invoking a black-box agent is probably a very bad idea unless there’s a consensus in the AI safety community it should be done!
EDIT: The following is probably incorrect, I need more time to think about it, please ignore it.
I’m confused about the seamless transition approach. Suppose the agent constructs a sub-agent that from time step t+1 (right after the transition) acts to minimize R′ iff it observes that the original agent keeps following the same policy after the transition (instead of switching to a policy that would maximize R′). Thereby, it will increase the extra corrective reward C (and thus the pseudo-reward R′+C) by decreasing V(R′,π_A,h_{t+1}).
In the proof of Theorem 17, the last sentence is:
This is obviously maximized by π_A being the optimal policy for R up until time t+1.
But πA might not be the optimal policy for R up until t+1. For example, if it corresponds to creating the above sub-agent, it isn’t optimal for R up until t+1.
There still might be problems with subagents, though. It could be optimal for the agent to create a subagent to protect it from being interfered with, while it “goes to sleep”.
I agree.
I think this might be solved by modifying the utility for the case f−1(y)≠0 to:
α1+[number of time-steps until the first "self-terminate" action]
Yep, that’s better. There’s still the risk of subagents being created—when the agent thinks that f−1(y)≠0, almost certainly, but not completely certainly. Then it might create a u-maximising subagent and then self-terminate.
That means that this design, like most indifference designs, is reflectively consistent but not reflectively stable.
Wow, I agree!
Let us modify the utility for the case f−1(y)=0 to:
u∗(h)={0h contains "self-terminate" actionu(h)otherwise
Meaning: no utility can be gained via subagents if the agent “jumps ship” (i.e. self-terminates to gain utility in case f−1(y)≠0).
Interesting. I’ll think of whether this works and can be generalised (it doesn’t make it reflectively stable—creating u-maximising subagents is still allowed, and doesn’t directly hurt the agent—but might improve the situation).
Yep, seamless transition does not work on black-box agents. However, you don’t have to fully understand its utility or the consequences of that utility, to get the design to work. So full transparency and understanding are not needed.