An agent that constantly twitches could still be a threat if it were trying to maximise the probability that it would actually twitch in the future. For example, if it were to break down, it wouldn’t be able to twitch, so it might want to gain control of resources.
I don’t suppose you could clarify exactly how this agent that is twitching is defined. In particular, how does it accumulate over time? Do you get 1 utility for each point in time where you twitch and is your total utility the undiscounted sum of these utilities.
I am not defining this agent using a utility function. It turns out that because of coherence arguments and the particular construction I gave, I can view the agent as maximizing some expected utility.
I like Gurkenglas’s suggestion of a random number generator hooked up to motor controls, let’s go with that.
Yeah, but it’s not trying to maximize that probability. I agree that a superintelligent agent that is trying to maximize the amount of twitching it does would be a threat, possibly by acquiring resources. But motor controls hooked up to random numbers certainly won’t do that.
If your robot powered by random numbers breaks down, it indeed will not twitch in the future. That’s fine, clearly it must have been maximizing a utility function that assigned utility 1 to it breaking at that exact moment in time. Jessica’s construction below would also work, but it’s specific to the case where you take the same action across all histories.
Presumably, it is a random number generator hooked up to motor controls. There is no explicit calculation of utilities that tells it to twitch.
It can maximize the utility function: ∑∞t=013t⋅(1 if I take the twitch action in time step t,0 otherwise). In a standard POMDP setting this always takes the twitch action.
Oh that’s interesting, so you’ve chosen a discount rate such that twitching now is always more important than twitching for the rest of time. And presumably it can’t both twitch AND take other actions in the world in the same time-step, as that’d make it an immediate threat.
Such a utility maximiser might become dangerous if it were broken in such a way that it wasn’t allowed to take the twitch action for a long period of time including the current time step, in which case it would take whatever actions would allow itself to twitch again as soon as possible. I wonder how dangerous such a robot would be?
On one hand, the goal of resuming twitching as soon as possible would seem to only require a limited amount of power to be accumulated, on the other hand, any resources accumulated in this process would then be deployed to maximising its utility. For example, it might have managed to gain control of a repair drone and this could now operate independently even if the original could now only twitch and nothing else. Even then, it’d likely be less of a threat as if the repair drone tried to leave to do anything, there would be a chance that the original robot would break down and the repair would be delayed. On the other hand, perhaps the repair drone can hack other systems without moving. This might result in resource accumulation.
In a POMDP there is no such thing as not being able to take a particular action at a particular time. You might have some other formalization of agents in mind; my guess is that, if this formalization is made explicit, there will be an obvious utility function that rationalizes the “always twitch” behavior.
POMDP is an abstraction. Real agents can be interfered with.
AI agents are designed using an agency abstraction. The notion of an AI “having a utility function” itself only has meaning relative to an agency abstraction. There is no such thing as a “real agent” independent of some concept of agency.
All the agency abstractions I know of permit taking one of some specified set of actions at each time step, which can easily be defined to include the “twitch” action. If you disagree with my claim, you can try formalizing a natural one that doesn’t have this property. (There are trivial ways to restrict the set of actions, but then you could use a utility function to rationalize “twitch if you can, take the lexicographically first action you can otherwise”)
How do you imagine the real agent working? Can you describe the process by which it chooses actions?
Presumably twitching requires sending a signal to a motor control and the connection here can be broken
Sorry, I wasn’t clear enough. What is the process which both:
Sends the signal to the motor control to twitch, and
Infers that it could break or be interfered with, and sends signals to the motor controls that cause it to be in a universe-state where it is less likely to break or be interfered with?
I claim that for any such reasonable process, if there is a notion of a “goal” in this process, I can create a goal that rationalizes the “always-twitch” policy. If I put in the goal that I construct into the program that you suggest, the policy always twitches, even if it infers that it could break or be interfered with.
The “reasonable” constraint is to avoid processes like “Maximize expected utility, except in the case where you would always twitch, in that case do something else”.