At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.
Picking subroutines to run isn’t in its action space, so it doesn’t pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we’re not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
(If this is outside its action space, then it can try to make a brainwashy message)
In one word??
Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string
Again, its action space is printing one word to a screen. It’s not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?
Picking subroutines to run isn’t in its action space, so it doesn’t pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we’re not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
In one word??
Again, its action space is printing one word to a screen. It’s not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?