I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?
I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?