So the problem with this is that 4⁄5 ideas here involve deliberately making the LLM more alien than it actually is, in ways unrelated to how alien the LLM may or may not be.
-- GPT-4 finetuned to interpreting commands literally* -- But this isn’t an actual failure mode for LLMs, mostly, because they’re trained off human language and humans don’t interpret commands literally. If you ask an LLM how to get your parent our of a burning building… it will actually suggest things that involves understanding that you also want your parent to be safe. Etc etc.
-- Personality switching AI assistants* -- Again, once an LLM has put on a personality it’s pretty unlikely to put it off, unless there’s a thematically-consistent reason for it as in the Waluigi effect.
-- Intentional non-robustness—Like, again, why invent problems? You could show them things models are bad at—counting characters, knowing what letters are in tokens, long term reasoning.
This advice does not help us be guided by the beauty of our weapons. I think we should mostly try to persuade people of things through means that would work if and only if what we were saying was true; that’s the difference between communication and propaganda.
I agree that we shouldn’t be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)
To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user’s experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.
For example, it might be possible to steal other people’s private information by jailbreaking their LLM-based AI assistants—and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?
Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway—so patching them up seems useful for (marginally) decreasing the world’s take-over-ability.
For example, because a human wouldn’t be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.
So the problem with this is that 4⁄5 ideas here involve deliberately making the LLM more alien than it actually is, in ways unrelated to how alien the LLM may or may not be.
-- GPT-4 finetuned to interpreting commands literally* -- But this isn’t an actual failure mode for LLMs, mostly, because they’re trained off human language and humans don’t interpret commands literally. If you ask an LLM how to get your parent our of a burning building… it will actually suggest things that involves understanding that you also want your parent to be safe. Etc etc.
-- Personality switching AI assistants* -- Again, once an LLM has put on a personality it’s pretty unlikely to put it off, unless there’s a thematically-consistent reason for it as in the Waluigi effect.
-- Intentional non-robustness—Like, again, why invent problems? You could show them things models are bad at—counting characters, knowing what letters are in tokens, long term reasoning.
This advice does not help us be guided by the beauty of our weapons. I think we should mostly try to persuade people of things through means that would work if and only if what we were saying was true; that’s the difference between communication and propaganda.
I agree that we shouldn’t be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)
To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user’s experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.
For example, it might be possible to steal other people’s private information by jailbreaking their LLM-based AI assistants—and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?
Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway—so patching them up seems useful for (marginally) decreasing the world’s take-over-ability.
For example, because a human wouldn’t be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.