Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
1st example
2nd example
3rd example