What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.
Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.
What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.
Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.