Teun van der Weij comments on Alignment Faking in Large Language Models

Teun van der Weij 4 Feb 2025 18:45 UTC
1 point
0
What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.

Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.