Models that are too good in deceptive alignment can deceptively look dumb during testing.
For the reasons I said in footnote 1, I feel pretty optimistic about being able to get around this problem. What do you think of those arguments?
Models that are too good in deceptive alignment can deceptively look dumb during testing.
For the reasons I said in footnote 1, I feel pretty optimistic about being able to get around this problem. What do you think of those arguments?