I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).
Extremely late, but I actually agree.
I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).