This seems like bad news for AI Control. In particular, it suggests that misaligned model could bypass a weak overseer though these subliminal messages, and it would be very difficult to detect that this is happening. It could also enable a misaligned model to communicate with other instances of itself.
This seems like bad news for AI Control. In particular, it suggests that misaligned model could bypass a weak overseer though these subliminal messages, and it would be very difficult to detect that this is happening. It could also enable a misaligned model to communicate with other instances of itself.
i thought this initially. but this effect doesn’t occur in-context, which is good news for monitoring