Monitoring seems crucial for the next safety case. My guess is that in couple of years we’ll lose capability-based safety cases, and need to rely on a good understanding of the question “how difficult are the tasks frontier agents can do covertly, without the human or AI monitor detecting its reasoning?”.
Having said this, it may be that in a year or two, we’ll be able to translate neuralese or have a good idea of how much one can train against a monitor before CoT stops being transparent, so the current paradigm of faithful CoT is certainly not the only hope for monitoring.
Monitoring seems crucial for the next safety case. My guess is that in couple of years we’ll lose capability-based safety cases, and need to rely on a good understanding of the question “how difficult are the tasks frontier agents can do covertly, without the human or AI monitor detecting its reasoning?”.
Having said this, it may be that in a year or two, we’ll be able to translate neuralese or have a good idea of how much one can train against a monitor before CoT stops being transparent, so the current paradigm of faithful CoT is certainly not the only hope for monitoring.