Absolutely! But having models that don’t know how to effectively scheme without access to external resources is still very useful. As just one example, we could interrogate the models and give them honeypots without giving them address to the internet.
Labs could also ban some of their models from reading certain webpages or ingesting certain information (as in filter at inference time all inputs to the model) in deployment (especially for models performing especially high-stakes tasks where sabotage would be easier).
I also suspect digesting a LessWrong post the model reads in deployment is more likely to involve explicit reasoning about the post in the CoT (or probes etc) and thus to be caught. By contrast, if a model has seen lots of of transcripts of misalignment in pre-training, the misaligned actions might be intuitive and second-nature, and thus harder to detect.
Absolutely! But having models that don’t know how to effectively scheme without access to external resources is still very useful. As just one example, we could interrogate the models and give them honeypots without giving them address to the internet.
Labs could also ban some of their models from reading certain webpages or ingesting certain information (as in filter at inference time all inputs to the model) in deployment (especially for models performing especially high-stakes tasks where sabotage would be easier).
I also suspect digesting a LessWrong post the model reads in deployment is more likely to involve explicit reasoning about the post in the CoT (or probes etc) and thus to be caught. By contrast, if a model has seen lots of of transcripts of misalignment in pre-training, the misaligned actions might be intuitive and second-nature, and thus harder to detect.