I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.
I’m also interested in these questions! I’m particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it’s worth studying.
I’m actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model’s behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It’s plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it’s worth it for the community to put a lot more effort into pretraining research.