People say it’s important to demonstrate alignment problems like goal misgeneralization. But now, OpenAI, Deepmind, and Anthropic have all had leaders sign the CAIS statement on extinction risk and are doing substantial alignment research. The gap between the 90th percentile alignment concerned people at labs and the MIRI worldview is now more about security mindset. Security mindset is present in cybersecurity because it is useful in the everyday, practical environment researchers work in. So perhaps a large part of the future hinges on whether security mindset becomes useful for solving problems in applied ML.
Will it become clear that the presence of one bug in an AI system implies that there are probably five more?
Will architectures that we understand with fewer moving parts be demonstrated to have better robustness than black-box systems or systems that work for complicated reasons?
Will it become tractable to develop these kinds of simple transparent systems so that security mindset can catch on?
People say it’s important to demonstrate alignment problems like goal misgeneralization. But now, OpenAI, Deepmind, and Anthropic have all had leaders sign the CAIS statement on extinction risk and are doing substantial alignment research. The gap between the 90th percentile alignment concerned people at labs and the MIRI worldview is now more about security mindset. Security mindset is present in cybersecurity because it is useful in the everyday, practical environment researchers work in. So perhaps a large part of the future hinges on whether security mindset becomes useful for solving problems in applied ML.
Will it become clear that the presence of one bug in an AI system implies that there are probably five more?
Will architectures that we understand with fewer moving parts be demonstrated to have better robustness than black-box systems or systems that work for complicated reasons?
Will it become tractable to develop these kinds of simple transparent systems so that security mindset can catch on?
I suggest calling it “the sentence on extinction risk” so that people can pick up what is meant without having to have already memorized an acronym.
Edited, thanks