There are two senses in which control or alignment could be a number-go-up science:
1: Right now, we have a metric that we can optimize on to direct our research. 2: At the point where we have systems in front of us that actively pose misalignment risk, we will have a metric that we can optimize.
I think you’re mostly talking about 1 here. The property of control that drew us to it is that it has that second sense much more than alignment research does. I think its advantage is smaller in property 1.
In order for 2 to go as well as possible, in the present, we should do research that fills some combination of two roles:
1. We try to improve our techniques, using a proxy for the future methodology. 2. We try to improve our methodology, so that we’ll be able to use it better at the point where we can directly assess risk.
Can you sketch out why you think that control is more amenable to property 2 than alignment is? (By “more amenable to property 2” I mean “all else equal, researchers trying to build high-quality evaluations will be more likely to succeed at building evaluations with property 2 for control than for alignment.”)
My best guess for why you think this is something like:
Control evaluations are largely tests of certain capabilities. As with other capabilities evaluations, you can just run your static evaluation set on new model generations without needing to substantially adapt the evaluation in response to ways non-capabilities ways that the new models are different. In other words, if new models have changes in architecture, training process, etc. that doesn’t matter for your evaluations except for insofar as they lead to stronger capabilities.
(Once I understand your view better I’ll try to say what I think about it.)
Thanks for this post.
There are two senses in which control or alignment could be a number-go-up science:
1: Right now, we have a metric that we can optimize on to direct our research.
2: At the point where we have systems in front of us that actively pose misalignment risk, we will have a metric that we can optimize.
I think you’re mostly talking about 1 here. The property of control that drew us to it is that it has that second sense much more than alignment research does. I think its advantage is smaller in property 1.
In order for 2 to go as well as possible, in the present, we should do research that fills some combination of two roles:
1. We try to improve our techniques, using a proxy for the future methodology.
2. We try to improve our methodology, so that we’ll be able to use it better at the point where we can directly assess risk.
Can you sketch out why you think that control is more amenable to property 2 than alignment is? (By “more amenable to property 2” I mean “all else equal, researchers trying to build high-quality evaluations will be more likely to succeed at building evaluations with property 2 for control than for alignment.”)
My best guess for why you think this is something like:
(Once I understand your view better I’ll try to say what I think about it.)