Some reasons for this (that I quickly wrote in response to someone asking a question about this):
There aren’t that many research direction we can do now which plausibly transfer to later much more powerful AIs while if we got really good at this it could transfer. (Up to around or a bit beyond full AI R&D automation maybe?)
Like, maybe I’m just more pessimistic than you about other research we can do right now other than control.
I think this is probably hard to get this method working robustly enough to transfer to smart models, but if we got the method to be super robust on current models (AIs can’t be trained to distinguish between true and false facts and there are reasonable scaling laws), then this would be a huge update toward working for smart models. And we can do this research now.
Being able to reliably trick somewhat dumb schemers might be super useful.
Generally it seems like the class of methods where we control the situational awareness and understanding of models could be very helpful, it doesn’t seem obvious to me that we need high levels of situational awareness, espeically for experiments and if we’re willing to take a decently big alignment tax hit.
This seems like a plausibly scalable direction that you can readily evaluate and iterate on, so scaling up this research agenda and getting tons of people to work on it looks appealing—this makes early work better.
Some reasons for this (that I quickly wrote in response to someone asking a question about this):
There aren’t that many research direction we can do now which plausibly transfer to later much more powerful AIs while if we got really good at this it could transfer. (Up to around or a bit beyond full AI R&D automation maybe?)
Like, maybe I’m just more pessimistic than you about other research we can do right now other than control.
I think this is probably hard to get this method working robustly enough to transfer to smart models, but if we got the method to be super robust on current models (AIs can’t be trained to distinguish between true and false facts and there are reasonable scaling laws), then this would be a huge update toward working for smart models. And we can do this research now.
Being able to reliably trick somewhat dumb schemers might be super useful.
Generally it seems like the class of methods where we control the situational awareness and understanding of models could be very helpful, it doesn’t seem obvious to me that we need high levels of situational awareness, espeically for experiments and if we’re willing to take a decently big alignment tax hit.
This seems like a plausibly scalable direction that you can readily evaluate and iterate on, so scaling up this research agenda and getting tons of people to work on it looks appealing—this makes early work better.
I’m not confident in these exact reasons.