Upvoted. The way I think about the case for ambitious interp is:
There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
Ambitious interp should have a much higher ceiling of usefulness if we can get it to work in time, because (a) it give us better options for getting a lot of juice out of our signal by understanding what’s going wrong (ie better debugging) and correcting the problem at the root and (b) it may be more robust to obfuscation (even holding constant the way which we optimise against the signal).
It still seems plausible that we can get it to work in time, because (a) there are reasonably promising agendas existing in interp, (b) timelines are plausibly more than a few years, and (c) we may be able to automate lots of the work (and work now might still be useful to prep for that).
I do think that pragmatic interp is great, but I don’t want the field to move away from ambitious interp entirely. My guess is that people in favour of moving away from ambitious interp mostly contest (3.a) (and secondarily maybe I think (2) is a bigger deal than they do, although maybe not). I don’t think I would disagree much about how object level promising the existing agendas are: I just think that ‘promising’ should mean ‘could lead somewhere given 10 years of effort, plus maybe lots of AI labour’ which might correspond to research that feels like it is very limited right now.
I know you know this but I thought it is important to emphasize that
your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.
It’s not even an oversight problem. There is simply nothing to ′ oversee’. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it’s too late.
Upvoted. The way I think about the case for ambitious interp is:
There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
Ambitious interp should have a much higher ceiling of usefulness if we can get it to work in time, because (a) it give us better options for getting a lot of juice out of our signal by understanding what’s going wrong (ie better debugging) and correcting the problem at the root and (b) it may be more robust to obfuscation (even holding constant the way which we optimise against the signal).
It still seems plausible that we can get it to work in time, because (a) there are reasonably promising agendas existing in interp, (b) timelines are plausibly more than a few years, and (c) we may be able to automate lots of the work (and work now might still be useful to prep for that).
I do think that pragmatic interp is great, but I don’t want the field to move away from ambitious interp entirely. My guess is that people in favour of moving away from ambitious interp mostly contest (3.a) (and secondarily maybe I think (2) is a bigger deal than they do, although maybe not). I don’t think I would disagree much about how object level promising the existing agendas are: I just think that ‘promising’ should mean ‘could lead somewhere given 10 years of effort, plus maybe lots of AI labour’ which might correspond to research that feels like it is very limited right now.
I know you know this but I thought it is important to emphasize that
your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.
It’s not even an oversight problem. There is simply nothing to ′ oversee’. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it’s too late.