I haven’t thought about this a lot, but “encrypted” could just mean “just beyond the capabilities of the Surgeon to identify”. So the gradient could be moving in a direction away from “easily identifiable early deceptive circuits” instead of “deception”, and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.
I haven’t thought about this a lot, but “encrypted” could just mean “just beyond the capabilities of the Surgeon to identify”. So the gradient could be moving in a direction away from “easily identifiable early deceptive circuits” instead of “deception”, and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
That’s definitely a thing that can happen.
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.