The kinds of approaches discussed in Eliciting latent knowledge are complete non-starters. All those approaches try to define a loss function so that the strategy “answer questions honestly” gets a low loss. But if you can’t learn to recognize sensor tampering then it doesn’t matter how low a loss you’d get by answering questions honestly, gradient descent simply can’t learn how to do it. Analogously, if there’s no simple and efficient primality test, then it doesn’t matter whether you have a loss function which would incentivize primality testing, you’re not going to be able to do it.
Is this passage saying that distinguishing distinct mechanisms is a subproblem of/equivalent to ELK?
To elaborate, as I understand it, this passage is saying,
answering questions honestly requires identifying the mechanism that e.g. leads to smiling humans on camera
if we can’t find that mechanism efficiently, then we cannot create a method for answering questions honestly
This seems to imply that ELK is equivalent to finding the mechanisms that produce a model’s prediction. So does that mean finding mechanisms that produce a prediction is a subproblem of ELK?
But if ELK can be solved without solving this subproblem, couldn’t SGD find a model that recognizes sensor tampering without solving this subproblem as well?
Or does the approach to ELK that involves training an honest model on the generative model’s weights also require dealing with that subproblem, but there are other approaches that don’t?
Is this passage saying that distinguishing distinct mechanisms is a subproblem of/equivalent to ELK?
To elaborate, as I understand it, this passage is saying,
answering questions honestly requires identifying the mechanism that e.g. leads to smiling humans on camera
if we can’t find that mechanism efficiently, then we cannot create a method for answering questions honestly
This seems to imply that ELK is equivalent to finding the mechanisms that produce a model’s prediction. So does that mean finding mechanisms that produce a prediction is a subproblem of ELK?
It’s a subproblem of our current approach to ELK.
(Though note that by “mechanisms that produce a prediction” we mean “mechanism that gives rise to a particular regularity in the predictions.”)
It may be possible to solve ELK without dealing with this subproblem though.
But if ELK can be solved without solving this subproblem, couldn’t SGD find a model that recognizes sensor tampering without solving this subproblem as well?
Or does the approach to ELK that involves training an honest model on the generative model’s weights also require dealing with that subproblem, but there are other approaches that don’t?