I agree with something like this, though I think you’re too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal. If the model is sufficiently good at deception, there will be few to no differential adversarial examples. Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is at least plausible that the model learns to pursue a misaligned objective even if it gains an understanding of the base objective before it becomes goal directed because the inductive biases of SGD favor models that are more numerous in parameter space.
If the model is sufficiently good at deception, there will be few to no differential adversarial examples.
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
the vastly larger number of misaligned goals
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?
I agree with something like this, though I think you’re too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal. If the model is sufficiently good at deception, there will be few to no differential adversarial examples. Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is at least plausible that the model learns to pursue a misaligned objective even if it gains an understanding of the base objective before it becomes goal directed because the inductive biases of SGD favor models that are more numerous in parameter space.
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?