If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely.
I am totally buffled by the fact that nobody pointed out that this is totally wrong.
Your model can have perfect representation of goal in “world-model” module, but not in “approve plan based on world-model prediction” module. In Humean style, from “what is” doesn’t follow “what should be”.
I.e. you conflate two different possible representations of a goal: representation that answers on questions about outside world, like “what next will happen in this training session”, “what humans try to achieve”, etc., and representation in goal-directedness system, like “what X I should maximize in the world”, or, without “should”, “what the best approximation of reward function given history of reward is”.
It would be nice to have architecture that can a) locate concept in world model, b) fire reward circuitry iff world model see this concept in possible future, but it definetely is a helluva work in interpretability and ML-model design.
The claim is that given the presence of differential adversarial examples, the optimisation process would adjust the parameters of the model such that it’s optimisation target is the base goal.
I am totally buffled by the fact that nobody pointed out that this is totally wrong.
Your model can have perfect representation of goal in “world-model” module, but not in “approve plan based on world-model prediction” module. In Humean style, from “what is” doesn’t follow “what should be”.
I.e. you conflate two different possible representations of a goal: representation that answers on questions about outside world, like “what next will happen in this training session”, “what humans try to achieve”, etc., and representation in goal-directedness system, like “what X I should maximize in the world”, or, without “should”, “what the best approximation of reward function given history of reward is”.
It would be nice to have architecture that can a) locate concept in world model, b) fire reward circuitry iff world model see this concept in possible future, but it definetely is a helluva work in interpretability and ML-model design.
The claim is that given the presence of differential adversarial examples, the optimisation process would adjust the parameters of the model such that it’s optimisation target is the base goal.