I think I see the logic. Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like “When asked a question of the form x, what proportion of the time would you tend to answer y?”
The previous remark about being a microscope into its dataset seemed benign to me, e.g, if the model were already good at answering questions like “What proportion of datapoints satisfying predicates X satisfy predicate Y?”
But perhaps you also argue that the latter induces some small amount of self-awareness → situational awareness?
Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like “When asked a question of the form x, what proportion of the time would you tend to answer y?”
I’m not an author of this post, so I don’t know.
I think one of the biggest dangers of this kind of self-awareness is that it allows models to know their level of accuracy in particular areas. Right now, they could be overconfident or underconfident in their abilities, which makes their plans less effective when actually implemented. If they are overconfident, their plan that relies on that ability would just fail; if they are underconfident, they are not using all of their capabilities.
Why?
By giving it more information about itself, which is self-awareness, and a big part of situational awareness.
I think I see the logic. Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like “When asked a question of the form x, what proportion of the time would you tend to answer y?”
The previous remark about being a microscope into its dataset seemed benign to me, e.g, if the model were already good at answering questions like “What proportion of datapoints satisfying predicates X satisfy predicate Y?”
But perhaps you also argue that the latter induces some small amount of self-awareness → situational awareness?
I’m not an author of this post, so I don’t know.
I think one of the biggest dangers of this kind of self-awareness is that it allows models to know their level of accuracy in particular areas. Right now, they could be overconfident or underconfident in their abilities, which makes their plans less effective when actually implemented. If they are overconfident, their plan that relies on that ability would just fail; if they are underconfident, they are not using all of their capabilities.