You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense
You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense