I don’t really follow. I think that the situation is way too complex to justify that level of confidence without having incredibly good arguments ideally with a bunch of empirical data. Imo Eliezer’s arguments do not meet that bar. This isn’t because I disagree with one specific argument, rather it’s because many of his arguments give me the vibe of “idk, maybe? Or something totally different could be true. It’s complicated and we lack the empirical data and nuanced understanding to make more complex statements, and this argument is not near the required bar”. I can dig into this for specific arguments, but no specific one is my true objection. And, again, I think it is much much harder to defend a P>98% position than P>20% position, and I disagree with that strategic choice. Or am I misunderstanding you? I feel like we may be talking past each other
As an example, I think that Eliezer gives some conceptual arguments in the book and his other writing, using human evolution as a prior, that most minds we might get do not care about humans. This seems a pretty crucial point for his argument, as I understand it. I personally think this could be true, could be false, LLMs are really weird, but a lot of the weirdness is centered on human concepts. If you think I’m missing key arguments he’s making, feel free to point me to the relevant places.
You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense
I don’t really follow. I think that the situation is way too complex to justify that level of confidence without having incredibly good arguments ideally with a bunch of empirical data. Imo Eliezer’s arguments do not meet that bar. This isn’t because I disagree with one specific argument, rather it’s because many of his arguments give me the vibe of “idk, maybe? Or something totally different could be true. It’s complicated and we lack the empirical data and nuanced understanding to make more complex statements, and this argument is not near the required bar”. I can dig into this for specific arguments, but no specific one is my true objection. And, again, I think it is much much harder to defend a P>98% position than P>20% position, and I disagree with that strategic choice. Or am I misunderstanding you? I feel like we may be talking past each other
As an example, I think that Eliezer gives some conceptual arguments in the book and his other writing, using human evolution as a prior, that most minds we might get do not care about humans. This seems a pretty crucial point for his argument, as I understand it. I personally think this could be true, could be false, LLMs are really weird, but a lot of the weirdness is centered on human concepts. If you think I’m missing key arguments he’s making, feel free to point me to the relevant places.
You say “LLMs are really weird”, like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
I disagree—one of the aspects of the weirdness is that they’re sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it’s ability to be harmless. I do not mean weird in the “kinda arbitrary and will be nothing like what we expect” sense