Re the argument for “Why internalization might be difficult”, I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it’s not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:
~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.
Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it’s an error.
Is that in doubt? Note that I don’t say it models the base objective in the post, I just say that it has a complex world model. This seemed unquestionable to me since it demonstrably knows lots of things. Or are you drawing a distinction between “a lot of facts about stuff” and ” a world model?” I haven’t draw that; “model” seems very general and “complex” trivially true. It may not be a smart model.
Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, “world model” evokes some object that has a map-territory relationship with the world. It’s not clear to me that GPT-3 has that.
Another part of me thinks: I’m confused. It seems just as reasonable to claim that it obviously has a world model that’s just not very smart. I’m probably using bad concepts and should think about this more.
Re the argument for “Why internalization might be difficult”, I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it’s not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:
~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.
Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it’s an error.
Thanks! I agree it’s an error, of course. I’ve changed the section, do you think it’s accurate now?)
It looks good to me!
Idk, maybe...?
Is that in doubt? Note that I don’t say it models the base objective in the post, I just say that it has a complex world model. This seemed unquestionable to me since it demonstrably knows lots of things. Or are you drawing a distinction between “a lot of facts about stuff” and ” a world model?” I haven’t draw that; “model” seems very general and “complex” trivially true. It may not be a smart model.
Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, “world model” evokes some object that has a map-territory relationship with the world. It’s not clear to me that GPT-3 has that.
Another part of me thinks: I’m confused. It seems just as reasonable to claim that it obviously has a world model that’s just not very smart. I’m probably using bad concepts and should think about this more.