Nora Belrose comments on AI as a science, and three obstacles to alignment strategies

Nora Belrose 3 Nov 2023 22:09 UTC
12 points
2
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
I think this counterfactual is literally incoherent— it does not make sense to talk about what an individual neural network would do if its “optimization power” were scaled up. It’s a category error. You instead need to ask what would happen if the training procedure were scaled up, and there are always many different ways that you can scale it up— e.g. keeping data fixed while parameters increase, or scaling both in lockstep, keeping the capability of the graders fixed, or investing in more capable graders / scalable oversight techniques, etc. So I deny that there is any fact of the matter about whether current LLMs “care about the target” in your sense. I think there probably are sensible ways of cashing out what it means for a 2023 LLM to “care about” something but this is not it.