Locating human values might be trickier than characterizing a koala, but the difference seems quantitative, not qualitative.
I generally agree with this. The things I’m saying about human values also apply to koala classification. As with koalas, I do think there’s probably a wide range of parameters which would end up using the “right” level of abstraction for human values to be “natural”. On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute—again, because Bayesian updates on low-level physics are just better in terms of predictive power.
Right now, we have no idea when that line will be crossed—just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either “koalas” or “human values” is a natural level of abstraction.
It doesn’t know what a koala looks like, but it has some lower-level “components” which would allow it to characterize a koala. Then I use transfer learning and train it to differentiate marsupials. Now it knows about koalas too.
Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. “Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
When we move to human values, that distinction becomes a lot more important: “physical process which generates the label ‘human values satisfied’” is not the same as “human values satisfied”. Confusing those two is how we get Goodhart problems.
We don’t need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept “koala”, rather than “physical process which generates the label koala”, it has to be constrained on compute in a way which makes the latter too expensive—despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible to use those lower-level components to model the physical process which generates the label koala without directly reasoning about koalas.
I’ve now written essentially the same response at least four times to your objections, so I recommend applying the general pattern yourself:
Consider how Bayesian updates on a low-level physics model would behave on whatever task you’re considering. What would go wrong?
Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
What’s preventing ML systems from failing in that way already? The answer is probably “they don’t have enough compute to get higher predictive power from a less abstract model”—which means that, if things keep scaling up, sooner or later that failure will happen.
You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
“Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I’ve now written essentially the same response at least four times to your objections
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way.
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
or example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments?
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
Optimize for having a diverse range of models that all seem to fit the data.
I generally agree with this. The things I’m saying about human values also apply to koala classification. As with koalas, I do think there’s probably a wide range of parameters which would end up using the “right” level of abstraction for human values to be “natural”. On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute—again, because Bayesian updates on low-level physics are just better in terms of predictive power.
Right now, we have no idea when that line will be crossed—just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either “koalas” or “human values” is a natural level of abstraction.
Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. “Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
When we move to human values, that distinction becomes a lot more important: “physical process which generates the label ‘human values satisfied’” is not the same as “human values satisfied”. Confusing those two is how we get Goodhart problems.
We don’t need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept “koala”, rather than “physical process which generates the label koala”, it has to be constrained on compute in a way which makes the latter too expensive—despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible to use those lower-level components to model the physical process which generates the label koala without directly reasoning about koalas.
I’ve now written essentially the same response at least four times to your objections, so I recommend applying the general pattern yourself:
Consider how Bayesian updates on a low-level physics model would behave on whatever task you’re considering. What would go wrong?
Next, imagine a more realistic system (e.g. current ML systems) failing in an analogous way. What would that look like?
What’s preventing ML systems from failing in that way already? The answer is probably “they don’t have enough compute to get higher predictive power from a less abstract model”—which means that, if things keep scaling up, sooner or later that failure will happen.
You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
Optimize for having a diverse range of models that all seem to fit the data.
How would that fix any of the problems we’ve been talking about?