You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
“Physical process which generates the label koala” is not the same as “koala”, and the system can get higher predictive power by modelling the former rather than the latter.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I’ve now written essentially the same response at least four times to your objections
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way.
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
or example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments?
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
Optimize for having a diverse range of models that all seem to fit the data.
You say: “we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute”. I think this depends on specific details of how the system is engineered.
Suppose we use classification accuracy as our loss function. If all the koalas are correctly classified by both models, then the two models have equal loss function scores. I suggested that at that point, we use some kind of active learning scheme to better specify the notion of “koala” or “human values” or whatever it is that we want. Or maybe just be conservative, and implement human values in a way that all our different notions of “human values” agree with.
You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way. My expectation is that human brains have many different computational notions of any given concept, similar to an ensemble (for example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is), and AGI will work the same way (at least, that’s how I would design it!)
I was trying to understand what you were getting at. This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments? (I don’t necessarily think it is bad for arguments to shift, I just think people should acknowledge that’s going on.)
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
Optimize for having a diverse range of models that all seem to fit the data.
How would that fix any of the problems we’ve been talking about?