You seem to be imagining a system that throws out all of its more abstract notions of “koala” once it has the capability to do Bayesian updates on low-level physics. I don’t see why we should engineer our system in this way.
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
or example, you might give me a precise definition of a sandwich, and I show you something and you’re like “oh actually that is/is not a sandwich, guess my definition was wrong in this case”—which reveals you have more than one way of knowing what “a sandwich” is
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
This new argument seems pretty different from the “alignment is mainly about the prompt” thesis in your original post—another shift in arguments?
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
Optimize for having a diverse range of models that all seem to fit the data.
It’s certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is:
if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.)
if we’re not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.
In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the “right” abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail.
(Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich—which is indeed the case for most definitions most of the time. It certainly does not prove the existence of multiple ways of knowing what a sandwich is.
Also, even if there’s some sort of ensembling, the concept “sandwich” still needs to specify one particular ensemble.
We’ve shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We’ve shifted to talking about what alignment means in general, and what’s hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write.
Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds of goals (addition, translation, etc). Our current discussion is about the properties of a certain kind of goal—goals which are abstract in an analogous way to human values.
Optimize for having a diverse range of models that all seem to fit the data.
How would that fix any of the problems we’ve been talking about?