I have the sense that you’ve misunderstood my past arguments. I don’t quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:
I didn’t pick the name “value learning”, and probably wouldn’t have picked it for that problem if others weren’t already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my “Value Learning” paper, the abstract includes “Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended”, which supports my recollection that I was never trying to use “Value Learning” for “getting the AI to understand human values is hard” as opposed to “getting the AI to act towards value in particular (as opposed to something else) is hard”, as supports my sense that this isn’t hindsight bias, and is in fact a misunderstanding.
A possible thing that’s muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.
The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF’d for corporatespeak or whatever) really doesn’t seem all that relevant, to me, to the question of how hard it’s going to be to get a long-horizon outcome-pumping AGI to act towards values.
If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that’d be no harder than any other sort of Q. As makes me feel pretty good about me being like “yep, that’s just not much evidence, because it’s just not surprising.”
If people think they’re going to be able to use GPT-4 and find the “generally moral” vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then… well they’re gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4′s “moral-ish” concept is not the sort of thing that makes for a nice future.
This is distinct from saying “an uploaded human allowed to make many copies of themselves would reliably create a dystopia”. I suspect some human-uploads could make great futures (but that most wouldn’t), but regardless, “would this dynamic system, under reflection, steer somewhere good?” is distinct from “if i use the best neuroscience at my disposal to extract something I hopefully call a “neural concept” and make a powerful optimizer pursue that, will result will be good?”. The answer to the latter is “nope, not unless you’re really very good at singling out the “value” concept from among all the brain’s concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)”.
Part of why you can’t pick out the “values” concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for “that which the long-term future should be optimized towards”, that concept is not encoded as simply and directly as the concept of “trees”. The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.
I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of “there are lots of concepts that are easy to confuse with the ‘values’ concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and …” as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.
(I don’t love this attempt at clarification myself, because it makes it sound like you’ll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how “getting the right values in there” is more like a problem of choosing the AI’s target from among its concepts rather than a problem of getting the concept to exist in the AI’s mind in the first place.)
(Where, again, the point I’d prefer to make is something like “the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn’t to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral”.)
For whatever it’s worth, while I think that the problem of getting the right values in there (“there” being its goals, not its model) is a real one, I don’t consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with “diamond” being the canonical example). (I’m probably on the record about this somewhere, and recall having tossed around guestimates like “being able to target the AGI is 80%+ of the problem”.) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the “Do What I Mean” proposal in the Value Learning paper), although that’s the sort of problem that we shouldn’t try to solve under time pressure.
Glancing back at my “Value Learning” paper, the abstract includes “Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended”, which supports my recollection that I was never trying to use “Value Learning” for “getting the AI to understand human values is hard” as opposed to “getting the AI to act towards value in particular (as opposed to something else) is hard”, as supports my sense that this isn’t hindsight bias, and is in fact a misunderstanding.
For what it’s worth, I didn’t claim that you argued “getting the AI to understand human values is hard”. I explicitly distanced myself from that claim. I was talking about the difficulty of value specification, and generally tried to make this distinction clear multiple times.
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you’re saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn’t route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like “I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human” and squinting.
Attempting to articulate the argument that I can half-see: on Matthew’s model of past!Nate’s model, AI was supposed to have a hard time answering questions like “Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?” without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and… nope, that one fell back into the “Matthew thinks Nate thought getting the AI to understand human values was hard” hypothesis.
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to “we can’t rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than “human-level at moral judgement” to avoid a catastrophe”, though I think that your whole framing is off and that you’re missing a few things:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
You can’t figure out what’s robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
In other words: It’s not that you need a super-ethicist, it’s that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
In other other words: a human’s ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.
This still doesn’t feel quite like it’s getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven’t dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like “the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down” and “suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question”. Which, as separate from the question of whether that’s a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans’ ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I’m arguing,
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
I have a quick response to what I see as your primary objection:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you’ll find that it’s cognizant of many nuances in human morality that go way deeper than the moral question of whether to “call 911 when Alice is in labor and your car has a flat”. Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”. I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can’t, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won’t be capable of performing in the near future, if you think that they are not capable of the ‘deep’ value specification that you care about. And here, again, I’m looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won’t be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it’s difficult for me to interpret your disagreement without a little more insight into what you’re predicting.
I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don’t understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven’t tried to answer your request for a prediction.)
Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
If ordinary humans can’t single out concepts that are robustly worth optimizing for, then either,
Human beings in general cannot single out what is robustly worth optimizing for
Only extraordinary humans can single out what is robustly worth optimizing for
Can you be more clear about which of these you believe?
I’m also including “indirect” ways that humans can single out concepts that are robustly worth optimizing for. But then I’m allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you’re allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can’t single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
If you allow indirection and don’t worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI’s imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N’s human-model and saying “whatever that thing would think is worth optimizing for” probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N’s model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don’t think the “value learning” problem is all that hard, if you’re allowed to assume that indirection works. The difficulty isn’t that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion’s share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I’ve generally pointed out how values are fragile, because that’s an inferentially-first step to most audiences (and a problem to which many people’s mind seems to quickly leap), on an inferential path that later includes “use indirection” (and later “first aim for a minimal pivotal task instead”). But separately, my own top guess is that “use indirection” is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.
I have the sense that you’ve misunderstood my past arguments. I don’t quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:
I didn’t pick the name “value learning”, and probably wouldn’t have picked it for that problem if others weren’t already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my “Value Learning” paper, the abstract includes “Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended”, which supports my recollection that I was never trying to use “Value Learning” for “getting the AI to understand human values is hard” as opposed to “getting the AI to act towards value in particular (as opposed to something else) is hard”, as supports my sense that this isn’t hindsight bias, and is in fact a misunderstanding.
A possible thing that’s muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.
The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF’d for corporatespeak or whatever) really doesn’t seem all that relevant, to me, to the question of how hard it’s going to be to get a long-horizon outcome-pumping AGI to act towards values.
If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that’d be no harder than any other sort of Q. As makes me feel pretty good about me being like “yep, that’s just not much evidence, because it’s just not surprising.”
If people think they’re going to be able to use GPT-4 and find the “generally moral” vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then… well they’re gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4′s “moral-ish” concept is not the sort of thing that makes for a nice future.
This is distinct from saying “an uploaded human allowed to make many copies of themselves would reliably create a dystopia”. I suspect some human-uploads could make great futures (but that most wouldn’t), but regardless, “would this dynamic system, under reflection, steer somewhere good?” is distinct from “if i use the best neuroscience at my disposal to extract something I hopefully call a “neural concept” and make a powerful optimizer pursue that, will result will be good?”. The answer to the latter is “nope, not unless you’re really very good at singling out the “value” concept from among all the brain’s concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)”.
Part of why you can’t pick out the “values” concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for “that which the long-term future should be optimized towards”, that concept is not encoded as simply and directly as the concept of “trees”. The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.
I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of “there are lots of concepts that are easy to confuse with the ‘values’ concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and …” as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.
(I don’t love this attempt at clarification myself, because it makes it sound like you’ll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how “getting the right values in there” is more like a problem of choosing the AI’s target from among its concepts rather than a problem of getting the concept to exist in the AI’s mind in the first place.)
(Where, again, the point I’d prefer to make is something like “the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn’t to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral”.)
For whatever it’s worth, while I think that the problem of getting the right values in there (“there” being its goals, not its model) is a real one, I don’t consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with “diamond” being the canonical example). (I’m probably on the record about this somewhere, and recall having tossed around guestimates like “being able to target the AGI is 80%+ of the problem”.) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the “Do What I Mean” proposal in the Value Learning paper), although that’s the sort of problem that we shouldn’t try to solve under time pressure.
For what it’s worth, I didn’t claim that you argued “getting the AI to understand human values is hard”. I explicitly distanced myself from that claim. I was talking about the difficulty of value specification, and generally tried to make this distinction clear multiple times.
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you’re saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn’t route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like “I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human” and squinting.
Attempting to articulate the argument that I can half-see: on Matthew’s model of past!Nate’s model, AI was supposed to have a hard time answering questions like “Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?” without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and… nope, that one fell back into the “Matthew thinks Nate thought getting the AI to understand human values was hard” hypothesis.
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to “we can’t rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than “human-level at moral judgement” to avoid a catastrophe”, though I think that your whole framing is off and that you’re missing a few things:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
You can’t figure out what’s robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
In other words: It’s not that you need a super-ethicist, it’s that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
In other other words: a human’s ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.
This still doesn’t feel quite like it’s getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven’t dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like “the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down” and “suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question”. Which, as separate from the question of whether that’s a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans’ ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I’m arguing,
I have a quick response to what I see as your primary objection:
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you’ll find that it’s cognizant of many nuances in human morality that go way deeper than the moral question of whether to “call 911 when Alice is in labor and your car has a flat”. Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”. I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can’t, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won’t be capable of performing in the near future, if you think that they are not capable of the ‘deep’ value specification that you care about. And here, again, I’m looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won’t be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it’s difficult for me to interpret your disagreement without a little more insight into what you’re predicting.
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don’t understand the relevance of this claim to my argument.)
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven’t tried to answer your request for a prediction.)
If ordinary humans can’t single out concepts that are robustly worth optimizing for, then either,
Human beings in general cannot single out what is robustly worth optimizing for
Only extraordinary humans can single out what is robustly worth optimizing for
Can you be more clear about which of these you believe?
I’m also including “indirect” ways that humans can single out concepts that are robustly worth optimizing for. But then I’m allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you’re allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can’t single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
If you allow indirection and don’t worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI’s imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N’s human-model and saying “whatever that thing would think is worth optimizing for” probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N’s model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don’t think the “value learning” problem is all that hard, if you’re allowed to assume that indirection works. The difficulty isn’t that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion’s share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I’ve generally pointed out how values are fragile, because that’s an inferentially-first step to most audiences (and a problem to which many people’s mind seems to quickly leap), on an inferential path that later includes “use indirection” (and later “first aim for a minimal pivotal task instead”). But separately, my own top guess is that “use indirection” is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.