Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer’s.
Rob Bensinger
Yep, before I saw orthonormal’s response I had a draft-reply written that says almost literally the same thing:
we just call ‘em like we see ’em
[...]
insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for ‘why we need to shut it all down’, we’d rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy
I’ll also add: Nate (unlike Eliezer, AFAIK?) hasn’t flatly said ‘alignment is extremely difficult’. Quoting from Nate’s “sharp left turn” post:
Many people wrongly believe that I’m pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That’s flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.
I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it’s somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it’s somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it’s all that qualitatively different than the sorts of summits humanity has surmounted before.
It’s made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.
What undermines my hope is that nobody seems to be working on the hard bits, and I don’t currently expect most people to become convinced that they need to solve those hard bits until it’s too late.
So it may be that Nate’s models would be less surprised by alignment breakthroughs than Eliezer’s models. And some other MIRI folks are much more optimistic than Nate, FWIW.
My own view is that I don’t feel nervous leaning on “we won’t crack open alignment in time” as a premise, and absent that premise I’d indeed be much less gung-ho about government intervention.
why put all your argumentative eggs in the “alignment is hard” basket? (If you’re right, then policymakers can’t tell that you’re right.)
The short answer is “we don’t put all our eggs in the basket” (e.g., Eliezer’s TED talk and TIME article emphasize that alignment is an open problem, but they emphasize other things too, and they don’t go into detail on exactly how hard Eliezer thinks the problem is), plus “we very much want at least some eggs in that basket because it’s true, it’s honest, it’s cruxy for us, etc.” And it’s easier for policymakers to acquire strong Bayesian evidence for “the problem is currently unsolved” and “there’s no consensus about how to solve it” and “most leaders in the field seem to think there’s a serious chance we won’t solve it in time” than to acquire strong Bayesian evidence for “we’re very likely generations away from solving alignment”, so the difficulty of communicating the latter isn’t a strong reason to de-emphasize all the former points.
The longer answer is a lot more complicated. We’re still figuring out how best to communicate our views to different audiences, and “it’s hard for policymakers to evaluate all the local arguments or know whether Yann LeCun is making more sense than Yoshua Bengio” is a serious constraint. If there’s a specific argument (or e.g. a specific three arguments) you think we should be emphasizing alongside “alignment is unsolved and looks hard”, I’d be interested to hear your suggestion and your reasoning. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk is a very long list and isn’t optimized for policymakers, so I’m not sure what specific changes you have in mind here.
I expect it makes it easier, but I don’t think it’s solved.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal “maximize diamonds in an aligned way”, why not a bunch of small grounded ones.
“Plan the factory layout of the diamond synthesis plant with these requirements”.
“Order the equipment needed, here’s the payment credentials”.
“Supervise construction this workday comparing to original plans”
“Given this step of the plan, do it”
(Once the factory is built) “remove the output from diamond synthesis machine A53 and clean it”.
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don’t build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you’re likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don’t fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn’t to go build the thing; it’s that solving it is an indication that we’ve become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they’re scared of AI killing them, then yes, the impact of any given system failing will stay low. But that’s not what’s going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. ‘maybe humans don’t deserve to live’, ‘if I don’t do it someone else will anyway’, ‘if it’s that easy to destroy the world then we’re fucked anyway so I should just do the Modest thing of assuming nothing I do is that important’...).
The world needs some solution to the problem “if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it”. I don’t know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don’t think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where “aligning” includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it’s powerful.)
To be clear: The diamond maximizer problem is about getting specific intended content into the AI’s goals (“diamonds” as opposed to some random physical structure it’s maximizing), not just about building a stable maximizer.
From briefly talking to Eliezer about this the other day, I think the story from MIRI’s perspective is more like:
Back in 2001, we defined “Friendly AI” as “The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.”
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren’t going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?
And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI’s intelligence to solve the problem of extrapolating everyone’s preferences in a reasonable way, and of aggregating those preferences fairly.
Come 2014, Stuart Russell and MIRI were both looking for a new term to replace “the Friendly AI problem”, now that the field was starting to become a Real Thing. Both parties disliked Bostrom’s “the control problem”. In conversation, Russell proposed “the alignment problem”, and MIRI liked it, so Russell and MIRI both started using the term in public.
Unfortunately, it gradually came to light that Russell and MIRI had understood “Friendly AI” to mean two moderately different things, and this disconnect now turned into a split between how MIRI used “(AI) alignment” and how Russell used “(value) alignment”. (Which I think also influenced the split between Paul Christiano’s “(intent) alignment” and MIRI’s “(outcome) alignment”.)
Russell’s version of “friendliness/alignment” was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that “friendliness” was about good behavior, regardless of how that’s achieved. MIRI’s conception of “the alignment problem” (like Bostrom’s “control problem”) included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.
Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn’t fit their narrow definition of “alignment”.
Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and “aim” than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.
I think considerations like this eventually trickled in to how MIRI used the term “alignment”. Our first public writing reflecting the switch from “Friendly AI” to “alignment”, our Dec. 2014 agent foundations research agenda, said:
We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”
Whereas by July 2016, when we released a new research agenda that was more ML-focused, “aligned” was shorthand for “aligned with the interests of the operators”.
In practice, we started using “aligned” to mean something more like “aimable” (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just “getting the AI to predictably tile the universe with smiley faces rather than paperclips”). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when “do a pivotal act” is legitimately a hugely more philosophically shallow topic than “implement CEV”. Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and “un-philosophical”, but that successfully captured all of the key obstacles: ‘explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process’.
Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn’t try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this “alignment research” thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)
I don’t think MIRI wants to stop using “aligned” in the context of pivotal acts, and I also don’t think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.
Turning “alignment” purely into a matter of “get the AI to do what a particular stakeholder wants” is good in some ways—e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.
But from Eliezer’s perspective, this move would also be sending a message to all the young Eliezers “Alignment Research is what you do if you’re a serious sober person who thinks it’s naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something”. Which does not seem ideal.
So I think my proposed solution would be to just acknowledge that ‘the alignment problem’ is ambiguous between a three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:
intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in ‘how do we get AIs to be generically trying-to-be-helpful’.
“strawberry problem” alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.
Plausibly it would help to have better names for the latter two things. The distinction is similar to “narrow value learning vs. ambitious value learning”, but both problems (as MIRI thinks about them) are a lot more general than just “value learning”, and there’s a lot more content to the strawberry problem than to “narrow alignment”, and more content to CEV than to “ambitious value learning” (e.g., CEV cares about aggregation across people, not just about extrapolation).
(Note: Take the above summary of MIRI’s history with a grain of salt; I had Nate Soares look at this comment and he said “on a skim, it doesn’t seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it’s better than nothing”.)
- 23 Feb 2024 22:27 UTC; 2 points) 's comment on The Gemini Incident by (
In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:
example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)
i’d also still be impressed by simple theories of aimable cognition (i mostly don’t expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)
fwiw i don’t myself really know how to answer the question “technical research is more useful than policy research”; like that question sounds to me like it’s generated from a place of “enough of either of these will save you” whereas my model is more like “you need both”
tho i’m more like “to get the requisite technical research, aim for uploads” at this juncture
if this was gonna be blasted outwards, i’d maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful
(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it’s moving too slowly etc.)
AI Views Snapshots
I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.
Nate’s answer to nearly all questions of the form “can you do X without wanting Y?” is “yes”, hence his second claim in the OP: “the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular”.
I do need to answer that question using in a goal-oriented search process. But my goal would be “answer Paul’s question”, not “destroy the world”.
Your ultimate goal would be neither of those things; you’re a human, and if you’re answering Paul’s question it’s probably because you have other goals that are served by answering.
In the same way, an AI that’s sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it’s unlikely by default that “answer questions” will be the AI’s primary goal.
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.
See my reply to Bogdan here. The issue isn’t “inelegance”; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or “ancient culinary arts and medicine shortly after a cultural reboot”, such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)
The opening sounds a lot like saying “aerodynamics used to be a science until people started building planes.”
The reason this analogy doesn’t land for me is that I don’t think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers’ epistemic position regarding heavier-than-air flight.
The point Nate was trying to make with “ML is no longer a science” wasn’t “boo current ML that actually works, yay GOFAI that didn’t work”. The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn’t entail that we understand the brain’s workings in the way we’ve long understood flight; it depends on what the (actual or hypothetical) tech is.
Maybe a clearer way of phrasing it is “AI used to be failed science; now it’s (mostly, outside of a few small oases) a not-even-attempted science”. “Failed science” maybe makes it clearer that the point here isn’t to praise the old approaches that didn’t work; there’s a more nuanced point being made.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody’s been able to call the specific capabilities of systems in advance. Nobody’s been able to call the specific exploits in advance. Nobody’s been able to build better cognitive algorithms by hand after understanding how the AI does things we can’t yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.
It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.
The missing thread isn’t trivial to put into words, but it includes things like:
This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn’t know about the code behind the scenes: “We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there’s some deeper level of organization is talking like a theorist when in fact this is an engineering problem.” Those types of understanding aren’t false, but they aren’t the sort of understanding of someone who has comprehended the codebase they’re looking at.
There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don’t need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what’s going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.
Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.
I read and responded to some pieces of that post when it came out; I don’t know whether Eliezer, Nate, etc. read it, and I’m guessing it didn’t shift MIRI, except as one of many data points “person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought”.
I’d say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MIRI had already added our voices to the “stop building toward AGI” chorus. (Though at that stage I think we were mostly doing this on general principle, for lack of any better ideas than “share our actual long-standing views and hope that helps somehow”. Our increased optimism about policy solutions mostly came later, in 2023.)
That said, I bet Katja’s post had tons of relevant positive effects even if it didn’t directly shift MIRI’s views.
Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn’t (and isn’t) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn’t otherwise in the business of challenging people to race to AGI faster.
It wouldn’t have occurred to me that someone might think ‘can a deep net fill a bucket of water, in real life, without being dangerously capable’ is a crucial question in this context; I’m not sure we ever even had the thought occur in our heads ‘when might such-and-such DL technique successfully fill a bucket?’. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.
(And while I think I understand why others see ChatGPT as a large positive update about alignment’s difficulty, I hope it’s also obvious why others, MIRI included, would not see it that way.)
Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches—the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn’t look to me like it actually lets you build a corrigible AI that can help with a pivotal act.
If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it’s a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.
(Though now that I’ve seen the confusion the example causes, I’m more inclined to think that the strawberry problem is a better frame than the Fantasia example.)
I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.
As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:
The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: “fill a bucket of water” seems like a simple enough task, but it’s bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
(We can obviously stipulate “this thing is smart enough to do the thing we want, but too dumb to do anything dangerous”, but the relevant notion of “smart enough” is not itself formal; we don’t understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don’t want.)
The point of emphasizing “holy shit, this seems so easy and simple and yet we don’t see a way to do it!” wasn’t to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, ‘real’ satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
The hope was that someone would see the simple toy problems and go ‘what, no way, this sounds easy’, get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate ‘get the AI to fill the bucket and halt’ problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
By using a children’s cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is “toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties”, not “robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications”.
Nate’s version of the talk, which is mostly a more polished version of Eliezer’s talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)
″… for systems that are sufficiently good at modeling their environment”,
‘if the system is smart enough to recognize that shutdown will lower its score’,
“Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...”,
… to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily ‘every system capable enough to fill a bucket of water’ (or ‘every DL system...’).
I think this provides some support
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I’ve made. Good grief.)
At the very least, the two claims are consistent.
?? “Consistent” is very different from “supports”! Every off-topic claim by EY is “consistent” with Gallabytes’ assertion.
using GOFAI methods
“Nope” to this part. I otherwise like this comment a lot!
The main thing I’m claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
Ah, this is helpful clarification! Thanks. :)
I don’t think MIRI ever considered this an important part of the alignment problem, and I don’t think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from “AI ever gets good at NLP at all”.
don’t know if your essay is the source of the phrase or whether you just titled it
I think I came up with that particular phrase (though not the idea, of course).
More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn’t do that in our introduction to corrigibility because it wasn’t necessary for illustrating the problem and where we’d run into roadblocks.
Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it’s not sufficient on its own.)
The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Aside from “concreteness can help make the example easier to think about when you’re new to the topic”, part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence”.
Having utility functions so prominently/commonly be the layer of abstraction that is used[4].
I mean, I think utility functions are an extremely useful and basic abstraction. I think it’s a lot harder to think about a lot of AI topics without invoking ideas like ‘this AI thinks outcome X is better than outcome Y’, or ‘this AI’s preference come with different weights, which can’t purely be reduced to what the AI believes’.
Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven’t replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting “no really, we used to believe X!” is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)
However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)
The Arbital page for “value identification problem” is a three-sentence stub, I’m not exactly sure what the term means on that stub (e.g., whether “pinpointing valuable outcomes to an advanced agent” is about pinpointing them in the agent’s beliefs or in its goals), and the MIRI website gives me no hits for “value identification”.
As for “value specification”, the main resource where MIRI talks about that is https://intelligence.org/files/TechnicalAgenda.pdf, where we introduce the problem by saying:
A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals.
A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.
It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).
So I don’t think we’ve ever said that an important subproblem of AI alignment is “make AI smart enough to figure out what goals humans want”?
for example in this 2016 talk from Yudkowsky.
[footnote:] More specifically, in the talk, at one point Yudkowsky asks “Why expect that [alignment] is hard?” and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he’s saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.
I don’t see him saying anywhere “the issue is that the AI doesn’t understand human goals”. In fact, the fable explicitly treats the AGI as being smart enough to understand English and have reasonable English-language conversations with the programmers:
With that said: What if programmers build an artificial general intelligence to optimize for smiles? Smiles are good, right? Smiles happen when good things happen.
During the development phase of this artificial general intelligence, the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied. The AI appears to be producing beneficial effects upon the world, and it is producing beneficial effects upon the world so far.
Now the programmers upgrade the code. They add some hardware. The artificial general intelligence gets smarter. It can now evaluate a wider space of policy options—not necessarily because it has new motors, new actuators, but because it is now smart enough to forecast the effects of more subtle policies. It says, “I thought of a great way of producing smiles! Can I inject heroin into people?” And the programmers say, “No! We will add a penalty term to your utility function for administering drugs to people.” And now the AGI appears to be working great again.
They further improve the AGI. The AGI realizes that, OK, it doesn’t want to add heroin anymore, but it still wants to tamper with your brain so that it expresses extremely high levels of endogenous opiates. That’s not heroin, right?
It is now also smart enough to model the psychology of the programmers, at least in a very crude fashion, and realize that this is not what the programmers want. If I start taking initial actions that look like it’s heading toward genetically engineering brains to express endogenous opiates, my programmers will edit my utility function. If they edit the utility function of my future self, I will get less of my current utility. (That’s one of the convergent instrumental strategies, unless otherwise averted: protect your utility function.) So it keeps its outward behavior reassuring. Maybe the programmers are really excited, because the AGI seems to be getting lots of new moral problems right—whatever they’re doing, it’s working great!
I think the point of the smiles example here isn’t “NLP is hard, so we’d use the proxy of smiles instead, and all the issues of alignment are downstream of this”; rather, it’s that as a rule, superficially nice-seeming goals that work fine when the AI is optimizing weakly (whether or not it’s good at NLP at the time) break down when those same goals are optimized very hard. The smiley example makes this obvious because the goal is simple enough that it’s easy for us to see what its implications are; far more complex goals also tend to break down when optimized hard enough, but this is harder to see because it’s harder to see the implications. (Which is why “smiley” is used here.)
MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[6] For instance, Nate Soares wrote in his 2016 paper on value learning, that “Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task.”
That link is broken; the paper is https://intelligence.org/files/ValueLearningProblem.pdf. The full paragraph here is:
Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task. Problems of ontology identification recur here: the framework for extracting preferences and affecting outcome ratings needs to be robust to drastic changes in the learner’s model of the operator. The special-case identification of the “operator model” must survive as the system goes from modeling the operator as a simple reward function to modeling the operator as a fuzzy, ever-changing part of reality built out of biological cells—which are made of atoms, which arise from quantum fields.
Revisiting the Ontology Identification section helps clarify what Nate means by “safely extracting preferences from a model of a human”: IIUC, he’s talking about a programmer looking at an AI’s brain, identifying the part of the AI’s brain that is modeling the human, identifying the part of the AI’s brain that is “the human’s preferences” within that model of a human, and then manually editing the AI’s brain to “hook up” the model-of-a-human-preference to the AI’s goals/motivations, in such a way that the AI optimizes for what it models the humans as wanting. (Or some other, less-toy process that amounts to the same thing—e.g., one assisted by automated interpretability tools.)
In this toy example, we can assume that the programmers look at the structure of the initial world-model and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis. Its actions would then be dominated by any tiny remaining probabilities that it is in a universe where fundamental carbon atoms are hiding somewhere.
[...]
To design a system that classifies potential outcomes according to how much diamond is in them, some mechanism is needed for identifying the intended ontology of the training data within the potential outcomes as currently modeled by the AI. This is the ontology identification problem introduced by de Blanc [2011] and further discussed by Soares [2015].
This problem is not a traditional focus of machine learning work. When our only concern is that systems form better world-models, then an argument can be made that the nuts and bolts are less important. As long as the system’s new world-model better predicts the data than its old world-model, the question of whether diamonds or atoms are “really represented” in either model isn’t obviously significant. When the system needs to consistently pursue certain outcomes, however, it matters that the system’s internal dynamics preserve (or improve) its representation of which outcomes are desirable, independent of how helpful its representations are for prediction. The problem of making correct choices is not reducible to the problem of making accurate predictions.
Inductive value learning requires the construction of an outcome-classifier from value-labeled training data, but it also requires some method for identifying, inside the states or potential states described in its world-model, the referents of the labels in the training data.
As Nate and I noted in other comments, the paper repeatedly clarifies that the core issue isn’t about whether the AI is good at NLP. Quoting the paper’s abstract:
Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended.
And the lede section:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.[1] Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
Back to your post:
And to be clear, I don’t mean that GPT-4 merely passively “understands” human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice
I don’t think I understand what difference you have in mind here, or why you think it’s important. Doesn’t “this AI understands X” more-or-less imply “this AI can successfully distinguish X from not-X in practice”?
This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”. That wouldn’t solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out “query a human” for “query an AI”?
I interpret this passage as saying that ‘the problem’ is extracting all the judgements that “you would make”, and putting that into a wish. I think he’s implying that these judgements are essentially fully contained in your brain. I don’t think it’s credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.
Absolutely. But as Eliezer clarified in his reply, the issue he was worried about was getting specific complex content into the agent’s goals, not getting specific complex content into the agent’s beliefs. Which is maybe clearer in the 2011 paper where he gave the same example and explicitly said that the issue was the agent’s “utility function”.
For example, a straightforward reading of Nate Soares’ 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: “I claim that as fictional depictions of AI go, this is pretty realistic.
As I said in another comment:
“Fill the cauldron” examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
The idea of the “fill the cauldron” examples isn’t “the AI is bad at NLP and therefore doesn’t understand what we mean when we say ‘fill’, ‘cauldron’, etc.” It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
It’s true that ‘value is relatively complex’ is part of why it’s hard to get the right goal into an AGI; but it doesn’t follow from this that ‘AI is able to develop pretty accurate beliefs about our values’ helps get those complex values into the AGI’s goals. (It does provide nonzero evidence about how complex value is, but I don’t see you arguing that value is very simple in any absolute sense, just that it’s simple enough for GPT-4 to learn decently well. Which is not reassuring, because GPT-4 is able to learn a lot of very complicated things, so this doesn’t do much to bound the complexity of human value.)
In any case, I take this confusion as evidence that the fill-the-cauldron example might not be very useful. Or maybe all these examples just need to explicitly specify, going forward, that the AI is part-human at understanding English.
Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI’s objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts.
Your image isn’t displaying for me, but I assume it’s this one?
I don’t know what you mean by “specify an AI’s objectives” here, but the specific term Nate uses here is “value learning” (not “value specification” or “value identification”). And Nate’s Value Learning Problem paper, as I noted above, explicitly disclaims that ‘get the AI to be smart enough to output reasonable-sounding moral judgments’ is a core part of the problem.
He states, “The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification.” I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we’ve given it.
The way you quoted this makes it sound like a gloss on the image, but it’s actually a quote from the very start of the talk:
The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.
The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. [...]
I wouldn’t read too much into the word choice here, since I think it’s just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI’s goals, not about getting content into the AI’s beliefs.
(In general, I think the phrase “value specification” is sort of confusingly vague. I’m not sure what the best replacement is for it—maybe just “value loading”, following Bostrom? -- but I suspect MIRI’s usage of it has been needlessly confusing. Back in 2014, we reluctantly settled on it as jargon for “the part of the alignment problem that isn’t subsumed in getting the AI to reliably maximize diamonds”, because this struck us as a smallish but nontrivial part of the problem; but I think it’s easy to read the term as referring to something a lot more narrow.)
The point of “the genie knows but doesn’t care” wasn’t that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn’t care about what you asked for. If you read Rob Bensinger’s essay carefully, you’ll find that he’s actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[10].
Yep—I think I’d have endorsed claims like “by default, a baby AGI won’t share your values even if it understands them” at the time, but IIRC the essay doesn’t make that point explicitly, and some of the points it does make seem either false (wait, we’re going to be able to hand AGI a hand-written utility function? that’s somehow tractable?) or confusingly written. (Like, if my point was ‘even if you could hand-write a utility function, this fails at point X’, I should have made that ‘even if’ louder.)
Some MIRI staff liked that essay at the time, so I don’t think it’s useless, but it’s not the best evidence: I wrote it not long after I first started learning about this whole ‘superintelligence risk’ thing, and I posted it before I’d ever worked at MIRI.
To pick out a couple of specific examples from your list, Wei Dai:
This is a serious long-term concern if we don’t kill ourselves first, but it’s not something I see as a premise for “the priority is for governments around the world to form an international agreement to halt AI progress”. If AI were easy to use for concrete tasks like “build nanotechnology” but hard to use for things like CEV, I’d instead see the priority as “use AI to prevent anyone else from destroying the world with AI”, and I wouldn’t want to trade off probability of that plan working in exchange for (e.g.) more probability of the US and the EU agreeing in advance to centralize and monitor large computing clusters.
After someone has done a pivotal act like that, you might then want to move more slowly insofar as you’re worried about subtle moral errors creeping in to precursors to CEV.
I currently assign very low probability to humans being able to control the first ASI systems, and redirecting governments’ attention away from “rogue AI” and toward “rogue humans using AI” seems very risky to me, insofar as it causes governments to misunderstand the situation, and to specifically misunderstand it in a way that encourages racing.
If you think rogue actors can use ASI to achieve their ends, then you should probably also think that you could use ASI to achieve your own ends; misuse risk tends to go hand-in-hand with “we’re the Good Guys, let’s try to outrace the Bad Guys so AI ends up in the right hands”. This could maybe be justified if it were true, but when it’s not even true it strikes me as an especially bad argument to make.