I think this is all correct, but it makes me wonder.
You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don’t necessarily lead to coherent behavior out of distribution and aren’t necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I’m not sure how to tell if it’s true or not.
(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)
subjecting it to any kind of reinforcement learning at all
When you say that, it seems to suggest that what you’re really thinking about is really the LLMs as of today, for which the vast majority of their behavioral tendencies comes from pretraining, and there’s just a bit of RL sprinkled on top to elevate some pretrained behavioral profiles over other pretrained behavioral profiles. Whereas what I am normally thinking about (as are Silver & Sutton, IIUC) is that either 100% or ≈100% of the system’s behavioral tendencies are ultimately coming from some kind of RL system. This is quite an important difference! It has all kinds of implications. More on that in a (hopefully) forthcoming post.
don’t necessarily lead to coherent behavior out of distribution
As mentioned in my other comment, I’m assuming (like Sutton & Silver) that we’re doing continuous learning, as is often the case in the RL literature (unlike LLMs). So every time the agent does some out-of-distribution thing, it stops being out of distribution! So yes there are a couple special cases in which OOD stuff is important (namely irreversible actions and deliberately not exploring certain states), but you shouldn’t be visualizing the agent as normally spending its time in OOD situations—that would be self-contradictory. :)
Thanks for the link! It’s indeed very relevant to my question.
I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann’s hypothesis, they’d probably refuse. But humans aren’t magical: we are some combination of reinforcement learning, imitation learning and so on. So there’s got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?
Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.
So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).
Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”
If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.
If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.
If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.
The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired, and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.
I think it helps. The link to “non-behaviorist rewards” seems the most relevant. The way I interpret it (correct me if I’m wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.
The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store.
“We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it.
what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way.
This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.
I thought about it some more and want to propose another framing.
The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent’s feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won’t even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.
The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is. It’s a kind of RL for which the death of one creature is not the end. In other words, we can function because we’ve delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there’s a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)
This suggests to me that if we want the rubber to meet the road—if we want the agent to have behaviors that track the world, not just the agent’s own feelings—then the optimization process that created the agent cannot be the agent’s own RL. By itself, RL can only learn to care about “behavioral reward” as you put it. Caring about the world can only occur if the agent “inherits” that caring from some other process in the world, by makeup or imitation.
This conclusion might be a bit disappointing, because finding the right process to “inherit” from isn’t easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn’t the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can’t hope that the agents will learn it by some clever RL. It has to be due to the agent’s makeup or imitation.
This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?
The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.
I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing with that, but I’m not quite sure why.
In other words, specification gaming is broader than wireheading. So for example, a “behaviorist reward” would be “+1 if the big red reward button on the wall gets pressed”. An agent with that reward, who is able to see the button, will probably wind up wanting the button to be pressed, and seizing control of the button if possible. But if the button is connected to a wire behind the wall, the agent won’t necessarily want to bypass the button by breaking open the wall and cutting the wire and shorting it. (If it did break open the wall and short the wire, then its value function would update, and it would get “addicted” to that, and going forward it would want to do that again, and things like it, in the future. But it won’t necessarily want to do that in the first place. This is an example of goal misgeneralization.)
imagine an environment where any mistake kills the agent. In such an environment, RL is impossible. The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is.
Evolution partly acts through RL. For example, falling off a cliff often leads to death, so we evolved fear of heights, which is an innate drive / primary reward that then leads (via RL) to flexible intelligent cliff-avoiding behavior.
But also evolution has put a number of things into our brains that are not the main RL system. For example, there’s an innate reflex controlling how and when to vomit. (You obviously don’t learn how and when to vomit by trial-and-error!)
here I disagree with you a bit: I think most of human learning is imitation
When I said human learning is not “imitation learning”, I was using the latter term in a specific algorithmic sense as described in §2.3.2. I certainly agree that people imitate other people and learn from culture, and that this is a very important fact about humans. I just think it happens though the human brain within-lifetime RL system—humans generally imitate other people and copy culture because they want to, because it feels like a good and right thing to do.
Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?
In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that’s the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?
even if during training it already knew that buttons are often connected to wires
I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.
A slightly smarter agent would turn its gaze slightly closer to the reward itself.
I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.
Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here.
If that’s the plan, to me at first glance it seems a bit brittle.
I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here.
…But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here?
If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans.
My perspective (well, the one that came to me during this conversation) is indeed “I don’t want to take cocaine → human-level RL is not the full story”. That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I’m not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.
It’s just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it’ll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it’ll start to realize that behind the button there’s a wire, and the wire leads to the agent’s own reward circuit and so on.
Can you engineer things just right, so the agent learns to care about just the right level of “realness”? I don’t know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: “you’ll care about reality in this specific way”. So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the “realness”? That’s the point I was trying to make a couple comments ago, but maybe didn’t phrase it well.
Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.
(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in the future.)
Wireheading is indeed an attractor, just like getting hooked on an addictive drug is an attractor. As soon as you try it, your value function will update, and then you’ll want to do it again. But before you try it, your value function has not updated, and it’s that not-updated value function that gets to evaluate whether taking an addictive drug is a good plan or bad plan. See also my discussion of “observation-utility agents” here. I don’t think you can get hooked on addictive drugs just by deeply understanding how they work.
So by the same token, it’s possible for our hypothetical agent to think that the pressing of the actual wired-up button is the best thing in the world. Cutting into the wall and shorting the wire would be bad, because it would destroy the thing that is best in the world, while also brainwashing me to not even care about the button, which adds insult to injury. This isn’t a false belief—it’s an ought not an is. I don’t think it’s reflectively-unstable either.
Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here “reward hacking” seems analogous to revealing the secret to avoid torture / negative reward.)
My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The “utility function” itself is “learned” somehow (maybe partly via RL through social pressure and other rewards, partly through “reasoning”), but sometimes it gets locked-in or resistant to further change.
At a higher level, I think evolution had a lot of time to tinker, and could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of “reward hacking” that had the highest negative impacts in our EEA. I suspect understanding exactly what evolution did could be helpful, but it would probably be a mishmash of things that only partially solve the problem, which would still leave us confused as to how to move forward with AI design.
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.
(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.)
So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out.
So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too.
evolution…could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of “reward hacking” that had the highest negative impacts in our EEA.
I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)
My intuition says reward hacking seems harder to solve than this (even in EEA), but I’m pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:
Best friends, and friend groups, do exist, and people do really enjoy and value them.
The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.)
As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosity drive, etc., because those determine the object-level preferences that I would be neglecting in favor of following Zoe’s whims.
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
As I always say, my claim is “understanding human social instincts seems like it would probably be useful for Safe & Beneficial AGI”, not “here is my plan for Safe & Beneficial AGI…”. :) That said, copying from §5.2 here:
…Alternatively, there’s a cop-out option! If the outcome of the [distributional shifts] is so hard to predict, we could choose to endorse the process while being agnostic about the outcome.
There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them. The problem is the same—the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions. But this is how it’s always been, and we generally feel good about it. [Not that we’ve ever had a choice!]
So the idea here is, we could make AIs with social and moral drives that we’re happy about, probably because those drives are human-like in some respects [but probably not in all respects!], and then we could define “good” as whatever those AIs wind up wanting, upon learning and reflection.
As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.
Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don’t have a complete understanding yet, so I’m just flagging this as a potential puzzle, not demanding an immediate answer.)
There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them.
I’m only “happy” to “entrust the future to the next generation of humans” if I know that they can’t (i.e., don’t have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we’re still in an arguably ok position. However AI will make it false very quickly by advancing technology.
So I really want the AI transition to be an opportunity for improving the basic dynamic of “the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions”, for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it’s always been this way, so it’s fine to continue.
I’m going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...
There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competition over the conversation topic—Joey keeps changing the subject to trains, Billy to dinosaurs. Why do they each want to share / show off their new knowledge? Because of “drive to feel liked / admired”. From Joey’s perspective, the moment of telling Billy a cool new fact about trains, and seeing the excited look on Billy’s face, is intrinsically motivating. And vice-versa for Billy. But they can’t both satisfy that innate drive simultaneously.
OK, then you’ll say, why didn’t they both read the same book last night? Well, maybe the library only had one copy. But also, if they had both read the same book, and thus both learned the same cool facts about bulldozers, then that’s a lose-lose instead of a win-win! Neither gets to satisfy their drive to feel liked/admired by thrilling the other with cool exciting new-to-them facts.
Do you buy that? Happy to talk through more vignettes.
I’m only “happy” to “entrust the future to the next generation of humans” if I know that they can’t (i.e., don’t have the technology to) do something irreversible…
OK, here’s another way to put it. When humans do good philosophy and ethics, they’re combining their ability to observe and reason (the source of all “is”) with their social and moral instincts / reflexes (the source of all “ought”, see Valence §2.7).
If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes. (And if it’s not possible for this process to lead somewhere good, then we’re screwed no matter what.)
Indeed, there are reasons to think the odds of this process leading somewhere good might be better in an AI than in humans, since the programmers can in principle turn the knobs on the various social and moral drives. (Some humans are more motivated by status and self-aggrandizement than others, and I think this is related to interpersonal variation in the strengths of different innate drives. It would be nice to just have a knob in the source code for that!) To be clear, there are also reasons to think the odds might be worse in an AI, since humans are at least a known quantity, and we might not really know what we’re doing in AGI programming until it’s too late.
On my current thinking, it will be very difficult to avoid fast takeoff and unipolar outcomes, for better or worse. (Call me old-fashioned.) (I’m stating this without justification.) If the resulting singleton ASI has the right ingredients to do philosophy and ethics (i.e., it has both the ability to reason, and human-like social and moral instincts / reflexes), but with superhuman speed and insight, then the future could turn out great. Is it an irreversible single-point-of-failure for the future of the universe? You bet! And that’s bad. But I’m currently pretty skeptical that we can avoid that kind of thing (e.g. I expect very low compute requirements that make AGI governance / anti-proliferation rather hopeless), so I’m more in the market for “least bad option” rather than “ideal plan”.
If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.
A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they’ll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
Another problem is that we’re relatively confident that at least some humans can reason “successfully”, in the sense of making philosophical progress, but we don’t know the same about AI. There seemingly arereasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
Finally, I find your arguments against moral realism somewhat convincing, but I’m still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternativesfor the realism side of the spectrum still somewhat convincing as well, don’t want to bet the universe on or against any of these positions.
RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.
I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.
RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.
RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)
By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)
For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.
I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I’m worried that in actual development of brain-like AGI, we will skip this part (because it’s too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it’s brain-like. (And then it ends up not doing that because we didn’t give it some “secret sauce” that humans have.) And this does look fairly hard to me, because we don’t yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?
But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].
I find it pretty plausible that wireheading is what we’ll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we’ll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it’ll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.
(Of course if the right values for us are to max out our wireheading, then we shouldn’t hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs’ values could conflict with ours.)
I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.
You’re prioritizing instilling AGI with the right social instincts, and arguing that’s very important for getting the AGI to eventually converge on values that we’d consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn’t matter what social instincts the AGI starts out with, what’s more important is that it has the capacity to eventually find and be motivated by objective moral truths.
From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it’s no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.
By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.
Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.
I think this is all correct, but it makes me wonder.
You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don’t necessarily lead to coherent behavior out of distribution and aren’t necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I’m not sure how to tell if it’s true or not.
You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.
(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)
When you say that, it seems to suggest that what you’re really thinking about is really the LLMs as of today, for which the vast majority of their behavioral tendencies comes from pretraining, and there’s just a bit of RL sprinkled on top to elevate some pretrained behavioral profiles over other pretrained behavioral profiles. Whereas what I am normally thinking about (as are Silver & Sutton, IIUC) is that either 100% or ≈100% of the system’s behavioral tendencies are ultimately coming from some kind of RL system. This is quite an important difference! It has all kinds of implications. More on that in a (hopefully) forthcoming post.
As mentioned in my other comment, I’m assuming (like Sutton & Silver) that we’re doing continuous learning, as is often the case in the RL literature (unlike LLMs). So every time the agent does some out-of-distribution thing, it stops being out of distribution! So yes there are a couple special cases in which OOD stuff is important (namely irreversible actions and deliberately not exploring certain states), but you shouldn’t be visualizing the agent as normally spending its time in OOD situations—that would be self-contradictory. :)
Thanks for the link! It’s indeed very relevant to my question.
I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann’s hypothesis, they’d probably refuse. But humans aren’t magical: we are some combination of reinforcement learning, imitation learning and so on. So there’s got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?
Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.
So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).
Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”
If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.
If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.
If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.
The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired, and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.
Does that help? Sorry if I missed your point.
I think it helps. The link to “non-behaviorist rewards” seems the most relevant. The way I interpret it (correct me if I’m wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.
The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store.
“We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it.
This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.
I thought about it some more and want to propose another framing.
The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent’s feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won’t even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.
The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is. It’s a kind of RL for which the death of one creature is not the end. In other words, we can function because we’ve delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there’s a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)
This suggests to me that if we want the rubber to meet the road—if we want the agent to have behaviors that track the world, not just the agent’s own feelings—then the optimization process that created the agent cannot be the agent’s own RL. By itself, RL can only learn to care about “behavioral reward” as you put it. Caring about the world can only occur if the agent “inherits” that caring from some other process in the world, by makeup or imitation.
This conclusion might be a bit disappointing, because finding the right process to “inherit” from isn’t easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn’t the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can’t hope that the agents will learn it by some clever RL. It has to be due to the agent’s makeup or imitation.
This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?
I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing with that, but I’m not quite sure why.
In other words, specification gaming is broader than wireheading. So for example, a “behaviorist reward” would be “+1 if the big red reward button on the wall gets pressed”. An agent with that reward, who is able to see the button, will probably wind up wanting the button to be pressed, and seizing control of the button if possible. But if the button is connected to a wire behind the wall, the agent won’t necessarily want to bypass the button by breaking open the wall and cutting the wire and shorting it. (If it did break open the wall and short the wire, then its value function would update, and it would get “addicted” to that, and going forward it would want to do that again, and things like it, in the future. But it won’t necessarily want to do that in the first place. This is an example of goal misgeneralization.)
Evolution partly acts through RL. For example, falling off a cliff often leads to death, so we evolved fear of heights, which is an innate drive / primary reward that then leads (via RL) to flexible intelligent cliff-avoiding behavior.
But also evolution has put a number of things into our brains that are not the main RL system. For example, there’s an innate reflex controlling how and when to vomit. (You obviously don’t learn how and when to vomit by trial-and-error!)
When I said human learning is not “imitation learning”, I was using the latter term in a specific algorithmic sense as described in §2.3.2. I certainly agree that people imitate other people and learn from culture, and that this is a very important fact about humans. I just think it happens though the human brain within-lifetime RL system—humans generally imitate other people and copy culture because they want to, because it feels like a good and right thing to do.
Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?
In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that’s the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?
I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.
I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.
Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here.
I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here.
…But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here?
If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans.
And see also “The bio-determinist child-rearing rule of thumb” here which implies that innate drives can have predictable results in adult desires and personality, robust to at least some variation in training environment. [But more wild variation in training environment, e.g. feral children, does seem to matter.] And also Heritability, Behaviorism, and Within-Lifetime RL
My perspective (well, the one that came to me during this conversation) is indeed “I don’t want to take cocaine → human-level RL is not the full story”. That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I’m not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.
It’s just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it’ll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it’ll start to realize that behind the button there’s a wire, and the wire leads to the agent’s own reward circuit and so on.
Can you engineer things just right, so the agent learns to care about just the right level of “realness”? I don’t know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: “you’ll care about reality in this specific way”. So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the “realness”? That’s the point I was trying to make a couple comments ago, but maybe didn’t phrase it well.
Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.
(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in the future.)
Wireheading is indeed an attractor, just like getting hooked on an addictive drug is an attractor. As soon as you try it, your value function will update, and then you’ll want to do it again. But before you try it, your value function has not updated, and it’s that not-updated value function that gets to evaluate whether taking an addictive drug is a good plan or bad plan. See also my discussion of “observation-utility agents” here. I don’t think you can get hooked on addictive drugs just by deeply understanding how they work.
So by the same token, it’s possible for our hypothetical agent to think that the pressing of the actual wired-up button is the best thing in the world. Cutting into the wall and shorting the wire would be bad, because it would destroy the thing that is best in the world, while also brainwashing me to not even care about the button, which adds insult to injury. This isn’t a false belief—it’s an ought not an is. I don’t think it’s reflectively-unstable either.
Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here “reward hacking” seems analogous to revealing the secret to avoid torture / negative reward.)
My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The “utility function” itself is “learned” somehow (maybe partly via RL through social pressure and other rewards, partly through “reasoning”), but sometimes it gets locked-in or resistant to further change.
At a higher level, I think evolution had a lot of time to tinker, and could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of “reward hacking” that had the highest negative impacts in our EEA. I suspect understanding exactly what evolution did could be helpful, but it would probably be a mishmash of things that only partially solve the problem, which would still leave us confused as to how to move forward with AI design.
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.
(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.)
So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out.
So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too.
I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)
My intuition says reward hacking seems harder to solve than this (even in EEA), but I’m pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:
Best friends, and friend groups, do exist, and people do really enjoy and value them.
The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.)
As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosity drive, etc., because those determine the object-level preferences that I would be neglecting in favor of following Zoe’s whims.
As I always say, my claim is “understanding human social instincts seems like it would probably be useful for Safe & Beneficial AGI”, not “here is my plan for Safe & Beneficial AGI…”. :) That said, copying from §5.2 here:
Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don’t have a complete understanding yet, so I’m just flagging this as a potential puzzle, not demanding an immediate answer.)
I’m only “happy” to “entrust the future to the next generation of humans” if I know that they can’t (i.e., don’t have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we’re still in an arguably ok position. However AI will make it false very quickly by advancing technology.
So I really want the AI transition to be an opportunity for improving the basic dynamic of “the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions”, for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it’s always been this way, so it’s fine to continue.
I’m going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...
Thanks!
There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competition over the conversation topic—Joey keeps changing the subject to trains, Billy to dinosaurs. Why do they each want to share / show off their new knowledge? Because of “drive to feel liked / admired”. From Joey’s perspective, the moment of telling Billy a cool new fact about trains, and seeing the excited look on Billy’s face, is intrinsically motivating. And vice-versa for Billy. But they can’t both satisfy that innate drive simultaneously.
OK, then you’ll say, why didn’t they both read the same book last night? Well, maybe the library only had one copy. But also, if they had both read the same book, and thus both learned the same cool facts about bulldozers, then that’s a lose-lose instead of a win-win! Neither gets to satisfy their drive to feel liked/admired by thrilling the other with cool exciting new-to-them facts.
Do you buy that? Happy to talk through more vignettes.
OK, here’s another way to put it. When humans do good philosophy and ethics, they’re combining their ability to observe and reason (the source of all “is”) with their social and moral instincts / reflexes (the source of all “ought”, see Valence §2.7).
If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes. (And if it’s not possible for this process to lead somewhere good, then we’re screwed no matter what.)
Indeed, there are reasons to think the odds of this process leading somewhere good might be better in an AI than in humans, since the programmers can in principle turn the knobs on the various social and moral drives. (Some humans are more motivated by status and self-aggrandizement than others, and I think this is related to interpersonal variation in the strengths of different innate drives. It would be nice to just have a knob in the source code for that!) To be clear, there are also reasons to think the odds might be worse in an AI, since humans are at least a known quantity, and we might not really know what we’re doing in AGI programming until it’s too late.
On my current thinking, it will be very difficult to avoid fast takeoff and unipolar outcomes, for better or worse. (Call me old-fashioned.) (I’m stating this without justification.) If the resulting singleton ASI has the right ingredients to do philosophy and ethics (i.e., it has both the ability to reason, and human-like social and moral instincts / reflexes), but with superhuman speed and insight, then the future could turn out great. Is it an irreversible single-point-of-failure for the future of the universe? You bet! And that’s bad. But I’m currently pretty skeptical that we can avoid that kind of thing (e.g. I expect very low compute requirements that make AGI governance / anti-proliferation rather hopeless), so I’m more in the market for “least bad option” rather than “ideal plan”.
A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they’ll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
Another problem is that we’re relatively confident that at least some humans can reason “successfully”, in the sense of making philosophical progress, but we don’t know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
Finally, I find your arguments against moral realism somewhat convincing, but I’m still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don’t want to bet the universe on or against any of these positions.
Thanks!
RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.
I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.
RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.
RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)
By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)
I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I’m worried that in actual development of brain-like AGI, we will skip this part (because it’s too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it’s brain-like. (And then it ends up not doing that because we didn’t give it some “secret sauce” that humans have.) And this does look fairly hard to me, because we don’t yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?
I find it pretty plausible that wireheading is what we’ll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we’ll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it’ll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.
(Of course if the right values for us are to max out our wireheading, then we shouldn’t hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs’ values could conflict with ours.)
You’re prioritizing instilling AGI with the right social instincts, and arguing that’s very important for getting the AGI to eventually converge on values that we’d consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn’t matter what social instincts the AGI starts out with, what’s more important is that it has the capacity to eventually find and be motivated by objective moral truths.
From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it’s no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.
Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.