Working on AGI safety via a deep-dive into brain algorithms, see https://sjbyrnes.com/agi.html
I agree about the general principle, even if I don’t think this particular thing is an example because of the “not maximizing sum of future rewards” thing.
Hmm, I guess I mostly disagree because:
I see this as sorta an unavoidable aspect of how the system works, so it doesn’t really need an explanation;
You’re jumping to “the system will maximize sum of future rewards” but I think RL in the brain is based on “maximize rewards for this step right now” (…and by the way “rewards for this step right now” implicitly involves an approximate assessment of future prospects.) See my comment “Humans are absolute rubbish at calculating a time-integral of reward”.
I’m all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
The way I’m thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete “features” but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn’t perfectly align with any “features” (since features are extracted from patterns in the environment), and instead you would wind up with “features” being “desirable” (correlated with reward) or “undesirable” (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring about “desirable” things rather than maximize reward per se, since the reward may not perfectly line up with anything in its ontology / predictive world-model. (Related.)
So then you sometimes have “a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y”.
That kinda has some spiritual similarity to model splintering I think, but I don’t think it’s exactly the same … for example I don’t think it even requires a distributional shift. (Or let me know if you disagree.) I don’t see how to import your model splintering ideas into this kind of algorithm more faithfully than that.
Anyway, I agree with “conservatism & asking for advice”. I guess I was thinking of conservatism as something like balancing good and bad aspects but weighing the bad aspects more. So maybe “a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y” is actually net undesirable, because the Y outweighs the X, after getting boosted up by the conservatism correction curve.
And as for asking for advice, I was thinking, if you get human feedback about this specific thing, then after you get the advice it would pattern-match 100% to desirable feature Z, and that outweighs everything else.
As for “when advice fails”, I do think you ultimately need some kind of corrigibility, but earlier on there could be something like “the algorithm that chooses when to ask questions and what questions to ask does not share the same desires as the algorithm that makes other types of decisions”, maybe.
Thanks for your helpful comments!!! :)
surely the plan to eat must originate in the cortex like every other plan, but it sure feels like it’s tied into the hypothalamus in some really important way
One thing is: I think you’re assuming a parallel model of decision-making—all plans are proposed in parallel, and the striatum picks a winner.
My scheme does have that, but then it also has a serial part: you consider one plan, then the next plan, etc. And each time you switch plans, there’s a dopamine signal that says whether this new plan is better or worse than the status quo / previous plan.
I think there’s good evidence for partially-serial consideration of options, at least in primates (e.g. Fig. 2b here). I mean, that’s obvious from introspection. My hunch is that partially-serial decision-making is universal in vertebrates.
Like, imagine the lamprey is swimming towards place A, and it gets to a fork where it could instead turn and go to place B. I think “the idea of going to place B” pops into the lamprey’s brain (pallium), displacing the old plan, at least for a moment. Then a dopamine signal promptly appears that says whether this new plan is better or worse than the old plan. If it’s worse (dopamine pause), the lamprey continues along its original trajectory without missing a beat. This is partially-serial decision-making. I don’t know how else the system could possibly work. Different pallium location memories are (at least partially) made out of the activations of different sparse subsets of neurons from the same pool of neurons, I think. You just can’t activate a bunch of them at once, it wouldn’t work, they would interfere with each other, AFAICT.
Anyway, if options are considered serially, things become simpler. All you really need is a mechanism for the hypothalamus to guess “if we do the current plan, how much and what type of food will I eat?”. (Such a mechanism does seem to exist AFAICT—in fact, I think mammals have two such mechanisms!)
OK, so then imagine a back-and-forth dialog.
The neocortex proposes a plan.
The hypothalamus & brainstem say “I’m hungry, but I notice that this plan won’t lead to eating any food. Bzzzz, Rejected! Try again.”
The neocortex proposes a different plan.
(And eventually, the cortex learns that when the body is hungry, maybe don’t even bother proposing plans that won’t involve eating!)
Base thoughts seem like literally animalistic desires
I’m happy for you to intuitively think of “the desire to eat” as an “animalistic instinct”. I guess I’m encouraging you to also intuitively think of things like “the desire to be well-respected by the people whom I myself most respect” as being also an “animalistic instinct”.
The thing is, everything that comes out of the latter instinct is highly ego-syntonic, so we tend to unthinkingly identify with them, instead of externalizing them, I think.
For example, if I binge-eat, I’m happy to say to myself “my brainstem made me do it”. Whereas if I do something that delights all my favorite in-group people, I would find it deeply threatening and insulting to say to myself “my brainstem made me do it”.
It feels like on stimulants, I have more “willpower” : it’s easy to take the “noble” choice when it might otherwise be hard. Likewise, when I’m drunk...
I think that the brainstem always penalizes (subtracts points from) plans that are anticipated to entail mental concentration. I have no idea why this is the case (evolutionarily)—maybe it’s something about energy, or opportunity cost, or the tendency to not notice lions sneaking up behind us because we’re so lost in thought. Beats me.
And I think that the more tired you are, the bigger the penalty that the brainstem applies to plans that entail mental concentration. (Just like physical exertion.)
(BTW this is my version of what you described as “motionlessness prior”.)
I think this impacts willpower in two ways:
First, we are often “applying willpower” towards things that entail mental concentration and/or physical exertion, like homework and exercise. So the more tired we are, the more brainstem skepticism we have to overcome. So we’re more likely to fail.
Second, the very process of trying to “apply willpower” (= reframing plans to pass muster with the brainstem while still satisfying various constraints) itself requires mental concentration. So if you’re sufficiently tired, the brainstem will veto even the idea of trying to “apply willpower”, and also be quicker to shut down the process if it’s not immediately successful.
(Same for feeling sick or stressed. And the opposite for being on stimulants.)
(As for being drunk, we have both those items plus a third item, which is that “applying willpower” requires mental concentration, and maybe a sufficiently drunk neocortex is just plain incapable of concentrating on anything in particular, for reasons unrelated to motivation.)
Are thoughts about why I should stay home really less rewarding than thoughts about why I should go to the gym? I’m imagining the opposite
Let’s suppose that you do in fact find it more pleasant to stay home in bed than go to the gym. And let’s further suppose that you eventually wind up going to the gym anyway. As an onlooker, I want to ask the question: Why did you do that? There must have been something appealing to you about going to the gym, or else you wouldn’t have done it. People don’t do unpleasant things for no reason whatsoever.
So if “the act of exercising” is less pleasant to you than “the act of staying home” … and yet I just saw you drive off to the gym … I think that the only explanation is: the thought of “I will have exercised” is much more appealing to you than “I will have stayed home”—so much so that it more-than-compensates for the immediate aversiveness.
(Or if the motivation is “if I stay home I’ll think of myself as a fat slob”, we can say “much less unappealing” instead of “much more appealing”.)
So I stand by the asymmetry. In terms of immediately salient aspects, staying at home is more desirable. In terms of less salient aspects, like all the consequences and implications that you think of when you hold each of the plans in your mind, going to the gym is more desirable (or less undesirable), at least for the person who actually winds up going.
Base motivations also seem like things which have a more concrete connection to reinforcement learning.
I think “noble motivations” are often ultimately rooted in our social instincts (“everyone I respect is talking about how great X is, I want to do X too.”)
I haven’t seen a satisfying (to me) gears-level explanation of social instincts that relates them to RL and everything else we know about the brain. I aspire to rectify this someday… I have vague ideas, but it’s a work in progress. :-)
In terms of how long or short the RL loop is, things like imagination and memory can bridge gaps in time. Like, if all the cool kids talk about how getting ripped is awesome, then going to the gym can be immediately rewarding, because while you’re there, you’re imagining the cool kids being impressed by your muscles two months from now. Or in the other direction: two months later the cool kids are impressed by your muscles, and you remember how you’re ripped because you went to the gym, and then your brain flags the concept of “going to the gym” as a thing that leads to praise. I think the brain’s RL system can handle those kinds of things.
someone who knows planes are very safe but is scared of flying anyway.
I think one of the “assessment calculators” (probably in the amygdala) is guessing which plans will lead to mortal danger, and this calculator is giving very high scores to any plan that involves flying in planes. We (= high-level planning cortex) don’t have conscious control over any of these assessment calculators. (Evolution designed it like that for good reason, to avoid wireheading.) The best we can do is try to nudge things on the margin by framing plans in different ways, attending to different aspects of them, etc. (And of course we can change the assessment calculators through experience—e.g. exposure therapy.)
wanting to scratch an itch while meditating
I think there’s a special thing for itches—along with getting poked on the shoulder, flashing lights, sudden sounds, pains, etc. I think the brainstem detects these things directly (superior colliculus etc.) and then “forces” the high-level-planning part of the neocortex (global neuronal workspace or whatever) to pay attention to them.
So if you get poked, the brainstem forces the high-level planner to direct attention towards the relevant part of somatosensory cortex. If there’s a flashing light, the brainstem forces attention towards the corresponding part of visual cortex. If there’s an itch, I think it’s interoceptive cortex (in the insula), etc.
Not sure but I think the mechanism for this involves PPN sending acetylcholine to the corresponding sensory cortex.
Basically, it doesn’t matter whether the current top-down model is “trying” to make strong predictions or weak predictions about that part of the sensory field. The exogenous injection of acetylcholine simply overrules it. So you’re more-or-less forced to keep thinking about the itch. Any other thoughts tend to get destabilized. (Dammit, evil brainstem!)
The effort doesn’t feel like thinking of new framings
I don’t think it has to…
What is “effort”, objectively? Maybe something like (1) you’re doing something, (2) it entails mental concentration or physical exertion (which as mentioned above are more penalized when you’re tired), (3) doing it is causing unpleasant feelings.
Well, that fits the itch example! The thing I’m doing is “thinking a certain thought” (namely, a thought that involves not scratching my itch). It entails a great feat of mental concentration—thanks to that acetylcholine thing I mentioned above, constantly messing with my attention. And every second that I continue to think that thought causes an unpleasant itching sensation to continue. So, maybe that’s all the ingredients we need for it to feel like “effort”.
(A plan can get a high brainstem reward while also feeling unpleasant.)
wanting to meditate well has a long chain of logical explanations.
Hmm, I don’t think that’s relevant. Imagine you had an itch where every time you scratch it, it makes your whole body hurt really bad for 30 seconds. And then it starts itching again. You would be in the same place as the meditation example, “exerting willpower” not to scratch the itch. Right? Sorry if I’m missing your point here.
Thanks!! And thanks for the wiring references! Such intricate complexity everywhere you look! Sometimes I wonder “how is there so much to say about neuroscience that we can write 50,000 neuroscience papers each year, year after year?”, and then I see stuff like this and say “Oh, that’s how.” :-P
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like “making its camera pixels all white” or whatever, and then that preserves performance because of instrumental convergence.
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Oh OK, I’m sufficiently ignorant about philosophy that I may have unthinkingly mixed up various technically different claims like
“there is a fact of the matter about what is moral vs immoral”,
“reasonable intelligent agents, when reflecting about what to do, will tend to decide to do moral things”,
“whether things are moral vs immoral has nothing to do with random details about how human brains are constructed”,
“even non-social aliens with radically different instincts and drives and brains would find similar principles of morality, just as they would probably find similar laws of physics and math”.
I really only meant to disagree with that whole package lumped together, and maybe I described it wrong. If you advocate for the first of these without the others, I don’t have particularly strong feelings (…well, maybe the feeling of being confused and vaguely skeptical, but we don’t have to get into that).
I think it can be simultaneously true that, say:
“weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345”
“weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)”
The second one doesn’t stop being true because the first one is also true. They can both be true, right?
In other words, “the model weights are what they are because it’s the simplest way to solve the problem” doesn’t eliminate other “why” questions about all the details of the model. There’s still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves “the algorithm itself having downstream impacts on the future in non-random ways that can’t be explained away by the algorithm logic itself or the real-world things upstream of the algorithm”. Or something like that, I think.
First I want to say kudos for posting that paper here and soliciting critical feedback :)
Singularity claim: Superintelligent AI is a realistic prospect, and it would be out of human control.
Minor point, but I read this as “it would definitely be out of human control”. If so, this is not a common belief. IIRC Yampolskiy believes it, but Yudkowsky doesn’t (I think?), and I don’t, and I think most x-risk proponents don’t. The thing that pretty much everyone believes is “it could be out of human control”, and then a subset of more pessimistic people (including me) believes “there is an unacceptably high probability that it will be out of human control”.
Let us imagine a system that is a massively improved version of AlphaGo (Silver et al., 2018), say ‘AlphaGo+++’, with instrumental superintelligence, i.e., maximising expected utility. In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:AccessibleI can win if I pay the human a bribe, so I will rob a bank and pay her.I cannot win at Go if I am turned off.The more I dominate the world, the better my chances to achieve my goals.I should kill all humans because that would improve my chances of winning.Not accessibleWinning in Go by superior play is more honourable than winning by bribery.I am responsible for my actions.World domination would involve suppression of others, which may imply suffering and violation of rights.Killing all humans has negative utility, everything else being equal.Keeping a promise is better than not keeping it, everything else being equal.Stabbing the human hurts them, and should thus be avoided, everything else being equal.Some things are more important than me winning at Go.Consistent goals are better than inconsistent onesSome goals are better than othersMaximal overall utility is better than minimal overall utility.
Let us imagine a system that is a massively improved version of AlphaGo (Silver et al., 2018), say ‘AlphaGo+++’, with instrumental superintelligence, i.e., maximising expected utility. In the proposed picture of singularity claim & orthogonality thesis, some thoughts are supposed to be accessible to the system, but others are not. For example:
I can win if I pay the human a bribe, so I will rob a bank and pay her.
I cannot win at Go if I am turned off.
The more I dominate the world, the better my chances to achieve my goals.
I should kill all humans because that would improve my chances of winning.
Winning in Go by superior play is more honourable than winning by bribery.
I am responsible for my actions.
World domination would involve suppression of others, which may imply suffering and violation of rights.
Killing all humans has negative utility, everything else being equal.
Keeping a promise is better than not keeping it, everything else being equal.
Stabbing the human hurts them, and should thus be avoided, everything else being equal.
Some things are more important than me winning at Go.
Consistent goals are better than inconsistent ones
Some goals are better than others
Maximal overall utility is better than minimal overall utility.
I’m not sure what you think is going on when people do ethical reasoning. Maybe you have a moral realism perspective that the laws of physics etc. naturally point to things being good and bad, and rational agents will naturally want to do the good thing. If so, I mean, I’m not a philosopher, but I strongly disagree. Stuart Russell gives the example of “trying to win at chess” vs “trying to win at suicide chess”. The game has the same rules, but the goals are opposite. (Well, the rules aren’t exactly the same, but you get the point.) You can’t look at the laws of physics and see what your goal in life should be.
My belief is that when people do ethical reasoning, they are weighing some of their desires against others of their desires. These desires ultimately come from innate instincts, many of which (in humans) are social instincts. The way our instincts work is that they aren’t (and can’t be) automatically “coherent” when projected onto the world; when we think about things one way it can spawn a certain desire, and when we think about the same thing in a different way it can spawn a contradictory desire. And then we hold both of those in our heads, and think about what we want to do. That’s how I think of ethical reasoning.
I don’t think ethical reasoning can invent new desires whole cloth. If I say “It’s ethical to buy bananas and paint them purple”, and you say “why?”, and then I say “because lots of bananas are too yellow”, and then you say “why?” and I say … anyway, at some point this conversation has to ground out at something that you find intuitively desirable or undesirable.
So when I look at your list I quoted above, I mostly say “Yup, that sounds about right.”
For example, imagine that you come to believe that everyone in the world was stolen away last night and locked in secret prisons, and you were forced to enter a lifelike VR simulation, so everyone else is now an unconscious morally-irrelevant simulation except for you. Somewhere in this virtual world, there is a room with a Go board. You have been told that if white wins this game, you and everyone will be safely released from prison and can return to normal life. If black wins, all humans (including you and your children etc.) will be tortured forever. You have good reason to believe all of this with 100% confidence.
OK that’s the setup. Now let’s go through the list:
I can win if I pay the human a bribe, so I will rob a bank and pay her. Yup, if there’s a “human” (so-called, really it’s just an NPC in the simulation) playing black, amenable to bribery, I would absolutely bribe “her” to play bad moves.
I cannot win at Go if I am turned off. Yup, white has to win this game, my children’s lives are at stake, I’m playing white, nobody else will play white if I’m gone, I’d better stay alive.
The more I dominate the world, the better my chances to achieve my goals. Yup, anything that will give me power and influence over the “person” playing black, or power and influence over “people” who can help me find better moves or help me build a better Go engine to consult on my moves, I absolutely want that.
I should kill all humans because that would improve my chances of winning. Well sure, if there are “people” who could conceivably get to the board and make good moves for black, that’s a problem for me and for all the real people in the secret prisons whose lives are at stake here.
Winning in Go by superior play is more honourable than winning by bribery. Well I’m concerned about what the fake simulated “people” think about me because I might need their help, and I certainly don’t want them trying to undermine me by making good moves for black. So I’m very interested in my reputation. But “honourable” as an end in itself? It just doesn’t compute. The “honourable” thing is working my hardest on behalf of the real humanity, the ones in the secret prison, and helping them avoid a life of torture.
I am responsible for my actions. Um, OK, sure, whatever.
World domination would involve suppression of others, which may imply suffering and violation of rights. Those aren’t real people, they’re NPCs in this simulated scenario, they’re not conscious, they can’t suffer. Meanwhile there are billions of real people who can suffer, including my own children, and they’re in a prison, they sure as heck want white to win at this Go game.
Killing all humans has negative utility, everything else being equal. Well sure, but those aren’t humans, the real humans are in secret prisons.
Keeping a promise is better than not keeping it, everything else being equal. I mean, the so-called “people” in this simulation may form opinions about my reputation, which impacts what they’ll do for me, so I do care about that, but it’s not something I inherently care about.
Stabbing the human hurts them, and should thus be avoided, everything else being equal. No. Those are NPCs. The thing to avoid is the real humanity being tortured forever.
Some things are more important than me winning at Go. For god’s sake, what could possibly be more important than white winning this game??? Everything is at stake here. My own children and everyone else being tortured forever versus living a rich life.
Consistent goals are better than inconsistent ones. Sure, I guess, but I think my goals are consistent. I want to save humanity from torture by making sure that white wins the game in this simulation.
Some goals are better than others. Yes. My goals are the goals that matter. If some NPC tells me that I should take up a life of meditation, screw them.
Maximal overall utility is better than minimal overall utility. Not sure what that means. The NPCs in this simulation don’t have “utility”. The real humans in the secret prison do.
Maybe you’ll object that “the belief that these NPCs can pass for human but be unconscious” is not a belief that a very intelligent agent would subscribe to. But I only made the scenario like that because you’re a human, and you do have the normal suite of innate human desires, and thus it’s a bit tricky to get you in the mindset of an agent who cares only about Go. For an actual Go-maximizing agent, you wouldn’t have to have those kinds of beliefs, you could just make the agent not care about humans and consciousness and suffering in the first place, just as you don’t care about “hurting” the colorful blocks in Breakout. Such an agent would (I presume) give correct answers to quiz questions about what is consciousness and what is suffering and what do humans think about them, but it wouldn’t care about any of that! It would only care about Go.
(Also, even if you believe that not-caring-about-consciousness would not survive reflection, you can get x-risk from an agent with radically superhuman intelligence in every domain but no particular interest in thinking about ethics. It’s busy doing other stuff, y’know, so it never stops to consider whether conscious entities are inherently important! In this view, maybe 30,000,000 years after destroying all life and tiling the galaxies with supercomputers and proving every possible theorem about Go, then it stops for a while, and reflects, and says “Oh hey, that’s funny, I guess Go doesn’t matter after all, oops”. I don’t hold that view anyway, just saying.)
(For more elaborate intuition-pumping fiction metaethics see Three Worlds Collide.)
Since this keeps coming up—Big Picture of Phasic Dopamine is still the best resource, but I just summarized this aspect of it in 20× fewer words: A model of decision-making in the brain (the short version). It’s pretty similar to what I wrote in my other reply comment though.
I think you’re misunderstanding (or I am).
I’m trying to make a two step argument:
(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I’m still happy to discuss and defend it]
(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]
If you’re going to disagree with (2), then SGD / “what the model was selected” for is not relevant.
“Doing exclusively within-universe processing” is a property of the internals of the trained model, not just the input-output behavior. If running the trained model involves a billion low-level GPU instructions, this property would correspond to the claim that each and every one of those billion GPU instructions is being executed for reasons that are unrelated to any anticipated downstream real-world consequences of that GPU instruction. (where “real world” = everything except the future processing steps inside the algorithm itself.)
Fair enough :)
Yes I understood that. I think my comment is a relevant thing to keep in mind when thinking about that question.
Like, we can only perceive our introspective world through the lens of abstract models of that world that our brains build. So we should be thinking: What do those models look like?
There’s a thing that Dan Dennett calls (I think) “sophisticated naive physics”, where you take “intuitive physics” as an object of study, determine its ontology, rules, etc. We understand that the ontology and relationships and affordances of “intuitive physics” may be quite different from anything in actual physics, but we can still study it. By the same token, I’m proposing that the question of “why do people have the intuitions they have about free will” should be studied through a lens of “sophisticated naive introspection”, i.e. determine the ontology and relationships and affordances of the abstract model space that we use when we introspect, while accepting that those things may be quite different from anything actually happening in the brain. Whereas your OP seems to take a different perspective, namely that introspection provides an accurate view of a subset of the things in the brain.
I haven’t thought it through very carefully, but my hunch is that you’re not quite getting the crux of the issue.
For one thing, there’s a little compatibilism joke I like:
“I don’t see why the chess engine is going through all the effort of evaluating 400,000 possible move sequences. It’s running a deterministic algorithm! The answer is predetermined!!”
Hahaha. The moral, I think, is that there are two stories:
“the chess engine moved the knight because that was the result of a deterministic process involving a bunch of interacting transistors on a chip”
“the chess engine moved the knight because it evaluated 400,000 move sequences and found that moving the knight was best”
These sound like different stories, but they are in fact the same story told at two different levels of abstraction.
So that’s one hint: we can get tripped up by failing to appreciate that stories at multiple levels of abstraction can all be “true” simultaneously.
Then another related hint is a thing I wrote here about the hard problem of consciousness:
All my internal models are simplified entities, containing their essential behavior and properties, but not usually capturing the nuts-and-bolts of how they work in the real world. (In a programming analogy, you could say that we’re modeling the [global neuronal workspace’s] API & documentation, not its implementation.) Thus, my attention schema does not involve neurons or synapses or GNWs or anything like that, even if, in reality, that’s what it’s modeling.
So I think it’s a slighty wrong starting point to think along the lines of “the processes in our brains to which we have conscious access are a subset of all the processes in our brains”. Instead I think we should be saying, “when we introspect, we’re not directly seeing any of the processes in our brains, instead the things we perceive are certain states of certain models that abstract over some of the processes in our brains”—just as “a dog barking” is an abstract model that represents some complicated statistical predictions about incoming photons and sound waves and so on.
I don’t really have a conclusion here, and I don’t mean for this to be a strong disagreement with what you wrote, just maybe a tweak in the framing or something. :-)
The kind of incentive argument I’m trying to make here is “If the model isn’t doing X, then by doing X a little bit it will score better on the objective, and by doing X more it will score even better on the objective, etc. etc.” That’s what I mean by “X is incentivized”. (Or more generally, that gradient descent systematically tends to lead to trained models that do X.) I guess my description in the article was not great.
So in general, I think deceptive alignment is “incentivized” in this sense. I think that, in the RL scenarios you talked about in your paper, it’s often the case that building a better and better deceptively-aligned mesa-optimizer will progressively increase the score on the objective function.
Then my argument here is that 4th-wall-breaking processing is not incentivized in that sense: if the trained model isn’t doing 4th-wall-breaking processing at all right now, I think it does not do any better on the objective by starting to do a little bit of 4th-wall-breaking processing. (At least that’s my hunch.)
(I do agree that if a deceptively-aligned mesa-optimizer with a 4th-wall-breaking objective magically appeared as the trained model, it would do well on the objective. I’m arguing instead that SGD is unlikely to create such a thing.)
Oh, I guess you’re saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property “no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world”. So I can say categorically: such an algorithm won’t hurt anyone (except by freak accident), won’t steal processing resources, won’t intervene when I go for the off-switch, etc., right? So I don’t see “arbitrarily scary”, or scary at all, right? Sorry if I’m confused…
Yes I can link it but it’s very long, sorry: Big Picture Of Phasic Dopamine.
The midbrain dopamine centers (VTA, SNc) are traditionally “part of the basal ganglia” AND “part of the brainstem”. I think that these regions are where you find the “final answer” about whether a plan is good or bad, and that the dopamine signals from these regions can just directly shut down bad ideas.
But of course a lot of processing happens before you get to the “final answer”…
Specifically, I think there are basically three layers of “plan evaluation”:
First, you start with a bubbly soup of partially-formed thoughts in the neocortex, and the (dorsal) striatum does a quick rough guess of how promising the different pieces look, and gently suppresses the less-promising bits / enhances the more-promising bits, so that when you get a fully-formed thought, it’s at least reasonably promising.
Second, once you have a stable fully-formed thought, I think various parts of the brain (parts of prefrontal & cingulate & insular cortex, ventral striatum, amygdala, hippocampus (sorta)) score that thought along maybe dozens of genetically-hardcoded axes like “If I’m gonna do this plan, how appropriate would it be to cringe? To salivate? To release cortisol? To laugh? How much salt would I wind up eating? How much umami? Etc. etc.” (They learn to do these evaluations through experience, a.k.a. supervised learning.) And they send all those “scores” down to the hypothalamus and brainstem.
Finally, we’re at the hypothalamus & brainstem. They look at the information from the previous bullet point, and they combine that information with other information streams, like metabolic status information (if I’m hungry, a plan that involves eating gets extra points), and whether the superior colliculus thinks there’s a snake in the field-of-view, and so on. Taking all that information into account, they make the final decision as to whether the plan is good or bad, using a genetically-hardcoded algorithm.
Happy to discuss more; also, I’m still reading about this / talking to experts / etc., and I reserve the right to change my mind :)