I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Steven Byrnes
Thanks, this is great!
Envy toward a friend’s success…
I used to think that envy was a social instinct (before 2023ish), but now I don’t think it’s a social instinct at all (see “changelog” here). Instead I currently think that envy is a special case of, umm, “craving” (in the general colloquial sense, not the specific Buddhist sense)—a kind of anxious frustration in a scenario where something is highly salient, and highly desired, but in fact cannot happen.
So a social example would be: Sally has a juice box, and I love juice, but I can’t have any. Looking at Sally drinking juice reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
Whereas non-social example of the same innate reaction would be: It’s lunch time, and every day at lunch I have juice and a sandwich in a brown paper bag, and I love juice. But it happens that there’s a new global juice shortage, so today for the first time I don’t have any juice. Looking at my sandwich and the brown bag reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
So that’s my starting point: both these two examples are the same kind of (not-specifically-social) craving-related frustration reaction.
After that, of course, the Sally scenario becomes social, because the scenario involves Sally doing something (i.e. drinking juice) that causes me to feel an unpleasant feeling (per above), and generically if someone is causing me unpleasant feelings then that tends to push me from regarding Sally as a friend, towards regarding her as an enemy, and to feel motivated to find an excuse to blame her for my troubles and pick a fight with her.
Admiration for a rival or enemy
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to, likewise Genghis Khan can feel like a friend or an enemy depending on which aspects of him you’re paying attention to. I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant. You might be able to flip back and forth from one to the other very quickly, even within 1 or 2 seconds, but not at the very same instant. For example, if I say the sentence “It was catastrophic how Genghis Khan killed all those people, but I have to admit, he was a talented leader”, I would suggest that the “innate friend-vs-enemy parameter” related to thoughts of Genghis Khan flips from enemy in the first half of the sentence to friend in the second half.
Compassion for a stable enemy’s suffering
There probably isn’t one great answer; probably different people are different. As above, we can think of people in different ways, paying attention to different aspects of them, and they can flip rapidly from enemy to friend and back. Since attention control is partly voluntary, it’s partly (but not entirely) a choice whether we see someone as a friend vs enemy, and we tend to choose the option that feels better / more motivating on net, and there can be a bunch of factors related to that. For example, approval reward is a factor—some people take pride in their compassion (just as we nod approvingly when superheroes take compassion upon their enemies, and cf. §6), while others take pride in their viciousness. Personality matters, culture matters, the detailed situation matters, etc.
Gratitude / indebtedness
Hmm. Generically, I think there are two (not mutually exclusive) paths:
(sympathy reward path) Alice helps Bob → Bob sees Alice more as a friend (positive valence / liking / admiring by association), and also more salient, and also higher-stakes to interact with (because good things might happen!) → Bob wants Alice to be happy (sympathy reward) → Bob helps Alice
(approval reward path) Alice helps Bob → Bob recognizes (consciously or not) that if he doesn’t reciprocate than Alice will wind up worse off for having interacted with Bob, and conversely that if he does reciprocate than Alice will wind up better off for having interacted with Bob, and Alice will know that and associate that feeling with Bob, and Bob sees the latter as preferable (approval reward) → Bob helps Alice
As an example of the latter, recently someone important-to-me went out of his way to help me, and I expected the interaction to work out well for him too, but instead it wound up being a giant waste of his time, and objectively it wasn’t really my fault, but I still felt horrible and lost much sleep over it, and I think the aspect that felt most painful to me was when I imagined him secretly being annoyed at me and regretful for ever reaching out to me, even if he was far too nice a guy to say anything like that to me directly.
…But I’m kinda neurotic; different people are different and I don’t want to overgeneralize. Happy to hear more about how things seem to you.
Private guilt
I talked about “no one will ever find out” a bit in §6.1 of the approval reward post. I basically think that you can consciously believe that no one will ever find out, while nevertheless viscerally feeling a bit of the reaction associated with a nonzero possibility of someone finding out.
As for the “Dobby effect” (self-punishment related to guilt, a.k.a. atonement), that’s an interesting question. I thought about it a bit and here’s my proposed explanation:
Generally, if Ahab does something hurtful to Bob, then Bob might get angry at Ahab, and thus want Ahab to suffer (and better yet, to suffer while thinking about Bob, such as if Bob is punching Ahab in the face). But that desire of Bob’s, just like hunger and many other things, is satiable—just like a hungry person stops being hungry after eating a certain amount, likewise Bob tends to lose his motivation for Ahab to suffer, after Ahab has already suffered a certain amount. For example, if an angry person punches out his opponent in a bar fight, he usually feels satisfied, and doesn’t keep kicking his victim when he’s down, except in unusual cases. Or even if he kicks a bit, he won’t keep kicking for hours and hours.
We all know this intuitively from life experience, and we intuitively pick up on what it implies: if Ahab did something hurtful to Bob, and Ahab wants to get back to a situation where Bob feels OK about Ahab ASAP, then Ahab should be making himself suffer, and better yet suffer while thinking about Bob. Then not only is Ahab helping dull Bob’s feelings of aggression by satiating them, but simultaneously, there’s the very fact that Ahab is helping Bob feel a good feeling (i.e., satiation of anger), which should help push Ahab towards the “friend” side of the ledger in Bob’s mind.
Aggregation cases
In “identifiable victim effect”, I normally think of, like, reading a news article about an earthquake across the world. It’s very abstract. There’s some connection to the ground-truth reward signals that I suggested in Neuroscience of human social instincts: a sketch, but it’s several steps removed. Ditto “psychic numbing”, I think.
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
Likewise, for “audience effect”, we all have life experience of doing something, and then it turns out that there’s a real person right there who was watching us and judging us based on what we did. At any second, that real person could appear, and make eye contact etc. So again, we’re very close to the full force of the ground-truth reward signals here.
…So I don’t see a contradiction there.
Again I really appreciate this kind of comment, feel free to keep chatting.
I read one of their papers (the Pong one, which is featured on the frontpage of their website) and thought it was really bad and p-hacked, see here & here.
…sounds like a joke? you do not want to do any computation on neurons, they are slow and fragile. (you might want to run brain-inspired algorithms, but on semiconductors!)
Strong agree.
oh oops sorry if I already shared that with you, I forgot, didn’t mean to spam.
My actual expectation is that WBE just ain’t gonna happen at all (at least not before ASI), for better or worse. I think the without-reverse-engineering path is impossible, and the with-reverse-engineering path would be possible given infinite time, but would incidentally involve figuring out how to make ASI way before the project is done, and that recipe would leak (or they would try it themselves). Or even more realistically, someone else on Earth would invent ASI first, via an unrelated effort. So I spend very little time thinking about WBE.
Like, a discussion might go:
Optimist: If you pick some random thing, there is no reason at all to expect that thing to be a ruthless sociopath. It’s an extraordinarily weird and unlikely property.
Me: Yes I happily concede that point.
O: You do? So why are you worried about ASI x-risk?
Me: Well if you show me some random thing, it’s probably, like, a rock or something. It’s not sociopathic, but only because it’s not intelligent at all.
O: Well, c’mon, you know what I mean. If you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
O: I dunno, like, human? Or AI?
Me: Different humans are different to some extent, and different AI algorithms are different to a much greater extent, and also different from humans. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Like, I dunno, it does seem rather maniacally obsessed with graph traversal right?
O: Oh c’mon, don’t be dense. I didn’t mean “AI” in the sense of the academic discipline, I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen, and we observe that they are not sociopathic.
Me: As it happens, I’m working on the threat model of model-based actor-critic RL agent “brain-like” AGI, not LLMs. LLMs are profoundly different from what I’m working on. Saying that LLMs will have similar properties as RL agent AGI because “both are AI” is like saying that LLMs will have similar properties as the A* search algorithm because “both are AI”. Or it’s like saying that a tree or a parasitic wasp will have similar properties as a human because both are alive. They can still be wildly different in every way that matters.
O: OK but lots of other doomers talk about LLMs causing doom, even if you claim to be agnostic about it. E.g. IABIED.
Me: Well fine, go find those people and argue with them, and leave me out of it, it’s not my wheelhouse. I mostly don’t expect LLMs to become powerful enough to be the kind of really scary thing that could cause human extinction even if they wanted to.
O: Well you’re here so I’ll keep talking to you. I still think you need some positive reason to believe that RL agent AGI will be a ruthless sociopath.
Me: Maybe a good starting point would be my posts LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem, or “The Era of Experience” has an unsolved technical alignment problem.
O: I’m still not seeing what you’re seeing. Can you explain it a different way?
Me: OK, back at the start of the conversation, I mentioned that random object like rocks are not able to accomplish impressive difficult feats. If we’re thinking about AI that can autonomously found and grow companies for years, or autonomously wipe out humans and run the world by itself, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive difficult feats. And the question we should be asking is: how does it do that? It can’t do it by choosing random actions. There has to be some explanation for how it finds actions that accomplish these feats.
And one possible answer is: it does it by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents and model-based planning algorithms. The whole point of those subfields of AI is, they’re algorithms that find actions that maximize an objective. I.e., you get ruthless sociopathic behavior by default. And this isn’t armchair theorizing, it’s dead obvious to anyone who has spent serious amounts of time building or using RL agents and/or model-based planning algorithms. These things are ruthless by default, unless the programmer goes out of their way to make them non-ruthless. (And I claim that it’s not obvious or even known how they would make them non-ruthless, see those links above.) (And of course, evolution did specifically add features to the human brain to make humans non-ruthless, i.e. our evolved social instincts. Human sociopaths do exist, after all, and are quite capable of accomplishing impressive difficult feats.)
So that’s one possible answer, and it’s an answer that brings in ruthlessness by default.
…And then there’s a second, different possible answer: it finds actions that accomplish impressive feats by imitating what humans would do in different contexts. That’s where (I claim) LLMs get the lion’s share of their capabilities from. See my post Foom & Doom §2.3 for details. Of course, in my view, the alignment benefits that LLMs derive of imitating humans are inexorably tied to capabilities costs, namely LLMs struggle to get very far beyond ideas that humans have already written down. And that’s why (as I mentioned above), I’m not expecting LLMs to get all the way to the scary kind of AGI / ASI capabilities that I’m mainly worried about.
New version of “Intro to Brain-Like-AGI Safety”
Do it! Write a new “version 2” post / post-series! It’s OK if there’s self-plagiarism. Would be time well spent.
If we put the emphasis on “simplest possible”, the most minimal that I personally recall writing is this one; here it is in its entirety:
The path we’re heading down is to eventually make AIs that are like a new intelligent species on our planet, and able to do everything that humans can do—understand what’s going on, creatively solve problems, take initiative, get stuff done, make plans, pivot when the plans fail, invent new tools to solve their problems, etc.—but with various advantages over humans like speed and the the ability to copy themselves.
Nobody currently has a great plan to figure out whether such AIs have our best interests at heart. We can ask the AI, but it will probably just say “yes”, and we won’t know if it’s lying.
The path we’re heading down is to eventually wind up with billions or trillions of such AIs, with billions or trillions of robot bodies spread all around the world.
It seems pretty obvious to me that by the time we get to that point—and indeed probably much much earlier—human extinction should be at least on the table as a possibility.
(This is an argument that human extinction is on the table, not that it’s likely.)
This one will be unconvincing to lots of people, because they’ll reject it for any of dozens of different reasons. I think those reasons are all wrong, but you need to start responding to them if you want any chance of bringing a larger share of the audience onto your side. These responses include both sophisticated “insider debates”, and just responding to dumb misconceptions that would pop into someone’s head.
(See §1.6 here for my case-for-doom writeup that I consider “better”, but it’s longer because it includes a list of counterarguments and responses.)
(This is a universal dynamic. For example, the case for evolution-by-natural-selection is simple and airtight, but the responses to every purported disproof of evolution-by-natural-selection would be at least book-length and would need to cover evolutionary theory and math in way more gory technical detail.)
I bet that Steve Byrnes can point out a bunch of specific sensory evidence that the brain uses to construct the status concept (stuff like gaze length of conspecifics or something?), but the human motivation system isn’t just optimizing for those physical proxy measures, or people wouldn’t be motivated to get prestige on internet forums where people have reputations but never see each other’s faces.
If it helps, my take is in Neuroscience of human social instincts: a sketch and its follow-up Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.
Sensory evidence is definitely involved, but kinda indirectly. As I wrote in the latter: “The central situation where Approval Reward fires in my brain, is a situation where someone else (especially one of my friends or idols) feels a positive or negative feeling as they think about and interact with me.” I think it has to start with in-person interactions with other humans (and associated sensory evidence), but then there’s “generalization upstream of reward signals” such that rewards also get triggered in semantically similar situations, e.g. online interactions. And it’s intimately related to the fact that there’s a semantic overlap between “I am happy” and “you are happy”, via both involving a “happy” concept. It’s a trick that works for certain social things but can’t be applied to arbitrary concepts like inclusive genetic fitness.
I stand by my nitpick in other comment that you’re not using the word “concept” quite right. Or, hmm, maybe we can distinguish (A) “concept” = a latent variable in a specific human brain’s world-model, versus (B) “concept” = some platonic Natural Abstraction™ or whatever, whether or not any human is actually tracking it. Maybe I was confused because you’re using the (B) sense but I (mis)read it as the (A) sense? In AI alignment, we care especially about getting a concept in the (A) sense to be explicitly desired because that’s likelier to generalize out-of-distribution, e.g. via out-of-the-box plans. (Arguably.) There are indeed situations where the desires bestowed by Approval Reward come apart from social status as normally understood (cf. this section, plus the possibility that we’ll all get addicted to sycophantic digital friends upon future technological changes), and I wonder whether the whole question of “is Approval Reward exactly creating social status desire, or something that overlaps it but comes apart out-of-distribution?” might be a bit ill-defined via “painting the target around the arrow” in how we think about what social status even means.
(This is a narrow reply, not taking a stand on your larger points, and I wrote it quickly, sorry for errors.)
You might (or might not) have missed that we can simultaneously be in defer-to-predictor mode for valence, override mode for goosebumps, defer-to-predictor mode for physiological arousal, etc. It’s not all-or-nothing. (I just edited the text you quoted to make that clearer.)
In “defer-to-predictor” mode, all of the informational content that directs thought rerolls is coming from the thought assessors in the Learned-from-Scratch part of the brain, even if if that information is neurologically routed through the steering subsystem?
To within the limitations of the model I’m putting forward here (which sweeps a bit of complexity under the rug), basically yes.
The black border around your macbook screen would be represented in some tiny subset of the cortex before you pay attention to it, and in a much larger subset of the cortex after you pay attention to it. In the before state (when it’s affecting a tiny subset of the cortex), I still want to declare it part of the “thought”, in the sense relevant to this post, i.e. (1) those bits of the cortex are still potentially providing context signals for the amygdala, striatum, etc., and (2) those bits are still interconnected with and compatible with what’s happening elsewhere in the cortex. If that tiny subset of the cortex doesn’t directly connect to the hippocampus (which it probably doesn’t), then it won’t directly impact your episodic memory afterwards, although it still has an indirect impact via needing to be compatible with the other parts of the cortex that connects to (i.e., if the border had been different than usual, you would have noticed something wrong).
If we think in terms of attractor dynamics (as in Hopfield nets, Boltzmann machines, etc.), then I guess your proposal in this comment corresponds to the definitions: “thought” = “stable attractor state”, and “proto-thought” = “weak disjointed activity that’s bubbling up and might (or might not) eventually develop into a new stable attractor state.
Whereas the purpose of this series, I’m just using the simpler “thought” = “whatever the cortex is doing”. And “whatever the cortex is doing” might be (at some moment) 95% stable attractor + 5% weak disjointed activity, or whatever.
Is there a reason why these “proto-thoughts” don’t have the problem cited above, that force “thoughts” to be sequential?
Weak disjointed activity can be hyper-local to some tiny part of the cortex, and then it might or might not impact other areas and gradually (i.e. over the course of 0.1 seconds or whatever) spread into a new stable attractor for a large fraction of the cortex, by outcompeting the stable attractor which was there before.
(I’m exaggerating a bit for clarity; the ability of some local pool of neurons to explore multiple possibilities simultaneously is more than zero, but I really don’t think it gets very far at all before there has to be a “winner”.)
…fish…
No, I was trying to describe sequential thoughts. First the fish has Thought A (well-established, stable attractor, global workspace) “I’m going left to my cave”, then for maybe a quarter of a second it has Thought B (well-established, stable attractor, global workspace) “I’m going right to the reef”, then it switches back to Thought A. I was not attempting to explain why those thoughts appeared rather than other possible thoughts, rather I was emphasizing the fact that these are two different thoughts, and that Thought B got discarded because it seemed bad.
I just reworded that section, hopefully that will help future readers, thanks.
FYI, I just revised the post, mainly by adding a new §5.2.1. Hopefully that will help you and/or future readers understand what I’m getting at more easily. Thanks for the feedback (and of course I’m open to further suggestions).
If memory serves, the journal Foundations of Physics was long known as a place for people to publish wild fringe theories that would never get accepted by more mainstream physics journals.
I remember back in 2007, this was common knowledge, so it was big news that (widely respected physicist) Gerard ’t Hooft was due to take over as editor-in-chief, and people in the physics department were speculating about whether he would radically change the nature of the journal. I don’t know whether that happened or not. But anyway, 1997 is before that.
I feel like you omit the possibility that the trait of motivated reasoning is like the “trait” of not-flying. You don’t need an explanation for why humans have the trait of not-flying, because not-flying is the default. Why didn’t this “trait” evolve away? Because there isn’t really any feasible genomic changes that would “get rid” of not-flying (i.e. that would make humans fly), at least not without causing other issues.
RE “evolutionarily-recent”: I guess your belief is that “lots of other mammals engaging in motivated reasoning” is not the world we live in. But is that right? I don’t see any evidence either way. How could one tell whether, say, a dog or a mouse ever engages in motivated reasoning?
My own theory (see [Valence series] 3. Valence & Beliefs) is that planning and cognition (in humans and other mammals) works by an algorithm that is generally very effective, and has gotten us very far, but which has motivated reasoning as a natural and unavoidable failure mode. Basically, the algorithm is built so as to systematically search for thoughts that seem good rather than bad. If some possibility is unpleasant, then the algorithm will naturally discover the strategy of “just don’t think about the unpleasant possibility”. That’s just what the algorithm will naturally do. There isn’t any elegant way to avoid this problem, other than evolve an entirely different algorithm for practical intelligence / planning / etc., if indeed such an alternative algorithm even exists at all.
Our brain has a hack-y workaround to mitigate this issue, namely the “involuntary attention” associated with anxiety, itches, etc., which constrain your thoughts so as to make you unable to put (particular types of) problems out of your mind. In parallel, culture has also developed some hack-y workarounds, like Reading The Sequences, or companies that have a red-teaming process. But none of these workarounds completely solves the issue, and/or they come along with their own bad side-effects.
Anyway, the key point is that motivated reasoning is a natural default that needs no particular explanation.
(Thanks for the thought-provoking post.)
Couple nitpicks:
If you’re going to join the mind and you don’t care about paper clips and it cares about paper clips, that’s not going to happen. But if it can offer some kind of compelling shared value story that everybody could agree with in some sense, then we can actually get values which can snowball.
I thought the “merge” idea was that, if the super-mind cares about paperclips and you care about staples, and you have 1% of the bargaining power of the super-mind, then you merge into a super+1-mind that cares 99% about paperclips and 1% about staples. And that can be a Pareto improvement for both. Right?
For one thing, it doesn’t really care about the actual von Neumann conditions like “not being money-pumped” because it’s the only mind, so there’s not an equilibrium that keeps it in check.
I think “not being money-pumped” is not primarily about adversarial dynamics, where there’s literally another agent trying to trick you, but rather about the broader notion of having goals about the future, and being effective in achieving those goals. Being dutch-book-able implies sometimes making bad decisions by your own light, and a smart agent should recognize that this is happening and avoid it, in order to accomplish more of its own goals.
(TBC there are other reasons to question the applicability of VNM rationality, including Garrabrant’s fairness thing and the assumption that the agent has pure long-term consequentialist goals in the first place.)
In the original blog post, we think a lot about slack. It says that if you have slack, you can kind of go off the optimal solution and do whatever you want. But in practice, what we see is that slack, when it occurs, produces this kind of drift. It’s basically the universe fulfilling its naturally entropic nature, in that most ways to go away from the optimum are bad. If we randomly drift, we just basically tend to lose fitness and produce really strange things which are not even really what we value.
My response to this gets at what Joe Carlsmith calls Deep Atheism. I think there just is no natural force that systematically produces goodness. I agree with you that slack is not a force that systematically produces goodness. But also, I feel much more strongly than you that competition is also not a force that systematically produces goodness. No such force exists. Too bad.
So I agree with this paragraph literally, but disagree with its connotation that competition would be better than slack.
…This is actually much easier than if we were trying to align the AIs to some kind of innate reward function that humans supposedly have…
I don’t know if you were subtweeting me here, but for the record, I agree that getting today’s LLMs to be generally nice is much easier than getting “brain-like AGI” to be generally nice (see e.g. here), and I’ve always treated “brain-like AGI” as “threat model” rather than “good plan”.
Parts of this post seem close to an error that Yudkowsky accused Schmidhuber of making:
At a past Singularity Summit, Juergen Schmidhuber thought that “improve compression of sensory data” would motivate an AI to do science and create art.
It’s true that, relative to doing nothing to understand the environment, doing science or creating art might increase the degree to which sensory information can be compressed.
But the maximum of this utility function comes from creating environmental subagents that encrypt streams of all 0s or all 1s, and then reveal the encryption key. It’s possible that Schmidhuber’s brain was reluctant to really actually search for an option for “maximizing sensory compression” that would be much better at fulfilling that utility function than art, science, or other activities that Schmidhuber himself ranked high in his preference ordering.
Specifically, parts of the talk suggest that nice things like affection, friendship, love, play, curiosity, anger, envy, democracy, liberalism, etc. are the global maxima of competitive forces in a post-AGI age (or at least, might be), whereas I think they aren’t. Merging / making copies [I sometimes call this “zombie dynamics”, in that an AGI that gets more chips will get more copies of itself to go after more resources, like a zombie horde making more zombies] has a lot to do with that, but so does simply being strategic. That gets us to a different issue:
In my mind, there are two quite different dichotomies:
The first dichotomy is:
(1A) “I intrinsically care about X [e.g. friendship] for its own sake”, versus
(1B) “X-type behaviors are instrumentally useful for accomplishing some other goal”
The second dichotomy is:
(2A) “I figured out a while ago that X-type behaviors are instrumentally useful for accomplishing some other goal, and now just carry on with X-type behaviors in such-and-such situation without thinking too hard, because there’s no point in reinventing the wheel every time”, versus
(2B) “I reason from first principles each time that X-type behaviors are instrumentally useful for accomplishing some other goal”.
You seem to use the term “amortised inference” to lump these two dichotomies together, whereas I would prefer to use that term just for (2A). Or if you like:
the first dichotomy is between (1A) Things figured out by evolution, versus (1B) Things figured out within a lifetime; while
the second dichotomy is between (2A) things figured out earlier in life, versus (2B) things figured out just now.
I think the (2A) things are extremely fragile, totally different from the (1A) things which are highly robust. For example, when I was a kid, I learned a (2A) heuristic that I should ask my parents to drive me places. Then I got older, and that heuristic stopped serving me well, so I almost immediately stopped using it. Likewise, I used a certain kind of appointment calendar for 5 years, and then someone suggested that I should switch to a different kind of appointment calendar, and I thought about it a bit, and decided they were right, and switched. I have a certain way of walking, that I’ve been using unthinkingly for decades, but if you put me in high heels, I would immediately drop that habit and learn a new one. These things are totally routine.
The discussion of “amortised inference” in the post makes it sound like a tricky thing that requires superintelligence, but it’s not. The dumbest person you know uses probably thousands or millions of implicit heuristics every day, and is able to flexibly update any of them, or add exceptions, when the situation changes such that the heuristic stops being instrumentally useful.
…Then there’s a normative dimension to all this. If people are cooperating on (2B) grounds versus (2A) grounds, I really don’t care, at least not in itself. If the situation changes such that cooperating stops being instrumentally useful, the (2B) people will immediately stab their former allies in the back, whereas the (2A) people might take a bit longer before the idea pops into their heads that it’s a great idea to stab their former allies in the back. I don’t really care, neither of these is real friendship. By contrast, real friendship has to be (1A), and I do care about real friendship existing into the distant future.
I’ve been talking about cooperation, but it’s equally true that once a trained AI surpasses some level of strategic and metacognitive competence, it doesn’t need curiosity (see Soares post), or play, anger, etc. It can figure out that it should do all those things strategically, and I don’t think it’s any harder than figuring out quantum mechanics etc.
In the article ““Proposal for an experimental test of the many-worlds interpretation of quantum mechanics” its author R. Plaga suggested that if we trap an ion in a quantum well, we can later use it for one-time communication between multiverse branches.
The paper must be wrong. Inter-branch communication is impossible. I searched briefly for a published rebuttal, and found this guy on quora who claims that Plaga himself eventually came around to his proposal being mistaken. That’s hearsay about Plaga, but the quora response also purports to explain the exact error, and it seems sensible at a glance. (I didn’t scrutinize either the paper or the quora rebuttal; I’m more confident about the wrongness than about what exactly the error is.)
I feel like I see it pretty often. Check out “Unfalsifiable stories of doom”, for example.
Or really, anyone who uses the phrase “hypothetical risk” or “hypothetical threat” as a conversation-stopper when talking about ASI extinction, is implicitly invoking the intuitive idea that we should by default be deeply skeptical of things that we have not already seen with our own eyes.
Obviously I agree that The Spokesperson is not going to sound realistic and sympathetic when he is arguing for “Ponzi Pyramid Incorporated” led by “Bernie Bankman”. It’s a reductio ad absurdum, showing that this style of argument proves too much. That’s the whole point.