I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”
Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.
I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.
why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.
Thanks, I just deleted that whole part. I do believe there’s something-like-that which is true, but it would take some work to pin down, and it’s not very relevant to this post, so I figure, I should just delete it. :-)
In case anyone’s curious, here’s the edit I just made:
OLD VERSION:
Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on.
Some of these “brain-like AGI ingredients” are universal parts of today’s popular ML algorithms (e.g. learning algorithms; distributed representations).
Others of these “brain-like AGI ingredients” are (individually) present in a subset of today’s popular ML algorithms but absent from others (e.g. reinforcement learning; predictive [a.k.a. self-supervised] learning; explicit planning).
Still others of these “brain-like AGI ingredients” seem mostly or totally absent from today’s most popular ML algorithms (e.g. ability to form “thoughts” [e.g. “I’m going to the store”] that blend together immediate actions, short-term predictions, long-term predictions, and flexible hierarchical plans, inside a generative world-model that supports causal and counterfactual and metacognitive reasoning).
So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.
NEW VERSION:
Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on. Some of those assumptions would also apply to some existing AI algorithms. But if you take the whole package together—all the parts and how they interconnect—it constitutes a yet-to-be-invented AI architecture. So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.
No worries, seems fine.
FWIW, my current feeling is like 25% probability that narrowing eyes (in anger etc.) has a functional explanation related to vision (as opposed to changing how your face looks to other people, or defending your eyes from attack, or whatever), and 80% probability that widening eyes (in fear etc.) has a functional explanation related to vision. But I didn’t think about it too hard.
In both cases, regardless of whether it’s functional or not, I have very high confidence that it’s an innate reaction, not a product of within-lifetime learning.
I think the point I was trying to make in this post is both narrower and weirder than the general topics of humans supervising more competent AIs, and generation-verification gaps. For example, my self-image might be partly formed from admiration of the character traits of a cartoon character, or Jesus, etc., and I might feel pride in acting in ways that I imagine them approving of, and that might influence how I go about my day-to-day conduct as a string theory researcher. But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.
Actual humans supervising actual AGIs is something that Paul talked about in IDA stuff, and like I said in the OP, I reject that entire line of research as a dead end.
Separately, I agree that “humans are an existence proof that safe & beneficial brain-like AGI is possible in principle” needs a heavy dose of nuance and caveats (humans are working towards misaligned AGI right now, plus I’d generally expect tech progress to drive humanity off the rails even without AGI or other destructive tech, among other things). But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.
Huh, I find the disgust example pretty plausible.
I agree that “universality-through-functionality” (§4.2) is implausible as a theory explaining all universal facial expressions. At least some universal facial expressions do not have immediate functional explanations, seems to me. E.g. the angry open-mouth tooth-showing grimace / scowl was presumably functional in chimps, because they’re showing off their fangs as a credible signal that they’re dangerous. We don’t have any fangs to show off, but we still have that same expression.
But I’m also sympathetic to there being more than zero universal facial expressions that do have immediate functional explanations. Not sure if I’m disagreeing with you or not.
(I don’t currently have a strong opinion one way or the other about whether Barrett’s claims here are plausible.)
I tried narrowing my eyes. This does not help improve my vision.
Well there is a relation between the aperture of a camera and its depth-of-field. (Famous example: pinhole cameras can focus any depth despite having no lens at all.) (Another famous example: I think this is why people squint when they aren’t wearing their glasses.) If the story is real at all, it might be more apparent in a dark environment, since then your pupil will be dilated, and also more apparent when trying to view something that has both near and far parts such that you can’t focus on both simultaneously. Yes it’s possible that this is too subtle an effect to matter in practice, I’m just trying to steelman it. I can’t immediately think of a DIY demo to try.
Widening my eyes does not seem to improve my peripheral vision.
Hmm, I think I disagree. I think there’s a part of my field-of-view (mainly the top part obviously) that’s black when my eyes are relaxed, but that I can see when my eyes are widened. As usual with peripheral vision, you kinda have to be paying attention to it, and it’s also easier to notice when there’s something moving. Here’s the procedure I tried just now: Hold your head straight, pick a fixation point in front of you (or better yet downward), and hold your hand with wiggling fingers as high up as it can go until you can’t see the wiggling, repeat with and without widening your eyes. Seemed like a nonzero effect to me (but not huge).
See also a thread here where I was also complaining about this.
I don’t think this is related to the points I was making in the post … But happy to chat about that anyway.
Yeah sure, common sense says that smart people will tend to enjoy being in more meritocratic intellectual fields, rather than less meritocratic ones, and also that fields in general tend to be more meritocratic when quality is easy to judge (although other things matter too, e.g. glamorous fields have it tougher because they attract grifters).
See e.g. what I wrote here about experimental science.
The mathematics community has successfully kept the cranks out, as far as I know, but two grimly amusing failures (in my controversial opinion) are: (1) in the 2000s, the (correct) theoretical physics consensus that we should be focusing on string theory was somewhat broken by an invasion of people unable to tell good physics theory from bad (e.g. “loop quantum gravity”), and there were enough of such people (including department chairs etc) that they broke the blockade and wound up with positions and credentials; (2) this funny anecdote in Dan Dennett’s memoir:
The hegemony of the analytic philosophers evaporated in 1979, at the Eastern Division meeting of the APA [American Philosophical Association] in Boston, when a coup d’état was staged by a group of mostly American but Continental philosophers who called themselves pluralists (let a thousand flowers bloom). I wonder how many of today’s young philosophers and graduate students have ever heard about this. It was an academic earthquake at the time. Frustrated by the short shrift given them by members of the “analytic monolith,” these philosophers studied the bylaws of the APA and discovered that although for decades the nominating committee had put forward a single candidate for vice president who was then elected by acclaim and would succeed as president the following year, the rules allowed nominations from the floor and actual elections! In secret, the pluralists put together their slate, prepared their challenges to the parliamentarian and other officers, and made sure their members were all set to descend en masse on the lightly attended business meeting and take over the APA Eastern Division. About half an hour before the meeting, their security broke down: a coup was rumored to be in the offing, and we monolith members were rounded up in the bar and hustled to the meeting to try to fend off the usurpation. Dick Rorty was president that year, and it was an irony (one of his favorite topics) that he—the most ecumenical and open-minded of the “analytic monolith” leaders—presided over the meeting, while Tom Nagel executed his duties as parliamentarian with aplomb. There were nominating speeches and rebuttals, the most memorable of which was by Ruth Marcus, whose Yale colleague John Smith, a philosopher of religion and a theologian, was the pluralists’ candidate. She explicitly trashed his whole career, his character, his books. I had never heard a philosopher speak so ill of a colleague in public, and seldom in private.
We lost. The establishment had nominated Adolf Grünbaum, a Pittsburgh philosopher of science, to be the new vice president. Not wanting to offend innocent Adolf, the victorious pluralists nominated and elected him vice president the following year, so that in 1982 he finally got to deliver the presidential address he had expected to give earlier. He did not accept the olive branch with equanimity. Adolf was famous for his tirades against Freud as an unscientific poseur, and his address was vintage Grünbaum. I happened to follow a cluster of pluralists out of the hall at the close of his address and overheard the reply when a pluralist who had stayed away asked how Grünbaum’s address had gone: “It was nasty, brutish and long.”
Thereafter, the APA’s programs were filled with papers on topics, and by philosophers, that would never have made the cut before the pluralist coup. Was this a good thing? Yes, said some monolith members, since it meant there was more guilt-free time to spend in the bar at conventions. Yes, said others, since the pluralists had justice on their side. My verdict is mixed. Still, the published programs of the APA meetings list dozens of talks whose titles are so ripe for parody that when I recently perused a few looking for likely examples to anonymize, I had difficulty “improving” on the actual candidates, but ask yourself whether you are aching to go to the sessions where the following talks will be given:
“The Ineffability of History and the Problem of the Unitary Self”
“Dialectical Encroachment: Humiliation and Integrity”
“Can Relationalistic Ontology Avoid Incoherence through a Recursive Metatheory?”
“Art as War: The Resilience of Autonomy”
Having said all that…
If your proposal is:
von Neumann and Tao did math-y stuff rather than other stuff because they got adulation when they did math-y stuff and they got heckled by idiots when they did other stuff.
…then I think that’s part of it but not all of it. I would note that they presumably got good at math by thinking about math all the time, and if they were thinking about math all the time, it’s probably because they found it very satisfying and enjoyable to think about math. I have a kid like that—when he was like 8 years old, I might be talking about politics at dinner or whatever, and he would interrupt me to share something he just thought of about perfect squares that he found very exciting. I.e., some people, when their mind is wandering, think about other people, and some people think about sports, and he was evidently thinking about perfect squares. Anyway, if a person intrinsically enjoys thinking about numbers and symbols, then it stands to reason that they would probably choose a career where they get to think about numbers and symbols all day.
I sometimes wonder why the AI x-risk community was so overrepresented in physicists in the early-ish days (e.g. Hawking, Tegmark, Wilczek, Musk, Tallinn, Rees, Omohundro, Aguirre…). The best I can come up with is that people who self-select into physics are unusually likely to have the combination of (1) smart & quantitative, and (2) really, deeply, profoundly bothered by not understanding important things about the world.
I wrote:
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
Then you replied:
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used…
But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.
Right?
Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn’t get around with infinite context window/training data/etc?
The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯
OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.
If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.
So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.
The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.
(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?
If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!
So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.
I’m pretty confused. This comment is just trying to get on the same page before I start arguing :-)
I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn.
My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
Does that match what you’re trying to say here?
I mentioned in a footnote that the “algorithmic distillation” paper (Laskin et al. 2022) was misleading, as discussed here. Your links are in the same genre, and I’m pretty skeptical of them too. Also confused.
I mostly tried to read your first suggestion, Towards General-Purpose In-Context Learning Agents.
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
The transfer-learning thing (fig 5) is hard to interpret. What does “not randomized” mean? Why does PPO start at zero and then immediately get worse in the bottom-left one? What would be the “test return” for a random policy, or the no-op policy, or any other relevant baseline, for all four of these? Why is their PPO so bad? Were they using crappy PPO hyperparameters to make GLA look better by comparison? How many other environments did they try but bury in their file drawer? Why is their source code not online? The curves just generally looks really unconvincing to me, and my gut reaction is that they were just flailing around for something to publish, because their exciting claim (meta-learning) doesn’t really work.
I could be wrong, perhaps you’re more familiar with this literature than I am.
Thanks!
I must imagine that there’s some neuroscience literature on sexual attraction, where brain region activations are cross-referenced with self-reported feelings of attraction, and referencing this would help support the point.
Alas it’s much less useful than you’d think, at least for the kinds of questions that I’m interested in. My view is it’s extremely difficult to learn anything useful from fMRI studies, at least for this kind of question. I think the important nuts-and-bolts questions would be answerable by measuring the activity and interconnectivity of tiny cell groups in the hypothalamus, but that’s not experimentally possible as of today.
(fMRI is not helpful for that: the relevant cell groups are all too small and physically proximate (sometimes even intermingled) to tell them apart by location as opposed to by receptor expression etc., and it’s moot anyway because fMRI just can’t measure the human hypothalamus at all, it’s too close to a major artery or something, I forget.)
The claim “modern LLMs can pursue goals and act like agents” does not contradict the claim “modern LLMs get their capabilities primarily from imitative learning”, right? Because there are examples in the pretraining data (including in the specially-commissioned proprietary expert-created data) where human-created text enacts the pursuit of goals. Right? See also here.
And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows.
I do think LLMs struggle when they depart from what’s in the pretraining data, but the meaning of that is a bit tricky to pin down. Like, if I ask you to “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”, you can do that in a fraction of a second, and correctly answer follow-up questions about that scenario, even though this specific mental image has never happened before in the history of the world. And LLMs can do that kind of thing too. Is that “departing from what’s in the pretraining data”? My answer is: No, not in the sense that matters. When you say your LLM “make[s] conclusions that no human knows”, I suspect it’s a similar kind of thing: it’s not “departing from the pretraining data” in the sense that matters. Indeed, anything that a third party can simply read and immediately understand is not “departing from the pretraining data” in the sense that matters, even if the person didn’t already know it.
By contrast, if you don’t know linear algebra, you can’t simply read a linear algebra textbook and immediately understand it. You need to spend many days and weeks internalizing these new ideas.
Anyway, in the post, I tried to be maximally clear-cut, by using the example of how billions of humans over thousands of years inventing language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch, without angels dropping new training data from the heavens. I very strongly don’t believe that billions of LLMs over thousands of years, in a sealed datacenter without any human intervention or human data, could do that. That would be real departure from the pretraining data. See also here and the first section here.
FYI I just wrote a post You can’t imitation-learn how to continual-learn which is related to this thread.
As an analogy, take an adult from 30000 BC, call him Grog, and give him access to a database of “actionable knowledge of 1000 textbooks”, and then tell him to go invent a less expensive solid-state LIDAR system. Will he immediately start making progress? I say “obviously not”.
What would the “actionable knowledge” look like? Maybe one piece of “actionable knowledge” is some fact from the ANSI Z136.1 laser eye safety manual (“For pulsed lasers of 1ns–50μs pulse duration and beam diameter 1 cm, at viewing distance 20 cm, the diffusely reflected beam energy cannot safely exceed 0.022 × CA joules, where CA is the correction factor for IR-A light based on reduced absorption properties of melanin”.) OK, Grog looks at that and immediately has some questions. What does “laser” mean? What is a “pulsed laser”? What does “ns” mean? What does “beam diameter” mean? What does “diffusely reflected” mean? Etc. etc.
This “knowledge” is not in fact “actionable” because Grog can’t make heads or tails of it.
And ditto for pretty much every other item in the database. Right?
What Grog would need to do is spend years developing a deep understanding of optics and lasers and so on before he could even start inventing a new LIDAR system. Of course, that’s what modern LIDAR inventors do: spend years developing understanding. Once Grog has that understanding, then yeah sure, convenient database access to relevant facts would be helpful, just as modern LIDAR inventors do in fact keep the ANSI Z136.1 manual in arm’s reach.
Thus, there’s more to knowledge than lists of facts. It’s ways that the facts all connect to each other in an interconnected web, and it’s ways to think about things, etc.
I claim that this all transfers quite well to LLMs. It’s just that LLMs already have decent “understanding” of everything that humans have ever written down anywhere on the internet or in any book, thanks to pretraining. So in our everyday interactions with LLMs, we don’t as often come across situations where the LLM is flailing around like poor Grog. But see 1, 2.
This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.
I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.
As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.
I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.
So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.
I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.