No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”.
Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.)
Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of “reward hacking” during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
I think that using ‘reward hacking’ and ‘specification gaming’ as synonyms is a significant part of the problem. I’d argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.
Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research’s recent paper Demonstrating specification gaming in reasoning models—I’ve seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.
It’s pretty common for people to use the terms “reward hacking” and “specification gaming” to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn’t actually appear there in practice. Some examples of this:
Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.
I think the literal definitions of the words in “specification gaming” align with this definition (although interestingly not the words in “reward hacking”). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it’s useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I’m not sure that the term “ruthless consequentialism” is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn’t necessarily exist.)
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?
Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?
As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.
(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that’s not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we’re seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.
I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I’ll call the goal), and a literal way we evaluate the model’s behavior (which I’ll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher’s head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we’re just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it’s the reward signal, and in an explicit evaluation it’s the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it’s undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.
So now to talk about some examples:
If I give a model a suite of coding tasks, and I evaluate the tasks by checking if running the test file works or not (as in the METR eval), then I would call it specification gaming, and the objective the model is scoring highly by is the results of the evaluation. By objective I am referring to the actual score the model gets on the evaluation, not what the human wants.
In just pure prompt settings where there is no eval going on, I’m more hesitant to use the terms reward hacking or specification gaming because the proxy is unclear. I do sometimes use the term to refer to the type of behavior that would’ve received a high proxy score if I had been running an eval, though this is a bit sloppy
I think a case could be made for saying a model is specification gaming if it regularly produces results that appear to fulfill prompts given to them by users, according to those users, while secretly doing something else, even if no explicit evaluation is going on. (So using the terminology mentioned before, it succeeds as per the proxy “the human thinks the model did what they wanted the model to do” even if it didn’t fulfill the goal given to it in the prompt.) But I do think this is more borderline and I don’t tend to use the term very much in this situation. And if models just lie or edit tests without fooling the user, they definitely aren’t specification gaming even by this definition.
Maybe a term like deceptive instruction following would be better to use here? Although that term also has the problem of ascribing intentionality.
Perhaps you can say that the implicit proxy in many codebases is successfully passing the tests, though that also seems borderline
An intuition pump for why I like to think about things this way. Let’s imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I’ve created to evaluate the model’s trajectories in this environment. Let’s now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won’t know whether or not this counts as specification gaming until I’m told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.
On an end note, this conversation as well as Rauno Arike’s comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I’ll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I’ll need to think more about what makes sense.
So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all.
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a “wrong” solution as they get for saying something like “I don’t know”. Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer “by accident” (for the wrong reasons). At least something like that seems to be suggested by this:
The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated [reasoning] LLMs consistently claimed to have solved the problems [even if they didn’t]. This discrepancy poses a significant challenge for mathematical applications of LLMs as mathematical results derived using these models cannot be trusted without rigorous human validation.
I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):
I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)
REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7)
LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7)
NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was ever even thinking! Just for fun, here’s the evolution of one of my main diagrams since 2021: (4/7)
EXAMPLES! It never hurts to have more examples! So I added a few more. I also switched the main running example of Post 13 from “envy” to “drive to be liked / admired”, partly because I’m no longer even sure envy is related to social instincts at all (oops) (5/7)
LLMs! … …Just kidding! LLMania has exploded since 2022 but remains basically irrelevant to this series. I hope this series is enjoyed by some of the six remaining AI researchers on Earth who don’t work on LLMs. (I did mention LLMs in a few more places though ↓ ) (6/7)
If you’ve already read the series, no need to do so again, but I want to keep it up-to-date for new readers. Again, see the changelogs at the bottom of each post for details. I’m sure I missed things (and introduced new errors)—let me know if you see any!
Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.
I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)
(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
Whew, a critique that our takeoff should be faster for a change, as opposed to slower.
Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.
(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)
This depends on how large you think the penalty is for parallelized labor as opposed to serial. If 330k parallel researchers is more like equivalent to 100 researchers at 50x speed than 100 researchers at 3,300x speed, then it’s more like a team of 100 researchers working for (50*57)/12=~250 years.
Also of course to the extent you think compute will be an important input, during October they still just have a month’s worth of total compute even though they’re working for 250-25,000 subjective years.
I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)
I’m imagining that there’s a mix of investing tons of effort into optimizing experimenting ideas, implementing and interpreting every experiment quickly, as well as tons of effort into more conceptual agendas given the compute shortage, some of which bear fruit but also involve lots of “wasted” effort exploring possible routes, and most of which end up needing significant experimentation as well to get working.
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
I don’t share this intuition regarding the gap between the first importantly useful AI and ASI. If so, that implies extremely fast takeoff, correct? Like on the order of days from AI that can do important things to full-blown superintelligence?
Currently there are hundreds or perhaps low thousands of years of relevant research effort going into frontier AI each year. The gap between importantly useful AI and ASI seems larger than a year of current AI progress (though I’m not >90% confident in that, especially if timelines are <2 years). Then we also need to take into account diminishing returns, compute bottlenecks, and parallelization penalties, so my guess is that the required person-years should be at minimum in the thousands and likely much more. Overall the scenario you’re describing is maybe (roughly) my 95th percentile speed?
I’m curious about your definition for importantly useful AI actually. Under some interpretations I feel like current AI should cross that bar.
I’m uncertain about the LLMs thing but would lean toward pretty large shifts by the time of ASI; I think it’s more likely LLMs scale to superhuman coders than to ASI.
If we divide the inventing-ASI task into (A) “thinking about and writing algorithms” versus (B) “testing algorithms”, in the world of today there’s a clean division of labor where the humans do (A) and the computers do (B). But in your imagined October 2027 world, there’s fungibility between how much compute is being used on (A) versus (B). I guess I should interpret your “330K superhuman AI researcher copies thinking at 57x human speed” as what would happen if the compute hypothetically all went towards (A), none towards (B)? And really there’s gonna be some division of compute between (A) and (B), such that the amount of (A) is less than I claimed? …Or how are you thinking about that?
I’m curious about your definition for importantly useful AI actually. Under some interpretations I feel like current AI should cross that bar.
Right, but I’m positing a discontinuity between current AI and the next paradigm, and I was talking about the gap between when AI-of-that-next-paradigm is importantly useful versus when it’s ASI. For example, AI-of-that-next-paradigm might arguably already exist today but where it’s missing key pieces such that it barely works on toy models in obscure arxiv papers. Or here’s a more concrete example: Take the “RL agent” line of AI research (AlphaZero, MuZero, stuff like that), which is quite different from LLMs (e.g. “training environment” rather than “training data”, and there’s nothing quite like self-supervised pretraining (see here)). This line of research has led to great results on board games and videogames, but it’s more-or-less economically useless, and certainly useless for alignment research, societal resilience, capabilities research, etc. If it turns out that this line of research is actually much closer to how future ASI will work at a nuts-and-bolts level than LLMs are (for the sake of argument), then we have not yet crossed the “AI-of-that-next-paradigm is importantly useful” threshold in my sense.
If it helps, here’s a draft paragraph from that (hopefully) forthcoming post:
Another possible counter-argument from a prosaic-AGI person would be: “Maybe this future paradigm exists, but LLM agents will find it, not humans, so this is really part of that ‘AIs-doing-AI-R&D’ story like I’ve been saying”. I have two responses. First, I disagree with that prediction. Granted, probably LLMs will be a helpful research tool involved in finding the new paradigm, but there have always been helpful research tools, from PyTorch to arXiv to IDEs, and I don’t expect LLMs to be fundamentally different from those other helpful research tools. Second, even if it’s true that LLMs will discover the new paradigm by themselves (or almost by themselves), I’m just not sure I even care. I see the pre-paradigm-shift AI world as a lesser problem, one that LLM-focused AI alignment researchers (i.e. the vast majority of them) are already focusing on. Good luck to them. And I want to talk about what happens in the strange new world that we enter after that paradigm shift.
Next:
If so, that implies extremely fast takeoff, correct? Like on the order of days from AI that can do important things to full-blown superintelligence?
Well, even if you have an ML training plan that will yield ASI, you still need to run it, which isn’t instantaneous. I dunno, it’s something I’m still puzzling over.
…But yeah, many of my views are pretty retro, like a time capsule from like AI alignment discourse of 2009. ¯\_(ツ)_/¯
That does raise my eyebrows a bit, but also, note that we currently have hundreds of top-level researchers at AGI labs tirelessly working day in and day out, and that all that activity results in a… fairly leisurely pace of progress, actually.[1]
Recall that what they’re doing there is blind atheoretical empirical tinkering (tons of parallel experiments most of which are dead ends/eke out scant few bits of useful information). If you take that research paradigm and ramp it up to superhuman levels (without changing the fundamental nature of the work), maybe it really would take this many researcher-years.
And if AI R&D automation is actually achieved on the back of sleepwalking LLMs, that scenario does seem plausible. These superhuman AI researchers wouldn’t actually be generally superhuman researchers, just superhuman at all the tasks in the blind-empirical-tinkering research paradigm. Which has steeply declining returns to more intelligence added.
That said, yeah, if LLMs actually scale to a “lucid” AGI, capable of pivoting to paradigms with better capability returns on intelligent work invested, I expect it to take dramatically less time.
It’s fast if you use past AI progress as the reference class, but is decidedly not fast if you try to estimate “absolute” progress. Like, this isn’t happening, we’ve jumped to near human-baseline and slowed to a crawl at this level. If we assume the human level is the ground and we’re trying to reach the Sun, it in fact might take millennia at this pace.
we’ve jumped to near human-baseline and slowed to a crawl at this level
A possible reason for that might be the fallibility of our benchmarks. It might be the case that for complex tasks, it’s hard for humans to see farther than their nose.
The short version is getting compute-optimal experiments to self-improve yourself, training to do tasks that unavoidably take a really long time to learn/get data on because of real-world experimentation being necessary, combined with a potential hardware bottleneck on robotics that also requires real-life experimentation to overcome.
Another point is that to the extent you buy the scaling hypothesis at all, then compute bottlenecks will start to bite, and given that researchers will seek small constant improvements they don’t generalize, and this can start a cascade of wrong decisions that could take a very long time to get out of.
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
I’d like to see that post, and I’d like to see your arguments on why it’s so easy for intelligence to be increased so fast, conditional on a new paradigm shift.
(For what it’s worth, I personally think LLMs might not be the last paradigm, because of their current lack of continuous learning/neuroplasticity plus no long term memory/state, but I don’t expect future paradigms to have an AlphaZero like trajectory curve, where things go from zero to wildly superhuman in days/weeks, though I do think takeoff is faster if we condition on a new paradigm being required for ASI, so I do see the AGI transition to plausibly include having only months until we get superintelligence, and maybe only 1-2 years before superintelligence starts having very, very large physical impacts through robotics, assuming that new paradigms are developed, so I’m closer to hundreds of person years/thousands of person years than dozens of person years).
The world is complicated (see: I, Pencil). You can be superhuman by only being excellent at a few fields, for example politics, persuasion, military, hacking. That still leaves you potentially vulnerable, even if your opponents are unlikely to succeed; or you could hurt yourself by your ignorance in some field. Or you can be superhuman in the sense of being able to make the pencil from scratch, only better at each step. That would probably take more time.
Are you suggesting that e.g. “R&D Person-Years 463205–463283 go towards ensuring that the AI has mastery of metallurgy, and R&D Person-Years 463283–463307 go towards ensuring that the AI has mastery of injection-molding machinery, and …”?
If no, then I don’t understand what “the world is complicated” has to do with “it takes a million person-years of R&D to build ASI”. Can you explain?
…Or if yes, that kind of picture seems to contradict the facts that:
This seems quite disanalogous to how LLMs are designed today (i.e., LLMs can already answer any textbook question about injection-molding machinery, but no human doing LLM R&D has ever worked specifically on LLM knowledge of injection-molding machinery),
This seems quite disanalogous to how the human brain was designed (i.e., humans are human-level at injection-molding machinery knowledge and operation, but Evolution designed human brains for the African Savannah, which lacked any injection-molding machinery).
LLMs quickly acquired the capacity to read what humans wrote and paraphrase it. It is not obvious to me (though that may speak more about my ignorance) that it will be similarly easy to acquire deep understanding of everything.
Incidentally, is there any meaningful sense in which we can say how many “person-years of thought” LLMs have already done?
We know they can do things in seconds that would take a human minutes. Does that mean those real-time seconds count as “human-minutes” of thought? Etc.
I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of those two categories, then how would that person describe / conceptualize those behaviors and why they occurred? (Or would they deny the premise that they are still exhibiting those behaviors?) Interested in any references or thoughts, or email / DM me if you prefer. Thanks in advance!
(Edited to add: Ideally someone would reply: “Yeah I have no sense of self, and also I regularly do things that onlookers describe as ‘applying willpower’ and/or ‘trying to impress someone’. And when that happens, I notice the following sequence of thoughts arising: [insert detailed description]”.)
I’m not very comfortable with the term enlightened but I’ve been on retreats teaching non-dual meditation, received ‘pointing out instructions’ in the Mahamudra tradition and have experienced some bizarre states of mind where it seemed to make complete sense to think of a sense of awake awareness as being the ground thing that was being experienced spontaneously, with sensations, thoughts and emotions appearing to it — rather than there being a separate me distinct from awareness that was experiencing things ‘using my awareness’, which is how it had always felt before.
When I have (or rather awareness itself has) experienced clear and stable non-dual states the normal ‘self’ stuff still appears in awareness and behaves fairly normally (e.g there’s hunger, thoughts about making dinner, impulses to move the body, the body moving around the room making dinner…). Being in that non dual state seemed to add a very pleasant quality of effortlessness and okayness to the mix but beyond that it wasn’t radically changing what the ‘small self’ in awareness was doing.
If later the thought “I want to eat a second portion of ice cream” came up followed by “I should apply some self control. I better not do that.” they would just be things appearing to awareness.
Of course another thing in awareness is the sense that awareness is aware of itself and the fact that everything feels funky and non-dual at the moment. You’d think that might change the chain of thoughts about the ‘small self’ wanting ice cream and then having to apply self control towards itself.
In fact the first few times I had intense non-dual experiences there was a chain of thoughts that went “what the hell is going on? I’m not sure I like this? What if I can’t get back into the normal dualistic state of mind?” followed by some panicked feelings and then the non-dual state quickly collapsing into a normal dualistic state.
With more practice, doing other forms of meditation to build a stronger base of calmness and self-compassion, I was able to experience the non-dual state and the chain of thoughts that appeared would go more like “This time let’s just stick with it a bit longer. Basically no one has a persistent non-dual experience that lasts forever. It will collapse eventually whether you like it or not. Nothing much has really changed about the contents of awareness. It’s the same stuff just from a different perspective. I’m still obviously able to feel calmness and joyfulness, I’m still able to take actions that keep me safe — so it’s fine to hang out here”. And then thoughts eventually wander around to ice cream or whatever. And, again, all this is just stuff appearing within a single unified awake sense of awareness that’s being labelled as the experiencer (rather than the ‘I’ in the thoughts above being the experiencer).
The fact that thoughts referencing the self are appearing in awareness whilst it’s awareness itself that feels like the experiencer doesn’t seem to create as many contradictions as you would expect. I presume that’s partly because awareness itself, is able to be aware of its own contents but not do much else. It doesn’t for example make decisions or have a sense of free will like the normal dualistic self. Those again would just be more appearances in awareness.
However it’s obvious that awareness being spontaneously aware of itself does change things in important and indirect ways. It does change the sequences of thoughts somehow and the overall feeling tone — and therefore behaviour. But perhaps in less radical ways than you would expect. For me, at different times, this ranged from causing a mini panic attack that collapsed the non-dual state (obviously would have been visible from the outside) to subtly imbuing everything with nice effortlessness vibes and taking the sting out of suffering type experiences but not changing my thought chains and behaviour enough to be noticeable from the outside to someone else.
Disclaimer: I felt unsure at several points writing this and I’m still quite new to non-dual experiences. I can’t reliably generate a clear non-dual state on command, it’s rather hit and miss. What I wrote above is written from a fairly dualistic state relying on memories of previous experiences a few days ago. And it’s possible that the non-dual experience I’m describing here is still rather shallow and missing important insights versus what very accomplished meditators experience.
I won’t claim that I’m constantly in a self of non-self, but as I’m writing this, I don’t really feel that I’m locally existing in my body. I’m rather the awareness of everything that continuously arises in consciousness.
This doesn’t happen all the time, I won’t claim to be enlightened or anything but maybe this n=1 self-report can help?
Even from this state of awareness, there’s still a will to do something. It is almost like you’re a force of nature moving forward with doing what you were doing before you were in a state of presence awareness. It isn’t you and at the same time it is you. Words are honestly quite insufficient to describe the experience, but If I try to conceptualise it, I’m the universe moving forward by itself. In a state of non-duality, the taste is often very much the same no matter what experience is arising.
There are some times when I’m not fully in a state of non-dual awareness when it can feel like “I” am pretending to do things. At the same time it also kind of feels like using a tool? The underlying motivation for action changes to something like acceptance or helpfulness, and in order to achieve that, there’s this tool of the self that you can apply.
I’m noticing it is quite hard to introspect and try to write from a state of presence awareness at the same time but hopefully it was somewhat helpful?
Could you give me some experiments to try from a state of awareness? I would be happy to try them out and come back.
Extra (relation to some of the ideas):
In the Mahayana wisdom tradition, explored in Rob Burbea’s Seeing That Frees, there’s this idea of emptiness, which is very related to the idea of non-dual perception. For all you see is arising from your own constricted view of experience, and so it is all arising in your own head. Realising this co-creation can enable a freedom of interpretation of your experiences.
Yet this view is also arising in your mind, and so you have “emptiness of emptiness,” meaning that you’re left without a basis. Therefore, both non-self and self are false but magnificent ways of looking at the world. Some people believe that the non-dual is better than the dual yet as my Thai Forest tradition guru Ajhan Buddhisaro says, “Don’t poopoo the mind.” The self boundary can be both a restricting and very useful concept, it is just very nice to have the skill to see past it and go back to the state of now, of presence awareness.
Emptiness is a bit like deeply seeing that our beliefs are built up from different axioms and being able to say that the axioms of reality aren’t based on anything but probabilistic beliefs. Or seeing that we have Occam’s razor because we have seen it work before, yet that it is fundamentally completely arbitrary and that the world just is arising spontaneously from moment to moment. Yet Occam’s razor is very useful for making claims about the world.
I’m not sure if that connection makes sense, but hopefully, that gives a better understanding of the non-dual understanding of the self and non-self. (At least the Thai Forest one)
Many helpful replies! Here’s where I’m at right now (feel free to push back!) [I’m coming from an atheist-physicalist perspective; this will bounce off everyone else.]
Hypothesis:
Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
The (A) & (B) “Steve” concepts are the same concept in normies like me, or at least deeply tangled together. So it’s hard to entertain the possibility of them coming apart, or to think through the consequences if they do.
Some people can get into a Mental State S (call it a form of “enlightenment”, or pick your favorite terminology) where their intuitive concept-space around (B) radically changes—it broadens, or disappears, or whatever. But for them, the (A) mental concept still exists and indeed doesn’t change much.
Anyway, people often have thoughts that connect sense-of-self to motivation, like “not wanting to be embarrassed” or “wanting to keep my promises”. My central claim that the relevant sense-of-self involved in that motivation is (A), not (B).
If we conflate (A) & (B)—as normies like me are intuitively inclined to do—then we get the intuition that a radical change in (B) must have radical impacts on behavior. But that’s wrong—the (A) concept is still there and largely unchanged even in Mental State S, and it’s (A), not (B), that plays a role in those behaviorally-important everyday thoughts like “not wanting to be embarrassed” or “wanting to keep my promises”. So radical changes in (B) would not (directly) have the radical behavioral effects that one might intuitively expect (although it does of course have more than zero behavioral effect, with self-reports being an obvious example).
Some meditators say that before you can get a good sense of non-self you first have to have good self-confidence. I think I would tend to agree with them as it is about how you generally act in the world and what consequences your actions will have. Without this the support for the type B that you’re talking about can be very hard to come by.
Otherwise I do really agree with what you say in this comment.
There is a slight disagreement with the elaboration though, I do not actually think that makes sense. I would rather say that the (A) that you’re talking about is more of a software construct than it is a hardware construct. When you meditate a lot, you realise this and get access to the full OS instead of just the specific software or OS emulator. A is then an evolutionary beneficial algorithm that runs a bit out of control (for example during childhood when we attribute all cause and effect to our “selves”).
Meditation allows us to see that what we have previously attributed to the self was flimsy and dependent on us believing that the hypothesis of the self is true.
My experience is different from the two you describe. I typically fully lack (A)[1], and partially lack (B). I think this is something different from what others might describe as ‘enlightenment’.
I might write more about this if anyone is interested.
Normies like me have an intuitive mental concept “me” which is simultaneously BOTH (A) me-the-human-body-etc AND (B) me-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
to:
Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
At least the ‘me-the-human-body’ part of the concept. I don’t know what the ‘-etc’ part refers to.
The “etc” would include things like the tendency for fingers to reactively withdraw from touching a hot surface.
Elaborating a bit: In my own (physicalist, illusionist) ontology, there’s a body with a nervous system including the brain, and the whole mental world including consciousness / awareness is inextricably part of that package. But in other people’s ontology, as I understand it, some nervous system activities / properties (e.g. a finger reactively withdrawing from pain, maybe some or all other desires and aversions) gets lumped in with the body, whereas other [things that I happen to believe are] nervous system activities / properties (e.g. awareness) gets peeled off into (B). So I said “etc” to include all the former stuff. Hopefully that’s clear.
(I’m trying hard not to get sidetracked into an argument about the true nature of consciousness—I’m stating my ontology without defending it.)
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
No.
Overall, I would say that my self-concept is closer to what a physicalist ontology implies is mundanely happening—a neural network, lacking a singular ‘self’ entity inside it, receiving sense data from sensors and able to output commands to this strange, alien vessel (body). (And also I only identify myself with some parts of the non-mechanistic-level description of what the neural network is doing).
I write in a lot more detail below. This isn’t necessarily written at you in particular, or with the expectation of you reading through all of it.
1. Non-belief in self-as-body (A)
I see two kinds of self-as-body belief. The first is looking in a mirror, or at a photo, and thinking, “that [body] is me.” The second is controlling the body, and having a sense that you’re the one moving it, or more strongly, that it is moving because it is you (and you are choosing to move).
I’ll write about my experiences with the second kind first.
The way a finger automatically withdraws from heat does not feel like a part of me in any sense. Yesterday, I accidentally dropped a utensil and my hands automatically snapped into place around it somehow, and I thought something like, “woah, I didn’t intend to do that. I guess it’s a highly optimized narrow heuristic, from times where reacting so quickly was helpful to survival”.
I experimented a bit between writing this, and I noticed one intuitive view I can have of the body is that it’s some kind of machine that automatically follows such simple intents about the physical world (including intents that I don’t consider ‘me’, like high fear of spiders). For example, if I have motivation and intent to open a window, then the body just automatically moves to it and opens it without me really noticing that the body itself (or more precisely, the body plus the non-me nervous/neural structure controlling it) is the thing doing that—it’s kind of like I’m a ghost (or abstract mind) with telekinesis powers (over nearby objects), but then we apply reductive physics and find that actually there’s a causal chain beneath the telekinesis involving a moving body (which I always know and can see, I just don’t usually think about it).
The way my hands are moving on the keyboard as I write this also doesn’t particularly feel like it’s me doing that; in my mind, I’m just willing the text to be written, and then the movement happens on its own, in a way that feels kind of alien if I actually focus on it (as if the hands are their own life form).
That said, this isn’t always true. I do have an ‘embodied self-sense’ sometimes. For example, I usually fall asleep cuddling stuffies because this makes me happy. At least some purposeful form of sense-of-embodiment seems present there, because the concept of cuddling has embodiment as an assumption.[1]
(As I read over the above, I wonder how different it really is from normal human experience. I’m guessing there’s a subtle difference between “being so embodied it becomes a basic implicit assumption that you don’t notice” and “being so nonembodied that noticing it feels like [reductive physics metaphor]”)
As for the first kind mentioned of locating oneself in the body’s appearance, which informs typical humans perception of others and themself—I don’t experience this with regard to myself (and try to avoid being biased about others this way), instead I just feel pretty dissociated when I see my body reflected and mostly ignore it.
In the past, it instead felt actively stressful/impossible/horrifying, because I had (and to an extent still do have) a deep intuition that I am already a ‘particular kind of being’, and, under the self-as-body ontology, this is expected to correspond to a particular kind of body, one which I did not observe reflected back. As this basic sense-of-self violation happened repeatedly, it gradually eroded away this aspect of sense-of-self / the embodied ontology.
I’d also feel alienated if I had to pilot an adult body to interact with others, so I’ve set up my life such that I only minimally need to do that (e.g for doctors appointments) and can otherwise just interact with the world through text.
2. What parts of the mind-brain are me, and what am I? (B)
I think there’s an extent to which I self-model as an ‘inner homunculus’, or a ‘singular-self inside’. I think it’s lesser and not as robust in me as it is in typical humans, though. For example, when I reflect on this word ‘I’ that I keep using, I notice it has a meaning that doesn’t feel very true of me: the meaning of a singular, unified entity, rather than multiple inner cognitive processes, or no self in particular.
I often notice my thoughts are coming from different parts of the mind. In one case, I was feeling bad about not having been productive enough in learning/generating insights and I thought to myself, “I need to do better”, and then felt aware that it was just one lone part thinking this while the rest doesn’t feel moved; the rest instead culminates into a different inner-monologue-thought: something like, “but we always need to do better. tsuyoku naratai is a universal impetus.” (to be clear, this is not from a different identity or character, but from different neural processes causally prior to what is thought (or written).)
And when I’m writing (which forces us to ‘collapse’ our subverbal understanding into one text), it’s noticeable how much a potential statement is endorsed by different present influences[2].
I tend to use words like ‘I’ and ‘me’ in writing to not confuse others (internally, ‘we’ can feel more fitting, referring again to multiple inner processes[2], and not to multiple high-level selves as some humans experience. ‘we’ is often naturally present in our inner monologue). We’ll use this language for most of the rest of the text[3].
There are times where this is less true. Our mind can return to acting as a human-singular-identity-player in some contexts. For example, if we’re interacting with someone or multiple others, that can push us towards performing a ‘self’ (but unless it’s someone we intuitively-trust and relatively private, we tend to feel alienated/stressed from this). Or if we’re, for example, playing a game with a friend, then in those moments we’ll probably be drawn back into a more childlike humanistic self-ontology rather than the dissociated posthumanism we describe here.
Also, we want to answer “what inner processes?”—there’s some division between parts of the mind-brain we refer to here, and parts that are the ‘structure’ we’re embedded in. We’re not quite sure how to write down the line, and it might be fuzzy or e.g contextual.[4]
3. Tracing the intuitive-ontology shift
“Why are you this way, and have you always been this way?” – We haven’t always. We think this is the result of a gradual erosion of the ‘default’ human ontology, mentioned once above.
We think this mostly did not come from something like ‘believing in physicalism’. Most physicalists aren’t like this. Ontological crises may have been part of it, though—independently synthesizing determinism as a child and realizing it made naive free will impossible sure did make past-child-quila depressed.
We think the strongest sources came from ‘intuitive-ontological’[5] incompatibilities, ways the observations seemed to sadly-contradict the platonicself-ontology we started with. Another term for these would be ‘survival updates’. This can also include ways one’s starting ontology was inadequate to explain certain important observations.
Also, I think that existing so often in a digital-informational context[6], and only infrequently in an analog/physical context, also contributed to eroding the self-as-body belief.
Also, eventually, it wasn’t just erosion/survival updates; at some point, I think I slowly started to embrace this posthumanist ontology, too. It feels narratively fitting that I’m now thinking about artificial intelligence and reading LessWrong.
(There is some sense in which maybe, my proclaimed ontology has its source in constant dissociation, which I only don’t experience when feeling especially comfortable/safe. I’m only speculating, though—this is the kind of thing that I’d consider leaving out, since I’m really unsure about it, it’s at the level of just one of many passing thoughts I’d consider.)
This ‘inner proccesses’ phrasing I keep using doesn’t feel quite right. Other words that come to mind: considerations? currently-active neural subnetworks? subagents? some kind of neural council metaphor?
(sometimes ‘we’ feels unfitting too, it’s weird, maybe ‘I’ is for when a self is being more-performed, or when text is less representative of the whole, hard to say)
We tried to point to some rough differences, but realized that the level we mean is somewhere between high-level concepts with words (like ‘general/narrow cognition’ and ‘altruism’ and ‘biases’) and the lowest-level description (i.e how actual neurons are interacting physically), and that we don’t know how to write about this.
We can differentiate between an endorsed ‘whole-world ontology’ like physicalism, and smaller-scale intuitive ontologies that are more like intuitive frames we seem to believe in, even if when asked we’ll say they’re not fundamental truths.
The intuitive ontology of the self is particularly central to humans.
Note this was mostly downstream of other factors, not causally prior to them. I don’t want anyone to read this and think internet use itself causes body-self incongruence, though it might avoid certain related feedback loops.
Some ultra-short book reviews on cognitive neuroscience
On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex—thalamus—hippocampus system, how it’s arranged, what computations it’s doing, what’s the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins’s more recent work here.
I am a strange loop by Hofstadter (2007)—I dunno, I didn’t feel like I got very much out of it, although it’s possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
Consciousness and the brain by Dehaene (2014)—very good. Maybe I could have saved time by just reading Kaj’s review, there wasn’t that much more to the book beyond that.
Conscience by Patricia Churchland (2019)—I hated it. I forget whether I thought it was vague / vacuous, or actually wrong. Apparently I have already blocked the memory!
How to Create a Mind by Kurzweil (2014)—Parts of it were redundant with On Intelligence (which I had read earlier), but still worthwhile. His ideas about how brain-computer interfaces are supposed to work (in the context of cortical algorithms) are intriguing; I’m not convinced, hoping to think about it more.
Rethinking Consciousness by Graziano (2019)—A+, see my review here
The Accidental Mind by Linden (2008)—Lots of fun facts. The conceit / premise (that the brain is a kludgy accident of evolution) is kinda dumb and overdone—and I disagree with some of the surrounding discussion—but that’s not really a big part of the book, just an excuse to talk about lots of fun neuroscience.
The Myth of Mirror Neurons by Hickok (2014)—A+, lots of insight about how cognition works, especially the latter half of the book. Prepare to skim some sections of endlessly beating a dead horse (as he dubunks seemingly endless lists of bad arguments in favor of some aspect of mirror neurons). As a bonus, you get treated to an eloquent argument for the “intense world” theory of autism, and some aspects of predictive coding.
Surfing Uncertainty by Clark (2015)—I liked it. See also SSC review. I think there’s still work to do in fleshing out exactly how these types of algorithms work; it’s too easy to mix things up and oversimplify when just describing things qualitatively (see my feeble attempt here, which I only claim is a small step in the right direction).
Rethinking innateness by Jeffrey Elman, Annette Karmiloff-Smith, Elizabeth Bates, Mark Johnson, Domenico Parisi, and Kim Plunkett (1996)—I liked it. Reading Steven Pinker, you get the idea that connectionists were a bunch of morons who thought that the brain was just a simple feedforward neural net. This book provides a much richer picture.
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
For some reason it took me until now to notice that:
(I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related to each other, I guess!)
I updated that post to note the correspondence, but also wanted to signal-boost this, in case other people missed it too.
~~
[You can stop reading here—the rest is less important]
If everybody agrees with that part, there’s a further question of “…now what?”. What terminology should I use going forward? If we have redundant terminology, should we try to settle on one?
One obvious option is that I could just stop using the terms “inner alignment” and “outer alignment” in the actor-critic RL context as above. I could even go back and edit them out of that post, in favor of “specification gaming” and “goal misgeneralization”. Or I could leave it. Or I could even advocate that other people switch in the opposite direction!
One consideration is: Pretty much everyone using the terms “inner alignment” and “outer alignment” are not using them in quite the way I am—I’m using them in the actor-critic model-based RL context, they’re almost always using them in the model-free policy optimization context (e.g. evolution) (see §10.2.2). So that’s a cause for confusion, and point in favor of my dropping those terms. On the other hand, I think people using the term “goal misgeneralization” are also almost always using them in a model-free policy optimization context. So actually, maybe that’s a wash? Either way, my usage is not a perfect match to how other people are using the terms, just pretty close in spirit. I’m usually the only one on Earth talking explicitly about actor-critic model-based RL AGI safety, so I kinda have no choice but to stretch existing terms sometimes.
Hmm, aesthetically, I think I prefer the “outer alignment” and “inner alignment” terminology that I’ve traditionally used. I think it’s a better mental picture. But in the context of current broader usage in the field … I’m not sure what’s best.
(Nate Soares dislikes the term “misgeneralization”, on the grounds that “misgeneralization” has a misleading connotation that “the AI is making a mistake by its own lights”, rather than “something is bad by the lights of the programmer”. I’ve noticed a few people trying to get the variation “goal malgeneralization” to catch on instead. That does seem like an improvement, maybe I’ll start doing that too.)
Note: I just noticed your post has a section “Manipulating itself and its learning process”, which I must’ve completely forgotten since I last read the post. I should’ve read your post before posting this. Will do so.
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
Calling problems “outer” and “inner” alignment seems to suggest that if we solved both we’ve successfully aligned AI to do nice things. However, this isn’t really the case here.
Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who’s thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren’t.
This kills you even if we have a nice ground-truth reward and the LVF accurately captures that.
In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don’t understand how it works.
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem.
I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture.
So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)
Yeah I guess I wasn’t thinking concretely enough. I don’t know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here’s a bit of rambling: (I think point 6 is most important.)
As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires.
I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account.
The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn’t update strongly on abstract arguments like “I should correct my estimates based on outside view”.
The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants.
Another question of course is whether the inner self-reflective optimizers are likely aligned to the initial value function. I would need to think about it. Do you see this as a part of the inner alignment problem or as a separate problem?
As an aside, one question would be whether the way this human makes decisions is still essentially actor-critic model-based RL like—whether the critic just got replaced through a more competent version. I don’t really know.
(Of course, I totally ackgnowledge that humans have pre-wired machinery for their intuitive self-models, rather than that just spawning up. I’m not particularly discussing my original objection anymore.)
I’m also uncertain whether something working through the main actor-critic model-based RL mechanism would be capable enough to do something pivotal. Like yeah, most and maybe all current humans probably work that way. But if you go a bit smarter then minds might use more advanced techniques of e.g. translating problems into abstract domains and writing narrow AIs to solve them there and then translating it back into concrete proposals or sth. Though maybe it doesn’t matter as long as the more advanced techniques don’t spawn up more powerful unaligned minds, in which case a smart mind would probably not use the technique in the first place. And I guess actor-critic model-based RL is sorta like expected utility maximization, which is pretty general and can get you far. Only the native kind of EU maximization we implement through actor-critic model-based RL might be very inefficient compared through other kinds.
I have a heuristic like “look at where the main capability comes from”, and I’d guess for very smart agents it perhaps doesn’t come from the value function making really good estimates by itself, and I want to understand how something could be very capable and look at the key parts for this and whether they might be dangerous.
Ignoring human self-models now, the way I imagine actor-critic model-based RL is that it would start out unreflective. It might eventually learn to model parts of itself and form beliefs about its own values. Then, the world-modelling machinery might be better at noticing inconsistencies in the behavior and value estimates of that agent than the agent itself. The value function might then learn to trust the world-model’s predictions about what would be in the interest of the agent/self.
This seems to me to sorta qualify as “there’s an inner optimizer”. I would’ve tentitatively predicted you to say like “yep but it’s an inner aligned optimizer”, but not sure if you actually think this or whether you disagree with my reasoning here. (I would need to consider how likely value drift from such a change seems. I don’t know yet.)
I don’t have a clear take here. I’m just curious if you have some thoughts on where something importantly mismatches your model.
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.
Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).
Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume).
All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts.
When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic.
Now let’s go through what you wrote:
1. humans have a self-model which can essentially have values different from the main value function
I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences.
2. the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates
I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”.
(And there’s a logical reason for that; namely, that hard thinking and brainstorming tends to bring self-reflective thoughts to mind — §8.5.5 — and hard thinking and brainstorming is involved in reducing inconsistency between different desires.)
3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
This one is more foreign to me. A self-reflective thought can have positive or negative valence for the same reasons that any other thought can have positive or negative valence—because of immediate rewards, and because of the past history of rewards, via TD learning, etc.
One thing is: someone can develop a learned metacognitive habit to the effect of “think self-reflective thoughts more often” (which is kinda synonymous with “don’t be so impulsive”). They would learn this habit exactly to the extent and in the circumstances that it has led to higher reward / positive valence in the past.
4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward.
If someone gets in the habit of “think self-reflective thoughts all the time” a.k.a. “don’t be so impulsive”, then their behavior will be especially strongly determined by which self-reflective thoughts are positive or negative valence.
But “which self-reflective thoughts are positive or negative valence” is still determined by the value function / valence guess function / thought assessor in conjunction with ground-truth rewards / actual valence—which in turn involves the reward function, and the past history of rewards, and TD learning, blah blah. Same as any other kind of thought.
…I won’t keep going with your other points, because it’s more of the same idea.
Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model.
Yes, I think I understand your translations and your framing of the value function.
Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I’m still wrongly describing your model:
plans vs thoughts
My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans.
Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so.
detailedness of evaluation of value function
My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like “how much does Alice admire me”) change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic “do an explicit expected utility estimate and just take what the result says as the value estimate”, and then that is being done and the value function itself doesn’t understand much about what’s going on in the expected utility estimate, but it just allows to happen whatever the abstract reasoning engine predicts. So it essentially optimizes goals that are stored as beliefs in the world model.
So technically you could still say “but what gets done still depends on the value function, so when the value function just trusts some optimization procedure which optimizes a stored goal, and that goal isn’t what we intended, then the value function is misaligned”. But it seems sorta odd because the value function isn’t really the main relevant thing doing the optimization.
The value function essentially is too dumb to do the main optimization itself for accomplishing extremely hard tasks. Even if you set incentives so that you get ground-truth reward for moving closer to the goal, it would be too slow at learning what strategies work well
Your model: The value function has quite a good model of what thoughts are useful to think. It is just computing value estimates, but it can make quite coherent estimates to accomplish powerful goals.
If there are abstract thoughts about actually optimizing a different goal than is in the interest of the value function, the value function shuts them down by assigning low value.
(My thoughts: One intuition is that to get to pivotal intelligence level, the value function might need some model of its own goals in order to efficiently recognizing when some values it is assigning aren’t that coherent, but I’m pretty unsure of that. Do you think the value function can learn a model of its own values?)
There’s a spectrum between my model and yours. I don’t know what model is better; at some point I’ll think about what may be a good model here. (Feel free to lmk your thoughts on why your model may be better, though maybe I just see it when in the future I think about it more carefully and reread some of your posts and model your model in more detail. I’m currently not modelling either model that detailed.)
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
[the value function] only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate.
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let’s say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it’s value estimate.
Or perhaps a slightly more concrete version of how that may happen. (I’m thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):
Sometimes the thought generator generates self-reflective thoughts like “what are my goals here”, where upon the thought generator produces an answer “X” to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking “what are my goals here”.
The same can happen with “what are my long-term goals”, where the thought generator might guess something that would cause high reward.
For humans, X is likely more socially nice than would be expected from the value function, since “X are my goals here” is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1]
Later the thought generator may generate the thought “make careful predictions whether the plan will actually accomplish the stated goals well”, where upon the thought generator often finds some incoherences that the value function didn’t notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like “make careful predictions whether the plan will actually accomplish the stated goals well”.
Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator’s predictions often were better. So over time the value function gets updated more and more toward “take the predictions of the thought generator as our valence guess”, since that strategy better predicts later valence guesses.
Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like “studying rationality may make me better at accomplishing my goals”, causing the agent to study rationality.
And also thoughts like “making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal”.
And maybe later more advanced bootstrapping through thoughts like “understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals”. Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it’s too useless IMO).
So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us viausing its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us.
So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it’s probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function. For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it’s plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.
Basically, whenever someone thinks “what’s actually my goal here”, I would say that’s already a slight departure from “using one’s model-based RL capabilities in the way we normally expect”. Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn’t describe my own brain as “using one’s model-based RL capabilities in the way we normally expect”. I’m not at all sure about this, but I would expect that “using its model-based RL capabilities in the way we normally expect” won’t get us to pivotal level of capability if the value function is primitive.
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further).
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.
Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind.
Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.)
Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past.
So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We automatically get plausible and immediate candy-eating plans getting a lot of valence / motivational force, while implausible, distant, and abstract candy-eating plans don’t feel so motivating.
Does that help? (I started writing a response to the rest of what you wrote, but maybe it’s better if I pause there and see what you think.)
Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it’s a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)
So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.
Let’s look at how an agent might accomplish a very difficult goal, where the agent didn’t accomplish similar goals yet so the value function doesn’t already assign higher valence to subgoals:
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
It’s easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like “apply the ‘clarify goal’ strategy to make progress towards the subgoal ‘evaluate whether training for corrigibility might work to safely perform a pivotal act’, which has expected utility X”.
So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward.
The expected utility could be calculated by having the world model see what value (aka expected reward/utility) the value function assigns to the endgoal, and then backpropagating expected utility estimates for subgoals based on how likely and given what resources the larger goal could be accomplished given the smaller goal.
However, the value function is stupid and often not very coherent given some simplicity assumptions of the world model. E.g. the valence of the outcome “1000 lives get saved” isn’t 1000x higher than of “1 life gets saved”.
So the world model’s expected utility estimates come apart from the value function estimates. And it seems to me that for very smart and reflective people, which difficult goals they achieve depend more on their world model’s expected utility guesses, rather than their value function estimates. So I wouldn’t call it “the agent works as we expect model-based RL agents to work”. (And I expect this kind of “the world model assigns expected utility guesses” may be necessary to get to pivotal capability if the value function is simple, though not sure.)
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world.
Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end.
Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences, unless the person is Machiavellian or something.
Speaking of which, there’s a fiction trope that basically only villains are allowed to make plans and display intelligence. The way to write a hero in (non-rationalist) fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and the former wins out over the latter.
To be clear, I’m not accusing you of failing to do things with good long-term consequences because they have good long-term consequences. Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about. So then you get the immediate intrinsic motivation by doing that kind of work, and yet it’s also true that you’re sincerely working towards consequences that are (hopefully) good. And then some more narrow projects towards that end can also wind up feeling socially good (and hence become intrinsically rewarding, even without explicitly holding their long-term consequences in mind), etc.
the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward
I don’t think this is necessary per above, but I also don’t think it’s realistic. The value function updating rule is something like TD learning, a simple equation / mechanism, not an intelligent force with foresight. (Or sorry if I’m misunderstanding. I didn’t really follow this part or the rest of your comment :( But I can try again if it’s important.)
Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about.
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist.
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
the impressive long-term optimization happens mainly through expected utility guesses the world model makes
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
…But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
But the decision of what plan I end up pursuing doesn’t depend on the value function.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
(Sorry if I’m misunderstanding, here or elsewhere.)
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
(No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
(Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.
I’d suggest not using conflated terminology and rather making up your own.
Or rather, first actually don’t use any abstract handles at all and just describe the problems/failure-modes directly, and when you’re confident you have a pretty natural breakdown of the problems with which you’ll stick for a while, then make up your own ontology.
In fact, while in your framework there’s a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect “rewards that agree with what we want” to mean something like “changes in expected utility according to humanity’s CEV”. (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.)
Second attempt:
“in a way that agrees with its eventual reward” seems to imply that there’s actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human’s (imperfect) world model. Problems:
Humans don’t actually give coherent rewards which are consistent with a utility function on their world model.
For this problem we might be able to define an extrapolation procedure that’s not too bad.
The reward depends on the state of the world model of the human, and our world models probably often has false beliefs.
Importantly, the setup needs to be designed in a way that there wouldn’t be an incentive to manipulate the humans into believing false things.
Maybe, optimistically, we could mitigate this problem by having the AI form a model of the operators, doing some ontology translation between the operator’s world model and its own world model, and flagging when there seems to be a relavant belief mismatch.
Our world models cannot evaluate yet whether e.g. filling the universe computronium running a certain type of programs would be good, because we are confused about qualia and don’t know yet what would be good according to our CEV. Basically, the ground-truth reward would very often just say “i don’t know yet”, even for cases which are actually very important according to our CEV. It’s not just that we would need a faithful translation of the state of the universe into our primitive ontology (like “there are simulations of lots of happy and conscious people living interesting lives”), it’s also that (1) the way our world model treats e.g. “consciousness” may not naturally map to anything in a more precise ontology, and while our human minds, learning a deeper ontology, might go like “ah, this is what I actually care about—I’ve been so confused”, such value-generalization is likely even much harder to specify than basic ontology translation. And (2), our CEV may include value-shards which we currently do not know of or track at all.
So while this kind of outer-vs-inner distinction might maybe be fine for human-level AIs, it stops being a good breakdown for smarter AIs, since whenever we want to make the AI do something where humans couldn’t evaluate the result within reasonable time, it needs to generalize beyond what could be evaluated through ground-truth reward.
So mainly because of point 3, instead of asking “how can i make the learned value function agree with the ground-truth reward”, I think it may be better to ask “how can I make the learned value function generalize from the ground-truth reward in the way I want”?
(I guess the outer-vs-inner could make sense in a case where your outer evaluation is superhumanly good, though I cannot think of such a case where looking at the problem from the model-based RL framework would still make much sense, but maybe I’m still unimaginative right now.)
Note that I assumed here that the ground-truth signal is something like feedback from humans. Maybe you’re thinking of it differently than I described here, e.g. if you want to code a steering subsystem for providing ground-truth. But if the steering subsystem is not smarter than humans at evaluating what’s good or bad, the same argument applies. If you think your steering subsystem would be smarter, I’d be interested in why.
(All that is assuming you’re attacking alignment from the actor-critic model-based RL framework. There are other possible frameworks, e.g. trying to directly point the utility function on an agent’s world-model, where the key problems are different.)
I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here:
Suppose there’s an intelligent designer (say, a human programmer), and they make a reward function R hoping that they will wind up with a trained AGI that’s trying to do X (where X is some idea in the programmer’s head), but they fail and the AGI is trying to do not-X instead. If R only depends on the AGI’s external behavior (as is often the case in RL these days), then we can imagine two ways that this failure happened:
The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).
I think it’s useful to catalog possible failures based on whether they involve (1) or (2), and I think it’s reasonable to call them “failures of outer alignment” and “failures of inner alignment” respectively, and I think when (1) is happening rarely or not at all, we can say that the reward function is doing a good job at “representing” the designer’s intention—or at any rate, it’s doing as well as we can possibly hope for from a reward function of that form. The AGI still might fail to acquire the right motivation, and there might be things we can do to help (e.g. change the training environment), but replacing R (which fires exactly to the extent that the AGI’s external behavior involves doing X) by a different external-behavior-based reward function R’ (which sometimes fires when the AGI is doing not-X, and/or sometimes doesn’t fire when the AGI is doing X) seems like it would only make things worse. So in that sense, it seems useful to talk about outer misalignment, a.k.a. situations where the reward function is failing to “represent” the AGI designer’s desired external behavior, and to treat those situations as generally bad.
That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.)
I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how …
One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, −1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate.
Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”)
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-critic model-based RL agents of the type I’m talking about evaluate possible plans using their learned value function, not their reward function, and these two don’t have to agree. Therefore, what they’re “trying” to do would not necessarily be to advance CEV, even if the reward function were perfect.
If I’m still missing where you’re coming from, happy to keep chatting :)
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that.
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it.
The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).
This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the “indirect alignment” one that say “fulfill human’s CEV”, which has lower complexity than the ones humans generalize to (since the value generalization prior doesn’t need to be specified and can instead be inferred from observations about humans). (I think that is what’s meant by “corrigibly aligned” in “Risks from learned optimization”, though it has been a very long time since I read this.))
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”: 1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium). 2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward (as opposed to e.g. being deceptive). Whereas I think “internalizing ground-truth reward” isn’t meaningful off distribution and it’s actually a very hard problem to set up the system in a way that it generalizes in the way we want.
But maybe you’re aware of that “finding the right prior so it generalizes to the right utility function” problem, and you see it as part of inner alignment.
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be.
So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen).
Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.)
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!)
If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me.
Inner alignment can lead to deception. But outer alignment can lead to deception too! Any misalignment can lead to deception, regardless of whether the source of that misalignment was “outer” or “inner” or “both” or “neither”.
“Deception” is deliberate by definition—otherwise we would call it by another term, like “mistake”. That’s why it has to happen after there are misaligned motivations, right?
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward
OK, so I guess I’ll put you down as a vote for the terminology “goal misgeneralization” (or “goal malgeneralization”), rather than “inner misalignment”, as you presumably find that the former makes it more immediately obvious what the concern is. Is that fair? Thanks.
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think I fully agree with this in spirit but not in terminology!
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
I also wouldn’t say “the learned value function just predicts reward”. The learned value function starts randomly initialized, and then it’s updated by TD learning or whatever, and then it eventually winds up with some set of weights at some particular moment, which can take inputs and produce outputs. That’s the system. We can put a comment in the code that says the value function is “supposed to” predict reward, and of course that code comment will be helpful for illuminating why the TD learning update code is structured the way is etc. But that “supposed to” is just a code comment, not the code itself. Will it in fact predict reward? That’s a complicated question about algorithms. In distribution, it will probably predict reward pretty accurately; out of distribution, it probably won’t; but with various caveats on both sides.
And then if we ask questions like “what is the AI trying to do right now” or “what does the AI desire”, the answer would mainly depend on the value function.
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”: 1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium). 2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
I’ve been lumping those together under the heading of “ambiguity in the reward signal”.
The second one would include e.g. ambiguity between “reward for button being pressed” vs “reward for human pressing the button” etc.
The first one would include e.g. ambiguity between “reward for being-helpful-variant-1” vs “reward for being-helpful-variant-2”, where the two variants are indistinguishable in-distribution but have wildly differently opinions about OOD options like brainwashing or mind-uploading.
Another way to think about it: the causal chain intuition is also an OOD issue, because it only becomes a problem when the causal chains are always intact in-distribution but they can come apart in new ways OOD.
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.
I think this summary is better: 1. “The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)”. 2. Something else went wrong [not easily compressible].
Sounds like we probably agree basically everywhere.
Yeah you can definitely mark me down in the camp of “not use ‘inner’ and ‘outer’ terminology”. If you need something for “outer”, how about “reward specification (problem/failure)”.
ADDED: I think I probably don’t want a word for inner-alignment/goal-misgeneralization. It would be like having a word for “the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions”.
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
Yeah I agree they don’t appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think.
Or more generally I think when you don’t use utility functions explicitly then capability likely suffers, though not totally sure.
I think there’s a connection between (A) a common misconception in thinking about future AI (that it’s not a huge deal if it’s “only” about as good as humans at most things), and (B) a common misconception in economics (the “Lump Of Labor Fallacy”).
So I started writing a blog post elaborating on that, but got stuck because my imaginary reader is not an economist and kept raising objections that amounted to saying “yeah but the Lump Of Labor Fallacy isn’t actually a fallacy, there really is a lump of labor” 🤦
Anyway, it’s bad pedagogy to explain a possibly-unintuitive thing by relating it to a different possibly-unintuitive thing. Oh well. (I might still try again to finish writing it at some point.)
It matters a lot what specifically it means to be “as good at humans at most things”. The vast majority of jobs include both legible, formal tasks and “be a good employee” requirements, much more nebulous and difficult to measure. Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers. It’ll replace some, of course.
That said, the fact that “lump of labor” IS a fallacy, and there’s not a fixed amount of work to be done, which more workers simply spread more thinly, means that it’s OK if it displaces many workers—there will be other things they can valuably do.
By that argument, human-level AI is effectively just immigration.
Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers. It’ll replace some, of course.
Yup, the context was talking about future AI which can e.g. have the idea that it will found a company, and then do every step to make that happen, and it can do that about as well as the best human (but not dramatically better than the best human).
I definitely sometimes talk to people who say “yes, I agree that that scenario will happen eventually, but it will not significantly change the world. AI would still be just another technology.” (As opposed to “…and then obviously 99.99…% of future companies will be founded by autonomous AIs, because if it becomes possible to mass-produce Jeff Bezos-es by the trillions, then that’s what will happen. And similarly in every other aspect of the economy.)
By that argument, human-level AI is effectively just immigration.
I think “the effective global labor pool increases by a factor of 1000, consisting of 99.9% AIs” is sometimes a useful scenario to bring up in conversation, but it’s also misleading in certain ways. My actual belief is that humans would rapidly have no ability to contribute to the economy in a post-AGI world, for a similar reason as a moody 7-year-old has essentially no ability to contribute to the economy today (in fact, people would pay good money to keep a moody 7-year-old out of their office or factory).
2. Speaking of mistakes, I’m also regretting some comments I made a while ago suggesting that the brain doesn’t do backpropagation. Maybe that’s true in a narrow sense, but Randall O’Reilly has convinced me that the brain definitely does error-driven learning sometimes (I already knew that), and moreover it may well be able to propagate errors through at least one or two layers of a hierarchy, with enough accuracy to converge. No that doesn’t mean that the brain is exactly the same as a PyTorch / Tensorflow Default-Settings Deep Neural Net.
3. My long work-in-progress post on autism continues to be stuck on the fact that there seem to be two theories of social impairment that are each plausible and totally different. In one theory, social interactions are complex and hard to follow / model for cognitive / predictive-model-building reasons. The evidence I like for that is the role of the cerebellum, which sounds awfully causally implicated in autism. Like, absence of a cerebellum can cause autism, if I’m remembering right. In the other theory, modeling social interactions in the neurotypical way (via empathy) is aversive. The evidence I like for that is people with autism self-reporting that eye contact is aversive, among other things. (This is part of “intense world theory”.) Of those two stories, I’m roughly 100% sold on the latter story is right. But the former story doesn’t seem obviously wrong, and I don’t like having two explanations for the same thing (although it’s not impossible, autism involves different symptoms in different people, and they could co-occur for biological reasons rather than computational reasons). I’m hoping that the stories actually come together somehow, and I’m just confused about what the cerebellum and amygdala do. So I’m reading and thinking about that.
4. New theory I’m playing with: the neocortex outputs predictions directly, in addition to motor commands. E.g. “my arm is going to be touched”. Then the midbrain knows not to flinch when someone touches the arm. That could explain why the visual cortex talks to the superior colliculus, which I always thought was weird. Jeff Hawkins says those connections are the neocortex sending out eye movement motor commands, but isn’t that controlled by the frontal eye fields? Oh, then Randall O’Reilly had this mysterious throwaway comment in a lecture that the frontal eye fields seem to be at the bottom of the visual hierarchy if you look at the connections. (He had a reference, I should read it.) I don’t know what the heck is going on.
modeling social interactions in the neurotypical way (via empathy) is aversive
Is it too pessimistic to assume that people mostly model other people in order to manipulate them better? I wonder how much of human mental inconsistency is a defense against modeling. Here on Less Wrong we complain that inconsistent behavior makes you vulnerable to Dutch-booking, but in real life, consistent behavior probably makes you even more vulnerable, because your enemies can easily predict what you do and plan accordingly.
I was just writing about my perspective here; see also Simulation Theory (the opposite of “Theory Theory”, believe it or not!). I mean, you could say that “making friends and being nice to them” is a form of manipulation, in some technical sense, blah blah evolutionary game theory blah blah, I guess. That seems like something Robin Hanson would say :-P I think it’s a bit too cynical if you mean “manipulation” in the everyday sense involving bad intent. Also, if you want to send out vibes of “Don’t mess with me or I will crush you!” to other people—and the ability to make credible threats is advantageous for game-theory reasons—that’s all about being predictable and consistent!
Again as I posted just now, I think the lion’s share of “modeling”, as I’m using the term, is something that happens unconsciously in a fraction of second, not effortful empathy or modeling.
Hmmm… If I’m trying to impress someone, I do indeed effortfully try to develop a model of what they’re impressed by, and then use that model when talking to them. And I tend to succeed! And it’s not all that hard! The most obvious strategy tends to work (i.e., go with what has impressed them in the past, or what they say would be impressive, or what impresses similar people). I don’t really see any aspect of human nature that is working to make it hard for me to impress someone, like by a person randomly changing what they find impressive. Do you? Are there better examples?
I have low confidence debating this, because it seems to me like many things could be explained in various ways. For example, I agree that certain predictability is needed to prevent people from messing with you. On the other hand, certain uncertainty is needed, too—if people know exactly when you would snap and start crushing them, they will go 5% below the line; but if the exact line depends on what you had for breakfast today, they will be more careful about getting too close to it.
Branding: 3 reasons why I prefer “AGI safety” to “AI alignment”
When engineers, politicians, bureaucrats, military leaders, etc. hear the word “safety”, they suddenly perk up and start nodding and smiling. Safety engineering—making sure that systems robustly do what you want them to do—is something that people across society can relate to and appreciate. By contrast, when people hear the term “AI alignment” for the first time, they just don’t know what it means or how to contextualize it.
There are a lot of things that people are working on in this space that aren’t exactly “alignment”—things like boxing, task-limited AI, myopic AI, impact-limited AI, non-goal-directed AI, AGI strategy & forecasting, etc. It’s useful to have a term that includes all those things, and I think that term should be “AGI safety”. Then we can reserve “AI alignment” for specifically value alignment.
Actually, I’m not even sure that “value alignment” is exactly the right term for value alignment. The term “value alignment” is naturally read as something like “the AI’s values are aligned with human values”, which isn’t necessarily wrong, but is a bit vague and not necessarily interpreted correctly. For example, if love is a human value, should the AGI adopt that value and start falling in love? No, they should facilitate humans falling in love. When people talk about CIRL, CEV, etc. it seems to be less about “value alignment” and more about “value indirection” (in the C++ sense), i.e. utility functions that involve human goals and values, and which more specifically define those things by pointing at human brains and human behavior.
I’m skeptical that anyone with that level of responsibility and acumen has that kind of juvenile destructive mindset. Can you think of other explanations?
There’s a difference between people talking about safety in the sense of 1. ‘how to handle a firearm safely’ and the sense of 2. ‘firearms are dangerous, let’s ban all guns’. These leaders may understand/be on board with 1, but disagree with 2.
I think if someone negatively reacts to ‘Safety’ thinking you mean ‘try to ban all guns’ instead of ‘teach good firearm safety’, you can rephrase as ‘Control’ in that context. I think Safety is more inclusive of various aspects of the problem than either ‘Control’ or ‘Alignment’, so I like it better as an encompassing term.
In the era of COVID, we should all be doing cardio exercise if possible, and not at a gym. Here’s what’s been working for me for the past many years. This is not well optimized for perfectly working out every muscle group etc., but it is very highly optimized for convenience, practicality, and sustainability, at least for me personally in my life situation.
(This post is mostly about home cardio exercise, but the last paragraph is about jogging.)
My home exercise routine consists of three simultaneous things: {exercise , YouTube video lectures , RockMyRun}. More on the exercise below. RockMyRun is a site/app that offers music mixes at fixed BPMs—the music helps my energy and the fixed BPM keeps me from gradually slowing down the pace. The video lectures make me motivated to work out, since there’s a lot of stuff I desperately want to learn. :)
Previously I’ve done instead {exercise, movies or TV}. (I still do on rare occasions.) This is motivating when combined with the rule of “no movies or TV unless exercising (or on social special occasions)”. I’ve pretty much followed that rule for years now.
My exercise routine consists of holding a dumbbell in each hands, then doing a sort of simultaneous reverse-lunge while lifting one of the dumbbells, alternating sides, kinda like this picture. Out of numerous things I’ve tried, this is the one that stuck, because it’s compatible with watching TV, compatible with very small spaces including low ceilings, has low risk of injury, doesn’t stomp or make noise, doesn’t require paying attention (once you get the hang of it), and seems to be a pretty good cardio workout (as judged by being able to break a sweat in a freezing cold room). I also do a few pushups now and then as a break, although that means missing what’s on the screen. I’ve gradually increased the dumbbell weight over the years from 3lbs (1.4kg) to now 15lbs (7kg).
I strongly believe that the top priority for an exercise routine is whatever helps you actually keep doing it perpetually. But beyond that, I’ve found some factors that give me a more intense workout: Coffee helps slightly (it’s a performance-enhancing drug! At least for some people); feeling cold at the beginning / being in a cold room seems to help; awesome action-packed movies or TV are a nice boost, but RockMyRun with boring video lectures is good enough. (My most intense workouts are watching music videos or concert recordings, but I get bored of those after a while.)
In other news, I also occasionally jog. RockMyRun is also a really good idea for that, not just for the obvious reasons (energy, pace), but because, when you set the BPM high, your running form magically and effortlessly improves. This completely solved my jogging knee pain problems, which I had struggled with for years. (I learned that tip from here, where he recommends 160BPM. I personally prefer 180BPM, because I like shorter and more intense runs for my time-crunched schedule.)
(I really only skimmed the paper, these are just impressions off the top of my head.)
I agree that “eating this sandwich” doesn’t have a reward prediction per se, because there are lots of different ways to think about eating this sandwich, especially what aspects are salient, what associations are salient, what your hormones and mood are, etc. If neuroeconomics is premised on reward predictions being attached to events and objects rather than thoughts, then I don’t like neuroeconomics either, at least not as a mechanistic theory of psychology. [I don’t know anything about neuroeconomics, maybe that was never the idea anyway.]
But when they float the idea of throwing out rewards altogether, I’m not buying it. The main reason is: I’m trying to understand what the brain does algorithmically, and I feel like I’m making progress towards a coherent picture …and part of that picture is a 1-dimensional signal called reward. If you got rid of that, I just have no idea how to fill in that gap. Doesn’t mean it’s impossible, but I did try to think it through and failed.
There’s also a nice biological story going with the algorithm story: the basal ganglia has a dense web of connections across the frontal lobe, and can just memorize “this meaningless set of neurons firing is associated with that reward, and this meaningless set of neurons firing is associated with that reward, etc. etc.” Then it (1) inhibits all but the highest-reward-predicting activity, and (2) updates the reward predictions based on what happens (TD learning). (Again this and everything else is very sketchy and speculative.)
(DeepMind had a paper that says there’s a reward prediction probability distribution instead of a reward prediction value, which is fine, that’s still consistent with the rest of my story.)
I get how deep neural nets can search for a policy directly. I don’t think those methods are consistent with the other things I believe about the brain (or at least the neocortex). In particular I think the brain does seem to have a mechanism for choosing among different possible actions being considered in parallel, as opposed to a direct learned function from sensory input to output. The paper also mentions learning to compare without learning a value, but I don’t think that works because there are too many possible comparisons (the square of the number of possible thoughts).
Introducing AGI Safety in general, and my research in particular, to novices / skeptics, in 5 minutes, out loud
I might be interviewed on a podcast where I need to introduce AGI risk to a broad audience of people who mostly aren’t familiar with it and/or think it’s stupid. The audience is mostly neuroscientists plus some AI people. I wrote the following as a possible entry-point, if I get thrown some generic opening question like “Tell me about what you’re working on”:
The human brain does all these impressive things, such that humanity was able to transform the world, go to the moon, invent nuclear weapons, wipe out various species, etc. Human brains did all those things by running certain algorithms.
And sooner or later, people will presumably figure out how to run similar algorithms on computer chips.
Then what? That’s the million-dollar question. Then what? What happens when researchers eventually get to the point where they can run human-brain-like algorithms on computer chips?
OK, to proceed I need to split into two ways of thinking about these future AI systems: Like a tool or like a species.
Let’s start with the tool perspective. Here I’m probably addressing the AI people in the audience. You’re thinking, “Oh, you’re talking about AI, well pfft, I know what AI is, I work with AI every day, AI is kinda like language models and ConvNets and AlphaFold and so on. By the time we get future algorithms that are more like how the human brain works, they’re going to be more powerful, sure, but we should still think of them as in the same category as ConvNets, we should think of them like a tool that people will use.” OK, if that’s your perspective, then the goal is for these tools to do the things that we want them to do. And conversely, the concern is that these systems could go about doing things that the programmers didn’t want them to do, and that literally nobody wanted them to do, like try to escape human control. The technical problem here is called The Alignment Problem: If people figure out how to run human-brain-like algorithms on computer chips, and they want those algorithms to try to do X, how can they do that? It’s not straightforward. For example, humans have an innate sex drive, but it doesn’t work very reliably, some people choose to be celibate. OK, so imagine you have the source code for a human-like brain architecture and training environment, and you want it to definitely grow into an adult that really, deeply, wants to do some particular task, like let’s say design solar cells, while also being honest and staying under human control. How would you do that? What exactly would you put into the source code? Nobody knows the answer. And when you dig into it you find that it’s a surprisingly tricky technical problem, for pretty deep reasons. And that technical problem is something that I and others in the field are working on.
That was the tool perspective. But then there’s probably another part of the audience, maybe a lot of the neuroscientists, who are strenuously objecting here: if we run human-brain-like algorithms on computer chips, we shouldn’t think of that as like a tool for humans to use, instead we should think of it like a species, a new intelligent species that we have invited onto our planet, and indeed a species which will eventually think much faster than humans, and be more insightful and creative than humans, and also probably eventually outnumber humans by a huge factor, and so on. In that perspective, the question is: if we’re going to invite this powerful new intelligent species onto our planet, how do we make sure that it’s a species that we actually want to share the planet with? And how do we make sure that they want to continue sharing the planet with us? Or more generally, how do we bring about a good future? There are some interesting philosophy questions here which we can get back to, but putting those aside, there’s also a technical problem to solve, which is, whatever properties we want this new intelligent species to have, we need to actually write source code such that that actually happens. For example, if we want this new species to feel compassion and friendship, we gotta put compassion and friendship into the source code. Human sociopaths are a case study here. Sociopaths exist, therefore it is possible to make an intelligent species that isn’t motivated by compassion and friendship. Not just possible, but strictly easier! I think maybe future programmers will want to put compassion and friendship into the source code, but they won’t know how, so they won’t do it. So I say, let’s try to figure that out ahead of time. Again, I claim this is a very tricky technical problem, when you start digging into it. We can talk about why. Anyway, that technical problem is also something that I’m working on.
So in summary, sooner or later people will figure out how to run human-brain-like algorithms on computer chips, and this is a very very big deal, it could be the best or worst thing that’s ever happened to humanity, and there’s work we can do right now to increase the chance that things go well, including, in particular, technical work that involves thinking about algorithms and AI and reading neuroscience papers. And that’s what I’m working on!
I’m open to feedback; e.g., where might skeptical audience-members fall off the boat? (I am aware that it’s too long for one answer; I expect that I’ll end up saying various pieces of this in some order depending on the flow of the conversation. But still, gotta start somewhere.)
No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”.
Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.)
Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?
I agree people often aren’t careful about this.
Anthropic says
Similarly OpenAI suggests that cheating behavior is due to RL.
Thanks!
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
Yes, you’re technically right.
I think that using ‘reward hacking’ and ‘specification gaming’ as synonyms is a significant part of the problem. I’d argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.
Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research’s recent paper Demonstrating specification gaming in reasoning models—I’ve seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
Reminds me of
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.
It’s pretty common for people to use the terms “reward hacking” and “specification gaming” to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn’t actually appear there in practice. Some examples of this:
OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking.
Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.
I think the literal definitions of the words in “specification gaming” align with this definition (although interestingly not the words in “reward hacking”). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it’s useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I’m not sure that the term “ruthless consequentialism” is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn’t necessarily exist.)
Thanks for the examples!
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?
Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?
As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.
(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that’s not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we’re seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.
I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I’ll call the goal), and a literal way we evaluate the model’s behavior (which I’ll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher’s head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we’re just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it’s the reward signal, and in an explicit evaluation it’s the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it’s undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.
So now to talk about some examples:
If I give a model a suite of coding tasks, and I evaluate the tasks by checking if running the test file works or not (as in the METR eval), then I would call it specification gaming, and the objective the model is scoring highly by is the results of the evaluation. By objective I am referring to the actual score the model gets on the evaluation, not what the human wants.
In just pure prompt settings where there is no eval going on, I’m more hesitant to use the terms reward hacking or specification gaming because the proxy is unclear. I do sometimes use the term to refer to the type of behavior that would’ve received a high proxy score if I had been running an eval, though this is a bit sloppy
I think a case could be made for saying a model is specification gaming if it regularly produces results that appear to fulfill prompts given to them by users, according to those users, while secretly doing something else, even if no explicit evaluation is going on. (So using the terminology mentioned before, it succeeds as per the proxy “the human thinks the model did what they wanted the model to do” even if it didn’t fulfill the goal given to it in the prompt.) But I do think this is more borderline and I don’t tend to use the term very much in this situation. And if models just lie or edit tests without fooling the user, they definitely aren’t specification gaming even by this definition.
Maybe a term like deceptive instruction following would be better to use here? Although that term also has the problem of ascribing intentionality.
Perhaps you can say that the implicit proxy in many codebases is successfully passing the tests, though that also seems borderline
An intuition pump for why I like to think about things this way. Let’s imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I’ve created to evaluate the model’s trajectories in this environment. Let’s now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won’t know whether or not this counts as specification gaming until I’m told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.
On an end note, this conversation as well as Rauno Arike’s comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I’ll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I’ll need to think more about what makes sense.
So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a “wrong” solution as they get for saying something like “I don’t know”. Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer “by accident” (for the wrong reasons). At least something like that seems to be suggested by this:
Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):
Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.
I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)
(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
Whew, a critique that our takeoff should be faster for a change, as opposed to slower.
This depends on how large you think the penalty is for parallelized labor as opposed to serial. If 330k parallel researchers is more like equivalent to 100 researchers at 50x speed than 100 researchers at 3,300x speed, then it’s more like a team of 100 researchers working for (50*57)/12=~250 years.
Also of course to the extent you think compute will be an important input, during October they still just have a month’s worth of total compute even though they’re working for 250-25,000 subjective years.
I’m imagining that there’s a mix of investing tons of effort into optimizing experimenting ideas, implementing and interpreting every experiment quickly, as well as tons of effort into more conceptual agendas given the compute shortage, some of which bear fruit but also involve lots of “wasted” effort exploring possible routes, and most of which end up needing significant experimentation as well to get working.
I don’t share this intuition regarding the gap between the first importantly useful AI and ASI. If so, that implies extremely fast takeoff, correct? Like on the order of days from AI that can do important things to full-blown superintelligence?
Currently there are hundreds or perhaps low thousands of years of relevant research effort going into frontier AI each year. The gap between importantly useful AI and ASI seems larger than a year of current AI progress (though I’m not >90% confident in that, especially if timelines are <2 years). Then we also need to take into account diminishing returns, compute bottlenecks, and parallelization penalties, so my guess is that the required person-years should be at minimum in the thousands and likely much more. Overall the scenario you’re describing is maybe (roughly) my 95th percentile speed?
I’m curious about your definition for importantly useful AI actually. Under some interpretations I feel like current AI should cross that bar.
I’m uncertain about the LLMs thing but would lean toward pretty large shifts by the time of ASI; I think it’s more likely LLMs scale to superhuman coders than to ASI.
Thanks, that’s very helpful!
If we divide the inventing-ASI task into (A) “thinking about and writing algorithms” versus (B) “testing algorithms”, in the world of today there’s a clean division of labor where the humans do (A) and the computers do (B). But in your imagined October 2027 world, there’s fungibility between how much compute is being used on (A) versus (B). I guess I should interpret your “330K superhuman AI researcher copies thinking at 57x human speed” as what would happen if the compute hypothetically all went towards (A), none towards (B)? And really there’s gonna be some division of compute between (A) and (B), such that the amount of (A) is less than I claimed? …Or how are you thinking about that?
Right, but I’m positing a discontinuity between current AI and the next paradigm, and I was talking about the gap between when AI-of-that-next-paradigm is importantly useful versus when it’s ASI. For example, AI-of-that-next-paradigm might arguably already exist today but where it’s missing key pieces such that it barely works on toy models in obscure arxiv papers. Or here’s a more concrete example: Take the “RL agent” line of AI research (AlphaZero, MuZero, stuff like that), which is quite different from LLMs (e.g. “training environment” rather than “training data”, and there’s nothing quite like self-supervised pretraining (see here)). This line of research has led to great results on board games and videogames, but it’s more-or-less economically useless, and certainly useless for alignment research, societal resilience, capabilities research, etc. If it turns out that this line of research is actually much closer to how future ASI will work at a nuts-and-bolts level than LLMs are (for the sake of argument), then we have not yet crossed the “AI-of-that-next-paradigm is importantly useful” threshold in my sense.
If it helps, here’s a draft paragraph from that (hopefully) forthcoming post:
Next:
Well, even if you have an ML training plan that will yield ASI, you still need to run it, which isn’t instantaneous. I dunno, it’s something I’m still puzzling over.
…But yeah, many of my views are pretty retro, like a time capsule from like AI alignment discourse of 2009. ¯\_(ツ)_/¯
That does raise my eyebrows a bit, but also, note that we currently have hundreds of top-level researchers at AGI labs tirelessly working day in and day out, and that all that activity results in a… fairly leisurely pace of progress, actually.[1]
Recall that what they’re doing there is blind atheoretical empirical tinkering (tons of parallel experiments most of which are dead ends/eke out scant few bits of useful information). If you take that research paradigm and ramp it up to superhuman levels (without changing the fundamental nature of the work), maybe it really would take this many researcher-years.
And if AI R&D automation is actually achieved on the back of sleepwalking LLMs, that scenario does seem plausible. These superhuman AI researchers wouldn’t actually be generally superhuman researchers, just superhuman at all the tasks in the blind-empirical-tinkering research paradigm. Which has steeply declining returns to more intelligence added.
That said, yeah, if LLMs actually scale to a “lucid” AGI, capable of pivoting to paradigms with better capability returns on intelligent work invested, I expect it to take dramatically less time.
It’s fast if you use past AI progress as the reference class, but is decidedly not fast if you try to estimate “absolute” progress. Like, this isn’t happening, we’ve jumped to near human-baseline and slowed to a crawl at this level. If we assume the human level is the ground and we’re trying to reach the Sun, it in fact might take millennia at this pace.
A possible reason for that might be the fallibility of our benchmarks. It might be the case that for complex tasks, it’s hard for humans to see farther than their nose.
The short version is getting compute-optimal experiments to self-improve yourself, training to do tasks that unavoidably take a really long time to learn/get data on because of real-world experimentation being necessary, combined with a potential hardware bottleneck on robotics that also requires real-life experimentation to overcome.
Another point is that to the extent you buy the scaling hypothesis at all, then compute bottlenecks will start to bite, and given that researchers will seek small constant improvements they don’t generalize, and this can start a cascade of wrong decisions that could take a very long time to get out of.
I’d like to see that post, and I’d like to see your arguments on why it’s so easy for intelligence to be increased so fast, conditional on a new paradigm shift.
(For what it’s worth, I personally think LLMs might not be the last paradigm, because of their current lack of continuous learning/neuroplasticity plus no long term memory/state, but I don’t expect future paradigms to have an AlphaZero like trajectory curve, where things go from zero to wildly superhuman in days/weeks, though I do think takeoff is faster if we condition on a new paradigm being required for ASI, so I do see the AGI transition to plausibly include having only months until we get superintelligence, and maybe only 1-2 years before superintelligence starts having very, very large physical impacts through robotics, assuming that new paradigms are developed, so I’m closer to hundreds of person years/thousands of person years than dozens of person years).
The world is complicated (see: I, Pencil). You can be superhuman by only being excellent at a few fields, for example politics, persuasion, military, hacking. That still leaves you potentially vulnerable, even if your opponents are unlikely to succeed; or you could hurt yourself by your ignorance in some field. Or you can be superhuman in the sense of being able to make the pencil from scratch, only better at each step. That would probably take more time.
Are you suggesting that e.g. “R&D Person-Years 463205–463283 go towards ensuring that the AI has mastery of metallurgy, and R&D Person-Years 463283–463307 go towards ensuring that the AI has mastery of injection-molding machinery, and …”?
If no, then I don’t understand what “the world is complicated” has to do with “it takes a million person-years of R&D to build ASI”. Can you explain?
…Or if yes, that kind of picture seems to contradict the facts that:
This seems quite disanalogous to how LLMs are designed today (i.e., LLMs can already answer any textbook question about injection-molding machinery, but no human doing LLM R&D has ever worked specifically on LLM knowledge of injection-molding machinery),
This seems quite disanalogous to how the human brain was designed (i.e., humans are human-level at injection-molding machinery knowledge and operation, but Evolution designed human brains for the African Savannah, which lacked any injection-molding machinery).
Yes, I meant it that way.
LLMs quickly acquired the capacity to read what humans wrote and paraphrase it. It is not obvious to me (though that may speak more about my ignorance) that it will be similarly easy to acquire deep understanding of everything.
But maybe it will. I don’t know.
Incidentally, is there any meaningful sense in which we can say how many “person-years of thought” LLMs have already done?
We know they can do things in seconds that would take a human minutes. Does that mean those real-time seconds count as “human-minutes” of thought? Etc.
I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of those two categories, then how would that person describe / conceptualize those behaviors and why they occurred? (Or would they deny the premise that they are still exhibiting those behaviors?) Interested in any references or thoughts, or email / DM me if you prefer. Thanks in advance!
(Edited to add: Ideally someone would reply: “Yeah I have no sense of self, and also I regularly do things that onlookers describe as ‘applying willpower’ and/or ‘trying to impress someone’. And when that happens, I notice the following sequence of thoughts arising: [insert detailed description]”.)
[also posted on twitter where it got a bunch of replies including one by Aella.]
I’ll give it a go.
I’m not very comfortable with the term enlightened but I’ve been on retreats teaching non-dual meditation, received ‘pointing out instructions’ in the Mahamudra tradition and have experienced some bizarre states of mind where it seemed to make complete sense to think of a sense of awake awareness as being the ground thing that was being experienced spontaneously, with sensations, thoughts and emotions appearing to it — rather than there being a separate me distinct from awareness that was experiencing things ‘using my awareness’, which is how it had always felt before.
When I have (or rather awareness itself has) experienced clear and stable non-dual states the normal ‘self’ stuff still appears in awareness and behaves fairly normally (e.g there’s hunger, thoughts about making dinner, impulses to move the body, the body moving around the room making dinner…). Being in that non dual state seemed to add a very pleasant quality of effortlessness and okayness to the mix but beyond that it wasn’t radically changing what the ‘small self’ in awareness was doing.
If later the thought “I want to eat a second portion of ice cream” came up followed by “I should apply some self control. I better not do that.” they would just be things appearing to awareness.
Of course another thing in awareness is the sense that awareness is aware of itself and the fact that everything feels funky and non-dual at the moment. You’d think that might change the chain of thoughts about the ‘small self’ wanting ice cream and then having to apply self control towards itself.
In fact the first few times I had intense non-dual experiences there was a chain of thoughts that went “what the hell is going on? I’m not sure I like this? What if I can’t get back into the normal dualistic state of mind?” followed by some panicked feelings and then the non-dual state quickly collapsing into a normal dualistic state.
With more practice, doing other forms of meditation to build a stronger base of calmness and self-compassion, I was able to experience the non-dual state and the chain of thoughts that appeared would go more like “This time let’s just stick with it a bit longer. Basically no one has a persistent non-dual experience that lasts forever. It will collapse eventually whether you like it or not. Nothing much has really changed about the contents of awareness. It’s the same stuff just from a different perspective. I’m still obviously able to feel calmness and joyfulness, I’m still able to take actions that keep me safe — so it’s fine to hang out here”. And then thoughts eventually wander around to ice cream or whatever. And, again, all this is just stuff appearing within a single unified awake sense of awareness that’s being labelled as the experiencer (rather than the ‘I’ in the thoughts above being the experiencer).
The fact that thoughts referencing the self are appearing in awareness whilst it’s awareness itself that feels like the experiencer doesn’t seem to create as many contradictions as you would expect. I presume that’s partly because awareness itself, is able to be aware of its own contents but not do much else. It doesn’t for example make decisions or have a sense of free will like the normal dualistic self. Those again would just be more appearances in awareness.
However it’s obvious that awareness being spontaneously aware of itself does change things in important and indirect ways. It does change the sequences of thoughts somehow and the overall feeling tone — and therefore behaviour. But perhaps in less radical ways than you would expect. For me, at different times, this ranged from causing a mini panic attack that collapsed the non-dual state (obviously would have been visible from the outside) to subtly imbuing everything with nice effortlessness vibes and taking the sting out of suffering type experiences but not changing my thought chains and behaviour enough to be noticeable from the outside to someone else.
Disclaimer: I felt unsure at several points writing this and I’m still quite new to non-dual experiences. I can’t reliably generate a clear non-dual state on command, it’s rather hit and miss. What I wrote above is written from a fairly dualistic state relying on memories of previous experiences a few days ago. And it’s possible that the non-dual experience I’m describing here is still rather shallow and missing important insights versus what very accomplished meditators experience.
Great description. This sounds very similar to some of my experiences with non-dual states.
I won’t claim that I’m constantly in a self of non-self, but as I’m writing this, I don’t really feel that I’m locally existing in my body. I’m rather the awareness of everything that continuously arises in consciousness.
This doesn’t happen all the time, I won’t claim to be enlightened or anything but maybe this n=1 self-report can help?
Even from this state of awareness, there’s still a will to do something. It is almost like you’re a force of nature moving forward with doing what you were doing before you were in a state of presence awareness. It isn’t you and at the same time it is you. Words are honestly quite insufficient to describe the experience, but If I try to conceptualise it, I’m the universe moving forward by itself. In a state of non-duality, the taste is often very much the same no matter what experience is arising.
There are some times when I’m not fully in a state of non-dual awareness when it can feel like “I” am pretending to do things. At the same time it also kind of feels like using a tool? The underlying motivation for action changes to something like acceptance or helpfulness, and in order to achieve that, there’s this tool of the self that you can apply.
I’m noticing it is quite hard to introspect and try to write from a state of presence awareness at the same time but hopefully it was somewhat helpful?
Could you give me some experiments to try from a state of awareness? I would be happy to try them out and come back.
Extra (relation to some of the ideas): In the Mahayana wisdom tradition, explored in Rob Burbea’s Seeing That Frees, there’s this idea of emptiness, which is very related to the idea of non-dual perception. For all you see is arising from your own constricted view of experience, and so it is all arising in your own head. Realising this co-creation can enable a freedom of interpretation of your experiences.
Yet this view is also arising in your mind, and so you have “emptiness of emptiness,” meaning that you’re left without a basis. Therefore, both non-self and self are false but magnificent ways of looking at the world. Some people believe that the non-dual is better than the dual yet as my Thai Forest tradition guru Ajhan Buddhisaro says, “Don’t poopoo the mind.” The self boundary can be both a restricting and very useful concept, it is just very nice to have the skill to see past it and go back to the state of now, of presence awareness.
Emptiness is a bit like deeply seeing that our beliefs are built up from different axioms and being able to say that the axioms of reality aren’t based on anything but probabilistic beliefs. Or seeing that we have Occam’s razor because we have seen it work before, yet that it is fundamentally completely arbitrary and that the world just is arising spontaneously from moment to moment. Yet Occam’s razor is very useful for making claims about the world.
I’m not sure if that connection makes sense, but hopefully, that gives a better understanding of the non-dual understanding of the self and non-self. (At least the Thai Forest one)
Many helpful replies! Here’s where I’m at right now (feel free to push back!) [I’m coming from an atheist-physicalist perspective; this will bounce off everyone else.]
Hypothesis:
Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.
The (A) & (B) “Steve” concepts are the same concept in normies like me, or at least deeply tangled together. So it’s hard to entertain the possibility of them coming apart, or to think through the consequences if they do.
Some people can get into a Mental State S (call it a form of “enlightenment”, or pick your favorite terminology) where their intuitive concept-space around (B) radically changes—it broadens, or disappears, or whatever. But for them, the (A) mental concept still exists and indeed doesn’t change much.
Anyway, people often have thoughts that connect sense-of-self to motivation, like “not wanting to be embarrassed” or “wanting to keep my promises”. My central claim that the relevant sense-of-self involved in that motivation is (A), not (B).
If we conflate (A) & (B)—as normies like me are intuitively inclined to do—then we get the intuition that a radical change in (B) must have radical impacts on behavior. But that’s wrong—the (A) concept is still there and largely unchanged even in Mental State S, and it’s (A), not (B), that plays a role in those behaviorally-important everyday thoughts like “not wanting to be embarrassed” or “wanting to keep my promises”. So radical changes in (B) would not (directly) have the radical behavioral effects that one might intuitively expect (although it does of course have more than zero behavioral effect, with self-reports being an obvious example).
End of hypothesis. Again, feel free to push back!
Some meditators say that before you can get a good sense of non-self you first have to have good self-confidence. I think I would tend to agree with them as it is about how you generally act in the world and what consequences your actions will have. Without this the support for the type B that you’re talking about can be very hard to come by.
Otherwise I do really agree with what you say in this comment.
There is a slight disagreement with the elaboration though, I do not actually think that makes sense. I would rather say that the (A) that you’re talking about is more of a software construct than it is a hardware construct. When you meditate a lot, you realise this and get access to the full OS instead of just the specific software or OS emulator. A is then an evolutionary beneficial algorithm that runs a bit out of control (for example during childhood when we attribute all cause and effect to our “selves”).
Meditation allows us to see that what we have previously attributed to the self was flimsy and dependent on us believing that the hypothesis of the self is true.
My experience is different from the two you describe. I typically fully lack (A)[1], and partially lack (B). I think this is something different from what others might describe as ‘enlightenment’.
I might write more about this if anyone is interested.
At least the ‘me-the-human-body’ part of the concept. I don’t know what the ‘-etc’ part refers to.
I just made a wording change from:
to:
I think that’s closer to what I was trying to get across. Does that edit change anything in your response?
The “etc” would include things like the tendency for fingers to reactively withdraw from touching a hot surface.
Elaborating a bit: In my own (physicalist, illusionist) ontology, there’s a body with a nervous system including the brain, and the whole mental world including consciousness / awareness is inextricably part of that package. But in other people’s ontology, as I understand it, some nervous system activities / properties (e.g. a finger reactively withdrawing from pain, maybe some or all other desires and aversions) gets lumped in with the body, whereas other [things that I happen to believe are] nervous system activities / properties (e.g. awareness) gets peeled off into (B). So I said “etc” to include all the former stuff. Hopefully that’s clear.
(I’m trying hard not to get sidetracked into an argument about the true nature of consciousness—I’m stating my ontology without defending it.)
No.
Overall, I would say that my self-concept is closer to what a physicalist ontology implies is mundanely happening—a neural network, lacking a singular ‘self’ entity inside it, receiving sense data from sensors and able to output commands to this strange, alien vessel (body). (And also I only identify myself with some parts of the non-mechanistic-level description of what the neural network is doing).
I write in a lot more detail below. This isn’t necessarily written at you in particular, or with the expectation of you reading through all of it.
1. Non-belief in self-as-body (A)
I see two kinds of self-as-body belief. The first is looking in a mirror, or at a photo, and thinking, “that [body] is me.” The second is controlling the body, and having a sense that you’re the one moving it, or more strongly, that it is moving because it is you (and you are choosing to move).
I’ll write about my experiences with the second kind first.
The way a finger automatically withdraws from heat does not feel like a part of me in any sense. Yesterday, I accidentally dropped a utensil and my hands automatically snapped into place around it somehow, and I thought something like, “woah, I didn’t intend to do that. I guess it’s a highly optimized narrow heuristic, from times where reacting so quickly was helpful to survival”.
I experimented a bit between writing this, and I noticed one intuitive view I can have of the body is that it’s some kind of machine that automatically follows such simple intents about the physical world (including intents that I don’t consider ‘me’, like high fear of spiders). For example, if I have motivation and intent to open a window, then the body just automatically moves to it and opens it without me really noticing that the body itself (or more precisely, the body plus the non-me nervous/neural structure controlling it) is the thing doing that—it’s kind of like I’m a ghost (or abstract mind) with telekinesis powers (over nearby objects), but then we apply reductive physics and find that actually there’s a causal chain beneath the telekinesis involving a moving body (which I always know and can see, I just don’t usually think about it).
The way my hands are moving on the keyboard as I write this also doesn’t particularly feel like it’s me doing that; in my mind, I’m just willing the text to be written, and then the movement happens on its own, in a way that feels kind of alien if I actually focus on it (as if the hands are their own life form).
That said, this isn’t always true. I do have an ‘embodied self-sense’ sometimes. For example, I usually fall asleep cuddling stuffies because this makes me happy. At least some purposeful form of sense-of-embodiment seems present there, because the concept of cuddling has embodiment as an assumption.[1]
(As I read over the above, I wonder how different it really is from normal human experience. I’m guessing there’s a subtle difference between “being so embodied it becomes a basic implicit assumption that you don’t notice” and “being so nonembodied that noticing it feels like [reductive physics metaphor]”)
As for the first kind mentioned of locating oneself in the body’s appearance, which informs typical humans perception of others and themself—I don’t experience this with regard to myself (and try to avoid being biased about others this way), instead I just feel pretty dissociated when I see my body reflected and mostly ignore it.
In the past, it instead felt actively stressful/impossible/horrifying, because I had (and to an extent still do have) a deep intuition that I am already a ‘particular kind of being’, and, under the self-as-body ontology, this is expected to correspond to a particular kind of body, one which I did not observe reflected back. As this basic sense-of-self violation happened repeatedly, it gradually eroded away this aspect of sense-of-self / the embodied ontology.
I’d also feel alienated if I had to pilot an adult body to interact with others, so I’ve set up my life such that I only minimally need to do that (e.g for doctors appointments) and can otherwise just interact with the world through text.
2. What parts of the mind-brain are me, and what am I? (B)
I think there’s an extent to which I self-model as an ‘inner homunculus’, or a ‘singular-self inside’. I think it’s lesser and not as robust in me as it is in typical humans, though. For example, when I reflect on this word ‘I’ that I keep using, I notice it has a meaning that doesn’t feel very true of me: the meaning of a singular, unified entity, rather than multiple inner cognitive processes, or no self in particular.
I often notice my thoughts are coming from different parts of the mind. In one case, I was feeling bad about not having been productive enough in learning/generating insights and I thought to myself, “I need to do better”, and then felt aware that it was just one lone part thinking this while the rest doesn’t feel moved; the rest instead culminates into a different inner-monologue-thought: something like, “but we always need to do better. tsuyoku naratai is a universal impetus.” (to be clear, this is not from a different identity or character, but from different neural processes causally prior to what is thought (or written).)
And when I’m writing (which forces us to ‘collapse’ our subverbal understanding into one text), it’s noticeable how much a potential statement is endorsed by different present influences[2].
I tend to use words like ‘I’ and ‘me’ in writing to not confuse others (internally, ‘we’ can feel more fitting, referring again to multiple inner processes[2], and not to multiple high-level selves as some humans experience. ‘we’ is often naturally present in our inner monologue). We’ll use this language for most of the rest of the text[3].
There are times where this is less true. Our mind can return to acting as a human-singular-identity-player in some contexts. For example, if we’re interacting with someone or multiple others, that can push us towards performing a ‘self’ (but unless it’s someone we intuitively-trust and relatively private, we tend to feel alienated/stressed from this). Or if we’re, for example, playing a game with a friend, then in those moments we’ll probably be drawn back into a more childlike humanistic self-ontology rather than the dissociated posthumanism we describe here.
Also, we want to answer “what inner processes?”—there’s some division between parts of the mind-brain we refer to here, and parts that are the ‘structure’ we’re embedded in. We’re not quite sure how to write down the line, and it might be fuzzy or e.g contextual.[4]
3. Tracing the intuitive-ontology shift
“Why are you this way, and have you always been this way?” – We haven’t always. We think this is the result of a gradual erosion of the ‘default’ human ontology, mentioned once above.
We think this mostly did not come from something like ‘believing in physicalism’. Most physicalists aren’t like this. Ontological crises may have been part of it, though—independently synthesizing determinism as a child and realizing it made naive free will impossible sure did make past-child-quila depressed.
We think the strongest sources came from ‘intuitive-ontological’[5] incompatibilities, ways the observations seemed to sadly-contradict the platonic self-ontology we started with. Another term for these would be ‘survival updates’. This can also include ways one’s starting ontology was inadequate to explain certain important observations.
Also, I think that existing so often in a digital-informational context[6], and only infrequently in an analog/physical context, also contributed to eroding the self-as-body belief.
Also, eventually, it wasn’t just erosion/survival updates; at some point, I think I slowly started to embrace this posthumanist ontology, too. It feels narratively fitting that I’m now thinking about artificial intelligence and reading LessWrong.
(There is some sense in which maybe, my proclaimed ontology has its source in constant dissociation, which I only don’t experience when feeling especially comfortable/safe. I’m only speculating, though—this is the kind of thing that I’d consider leaving out, since I’m really unsure about it, it’s at the level of just one of many passing thoughts I’d consider.)
This ‘inner proccesses’ phrasing I keep using doesn’t feel quite right. Other words that come to mind: considerations? currently-active neural subnetworks? subagents? some kind of neural council metaphor?
(sometimes ‘we’ feels unfitting too, it’s weird, maybe ‘I’ is for when a self is being more-performed, or when text is less representative of the whole, hard to say)
We tried to point to some rough differences, but realized that the level we mean is somewhere between high-level concepts with words (like ‘general/narrow cognition’ and ‘altruism’ and ‘biases’) and the lowest-level description (i.e how actual neurons are interacting physically), and that we don’t know how to write about this.
We can differentiate between an endorsed ‘whole-world ontology’ like physicalism, and smaller-scale intuitive ontologies that are more like intuitive frames we seem to believe in, even if when asked we’ll say they’re not fundamental truths.
The intuitive ontology of the self is particularly central to humans.
Note this was mostly downstream of other factors, not causally prior to them. I don’t want anyone to read this and think internet use itself causes body-self incongruence, though it might avoid certain related feedback loops.
Some ultra-short book reviews on cognitive neuroscience
On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex—thalamus—hippocampus system, how it’s arranged, what computations it’s doing, what’s the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins’s more recent work here.
I am a strange loop by Hofstadter (2007)—I dunno, I didn’t feel like I got very much out of it, although it’s possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
Consciousness and the brain by Dehaene (2014)—very good. Maybe I could have saved time by just reading Kaj’s review, there wasn’t that much more to the book beyond that.
Conscience by Patricia Churchland (2019)—I hated it. I forget whether I thought it was vague / vacuous, or actually wrong. Apparently I have already blocked the memory!
How to Create a Mind by Kurzweil (2014)—Parts of it were redundant with On Intelligence (which I had read earlier), but still worthwhile. His ideas about how brain-computer interfaces are supposed to work (in the context of cortical algorithms) are intriguing; I’m not convinced, hoping to think about it more.
Rethinking Consciousness by Graziano (2019)—A+, see my review here
The Accidental Mind by Linden (2008)—Lots of fun facts. The conceit / premise (that the brain is a kludgy accident of evolution) is kinda dumb and overdone—and I disagree with some of the surrounding discussion—but that’s not really a big part of the book, just an excuse to talk about lots of fun neuroscience.
The Myth of Mirror Neurons by Hickok (2014)—A+, lots of insight about how cognition works, especially the latter half of the book. Prepare to skim some sections of endlessly beating a dead horse (as he dubunks seemingly endless lists of bad arguments in favor of some aspect of mirror neurons). As a bonus, you get treated to an eloquent argument for the “intense world” theory of autism, and some aspects of predictive coding.
Surfing Uncertainty by Clark (2015)—I liked it. See also SSC review. I think there’s still work to do in fleshing out exactly how these types of algorithms work; it’s too easy to mix things up and oversimplify when just describing things qualitatively (see my feeble attempt here, which I only claim is a small step in the right direction).
Rethinking innateness by Jeffrey Elman, Annette Karmiloff-Smith, Elizabeth Bates, Mark Johnson, Domenico Parisi, and Kim Plunkett (1996)—I liked it. Reading Steven Pinker, you get the idea that connectionists were a bunch of morons who thought that the brain was just a simple feedforward neural net. This book provides a much richer picture.
In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:
For some reason it took me until now to notice that:
my “outer misalignment” is more-or-less synonymous with “specification gaming”,
my “inner misalignment” is more-or-less synonymous with “goal misgeneralization”.
(I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related to each other, I guess!)
I updated that post to note the correspondence, but also wanted to signal-boost this, in case other people missed it too.
~~
[You can stop reading here—the rest is less important]
If everybody agrees with that part, there’s a further question of “…now what?”. What terminology should I use going forward? If we have redundant terminology, should we try to settle on one?
One obvious option is that I could just stop using the terms “inner alignment” and “outer alignment” in the actor-critic RL context as above. I could even go back and edit them out of that post, in favor of “specification gaming” and “goal misgeneralization”. Or I could leave it. Or I could even advocate that other people switch in the opposite direction!
One consideration is: Pretty much everyone using the terms “inner alignment” and “outer alignment” are not using them in quite the way I am—I’m using them in the actor-critic model-based RL context, they’re almost always using them in the model-free policy optimization context (e.g. evolution) (see §10.2.2). So that’s a cause for confusion, and point in favor of my dropping those terms. On the other hand, I think people using the term “goal misgeneralization” are also almost always using them in a model-free policy optimization context. So actually, maybe that’s a wash? Either way, my usage is not a perfect match to how other people are using the terms, just pretty close in spirit. I’m usually the only one on Earth talking explicitly about actor-critic model-based RL AGI safety, so I kinda have no choice but to stretch existing terms sometimes.
Hmm, aesthetically, I think I prefer the “outer alignment” and “inner alignment” terminology that I’ve traditionally used. I think it’s a better mental picture. But in the context of current broader usage in the field … I’m not sure what’s best.
(Nate Soares dislikes the term “misgeneralization”, on the grounds that “misgeneralization” has a misleading connotation that “the AI is making a mistake by its own lights”, rather than “something is bad by the lights of the programmer”. I’ve noticed a few people trying to get the variation “goal malgeneralization” to catch on instead. That does seem like an improvement, maybe I’ll start doing that too.)
Note: I just noticed your post has a section “Manipulating itself and its learning process”, which I must’ve completely forgotten since I last read the post. I should’ve read your post before posting this. Will do so.
Calling problems “outer” and “inner” alignment seems to suggest that if we solved both we’ve successfully aligned AI to do nice things. However, this isn’t really the case here.
Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who’s thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren’t.
This kills you even if we have a nice ground-truth reward and the LVF accurately captures that.
In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don’t understand how it works.
Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning.
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem.
I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture.
So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)
Thanks.
Yeah I guess I wasn’t thinking concretely enough. I don’t know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here’s a bit of rambling: (I think point 6 is most important.)
As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires.
I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account.
The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn’t update strongly on abstract arguments like “I should correct my estimates based on outside view”.
The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants.
Another question of course is whether the inner self-reflective optimizers are likely aligned to the initial value function. I would need to think about it. Do you see this as a part of the inner alignment problem or as a separate problem?
As an aside, one question would be whether the way this human makes decisions is still essentially actor-critic model-based RL like—whether the critic just got replaced through a more competent version. I don’t really know.
(Of course, I totally ackgnowledge that humans have pre-wired machinery for their intuitive self-models, rather than that just spawning up. I’m not particularly discussing my original objection anymore.)
I’m also uncertain whether something working through the main actor-critic model-based RL mechanism would be capable enough to do something pivotal. Like yeah, most and maybe all current humans probably work that way. But if you go a bit smarter then minds might use more advanced techniques of e.g. translating problems into abstract domains and writing narrow AIs to solve them there and then translating it back into concrete proposals or sth. Though maybe it doesn’t matter as long as the more advanced techniques don’t spawn up more powerful unaligned minds, in which case a smart mind would probably not use the technique in the first place. And I guess actor-critic model-based RL is sorta like expected utility maximization, which is pretty general and can get you far. Only the native kind of EU maximization we implement through actor-critic model-based RL might be very inefficient compared through other kinds.
I have a heuristic like “look at where the main capability comes from”, and I’d guess for very smart agents it perhaps doesn’t come from the value function making really good estimates by itself, and I want to understand how something could be very capable and look at the key parts for this and whether they might be dangerous.
Ignoring human self-models now, the way I imagine actor-critic model-based RL is that it would start out unreflective. It might eventually learn to model parts of itself and form beliefs about its own values. Then, the world-modelling machinery might be better at noticing inconsistencies in the behavior and value estimates of that agent than the agent itself. The value function might then learn to trust the world-model’s predictions about what would be in the interest of the agent/self.
This seems to me to sorta qualify as “there’s an inner optimizer”. I would’ve tentitatively predicted you to say like “yep but it’s an inner aligned optimizer”, but not sure if you actually think this or whether you disagree with my reasoning here. (I would need to consider how likely value drift from such a change seems. I don’t know yet.)
I don’t have a clear take here. I’m just curious if you have some thoughts on where something importantly mismatches your model.
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.
Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).
Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume).
All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts.
When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic.
Now let’s go through what you wrote:
I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences.
I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”.
(And there’s a logical reason for that; namely, that hard thinking and brainstorming tends to bring self-reflective thoughts to mind — §8.5.5 — and hard thinking and brainstorming is involved in reducing inconsistency between different desires.)
This one is more foreign to me. A self-reflective thought can have positive or negative valence for the same reasons that any other thought can have positive or negative valence—because of immediate rewards, and because of the past history of rewards, via TD learning, etc.
One thing is: someone can develop a learned metacognitive habit to the effect of “think self-reflective thoughts more often” (which is kinda synonymous with “don’t be so impulsive”). They would learn this habit exactly to the extent and in the circumstances that it has led to higher reward / positive valence in the past.
If someone gets in the habit of “think self-reflective thoughts all the time” a.k.a. “don’t be so impulsive”, then their behavior will be especially strongly determined by which self-reflective thoughts are positive or negative valence.
But “which self-reflective thoughts are positive or negative valence” is still determined by the value function / valence guess function / thought assessor in conjunction with ground-truth rewards / actual valence—which in turn involves the reward function, and the past history of rewards, and TD learning, blah blah. Same as any other kind of thought.
…I won’t keep going with your other points, because it’s more of the same idea.
Does that help explain where I’m coming from?
Thanks!
Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model.
Yes, I think I understand your translations and your framing of the value function.
Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I’m still wrongly describing your model:
plans vs thoughts
My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans.
Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so.
detailedness of evaluation of value function
My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like “how much does Alice admire me”) change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic “do an explicit expected utility estimate and just take what the result says as the value estimate”, and then that is being done and the value function itself doesn’t understand much about what’s going on in the expected utility estimate, but it just allows to happen whatever the abstract reasoning engine predicts. So it essentially optimizes goals that are stored as beliefs in the world model.
So technically you could still say “but what gets done still depends on the value function, so when the value function just trusts some optimization procedure which optimizes a stored goal, and that goal isn’t what we intended, then the value function is misaligned”. But it seems sorta odd because the value function isn’t really the main relevant thing doing the optimization.
The value function essentially is too dumb to do the main optimization itself for accomplishing extremely hard tasks. Even if you set incentives so that you get ground-truth reward for moving closer to the goal, it would be too slow at learning what strategies work well
Your model: The value function has quite a good model of what thoughts are useful to think. It is just computing value estimates, but it can make quite coherent estimates to accomplish powerful goals.
If there are abstract thoughts about actually optimizing a different goal than is in the interest of the value function, the value function shuts them down by assigning low value.
(My thoughts: One intuition is that to get to pivotal intelligence level, the value function might need some model of its own goals in order to efficiently recognizing when some values it is assigning aren’t that coherent, but I’m pretty unsure of that. Do you think the value function can learn a model of its own values?)
There’s a spectrum between my model and yours. I don’t know what model is better; at some point I’ll think about what may be a good model here. (Feel free to lmk your thoughts on why your model may be better, though maybe I just see it when in the future I think about it more carefully and reread some of your posts and model your model in more detail. I’m currently not modelling either model that detailed.)
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.
Thanks!
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let’s say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it’s value estimate.
Or perhaps a slightly more concrete version of how that may happen. (I’m thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):
Sometimes the thought generator generates self-reflective thoughts like “what are my goals here”, where upon the thought generator produces an answer “X” to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking “what are my goals here”.
The same can happen with “what are my long-term goals”, where the thought generator might guess something that would cause high reward.
For humans, X is likely more socially nice than would be expected from the value function, since “X are my goals here” is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1]
Later the thought generator may generate the thought “make careful predictions whether the plan will actually accomplish the stated goals well”, where upon the thought generator often finds some incoherences that the value function didn’t notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like “make careful predictions whether the plan will actually accomplish the stated goals well”.
Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator’s predictions often were better. So over time the value function gets updated more and more toward “take the predictions of the thought generator as our valence guess”, since that strategy better predicts later valence guesses.
Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like “studying rationality may make me better at accomplishing my goals”, causing the agent to study rationality.
And also thoughts like “making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal”.
And maybe later more advanced bootstrapping through thoughts like “understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals”. Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it’s too useless IMO).
So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:
So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it’s probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function.
For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it’s plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.
Basically, whenever someone thinks “what’s actually my goal here”, I would say that’s already a slight departure from “using one’s model-based RL capabilities in the way we normally expect”. Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn’t describe my own brain as “using one’s model-based RL capabilities in the way we normally expect”. I’m not at all sure about this, but I would expect that “using its model-based RL capabilities in the way we normally expect” won’t get us to pivotal level of capability if the value function is primitive.
If I just trust my model of your model here. (Though I might misrepresent your model. I would need to reread your posts.)
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.
Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind.
Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.)
Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past.
So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We automatically get plausible and immediate candy-eating plans getting a lot of valence / motivational force, while implausible, distant, and abstract candy-eating plans don’t feel so motivating.
Does that help? (I started writing a response to the rest of what you wrote, but maybe it’s better if I pause there and see what you think.)
Thanks.
Yeah I think the parts of my comment where I treated the value function as making predictions on how well a plan works were pretty confused. I agree it’s a better framing that plans proposed by the thought generator include predicted outcomes and the value function evaluates on those. (Maybe I previously imagined the thought generator more like proposing actions, idk.)
So yeah I guess what I wrote was pretty confusing, though I still have some concerns here.
Let’s look at how an agent might accomplish a very difficult goal, where the agent didn’t accomplish similar goals yet so the value function doesn’t already assign higher valence to subgoals:
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
It’s easy to implement a fix, e.g.: Save an expected utility guess (aka instrumental value) for each subgoal, and then the value function can assign valence according to the expected utility guess. So in this case I might have a thought like “apply the ‘clarify goal’ strategy to make progress towards the subgoal ‘evaluate whether training for corrigibility might work to safely perform a pivotal act’, which has expected utility X”.
So the way I imagine it here, the value function would need to take the expected utility guess X and output a value roughly proportional to X, so that enough valence is supplied to keep the brainstorming going. I think the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward.
The expected utility could be calculated by having the world model see what value (aka expected reward/utility) the value function assigns to the endgoal, and then backpropagating expected utility estimates for subgoals based on how likely and given what resources the larger goal could be accomplished given the smaller goal.
However, the value function is stupid and often not very coherent given some simplicity assumptions of the world model. E.g. the valence of the outcome “1000 lives get saved” isn’t 1000x higher than of “1 life gets saved”.
So the world model’s expected utility estimates come apart from the value function estimates. And it seems to me that for very smart and reflective people, which difficult goals they achieve depend more on their world model’s expected utility guesses, rather than their value function estimates. So I wouldn’t call it “the agent works as we expect model-based RL agents to work”.
(And I expect this kind of “the world model assigns expected utility guesses” may be necessary to get to pivotal capability if the value function is simple, though not sure.)
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing.
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world.
Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end.
Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences, unless the person is Machiavellian or something.
Speaking of which, there’s a fiction trope that basically only villains are allowed to make plans and display intelligence. The way to write a hero in (non-rationalist) fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and the former wins out over the latter.
To be clear, I’m not accusing you of failing to do things with good long-term consequences because they have good long-term consequences. Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about. So then you get the immediate intrinsic motivation by doing that kind of work, and yet it’s also true that you’re sincerely working towards consequences that are (hopefully) good. And then some more narrow projects towards that end can also wind up feeling socially good (and hence become intrinsically rewarding, even without explicitly holding their long-term consequences in mind), etc.
I don’t think this is necessary per above, but I also don’t think it’s realistic. The value function updating rule is something like TD learning, a simple equation / mechanism, not an intelligent force with foresight. (Or sorry if I’m misunderstanding. I didn’t really follow this part or the rest of your comment :( But I can try again if it’s important.)
Ok yeah I think you’re probably right that for humans (including me) this is the mechanism through which valence is supplied for pursuing long-term objectives, or at least that it probably doesn’t look like the value function deferring to expected utility guess of the world model.
I think it doesn’t change much of the main point, that the impressive long-term optimization happens mainly through expected utility guesses the world model makes, rather than value guesses of the value function. (Where the larger context is that I am pushing back against your framing of “inner alignment is about the value function ending up accurately predicting expected reward”.)
I agree that for ~all thoughts I think, they have high enough valence for non-long-term reasons, e.g. self-image valence related.
But I do NOT mean what’s the reason why I am motivated to work on whatever particular alignment subproblem I decided to work on, but why I decided to work on that rather than something else. And the process that led to that decision is sth like “think hard about how to best increase the probability that human-aligned superintelligence is built → … → think that I need to get an even better inside view on how feasible alignment/corrigibility is → plan going through alignment proposals and playing the builder-breaker-game”.
So basically I am thinking about problems like “does doing planA or planB cause a higher expected reduction in my probability of doom”. Where I am perhaps motivated to think that because it’s what my role models would approve of. But the decision of what plan I end up pursuing doesn’t depend on the value function. And those decisions are the ones that add up to accomplishing very long-range objectives.
It might also help to imagine the extreme case: Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”. And yet it’s plausible to me that an AI would need to move a chunk into the direction of thinking like this keeper to reach pivotal capability.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
…But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
(Sorry if I’m misunderstanding, here or elsewhere.)
(No I wouldn’t say the candy example involves long-term planning—it’s fairly easy and doesn’t take that many steps. It’s true that long-term results can be accomplished without expected utility guesses from the world model, but I think it may be harder for really really hard problems because the value function isn’t that coherent.)
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways. So most goals/values could be optimized that way.
Of course, the goals the keeper will end up optimizing are likely close to some self-reflective thoughts that have high valence. It could be an unlikely failure mode, but it’s possible that the thing that gets optimized ends up different from what was high valence. If that happens, strategic thinking can be used to figure out how keep valence flowing / how to motivate your brain to continue working on something.
Ok actually the way I imagined it, the value function doesn’t evaluate based on abstract concepts like pdoom, but rather the whole reasoning is related to thoughts like “i am thinking like the person I want to be” which have high valence.
(Though I guess your pdoom evaluation is similar to the “take the expected utility guess from the world model” value function that I orignially had in mind. I guess the way I modeled it was maybe more like that there’s a belief like “pdoom=high ⇔ bad” and then the value function is just like “apparently that option is bad, so let’s not do that”, rather than the value function itself assinging low value to high pdoom. (Where the value function previously would’ve needed to learn to trust the good/bad judgement of the world model, though again I think it’s unlikely that it works that way in humans.))
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
Thanks! I think you’re right that my “value function still assigns high valence for thinking in those fun productive ways” hypothesis isn’t realistic for the reason you described.
I somehow previously hadn’t properly internalized that you think primary reward fires even if you only imagine another person admiring you. It seems quite plausible but not sure yet.
Paraphrase of your model of how you might end up pursuing what a fictional character would pursue. (Please correct if wrong.):
The fictional character does cool stuff so you start to admire him.
You imagine yourself doing something similarly cool and have the associated thought “the fictional character would be impressed by me”, which triggers primary reward.
The value function learns to assign positive valence to outcomes which the fictional character would be impressed by, since you sometimes imagine the fictional character being impressed afterwards and thus get primary reward.
I still find myself a bit confused:
Getting primary reward only for thinking of something rather than the actual outcome seems weird to me. I guess thoughts are also constrained by world-model-consistency, so you’re incentivized to imagine realistic scenarios that would impress someone, but still.
In particular I don’t quite see the advantage of that design compared to the design where primary reward only triggers on actually impressing people, and then the value function learns to predict that if you impress someone you will get positive reward, and thus predict high value for that and causal upstream events.
(That said it currently seems to me like forming values from imagining fictional characters is a thing, and that seems to be better-than-default predicted by the “primary reward even on just thoughts” hypothesis, though possible that there’s another hypothesis that explains that well too.)
(Tbc, I think fictional characters influencing one’s values is usually relatively weak/rare, though it’s my main hypothesis for how e.g. most of Eliezer’s values were formed (from his science fiction books). But I wouldn’t be shocked if forming values from fictional characters actually isn’t a thing.)
I’m not quite sure whether one would actually think the thought “the fictional character would be impressed by me”. It rather seems like one might think something like “what would the fictional character do”, without imagining the fictional character thinking about oneself.
I’d suggest not using conflated terminology and rather making up your own.
Or rather, first actually don’t use any abstract handles at all and just describe the problems/failure-modes directly, and when you’re confident you have a pretty natural breakdown of the problems with which you’ll stick for a while, then make up your own ontology.
In fact, while in your framework there’s a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this:
First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect “rewards that agree with what we want” to mean something like “changes in expected utility according to humanity’s CEV”. (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.)
Second attempt:
“in a way that agrees with its eventual reward” seems to imply that there’s actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human’s (imperfect) world model. Problems:
Humans don’t actually give coherent rewards which are consistent with a utility function on their world model.
For this problem we might be able to define an extrapolation procedure that’s not too bad.
The reward depends on the state of the world model of the human, and our world models probably often has false beliefs.
Importantly, the setup needs to be designed in a way that there wouldn’t be an incentive to manipulate the humans into believing false things.
Maybe, optimistically, we could mitigate this problem by having the AI form a model of the operators, doing some ontology translation between the operator’s world model and its own world model, and flagging when there seems to be a relavant belief mismatch.
Our world models cannot evaluate yet whether e.g. filling the universe computronium running a certain type of programs would be good, because we are confused about qualia and don’t know yet what would be good according to our CEV. Basically, the ground-truth reward would very often just say “i don’t know yet”, even for cases which are actually very important according to our CEV. It’s not just that we would need a faithful translation of the state of the universe into our primitive ontology (like “there are simulations of lots of happy and conscious people living interesting lives”), it’s also that (1) the way our world model treats e.g. “consciousness” may not naturally map to anything in a more precise ontology, and while our human minds, learning a deeper ontology, might go like “ah, this is what I actually care about—I’ve been so confused”, such value-generalization is likely even much harder to specify than basic ontology translation. And (2), our CEV may include value-shards which we currently do not know of or track at all.
So while this kind of outer-vs-inner distinction might maybe be fine for human-level AIs, it stops being a good breakdown for smarter AIs, since whenever we want to make the AI do something where humans couldn’t evaluate the result within reasonable time, it needs to generalize beyond what could be evaluated through ground-truth reward.
So mainly because of point 3, instead of asking “how can i make the learned value function agree with the ground-truth reward”, I think it may be better to ask “how can I make the learned value function generalize from the ground-truth reward in the way I want”?
(I guess the outer-vs-inner could make sense in a case where your outer evaluation is superhumanly good, though I cannot think of such a case where looking at the problem from the model-based RL framework would still make much sense, but maybe I’m still unimaginative right now.)
Note that I assumed here that the ground-truth signal is something like feedback from humans. Maybe you’re thinking of it differently than I described here, e.g. if you want to code a steering subsystem for providing ground-truth. But if the steering subsystem is not smarter than humans at evaluating what’s good or bad, the same argument applies. If you think your steering subsystem would be smarter, I’d be interested in why.
(All that is assuming you’re attacking alignment from the actor-critic model-based RL framework. There are other possible frameworks, e.g. trying to directly point the utility function on an agent’s world-model, where the key problems are different.)
Thanks!
I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here:
(A bit more related discussion here.)
That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.)
I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how …
One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, −1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate.
Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”)
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-critic model-based RL agents of the type I’m talking about evaluate possible plans using their learned value function, not their reward function, and these two don’t have to agree. Therefore, what they’re “trying” to do would not necessarily be to advance CEV, even if the reward function were perfect.
If I’m still missing where you’re coming from, happy to keep chatting :)
Thanks!
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it.
This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution.
You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions.
(Though we could aim for a different compatible utility function, namely the “indirect alignment” one that say “fulfill human’s CEV”, which has lower complexity than the ones humans generalize to (since the value generalization prior doesn’t need to be specified and can instead be inferred from observations about humans). (I think that is what’s meant by “corrigibly aligned” in “Risks from learned optimization”, though it has been a very long time since I read this.))
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”:
1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium).
2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward (as opposed to e.g. being deceptive). Whereas I think “internalizing ground-truth reward” isn’t meaningful off distribution and it’s actually a very hard problem to set up the system in a way that it generalizes in the way we want.
But maybe you’re aware of that “finding the right prior so it generalizes to the right utility function” problem, and you see it as part of inner alignment.
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be.
So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen).
Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.)
Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!)
If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me.
Inner alignment can lead to deception. But outer alignment can lead to deception too! Any misalignment can lead to deception, regardless of whether the source of that misalignment was “outer” or “inner” or “both” or “neither”.
“Deception” is deliberate by definition—otherwise we would call it by another term, like “mistake”. That’s why it has to happen after there are misaligned motivations, right?
OK, so I guess I’ll put you down as a vote for the terminology “goal misgeneralization” (or “goal malgeneralization”), rather than “inner misalignment”, as you presumably find that the former makes it more immediately obvious what the concern is. Is that fair? Thanks.
I think I fully agree with this in spirit but not in terminology!
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
I also wouldn’t say “the learned value function just predicts reward”. The learned value function starts randomly initialized, and then it’s updated by TD learning or whatever, and then it eventually winds up with some set of weights at some particular moment, which can take inputs and produce outputs. That’s the system. We can put a comment in the code that says the value function is “supposed to” predict reward, and of course that code comment will be helpful for illuminating why the TD learning update code is structured the way is etc. But that “supposed to” is just a code comment, not the code itself. Will it in fact predict reward? That’s a complicated question about algorithms. In distribution, it will probably predict reward pretty accurately; out of distribution, it probably won’t; but with various caveats on both sides.
And then if we ask questions like “what is the AI trying to do right now” or “what does the AI desire”, the answer would mainly depend on the value function.
I’ve been lumping those together under the heading of “ambiguity in the reward signal”.
The second one would include e.g. ambiguity between “reward for button being pressed” vs “reward for human pressing the button” etc.
The first one would include e.g. ambiguity between “reward for being-helpful-variant-1” vs “reward for being-helpful-variant-2”, where the two variants are indistinguishable in-distribution but have wildly differently opinions about OOD options like brainwashing or mind-uploading.
Another way to think about it: the causal chain intuition is also an OOD issue, because it only becomes a problem when the causal chains are always intact in-distribution but they can come apart in new ways OOD.
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.
I think this summary is better: 1. “The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)”. 2. Something else went wrong [not easily compressible].
Sounds like we probably agree basically everywhere.
Yeah you can definitely mark me down in the camp of “not use ‘inner’ and ‘outer’ terminology”. If you need something for “outer”, how about “reward specification (problem/failure)”.
ADDED: I think I probably don’t want a word for inner-alignment/goal-misgeneralization. It would be like having a word for “the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions”.
Yeah I agree they don’t appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think.
Or more generally I think when you don’t use utility functions explicitly then capability likely suffers, though not totally sure.
I think there’s a connection between (A) a common misconception in thinking about future AI (that it’s not a huge deal if it’s “only” about as good as humans at most things), and (B) a common misconception in economics (the “Lump Of Labor Fallacy”).
So I started writing a blog post elaborating on that, but got stuck because my imaginary reader is not an economist and kept raising objections that amounted to saying “yeah but the Lump Of Labor Fallacy isn’t actually a fallacy, there really is a lump of labor” 🤦
Anyway, it’s bad pedagogy to explain a possibly-unintuitive thing by relating it to a different possibly-unintuitive thing. Oh well. (I might still try again to finish writing it at some point.)
It matters a lot what specifically it means to be “as good at humans at most things”. The vast majority of jobs include both legible, formal tasks and “be a good employee” requirements, much more nebulous and difficult to measure. Being just as good as the median employee at the formal job description, without the flexibility and trust from being a functioning member of society is NOT enough to replace most workers. It’ll replace some, of course.
That said, the fact that “lump of labor” IS a fallacy, and there’s not a fixed amount of work to be done, which more workers simply spread more thinly, means that it’s OK if it displaces many workers—there will be other things they can valuably do.
By that argument, human-level AI is effectively just immigration.
Yup, the context was talking about future AI which can e.g. have the idea that it will found a company, and then do every step to make that happen, and it can do that about as well as the best human (but not dramatically better than the best human).
I definitely sometimes talk to people who say “yes, I agree that that scenario will happen eventually, but it will not significantly change the world. AI would still be just another technology.” (As opposed to “…and then obviously 99.99…% of future companies will be founded by autonomous AIs, because if it becomes possible to mass-produce Jeff Bezos-es by the trillions, then that’s what will happen. And similarly in every other aspect of the economy.)
I think “the effective global labor pool increases by a factor of 1000, consisting of 99.9% AIs” is sometimes a useful scenario to bring up in conversation, but it’s also misleading in certain ways. My actual belief is that humans would rapidly have no ability to contribute to the economy in a post-AGI world, for a similar reason as a moody 7-year-old has essentially no ability to contribute to the economy today (in fact, people would pay good money to keep a moody 7-year-old out of their office or factory).
Dear diary...
[this is an experiment in just posting little progress reports as a self-motivation tool.]
1. I have a growing suspicion that I was wrong to lump the amygdala in with the midbrain. It may be learning by the same reward signal as the neocortex. Or maybe not. It’s confusing. Things I’m digesting: https://twitter.com/steve47285/status/1314553896057081857?s=19 (and references therein) and https://www.researchgate.net/publication/11523425_Parallels_between_cerebellum-_and_amygdala-dependent_conditioning
2. Speaking of mistakes, I’m also regretting some comments I made a while ago suggesting that the brain doesn’t do backpropagation. Maybe that’s true in a narrow sense, but Randall O’Reilly has convinced me that the brain definitely does error-driven learning sometimes (I already knew that), and moreover it may well be able to propagate errors through at least one or two layers of a hierarchy, with enough accuracy to converge. No that doesn’t mean that the brain is exactly the same as a PyTorch / Tensorflow Default-Settings Deep Neural Net.
3. My long work-in-progress post on autism continues to be stuck on the fact that there seem to be two theories of social impairment that are each plausible and totally different. In one theory, social interactions are complex and hard to follow / model for cognitive / predictive-model-building reasons. The evidence I like for that is the role of the cerebellum, which sounds awfully causally implicated in autism. Like, absence of a cerebellum can cause autism, if I’m remembering right. In the other theory, modeling social interactions in the neurotypical way (via empathy) is aversive. The evidence I like for that is people with autism self-reporting that eye contact is aversive, among other things. (This is part of “intense world theory”.) Of those two stories, I’m roughly 100% sold on the latter story is right. But the former story doesn’t seem obviously wrong, and I don’t like having two explanations for the same thing (although it’s not impossible, autism involves different symptoms in different people, and they could co-occur for biological reasons rather than computational reasons). I’m hoping that the stories actually come together somehow, and I’m just confused about what the cerebellum and amygdala do. So I’m reading and thinking about that.
4. New theory I’m playing with: the neocortex outputs predictions directly, in addition to motor commands. E.g. “my arm is going to be touched”. Then the midbrain knows not to flinch when someone touches the arm. That could explain why the visual cortex talks to the superior colliculus, which I always thought was weird. Jeff Hawkins says those connections are the neocortex sending out eye movement motor commands, but isn’t that controlled by the frontal eye fields? Oh, then Randall O’Reilly had this mysterious throwaway comment in a lecture that the frontal eye fields seem to be at the bottom of the visual hierarchy if you look at the connections. (He had a reference, I should read it.) I don’t know what the heck is going on.
Is it too pessimistic to assume that people mostly model other people in order to manipulate them better? I wonder how much of human mental inconsistency is a defense against modeling. Here on Less Wrong we complain that inconsistent behavior makes you vulnerable to Dutch-booking, but in real life, consistent behavior probably makes you even more vulnerable, because your enemies can easily predict what you do and plan accordingly.
I was just writing about my perspective here; see also Simulation Theory (the opposite of “Theory Theory”, believe it or not!). I mean, you could say that “making friends and being nice to them” is a form of manipulation, in some technical sense, blah blah evolutionary game theory blah blah, I guess. That seems like something Robin Hanson would say :-P I think it’s a bit too cynical if you mean “manipulation” in the everyday sense involving bad intent. Also, if you want to send out vibes of “Don’t mess with me or I will crush you!” to other people—and the ability to make credible threats is advantageous for game-theory reasons—that’s all about being predictable and consistent!
Again as I posted just now, I think the lion’s share of “modeling”, as I’m using the term, is something that happens unconsciously in a fraction of second, not effortful empathy or modeling.
Hmmm… If I’m trying to impress someone, I do indeed effortfully try to develop a model of what they’re impressed by, and then use that model when talking to them. And I tend to succeed! And it’s not all that hard! The most obvious strategy tends to work (i.e., go with what has impressed them in the past, or what they say would be impressive, or what impresses similar people). I don’t really see any aspect of human nature that is working to make it hard for me to impress someone, like by a person randomly changing what they find impressive. Do you? Are there better examples?
I have low confidence debating this, because it seems to me like many things could be explained in various ways. For example, I agree that certain predictability is needed to prevent people from messing with you. On the other hand, certain uncertainty is needed, too—if people know exactly when you would snap and start crushing them, they will go 5% below the line; but if the exact line depends on what you had for breakfast today, they will be more careful about getting too close to it.
Fair enough :-)
Branding: 3 reasons why I prefer “AGI safety” to “AI alignment”
When engineers, politicians, bureaucrats, military leaders, etc. hear the word “safety”, they suddenly perk up and start nodding and smiling. Safety engineering—making sure that systems robustly do what you want them to do—is something that people across society can relate to and appreciate. By contrast, when people hear the term “AI alignment” for the first time, they just don’t know what it means or how to contextualize it.
There are a lot of things that people are working on in this space that aren’t exactly “alignment”—things like boxing, task-limited AI, myopic AI, impact-limited AI, non-goal-directed AI, AGI strategy & forecasting, etc. It’s useful to have a term that includes all those things, and I think that term should be “AGI safety”. Then we can reserve “AI alignment” for specifically value alignment.
Actually, I’m not even sure that “value alignment” is exactly the right term for value alignment. The term “value alignment” is naturally read as something like “the AI’s values are aligned with human values”, which isn’t necessarily wrong, but is a bit vague and not necessarily interpreted correctly. For example, if love is a human value, should the AGI adopt that value and start falling in love? No, they should facilitate humans falling in love. When people talk about CIRL, CEV, etc. it seems to be less about “value alignment” and more about “value indirection” (in the C++ sense), i.e. utility functions that involve human goals and values, and which more specifically define those things by pointing at human brains and human behavior.
A friend in the AI space who visited Washington told me that military leaders distinctly do not like the term “safety”.
Why not?
Because they’re interested in weapons and making people distinctly not safe.
Right, for them “alignment” could mean their desired concept, “safe for everyone except our targets”.
I’m skeptical that anyone with that level of responsibility and acumen has that kind of juvenile destructive mindset. Can you think of other explanations?
There’s a difference between people talking about safety in the sense of 1. ‘how to handle a firearm safely’ and the sense of 2. ‘firearms are dangerous, let’s ban all guns’. These leaders may understand/be on board with 1, but disagree with 2.
I think if someone negatively reacts to ‘Safety’ thinking you mean ‘try to ban all guns’ instead of ‘teach good firearm safety’, you can rephrase as ‘Control’ in that context. I think Safety is more inclusive of various aspects of the problem than either ‘Control’ or ‘Alignment’, so I like it better as an encompassing term.
Interesting. I guess I was thinking specifically about DARPA which might or might not be representative, but see Safe Documents, Safe Genes, Safe Autonomy, Safety and security properties of software, etc. etc.
In the era of COVID, we should all be doing cardio exercise if possible, and not at a gym. Here’s what’s been working for me for the past many years. This is not well optimized for perfectly working out every muscle group etc., but it is very highly optimized for convenience, practicality, and sustainability, at least for me personally in my life situation.
(This post is mostly about home cardio exercise, but the last paragraph is about jogging.)
My home exercise routine consists of three simultaneous things: {exercise , YouTube video lectures , RockMyRun}. More on the exercise below. RockMyRun is a site/app that offers music mixes at fixed BPMs—the music helps my energy and the fixed BPM keeps me from gradually slowing down the pace. The video lectures make me motivated to work out, since there’s a lot of stuff I desperately want to learn. :)
Previously I’ve done instead {exercise, movies or TV}. (I still do on rare occasions.) This is motivating when combined with the rule of “no movies or TV unless exercising (or on social special occasions)”. I’ve pretty much followed that rule for years now.
My exercise routine consists of holding a dumbbell in each hands, then doing a sort of simultaneous reverse-lunge while lifting one of the dumbbells, alternating sides, kinda like this picture. Out of numerous things I’ve tried, this is the one that stuck, because it’s compatible with watching TV, compatible with very small spaces including low ceilings, has low risk of injury, doesn’t stomp or make noise, doesn’t require paying attention (once you get the hang of it), and seems to be a pretty good cardio workout (as judged by being able to break a sweat in a freezing cold room). I also do a few pushups now and then as a break, although that means missing what’s on the screen. I’ve gradually increased the dumbbell weight over the years from 3lbs (1.4kg) to now 15lbs (7kg).
I strongly believe that the top priority for an exercise routine is whatever helps you actually keep doing it perpetually. But beyond that, I’ve found some factors that give me a more intense workout: Coffee helps slightly (it’s a performance-enhancing drug! At least for some people); feeling cold at the beginning / being in a cold room seems to help; awesome action-packed movies or TV are a nice boost, but RockMyRun with boring video lectures is good enough. (My most intense workouts are watching music videos or concert recordings, but I get bored of those after a while.)
In other news, I also occasionally jog. RockMyRun is also a really good idea for that, not just for the obvious reasons (energy, pace), but because, when you set the BPM high, your running form magically and effortlessly improves. This completely solved my jogging knee pain problems, which I had struggled with for years. (I learned that tip from here, where he recommends 160BPM. I personally prefer 180BPM, because I like shorter and more intense runs for my time-crunched schedule.)
Quick comments on “The case against economic values in the brain” by Benjamin Hayden & Yael Niv :
(I really only skimmed the paper, these are just impressions off the top of my head.)
I agree that “eating this sandwich” doesn’t have a reward prediction per se, because there are lots of different ways to think about eating this sandwich, especially what aspects are salient, what associations are salient, what your hormones and mood are, etc. If neuroeconomics is premised on reward predictions being attached to events and objects rather than thoughts, then I don’t like neuroeconomics either, at least not as a mechanistic theory of psychology. [I don’t know anything about neuroeconomics, maybe that was never the idea anyway.]
But when they float the idea of throwing out rewards altogether, I’m not buying it. The main reason is: I’m trying to understand what the brain does algorithmically, and I feel like I’m making progress towards a coherent picture …and part of that picture is a 1-dimensional signal called reward. If you got rid of that, I just have no idea how to fill in that gap. Doesn’t mean it’s impossible, but I did try to think it through and failed.
There’s also a nice biological story going with the algorithm story: the basal ganglia has a dense web of connections across the frontal lobe, and can just memorize “this meaningless set of neurons firing is associated with that reward, and this meaningless set of neurons firing is associated with that reward, etc. etc.” Then it (1) inhibits all but the highest-reward-predicting activity, and (2) updates the reward predictions based on what happens (TD learning). (Again this and everything else is very sketchy and speculative.)
(DeepMind had a paper that says there’s a reward prediction probability distribution instead of a reward prediction value, which is fine, that’s still consistent with the rest of my story.)
I get how deep neural nets can search for a policy directly. I don’t think those methods are consistent with the other things I believe about the brain (or at least the neocortex). In particular I think the brain does seem to have a mechanism for choosing among different possible actions being considered in parallel, as opposed to a direct learned function from sensory input to output. The paper also mentions learning to compare without learning a value, but I don’t think that works because there are too many possible comparisons (the square of the number of possible thoughts).
Introducing AGI Safety in general, and my research in particular, to novices / skeptics, in 5 minutes, out loud
I might be interviewed on a podcast where I need to introduce AGI risk to a broad audience of people who mostly aren’t familiar with it and/or think it’s stupid. The audience is mostly neuroscientists plus some AI people. I wrote the following as a possible entry-point, if I get thrown some generic opening question like “Tell me about what you’re working on”:
I’m open to feedback; e.g., where might skeptical audience-members fall off the boat? (I am aware that it’s too long for one answer; I expect that I’ll end up saying various pieces of this in some order depending on the flow of the conversation. But still, gotta start somewhere.)
I would prepare a shortened version − 100 words max—that you could also give.
Yeah, I think I have a stopping point after the first three paragraphs (with minor changes).
Could you just say you’re working on safe design principles for brain-like artificial intelligence?