Usual plans. AI gains capabilities C. We figure out how to point C to T (alignment target). There’s no deep connection between C and T. One thing is mounted onto the other.
HRLM plans. We give AI special C, with a deep connection to T.
HRLM is the idea that there’s some special reasoning/learning method which is crucial for alignment or makes it fundamentally easier. HRLM means “humanlike reasoning/learning method” or “special, human-endorsed reasoning/learning method”. There’s no hard line separating the two types of plans. It’s a matter of degree.
I believe HRLM is ~never discussed in full generality and ~never discussed from a theoretical POV. This is a small post where I want to highlight the idea and facilitate discussion, not make a strong case for it.
Examples of HRLM
(My description of other people’s work is not endorsed by them.)
Corrigibility. “Corrigible cognition” is a hypothetical, special type of self-reflection (C) which is extremely well-suited for learning human values/desires (T).
In “Don’t align agents to evaluations of plans” Alex Turner argues “there’s a correct way to reason (C) about goals (T) and consequentialist maximization of an ‘ideal’ function is not it”, “‘direct cognition’ (C) about goals (T) is fundamentally better than ‘indirect cognition’”. Shard Theory, in general, proposes a very special method for learning and thinking about values.
A post about “follow-the-trying game” by Steve Byrnes basically says “AI will become aligned or misaligned at the stage of generating thoughts, so we need to figure out the ‘correct’ way of generating thoughts (C), instead of overfocusing on judging what thoughts are aligned (T)”. Steve’s entire agenda is about HRLM.
Large Language Models. I’m not familiar with the debate, but I would guess it boils down to two possibilities: “understanding human language is a core enough capability (C) for a LLM, which makes it inherently more alignable to human goals (T)” and “LLMs ‘understand’ human language through some alien tricks which don’t make them inherently more alignable”. If the former is true, LLMs are an example of HRLM.
Policy Alignment (Abram Demski) is tangentially related, but it’s more in the camp of “usual plans”.
Notice how, despite multiple agendas falling under HRLM (Shard Theory, brain-like AGI, LLM-focused proposals), there’s almost no discussion of HRLM from a theoretical POV. What is, abstractly speaking, “humanlike reasoning”? What are the general principles of it? What are the general arguments for safety guarantees it’s supposed to bring about? What are the True Names here? With Shard Theory, there’s ~zero explanation of how simpler shards aggregate into more complex shards and how it preserves goals. With brain-like AGI, there’s ~zero idea of how to prevent thought generation from bullshitting thought assessment. But those are the very core questions of the agendas. So they barely move us from square one.[1]
Possibilities
There are many possibilities. It could be that any HRLM handicaps AI’s capabilities (a superintelligence is supposed to be unimaginably better at reasoning than humans, so why wouldn’t it have an alien reasoning method). It also could be that HRLM is necessary for general intelligence. But maybe general intelligence is overrated...
Here’s what I personally believe right now:
What we value is inherently tied to how we think about it. In general, what we think about is often inherently tied to how we think about it.
General intelligence is based on a special principle. It has a relatively crisp “core”.
Some special computational principle is needed for solving subsystems alignment.
If 1-3 is true, 1-3 is most likely the same thing. Therefore, HRLM is needed for general intelligence, outer and inner alignment (including subsystems alignment). Separately, I think general intelligence boosts capabilities below peak human level.
I consider 1-3 to be plausible enough postulates. I have no further arguments for 4.
My own ideas about HRLM (to be updated)
I have a couple of very unfinished ideas. Will try to write about them this or the next month.
I believe there could be a special type of cognition which helps to avoid specification gaming and goal misgeneralization. AI should create simple models which describe “costs/benefits” of actions (e.g. “actions” can be body movements, “cost” can be the amount and complexity of movements, “benefit” can be distance covered), this way AI can notice if certain actions produce anomalously high benefit (e.g. maybe certain body movements exploit a glitch in the physics simulation, making the body cover kilometers per second).
“By default, manipulating easier to optimize/comprehend variables is better than manipulating harder to optimize/comprehend variables” — this is the idea from one of my posts. The problem with it is that I only defined “optimization” and “comprehension” for world-models, not for any modelling (= cognition) in general.
A formal algorithm can have parts and it will critically depend on those parts (for example, an algorithm for solving equations might have an absolutely necessary addition sub-algorithm). An informal algorithm can have parts without critically depending on those parts (for example, the algorithm answering “is this a picture of a dog?” might have a sub-algorithm answering “is this patch of pixels the focal point of the image / does it contrast enough with other patches / is it as detailed as the other patches?”—the sub-algorithm is not very necessary, but it lowers pareidolia, by preventing the algorithm from overanalyzing random parts of the image). I think we can say something about the latter type of algorithms, about how they work.
IMO that’s downstream of inner alignment being extremely hard. It’s almost impossible to come up with at least mildly promising solution which explains, at least in some detail, how the hardest part of the problem might get solved. I’m not trying to throw shade. Also, I might just be ignorant about some ideas in those agendas.
I just want to note that humans aren’t aligned by default, so creating human-like reasoning and learning is not itself an alignment method. It’s just a different variant of providing capabilities, which you separately need to point at an alignment target.
It may or may not be easier to align than alternatives. I personally don’t think this matters because I strongly believe that the only type of AGI worth aligning is the type(s) most likely to be developed first. Hoping that the indurstry and society is going to make major changes to AGI development based on which types the researchers think are easier to align seems like a forlorn hope.
More on why it’s a mistake to assume human-like cognition in itself leads to alignment:
Sociopaths/psychopaths are a particularly vivid example of how humans are misaligned. And there are good reasons to think that they are not a special case in which empathy was accidentally left out or deliberately blocked, but that they are the baseline human cognition without the mechanisms that create empathy. It’s tough to make this case for certain, but it’s a very bad idea to assume that humans are aligned by default, so all we’ve got to do is reproduce human-like cognitive mechanisms and maybe train it “in a good family” or similar.
That’s not to argue against human-like approaches to AGI as worse for alignment, just to say that they’re only better in that we have a little better understanding of that type of cognition and some mechanisms by which humans often wind up approximately aligned in common contexts.
My own research is also in using loosely human-like reasoning and learning as a route to alignment, but that’s primarily because a) that’s my background expertise so it’s my relative advantage and b) I think LLMs are very loosely like some parts of the human brain/mind, and that we’ll see continued expansion of LLM agents to reason in more loosely human-like ways (that is, with chains of thought, specific memory looks ups, metacognition to organize this, etc).
So I’m working on aligning loosely human-like cognition not because I think it’s by default any easier than aligning any other form of AGI, but just because that’s what seems most likely to become the first takeover capable (or pivotal act capable) AGI.
Yes, it could be that “special, inherently more alignable cognition” doesn’t exist or can’t be discovered by mere mortal humans. It could be that humanlike reasoning isn’t inherently more alignable. Finally, it could be that we can’t afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.
Wrt sociopaths/psychopaths. I’m approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.
I think MONA (which I worked on) counts as another example of this. Basically, you can make your agent only care about short-term feedback from a trusted model that imitates humans, so it isn’t motivated to pursue long-term plans that a human wouldn’t endorse.
I’m also doing an HRLM-like thing for my MATS project. The very high-level idea is to start with a lightly fine-tuned language model, which we assume is aligned/humanlike.[1] Then we get the model to reason in a humanlike way about how to improve its own performance. This is supposed to be much more interpretable and aligned than RL, which modifies the underlying model in inscrutable ways to maximize reward by any means possible.
You might even say that “humanlike reasoning and learning” describes my personal research agenda. I’d be excited for people to talk about this sort of thing more!
Interestingly, both of the methods I mentioned above actually do something different from the dichotomy you described. Rather than “gaining capabilities, then aligning to a target” (usual plans) or “simultaneously gaining capabilities and aligning to a target in a deeply connected way” (HRLM plans), these methods could be described as “aligning to a target, then gaining capabilities.”[2]
Btw, the thing that HRLM stands for (Humanlike Reasoning/Learning Method) was a bit buried and at first I thought you didn’t mention it at all. It’d be easier to see if you capitalized it and moved it forward a bit
This is not a totally safe assumption. For example, LLMs can easily be jailbroken to act very unhumanlike, and even the standard “assistant character” knows a lot of information that a random human would not. As you pointed out, LLMs are pretty alien. But hopefully in practice, the assumption is mostly true in the ways that matter.
Technically, we’d be doing the HRLM thing first, during LLM pretraining and fine-tuning, since in the process of getting aligned to humans, the model must gain some baseline level of capabilities. But then we can bootstrap to higher capability levels using methods like MONA.
I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.
Give AGI humanlike reasoning? (draft of a post)
Alignment plans can be split into two types:
Usual plans. AI gains capabilities C. We figure out how to point C to T (alignment target). There’s no deep connection between C and T. One thing is mounted onto the other.
HRLM plans. We give AI special C, with a deep connection to T.
HRLM is the idea that there’s some special reasoning/learning method which is crucial for alignment or makes it fundamentally easier. HRLM means “humanlike reasoning/learning method” or “special, human-endorsed reasoning/learning method”. There’s no hard line separating the two types of plans. It’s a matter of degree.
I believe HRLM is ~never discussed in full generality and ~never discussed from a theoretical POV. This is a small post where I want to highlight the idea and facilitate discussion, not make a strong case for it.
Examples of HRLM
(My description of other people’s work is not endorsed by them.)
Corrigibility. “Corrigible cognition” is a hypothetical, special type of self-reflection (C) which is extremely well-suited for learning human values/desires (T).
In “Don’t align agents to evaluations of plans” Alex Turner argues “there’s a correct way to reason (C) about goals (T) and consequentialist maximization of an ‘ideal’ function is not it”, “‘direct cognition’ (C) about goals (T) is fundamentally better than ‘indirect cognition’”. Shard Theory, in general, proposes a very special method for learning and thinking about values.
A post about “follow-the-trying game” by Steve Byrnes basically says “AI will become aligned or misaligned at the stage of generating thoughts, so we need to figure out the ‘correct’ way of generating thoughts (C), instead of overfocusing on judging what thoughts are aligned (T)”. Steve’s entire agenda is about HRLM.
Large Language Models. I’m not familiar with the debate, but I would guess it boils down to two possibilities: “understanding human language is a core enough capability (C) for a LLM, which makes it inherently more alignable to human goals (T)” and “LLMs ‘understand’ human language through some alien tricks which don’t make them inherently more alignable”. If the former is true, LLMs are an example of HRLM.
Policy Alignment (Abram Demski) is tangentially related, but it’s more in the camp of “usual plans”.
Notice how, despite multiple agendas falling under HRLM (Shard Theory, brain-like AGI, LLM-focused proposals), there’s almost no discussion of HRLM from a theoretical POV. What is, abstractly speaking, “humanlike reasoning”? What are the general principles of it? What are the general arguments for safety guarantees it’s supposed to bring about? What are the True Names here? With Shard Theory, there’s ~zero explanation of how simpler shards aggregate into more complex shards and how it preserves goals. With brain-like AGI, there’s ~zero idea of how to prevent thought generation from bullshitting thought assessment. But those are the very core questions of the agendas. So they barely move us from square one.[1]
Possibilities
There are many possibilities. It could be that any HRLM handicaps AI’s capabilities (a superintelligence is supposed to be unimaginably better at reasoning than humans, so why wouldn’t it have an alien reasoning method). It also could be that HRLM is necessary for general intelligence. But maybe general intelligence is overrated...
Here’s what I personally believe right now:
What we value is inherently tied to how we think about it. In general, what we think about is often inherently tied to how we think about it.
General intelligence is based on a special principle. It has a relatively crisp “core”.
Some special computational principle is needed for solving subsystems alignment.
If 1-3 is true, 1-3 is most likely the same thing. Therefore, HRLM is needed for general intelligence, outer and inner alignment (including subsystems alignment). Separately, I think general intelligence boosts capabilities below peak human level.
I consider 1-3 to be plausible enough postulates. I have no further arguments for 4.
My own ideas about HRLM (to be updated)
I have a couple of very unfinished ideas. Will try to write about them this or the next month.
I believe there could be a special type of cognition which helps to avoid specification gaming and goal misgeneralization. AI should create simple models which describe “costs/benefits” of actions (e.g. “actions” can be body movements, “cost” can be the amount and complexity of movements, “benefit” can be distance covered), this way AI can notice if certain actions produce anomalously high benefit (e.g. maybe certain body movements exploit a glitch in the physics simulation, making the body cover kilometers per second).
“By default, manipulating easier to optimize/comprehend variables is better than manipulating harder to optimize/comprehend variables” — this is the idea from one of my posts. The problem with it is that I only defined “optimization” and “comprehension” for world-models, not for any modelling (= cognition) in general.
A formal algorithm can have parts and it will critically depend on those parts (for example, an algorithm for solving equations might have an absolutely necessary addition sub-algorithm). An informal algorithm can have parts without critically depending on those parts (for example, the algorithm answering “is this a picture of a dog?” might have a sub-algorithm answering “is this patch of pixels the focal point of the image / does it contrast enough with other patches / is it as detailed as the other patches?”—the sub-algorithm is not very necessary, but it lowers pareidolia, by preventing the algorithm from overanalyzing random parts of the image). I think we can say something about the latter type of algorithms, about how they work.
IMO that’s downstream of inner alignment being extremely hard. It’s almost impossible to come up with at least mildly promising solution which explains, at least in some detail, how the hardest part of the problem might get solved. I’m not trying to throw shade. Also, I might just be ignorant about some ideas in those agendas.
I just want to note that humans aren’t aligned by default, so creating human-like reasoning and learning is not itself an alignment method. It’s just a different variant of providing capabilities, which you separately need to point at an alignment target.
It may or may not be easier to align than alternatives. I personally don’t think this matters because I strongly believe that the only type of AGI worth aligning is the type(s) most likely to be developed first. Hoping that the indurstry and society is going to make major changes to AGI development based on which types the researchers think are easier to align seems like a forlorn hope.
More on why it’s a mistake to assume human-like cognition in itself leads to alignment:
Sociopaths/psychopaths are a particularly vivid example of how humans are misaligned. And there are good reasons to think that they are not a special case in which empathy was accidentally left out or deliberately blocked, but that they are the baseline human cognition without the mechanisms that create empathy. It’s tough to make this case for certain, but it’s a very bad idea to assume that humans are aligned by default, so all we’ve got to do is reproduce human-like cognitive mechanisms and maybe train it “in a good family” or similar.
That’s not to argue against human-like approaches to AGI as worse for alignment, just to say that they’re only better in that we have a little better understanding of that type of cognition and some mechanisms by which humans often wind up approximately aligned in common contexts.
My own research is also in using loosely human-like reasoning and learning as a route to alignment, but that’s primarily because a) that’s my background expertise so it’s my relative advantage and b) I think LLMs are very loosely like some parts of the human brain/mind, and that we’ll see continued expansion of LLM agents to reason in more loosely human-like ways (that is, with chains of thought, specific memory looks ups, metacognition to organize this, etc).
So I’m working on aligning loosely human-like cognition not because I think it’s by default any easier than aligning any other form of AGI, but just because that’s what seems most likely to become the first takeover capable (or pivotal act capable) AGI.
Yes, it could be that “special, inherently more alignable cognition” doesn’t exist or can’t be discovered by mere mortal humans. It could be that humanlike reasoning isn’t inherently more alignable. Finally, it could be that we can’t afford to study it because the dominating paradigm is different. Also, I realize that glass box AI is a pipe dream.
Wrt sociopaths/psychopaths. I’m approaching it from a more theoretical standpoint. If I knew a method of building a psychopath AI (caring about something selfish, e.g. gaining money or fame or social power or new knowledge or even paperclips) and knew the core reasons of why it works, I would consider it a major progress. Because it would solve many alignment subproblems, such as ontology identification and subsystems alignment.
I think MONA (which I worked on) counts as another example of this. Basically, you can make your agent only care about short-term feedback from a trusted model that imitates humans, so it isn’t motivated to pursue long-term plans that a human wouldn’t endorse.
I’m also doing an HRLM-like thing for my MATS project. The very high-level idea is to start with a lightly fine-tuned language model, which we assume is aligned/humanlike.[1] Then we get the model to reason in a humanlike way about how to improve its own performance. This is supposed to be much more interpretable and aligned than RL, which modifies the underlying model in inscrutable ways to maximize reward by any means possible.
You might even say that “humanlike reasoning and learning” describes my personal research agenda. I’d be excited for people to talk about this sort of thing more!
Interestingly, both of the methods I mentioned above actually do something different from the dichotomy you described. Rather than “gaining capabilities, then aligning to a target” (usual plans) or “simultaneously gaining capabilities and aligning to a target in a deeply connected way” (HRLM plans), these methods could be described as “aligning to a target, then gaining capabilities.”[2]
Btw, the thing that HRLM stands for (Humanlike Reasoning/Learning Method) was a bit buried and at first I thought you didn’t mention it at all. It’d be easier to see if you capitalized it and moved it forward a bit
This is not a totally safe assumption. For example, LLMs can easily be jailbroken to act very unhumanlike, and even the standard “assistant character” knows a lot of information that a random human would not. As you pointed out, LLMs are pretty alien. But hopefully in practice, the assumption is mostly true in the ways that matter.
Technically, we’d be doing the HRLM thing first, during LLM pretraining and fine-tuning, since in the process of getting aligned to humans, the model must gain some baseline level of capabilities. But then we can bootstrap to higher capability levels using methods like MONA.
I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.