I think MONA (which I worked on) counts as another example of this. Basically, you can make your agent only care about short-term feedback from a trusted model that imitates humans, so it isn’t motivated to pursue long-term plans that a human wouldn’t endorse.
I’m also doing an HRLM-like thing for my MATS project. The very high-level idea is to start with a lightly fine-tuned language model, which we assume is aligned/humanlike.[1] Then we get the model to reason in a humanlike way about how to improve its own performance. This is supposed to be much more interpretable and aligned than RL, which modifies the underlying model in inscrutable ways to maximize reward by any means possible.
You might even say that “humanlike reasoning and learning” describes my personal research agenda. I’d be excited for people to talk about this sort of thing more!
Interestingly, both of the methods I mentioned above actually do something different from the dichotomy you described. Rather than “gaining capabilities, then aligning to a target” (usual plans) or “simultaneously gaining capabilities and aligning to a target in a deeply connected way” (HRLM plans), these methods could be described as “aligning to a target, then gaining capabilities.”[2]
Btw, the thing that HRLM stands for (Humanlike Reasoning/Learning Method) was a bit buried and at first I thought you didn’t mention it at all. It’d be easier to see if you capitalized it and moved it forward a bit
This is not a totally safe assumption. For example, LLMs can easily be jailbroken to act very unhumanlike, and even the standard “assistant character” knows a lot of information that a random human would not. As you pointed out, LLMs are pretty alien. But hopefully in practice, the assumption is mostly true in the ways that matter.
Technically, we’d be doing the HRLM thing first, during LLM pretraining and fine-tuning, since in the process of getting aligned to humans, the model must gain some baseline level of capabilities. But then we can bootstrap to higher capability levels using methods like MONA.
I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.
I think MONA (which I worked on) counts as another example of this. Basically, you can make your agent only care about short-term feedback from a trusted model that imitates humans, so it isn’t motivated to pursue long-term plans that a human wouldn’t endorse.
I’m also doing an HRLM-like thing for my MATS project. The very high-level idea is to start with a lightly fine-tuned language model, which we assume is aligned/humanlike.[1] Then we get the model to reason in a humanlike way about how to improve its own performance. This is supposed to be much more interpretable and aligned than RL, which modifies the underlying model in inscrutable ways to maximize reward by any means possible.
You might even say that “humanlike reasoning and learning” describes my personal research agenda. I’d be excited for people to talk about this sort of thing more!
Interestingly, both of the methods I mentioned above actually do something different from the dichotomy you described. Rather than “gaining capabilities, then aligning to a target” (usual plans) or “simultaneously gaining capabilities and aligning to a target in a deeply connected way” (HRLM plans), these methods could be described as “aligning to a target, then gaining capabilities.”[2]
Btw, the thing that HRLM stands for (Humanlike Reasoning/Learning Method) was a bit buried and at first I thought you didn’t mention it at all. It’d be easier to see if you capitalized it and moved it forward a bit
This is not a totally safe assumption. For example, LLMs can easily be jailbroken to act very unhumanlike, and even the standard “assistant character” knows a lot of information that a random human would not. As you pointed out, LLMs are pretty alien. But hopefully in practice, the assumption is mostly true in the ways that matter.
Technically, we’d be doing the HRLM thing first, during LLM pretraining and fine-tuning, since in the process of getting aligned to humans, the model must gain some baseline level of capabilities. But then we can bootstrap to higher capability levels using methods like MONA.
I’m approaching it from a “theoretical” perspective[1], so I want to know how “humanlike reasoning” could be defined (beyond “here’s some trusted model which somehow imitates human judgement”) or why human-approved capability gain preserves alignment (like, what’s the core reason, what makes human judgement good?). So my biggest worry is not that the assumption might be false, but that the assumption is barely understood on the theoretical level.
What are your research interests? Are you interested in defining what “explanation” means (or at least in defining some properties/principles of explanations)? Typical LLM stuff is highly empirical, but I’m kinda following the pipe dream of glass box AI.
I’m contrasting theoretical and empirical approaches.
Empirical—“this is likely to work, based on evidence”. Theoretical—“this should work, based on math / logic / philosophy”.
Empirical—“if we can operationalize X for making experiments, we don’t need to look for a deeper definition”. Theoretical—“we need to look for a deeper definition anyway”.
I wouldn’t describe my research as super theoretical, but it does involve me making arguments (although not formal proofs) about why I expect my plans to work even as training continues.
For example, it seems like in some non-formalized sense, a human is trivially aligned to themselves. The more similar your AI’s behavior is to a particular human’s behavior, the more aligned it is to that human. If you want to align the AI to a group of humans (e.g. all of humanity), you might want to start by emulating a good approximation of all humans and bootstrapping from there. I’m not working on this aspect of the problem directly—I’m just assuming that the LLM is pretty humanlike to begin with—but I wrote a post talking about similar ideas.