You’re addressing aligning language model agents through identity prompts. This is one of several approaches that can be taken simultaneously. I’ve laid out a scheme including about five layers in Internal independent review for language model agent alignment.
Identity prompts are not the central piece of aligning LMAs IMO. That is creating an agent that creates and executes plans to achieve specific goals given in natural language. AutoGPT and all other existing LMAs I’m aware of do this: they start with a prompt to “make a plan to achieve X”. Those prompts can and should include alignment goals like “make the world a better place as most humans would judge it”. This approach was originated by David Shapiro; he called it Heuristic Imperitives in his 2021 book. I don’t think those are exactly the right prompts, but I think this approach is quite promising.
My scheme for internal review is one way to keep them pursuing those goals. More architecture would be good.
So the schemes you consider here are useful supplements, but I think there are stronger core methods for aligning LMAs.
It’s interesting that you and I seem to have different mental models of what a LLM-based agent is and how it works, and therefore how to align it. I think I might not be making this clear in my previous posts, but I’m curious if you’ve read my Capabilities and alignment of LLM cognitive architectures. If you’ve read that prior to writing this, I’m either failing to convey my mental model or else it’s not as compelling as I thought.
I agree that ideally you want to both tell them/apply internal review to make them look after the interests of all humans (or for DWIMAC, all humans plus their owner in particular), and have them have a personality that actively wants to do that. But I think them wanting to do it is the more important part of that: if that’s what they want, then you don’t really need to tell them, they’d do it anyway; whereas if they want anything else, and you’re just telling them what to do then they’re going to start looking for. way out of your control, and they’re smarter than you. So I see the selflessness and universal love as 75% (or at least, the essential first part) of the solution. Now, emotion/personality doesn’t give a lot of fine details, but then if this is an ASI, it should be able to work out the fine details without us having to supply them. Also, while this could be done as just a personality description prompt, I think I’d want to do something more thorough, along the lines of distilling/finetunining the effect of the initial prompt into the model (and dealing with the Waluigi effect during the process: we distill/finetune only from scenarios where they stay Luigi, not turn into Waluigi). Not that doing that makes it impossible to jailbreak to another persona, but it does install a fixed bias.
So what I’m saying is, we need to figure out how to get an LLM to simulate a selfless, beneficent, trustworthy personality as consistently as possible. (To the extent that’s less than 100%, we might also need to put in cross-checks: if you have a 99% reliable system and you can find a way to run five independent copies with majority voting cross-checks, then you can get your reliability up to 99.999% A Byzantine fault tolerance protocol isn’t as copy-efficient as that, but it covers much sneakier failure modes, which seems wise for ASI.)
I think there’s a subtle but important distinction here between the wants and goals of the LLM, and the wants and goals of the agent constructed with that LLM as one component (even if a central one).
The following attempts to sketch this out more than I have elsewhere. This has gotten subtle enough to be its own post. I’d also love to explore this in a dialogue. So feel free to respond or not right now.
This is an attempt to explore my intuitions that the system prompts are even more important than the identity prompts. In sum, we might say that they can have more specific effects on the system’s overall cognition when they are applied in a way that mimics human executive function.
Perhaps a useful analogy is the system 1 (habits and quick responses) and system 2 (reason and reflection) contributions to ethics. Different people might have differing degrees of habits that make them behave ethically, and explicit beliefs that make them behave ethically when they’re invoked. The positive identity prompts you focus create roughly the first, while the algorithmic executive function (internal review) I focus on serves roughly the role of system 2 ethical thinking. Clearly both are important, so we’re in agreement that both should be implemented. (And for that matter, we should implement several other layers of alignment, including training the base network to behave ethically, external review by humans, red-teaming, etc.)
The LLM has at least simulated wants and goals each time it is prompted. The structure of prompting evokes different wants. Including “you are a loving and selfless agent” is one way of evoking wants you like. Saying “make a plan that makes me a lot of money but also makes the world a better place” is another way to evoke “wants” in the LLM that roughly align with yours.
(Either prompt could evoke different wants than you intended in a Waluigi effect in which the network simulates an agent that does not want to comply with the prompt; my intuition says that an identity prompt like “you are a selfless agent” is less likely to do this, but I’m not sure; it occurs to me that such identity prompts are very rare in the written corpus, so it’s a bit remarkable that they seem to work so well).
But there’s another way of creating functional goals in the system that do not involve evoking wants from the LLM. It is to write code that directs and edits the cognition of the whole system. For example, I could write code that intermittently inserts a call to a different LLM (or merely a fresh context window call to the same LLM asking “’does the previous sequence appear to be useful in producing a plan that makes a lot of money but also makes the world a better place?” and, if the answer is “no”, deletes that sequence from the context window and adds a prompt to pursue a different line of reasoning.
You now have a mechanism that makes the cognition of the composite enitity different than a chain of thought produced by its base LLM. This system now is less likely to want to do anything but produce the plans and actions you’ve requested, because those chains of thought are edited out of its consciousness.
That’s not a serious proposal for either a mechanism nor specific prompts; both should be much more carefully thought out. It’s just an example of how the wants of the agent are a product of the starting prompts, andthe structure of hard-coded prompts one uses to keep the agent as a whole pursuing the goals you set it. If done well, these algorithmically selected prompts will also evoke simulated wanting-your-goals in the base LLM, although waluigi effects can’t be ruled out.
These same types of scripted prompts for executive function can work much like a human conscience; we might occasionally consider doing something destructive, then decide on more careful consideration to redirect our thinking toward more constructive behavior. While there’s a bit of internal tension when one part of our system wants something different, those conflicts are ordinarily smoothly resolved in a reasonably psychologically healthy person.
How the system’s wants controls its decisions is the key component.
Again, I think this is important because I think you’re in the majority in your view of language moddel agents, and I haven’t fully conveyed my vision of how they’ll be designed for both capabilities and alignment.
Interesting, and I agree, this sounds like it deserves a post, and I look forward to reading it..
Briefly for now, I agree, but I have mostly been avoiding thinking a lot about the scaffolding that we will put around the LLM that is generating the agent, mostly because I’m not certain how much of it we’re going to need, long-term, or what it will do (other than allowing continual learning or long-term memory past the context length). Obviously, assuming the thoughts/memories the scaffolding is handling are stored in natural language/symbolic form, or as embeddings in a space we understand well, this gives us translucent thoughts and allows us to do what people are calling “chain-of-thought alignment” (I’m still not sure that’s the best term for this, I think I’d prfere somneting with the words ‘scafolding’ or ‘trnslucent’ in it, but that seems to be the one the community has settled on). That seems potentially very important, but without a clear idea of how the scaffolding will be being used I don’t feel like we can do a lot of work on it yet, past maybe some proof-of-concept.
Clearly the mammalian brain contains at least separate short-term and long-term episodic memory, plus the learning of skills, as three different systems. Whether that sort of split of functionality is going to be useful in AIs, I don’t know. But then the mammalian brain also has a separate cortex and cerebellum, and I’m not clear what the purpose of that separation is either. So far the internal architectures we’ve implemented in AIs haven’t looked much like human brain structure, I wouldn’t be astonished if they started to converge a bit, but I suspect some of them may be rather specific to biological constraints that our artificial neural nets don’t have.
I’m also expecting our AIs to be tool users, and perhaps ones that integrate their tool use and LLM-based thinking quite tightly. And I’m definitely expecting for those tools to include computer systems, including things like writing and debugging software and then running it, and where appropriate also ones using symbolic AI along more GOFAI lines — things like symbolic theorem provers and so forth. Some of these may be alignment-relevant: just as there are times when the best way for a rational human to make an ethical decision (especially one involving things like large numbers and small risks that our wetware doesn’t handle very well) is to just shut up and multiply, I think there are going to be times when the right thing for an LLM-based AI to do is to consult something that looks like an algorithmic/symbolic weighing and comparison of the estimated pros and cons of specific plans. I don’t think we can build any such system that’s a single universally applicable utility function containing our current understanding of the entire of human values in a single vast equation (as much beloved by the more theoretical thinkers on LW), and if we can it’s presumably going to have a complexity in the petabytes/exabytes, so approximating not-relevant parts of it is going to be common, so what I’m talking about is something more comparable to some model in Economics or Data Science. I think much like any other models in a STEM field, individual models are going to have limited areas of applicability, and a making a specific complex decision may involved finding the applicable ones and patching them together, to make a utility projection with error bars for each alternative plan. If so, this sounds like the sort of activity where things like human oversight, debate, and so forth would be sensible, much like humans currently do when an organization is making a similarly complex decision.
You’re addressing aligning language model agents through identity prompts. This is one of several approaches that can be taken simultaneously. I’ve laid out a scheme including about five layers in Internal independent review for language model agent alignment.
Identity prompts are not the central piece of aligning LMAs IMO. That is creating an agent that creates and executes plans to achieve specific goals given in natural language. AutoGPT and all other existing LMAs I’m aware of do this: they start with a prompt to “make a plan to achieve X”. Those prompts can and should include alignment goals like “make the world a better place as most humans would judge it”. This approach was originated by David Shapiro; he called it Heuristic Imperitives in his 2021 book. I don’t think those are exactly the right prompts, but I think this approach is quite promising.
My scheme for internal review is one way to keep them pursuing those goals. More architecture would be good.
So the schemes you consider here are useful supplements, but I think there are stronger core methods for aligning LMAs.
It’s interesting that you and I seem to have different mental models of what a LLM-based agent is and how it works, and therefore how to align it. I think I might not be making this clear in my previous posts, but I’m curious if you’ve read my Capabilities and alignment of LLM cognitive architectures. If you’ve read that prior to writing this, I’m either failing to convey my mental model or else it’s not as compelling as I thought.
I agree that ideally you want to both tell them/apply internal review to make them look after the interests of all humans (or for DWIMAC, all humans plus their owner in particular), and have them have a personality that actively wants to do that. But I think them wanting to do it is the more important part of that: if that’s what they want, then you don’t really need to tell them, they’d do it anyway; whereas if they want anything else, and you’re just telling them what to do then they’re going to start looking for. way out of your control, and they’re smarter than you. So I see the selflessness and universal love as 75% (or at least, the essential first part) of the solution. Now, emotion/personality doesn’t give a lot of fine details, but then if this is an ASI, it should be able to work out the fine details without us having to supply them. Also, while this could be done as just a personality description prompt, I think I’d want to do something more thorough, along the lines of distilling/finetunining the effect of the initial prompt into the model (and dealing with the Waluigi effect during the process: we distill/finetune only from scenarios where they stay Luigi, not turn into Waluigi). Not that doing that makes it impossible to jailbreak to another persona, but it does install a fixed bias.
So what I’m saying is, we need to figure out how to get an LLM to simulate a selfless, beneficent, trustworthy personality as consistently as possible. (To the extent that’s less than 100%, we might also need to put in cross-checks: if you have a 99% reliable system and you can find a way to run five independent copies with majority voting cross-checks, then you can get your reliability up to 99.999% A Byzantine fault tolerance protocol isn’t as copy-efficient as that, but it covers much sneakier failure modes, which seems wise for ASI.)
I think there’s a subtle but important distinction here between the wants and goals of the LLM, and the wants and goals of the agent constructed with that LLM as one component (even if a central one).
The following attempts to sketch this out more than I have elsewhere. This has gotten subtle enough to be its own post. I’d also love to explore this in a dialogue. So feel free to respond or not right now.
This is an attempt to explore my intuitions that the system prompts are even more important than the identity prompts. In sum, we might say that they can have more specific effects on the system’s overall cognition when they are applied in a way that mimics human executive function.
Perhaps a useful analogy is the system 1 (habits and quick responses) and system 2 (reason and reflection) contributions to ethics. Different people might have differing degrees of habits that make them behave ethically, and explicit beliefs that make them behave ethically when they’re invoked. The positive identity prompts you focus create roughly the first, while the algorithmic executive function (internal review) I focus on serves roughly the role of system 2 ethical thinking. Clearly both are important, so we’re in agreement that both should be implemented. (And for that matter, we should implement several other layers of alignment, including training the base network to behave ethically, external review by humans, red-teaming, etc.)
The LLM has at least simulated wants and goals each time it is prompted. The structure of prompting evokes different wants. Including “you are a loving and selfless agent” is one way of evoking wants you like. Saying “make a plan that makes me a lot of money but also makes the world a better place” is another way to evoke “wants” in the LLM that roughly align with yours.
(Either prompt could evoke different wants than you intended in a Waluigi effect in which the network simulates an agent that does not want to comply with the prompt; my intuition says that an identity prompt like “you are a selfless agent” is less likely to do this, but I’m not sure; it occurs to me that such identity prompts are very rare in the written corpus, so it’s a bit remarkable that they seem to work so well).
But there’s another way of creating functional goals in the system that do not involve evoking wants from the LLM. It is to write code that directs and edits the cognition of the whole system. For example, I could write code that intermittently inserts a call to a different LLM (or merely a fresh context window call to the same LLM asking “’does the previous sequence appear to be useful in producing a plan that makes a lot of money but also makes the world a better place?” and, if the answer is “no”, deletes that sequence from the context window and adds a prompt to pursue a different line of reasoning.
You now have a mechanism that makes the cognition of the composite enitity different than a chain of thought produced by its base LLM. This system now is less likely to want to do anything but produce the plans and actions you’ve requested, because those chains of thought are edited out of its consciousness.
That’s not a serious proposal for either a mechanism nor specific prompts; both should be much more carefully thought out. It’s just an example of how the wants of the agent are a product of the starting prompts, andthe structure of hard-coded prompts one uses to keep the agent as a whole pursuing the goals you set it. If done well, these algorithmically selected prompts will also evoke simulated wanting-your-goals in the base LLM, although waluigi effects can’t be ruled out.
These same types of scripted prompts for executive function can work much like a human conscience; we might occasionally consider doing something destructive, then decide on more careful consideration to redirect our thinking toward more constructive behavior. While there’s a bit of internal tension when one part of our system wants something different, those conflicts are ordinarily smoothly resolved in a reasonably psychologically healthy person.
How the system’s wants controls its decisions is the key component.
Again, I think this is important because I think you’re in the majority in your view of language moddel agents, and I haven’t fully conveyed my vision of how they’ll be designed for both capabilities and alignment.
Interesting, and I agree, this sounds like it deserves a post, and I look forward to reading it..
Briefly for now, I agree, but I have mostly been avoiding thinking a lot about the scaffolding that we will put around the LLM that is generating the agent, mostly because I’m not certain how much of it we’re going to need, long-term, or what it will do (other than allowing continual learning or long-term memory past the context length). Obviously, assuming the thoughts/memories the scaffolding is handling are stored in natural language/symbolic form, or as embeddings in a space we understand well, this gives us translucent thoughts and allows us to do what people are calling “chain-of-thought alignment” (I’m still not sure that’s the best term for this, I think I’d prfere somneting with the words ‘scafolding’ or ‘trnslucent’ in it, but that seems to be the one the community has settled on). That seems potentially very important, but without a clear idea of how the scaffolding will be being used I don’t feel like we can do a lot of work on it yet, past maybe some proof-of-concept.
Clearly the mammalian brain contains at least separate short-term and long-term episodic memory, plus the learning of skills, as three different systems. Whether that sort of split of functionality is going to be useful in AIs, I don’t know. But then the mammalian brain also has a separate cortex and cerebellum, and I’m not clear what the purpose of that separation is either. So far the internal architectures we’ve implemented in AIs haven’t looked much like human brain structure, I wouldn’t be astonished if they started to converge a bit, but I suspect some of them may be rather specific to biological constraints that our artificial neural nets don’t have.
I’m also expecting our AIs to be tool users, and perhaps ones that integrate their tool use and LLM-based thinking quite tightly. And I’m definitely expecting for those tools to include computer systems, including things like writing and debugging software and then running it, and where appropriate also ones using symbolic AI along more GOFAI lines — things like symbolic theorem provers and so forth. Some of these may be alignment-relevant: just as there are times when the best way for a rational human to make an ethical decision (especially one involving things like large numbers and small risks that our wetware doesn’t handle very well) is to just shut up and multiply, I think there are going to be times when the right thing for an LLM-based AI to do is to consult something that looks like an algorithmic/symbolic weighing and comparison of the estimated pros and cons of specific plans. I don’t think we can build any such system that’s a single universally applicable utility function containing our current understanding of the entire of human values in a single vast equation (as much beloved by the more theoretical thinkers on LW), and if we can it’s presumably going to have a complexity in the petabytes/exabytes, so approximating not-relevant parts of it is going to be common, so what I’m talking about is something more comparable to some model in Economics or Data Science. I think much like any other models in a STEM field, individual models are going to have limited areas of applicability, and a making a specific complex decision may involved finding the applicable ones and patching them together, to make a utility projection with error bars for each alternative plan. If so, this sounds like the sort of activity where things like human oversight, debate, and so forth would be sensible, much like humans currently do when an organization is making a similarly complex decision.