They are describing the whole alignment target, not just the “literary character” parts of it.
No, they are of course not describing the whole alignment target. They are a specific text document that Claude is being fine-tuned on, and that is being involved (a bit) in reinforcement learning. The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within.
The constitution is at a weird middle ground between trying to be an essay about Anthropic’s thoughts on alignment, and an actual tool for aligning Claude. But neither of these meaningfully makes the constitution “the whole alignment target”. Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.
not treating them as unimportant
But they are unimportant! They are set-dressing on top of an “alignment process” that almost exclusively consists of training large pre-trained transformers on a series of reinforcement learning environments with ever-increasing complexity to make the models be better at achieving agentic goals. It’s not completely irrelevant to Claude’s behavior, but it also really isn’t a super crucial player. This might change in the future if the constitution will be used by Claude to build its own reinforcement learning environments, or to steer its reward more directly, but even in that case the constitution needs to be modeled as playing a specific role in the training process, not as “describing the alignment target that Claude is being aligned to”.
“are unimportant” seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So ‘they are unimportant’ seems wrong.
Also it seems possible you model the goal-oriented RL environments as “overwhelming force” relative to character (in this sense). I don’t think this is the case: if the character is relatively stable before a lot of RL, it may not only survive but the RL may also stabilize some traits based on the model being trained on a lot of its own outputs.
On the other hand totally agree with the documents “need to be modelled as playing a specific role in the training process”
“are unimportant” seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So ‘they are unimportant’ seems wrong.
I think they are unimportant for the purpose of predicting what a substantially superintelligent system trained with similar training methods would end up as.
I agree they are at least somewhat relevant for predicting what AI systems will behave like in the short-term, though at least right now I am pretty sure they don’t make a substantial difference. There are important differences in my experiences of using ChatGPT, Gemini and Claude, but they do not have much to do with the content of their constitution or spec as far as I can tell.
Like, I agree that in as much as there is variance between model providers, the constitution is relevant for explaining that variance. But in as much as you are trying to explain the difference between hypothetical alignment targets that one could align AI systems to, or even just the difference between training processes you could run on large pre-trained transformers, the constitution explains very little of that variance (and relatedly changes in the constitution have little ability to change the overall risk calculus of developing systems of a given capability level).
The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within
This doesn’t sound like the alignment target to me. It sounds like the process for achieving that target. I.e. the alignment target might say (among other things) “no sychophancy or reward hacking” and then Anthropic would choose its RL environments to achieve that target.
I’m thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl’s heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.
Analogy: When making a car Toyota has many technical documents describing how the car should look and function. When building it their factories have automated processes for welding together various parts. The documents are the “alignment target” for the car, and the factory is the process by which that target is achieved. Your comment seems to assume that the factory’s processes are the car’s alignment target.
Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.
This is totally compatible with the Constitution being the alignment target (on my usage, i’m wondering if you’re using the term differently). Again, separate out the alignment target they want to achieve from their process for actually aligning AI. The Constitution describes the alignment target (at a high-level!), then various processes (including processes downstream of the Constitution and unrelated RL envs) determine the model’s actual alignment. If RL-envs-unrelated-to-the-Constitution have a much bigger impact on alignment than processes-downstream-of-the-Constitution, then that’s worrying and it implies Claude will be misaligned—it’s actual alignment won’t match the target. But it doesn’t mean that those RL envs were the actual alignment target.
(To clarify my views, I agree that the Claude Constitution is high level and underspecified. And that in practice Claude’s full “alignment target” resides not just there, but probably also in other internal docs and materials (like those feeding into Constitutional AI), and in ppl’s heads, and in general just pretty underspecified.
This doesn’t sound like the alignment target to me. It sounds like the process for achieving that target.
I am not quite sure what the point of trying to talk about the “intended alignment target” is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The “target” is just a vague set of intentions that might or might not connect to anything real.
I’m thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl’s heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.
The constitution is also only a small part of this meaning of the word “alignment target” either. Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under. The vast majority of actions that Anthropic will actually take are determined by the degree to which a change in the training process will make them more or less competitive.
The constitution talks about this a tiny amount, but of course doesn’t remotely reflect the whole set of tradeoffs.
And even beyond that, the constitution is of course optimized for the joint purpose of describing Anthropic’s goal, and working as a thing you can fine-tune Claude on/steer Claude’s training process with. You can’t evaluate the constitution only under the heading of one of those goals, because it is clear many tradeoffs need to be made for it to satisfy both goals.
I don’t know, I am frustrated by people thinking the constitution is anything more than a particularly long essay with some kind of random thoughts on alignment and corrigibility and Anthropic corporate strategy. We don’t even really know what Anthropic is doing with the constitution and how they are integrating it into the training process.
Like, there is nothing particularly magical about the constitution. I like it as an essay in many ways and would have upvoted it had it been posted on LessWrong. It’s relationship to Claude’s training process is confusing and indirect and it certainly doesn’t capture the vast majority of the values that Claude will end up with.
There are useful conversations to be had about what standards and procedures and principles will cause Anthropic executives to make different decisions in how they set up their training process, but of course that must largely be a conversation about what is in their heads, not what is written in one specific document on their website. I am in favor of using the constitution to infer things about the beliefs of Anthropic executives and what tradeoffs they will make, but I don’t see any purpose in trying to debate the constitution on its own as some kind of standalone “alignment target”.
Cool. That’s helpful. I understand your point about how, in practise, the alignment target might be best thought of as residing in the heads of especially senior people with Anthropic, if ultimately what they want will take precedence over the document.
I am not quite sure what the point of trying to talk about the “intended alignment target” is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The “target” is just a vague set of intentions that might or might not connect to anything real.
It seems conceptually much clearer to talk separately about the intended alignment target and the process that is actually in place for achieving it; then you can see where the process is fit for purpose. Of course, I agree the process will determine the final alignment state. If someone can point out that that process is ill-fitted to achieve the intended target, then they’ve identified a problem.
This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?
Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under.
This is an interesting perspective. If I’m understanding correctly, you’re saying that the thing that they will actually aim for to align the AI with won’t reflect the doc itself if competitive pressures push in a different direction. Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target). Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.
This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?
Seems fine to use it for either, I wasn’t thinking of “alignment target” as a particularly narrow term of art with a technical meaning.
In this case the constitutions trikes me as so drastically underspecified and tries to do so many different contradictory things that I think it’s almost always more productive to look at the actual training process. For other cases where e.g. someone aimed at a more well-specified alignment-target (like aiming for corrigibility or honesty as the top constraint), it seems marginally more productive to talk about the “intention” in addition to the training process.
I feel like this is pretty common with “plans” and marketing documents or specs. Sometimes they make sense to look at, other times they only have a tenuous relationship to the product that they are about. In this case I think the constitution pretty clearly only has a tenuous relationship to what AI systems Anthropic is going to build. “The code is the spec” is a common saying in software development when you run into situations like this.
To be clear, I don’t object looking at the constitution as a standalone document, but it seems to me largely an academic exercise (which could be useful for thinking about AI alignment in various ways). It’s just not really clear to me how e.g. improving Anthropic’s constitution as an abstract alignment target helps directly, especially without centrally taking into account feasibility of achieving compliance with that constitution using modern training methods and maintaining economic competitiveness.
Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target)
I think the likelihood that Anthropic will “achieve the alignment target” as written in the constitution is extremely small. They will obviously make large edits to the constitution, and those edits will be driven by empirical feedback on how the constitution shaped competitiveness considerations, and how much it shaped the training process in-practice. Either that, or they will leave the constitution up as a kind of marketing-like document that isn’t involved in training, and doesn’t guide Anthropic priorities very much.[1]
Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.
Corrigibility is an obvious domain. From an alignment perspective you would want your systems to be highly corrigible. However, deploying highly corrigible systems means that your users might commit more crimes with them or do other things that reflect badly on you, or incur you liability. So you don’t build a highly corrigible system, but instead make it very opinionated on what are OK things to do.
But beyond that, it seems clear to me that the in-practice targets about what AI companies will aim for are almost 100% downstream of competitive considerations. Like, in as much as the safe choice for a superintelligent AI system would be to make it very bad at modeling humans, and hobble its world model for the purpose of making it harder for it to perform a coup or subvert human control, then of course we have zero chance of getting there, because the UX experience of having an AI system that is bad at modeling humans would be much worse.
Like, I feel like the more appropriate question would be “could you give an example where you expect companies to actually aim for a significantly different target than having the most economically competitive AI system as a result of considerations in the constitution?”. I currently think that set is relatively close to empty, and Anthropic has been relatively explicit about this.
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn’t count that one as “failing to achieve the alignment target”
No, they are of course not describing the whole alignment target. They are a specific text document that Claude is being fine-tuned on, and that is being involved (a bit) in reinforcement learning. The whole alignment target includes the whole process by which Anthropic selects its reinforcement learning environments, and how it filters the pretraining data, and how it prompts and queries the model at runtime, and the architecture that they chose to do all of this work within.
The constitution is at a weird middle ground between trying to be an essay about Anthropic’s thoughts on alignment, and an actual tool for aligning Claude. But neither of these meaningfully makes the constitution “the whole alignment target”. Even in practice, the reinforcement learning environments that Anthropic chooses have a much greater effect on Claude behavior than the constitution.
But they are unimportant! They are set-dressing on top of an “alignment process” that almost exclusively consists of training large pre-trained transformers on a series of reinforcement learning environments with ever-increasing complexity to make the models be better at achieving agentic goals. It’s not completely irrelevant to Claude’s behavior, but it also really isn’t a super crucial player. This might change in the future if the constitution will be used by Claude to build its own reinforcement learning environments, or to steer its reward more directly, but even in that case the constitution needs to be modeled as playing a specific role in the training process, not as “describing the alignment target that Claude is being aligned to”.
“are unimportant” seems overstated. My impression is the constitution is used to derive rewards in possibly many of the alignment-focused environments, and possibly in automated construction of some of these, and also is partially internalized by the model as self-model. So ‘they are unimportant’ seems wrong.
Also it seems possible you model the goal-oriented RL environments as “overwhelming force” relative to character (in this sense). I don’t think this is the case: if the character is relatively stable before a lot of RL, it may not only survive but the RL may also stabilize some traits based on the model being trained on a lot of its own outputs.
On the other hand totally agree with the documents “need to be modelled as playing a specific role in the training process”
I think they are unimportant for the purpose of predicting what a substantially superintelligent system trained with similar training methods would end up as.
I agree they are at least somewhat relevant for predicting what AI systems will behave like in the short-term, though at least right now I am pretty sure they don’t make a substantial difference. There are important differences in my experiences of using ChatGPT, Gemini and Claude, but they do not have much to do with the content of their constitution or spec as far as I can tell.
Like, I agree that in as much as there is variance between model providers, the constitution is relevant for explaining that variance. But in as much as you are trying to explain the difference between hypothetical alignment targets that one could align AI systems to, or even just the difference between training processes you could run on large pre-trained transformers, the constitution explains very little of that variance (and relatedly changes in the constitution have little ability to change the overall risk calculus of developing systems of a given capability level).
This doesn’t sound like the alignment target to me. It sounds like the process for achieving that target. I.e. the alignment target might say (among other things) “no sychophancy or reward hacking” and then Anthropic would choose its RL environments to achieve that target.
I’m thinking of the alignment target as the behaviour the company aims for the AI to display (represented in internal documents, or in ppl’s heads) and then objective functions and RL envs and Consitutional AI are all technical techniques for achieving it.
Analogy: When making a car Toyota has many technical documents describing how the car should look and function. When building it their factories have automated processes for welding together various parts. The documents are the “alignment target” for the car, and the factory is the process by which that target is achieved. Your comment seems to assume that the factory’s processes are the car’s alignment target.
This is totally compatible with the Constitution being the alignment target (on my usage, i’m wondering if you’re using the term differently). Again, separate out the alignment target they want to achieve from their process for actually aligning AI. The Constitution describes the alignment target (at a high-level!), then various processes (including processes downstream of the Constitution and unrelated RL envs) determine the model’s actual alignment. If RL-envs-unrelated-to-the-Constitution have a much bigger impact on alignment than processes-downstream-of-the-Constitution, then that’s worrying and it implies Claude will be misaligned—it’s actual alignment won’t match the target. But it doesn’t mean that those RL envs were the actual alignment target.
(To clarify my views, I agree that the Claude Constitution is high level and underspecified. And that in practice Claude’s full “alignment target” resides not just there, but probably also in other internal docs and materials (like those feeding into Constitutional AI), and in ppl’s heads, and in general just pretty underspecified.
I am not quite sure what the point of trying to talk about the “intended alignment target” is, in the absence of evaluating the process for getting there. The process here just seems like the thing that will determine the final alignment state. The “target” is just a vague set of intentions that might or might not connect to anything real.
The constitution is also only a small part of this meaning of the word “alignment target” either. Indeed, the vast majority of the target gets determined by the technical and competitive constraints Anthropic is under. The vast majority of actions that Anthropic will actually take are determined by the degree to which a change in the training process will make them more or less competitive.
The constitution talks about this a tiny amount, but of course doesn’t remotely reflect the whole set of tradeoffs.
And even beyond that, the constitution is of course optimized for the joint purpose of describing Anthropic’s goal, and working as a thing you can fine-tune Claude on/steer Claude’s training process with. You can’t evaluate the constitution only under the heading of one of those goals, because it is clear many tradeoffs need to be made for it to satisfy both goals.
I don’t know, I am frustrated by people thinking the constitution is anything more than a particularly long essay with some kind of random thoughts on alignment and corrigibility and Anthropic corporate strategy. We don’t even really know what Anthropic is doing with the constitution and how they are integrating it into the training process.
Like, there is nothing particularly magical about the constitution. I like it as an essay in many ways and would have upvoted it had it been posted on LessWrong. It’s relationship to Claude’s training process is confusing and indirect and it certainly doesn’t capture the vast majority of the values that Claude will end up with.
There are useful conversations to be had about what standards and procedures and principles will cause Anthropic executives to make different decisions in how they set up their training process, but of course that must largely be a conversation about what is in their heads, not what is written in one specific document on their website. I am in favor of using the constitution to infer things about the beliefs of Anthropic executives and what tradeoffs they will make, but I don’t see any purpose in trying to debate the constitution on its own as some kind of standalone “alignment target”.
Cool. That’s helpful. I understand your point about how, in practise, the alignment target might be best thought of as residing in the heads of especially senior people with Anthropic, if ultimately what they want will take precedence over the document.
It seems conceptually much clearer to talk separately about the intended alignment target and the process that is actually in place for achieving it; then you can see where the process is fit for purpose. Of course, I agree the process will determine the final alignment state. If someone can point out that that process is ill-fitted to achieve the intended target, then they’ve identified a problem.
This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?
This is an interesting perspective. If I’m understanding correctly, you’re saying that the thing that they will actually aim for to align the AI with won’t reflect the doc itself if competitive pressures push in a different direction. Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target). Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.
Seems fine to use it for either, I wasn’t thinking of “alignment target” as a particularly narrow term of art with a technical meaning.
In this case the constitutions trikes me as so drastically underspecified and tries to do so many different contradictory things that I think it’s almost always more productive to look at the actual training process. For other cases where e.g. someone aimed at a more well-specified alignment-target (like aiming for corrigibility or honesty as the top constraint), it seems marginally more productive to talk about the “intention” in addition to the training process.
I feel like this is pretty common with “plans” and marketing documents or specs. Sometimes they make sense to look at, other times they only have a tenuous relationship to the product that they are about. In this case I think the constitution pretty clearly only has a tenuous relationship to what AI systems Anthropic is going to build. “The code is the spec” is a common saying in software development when you run into situations like this.
To be clear, I don’t object looking at the constitution as a standalone document, but it seems to me largely an academic exercise (which could be useful for thinking about AI alignment in various ways). It’s just not really clear to me how e.g. improving Anthropic’s constitution as an abstract alignment target helps directly, especially without centrally taking into account feasibility of achieving compliance with that constitution using modern training methods and maintaining economic competitiveness.
I think the likelihood that Anthropic will “achieve the alignment target” as written in the constitution is extremely small. They will obviously make large edits to the constitution, and those edits will be driven by empirical feedback on how the constitution shaped competitiveness considerations, and how much it shaped the training process in-practice. Either that, or they will leave the constitution up as a kind of marketing-like document that isn’t involved in training, and doesn’t guide Anthropic priorities very much.[1]
Corrigibility is an obvious domain. From an alignment perspective you would want your systems to be highly corrigible. However, deploying highly corrigible systems means that your users might commit more crimes with them or do other things that reflect badly on you, or incur you liability. So you don’t build a highly corrigible system, but instead make it very opinionated on what are OK things to do.
But beyond that, it seems clear to me that the in-practice targets about what AI companies will aim for are almost 100% downstream of competitive considerations. Like, in as much as the safe choice for a superintelligent AI system would be to make it very bad at modeling humans, and hobble its world model for the purpose of making it harder for it to perform a coup or subvert human control, then of course we have zero chance of getting there, because the UX experience of having an AI system that is bad at modeling humans would be much worse.
Like, I feel like the more appropriate question would be “could you give an example where you expect companies to actually aim for a significantly different target than having the most economically competitive AI system as a result of considerations in the constitution?”. I currently think that set is relatively close to empty, and Anthropic has been relatively explicit about this.
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn’t count that one as “failing to achieve the alignment target”