This point seems kind of obvious to me. Are you suggesting that we should use alignment target to refer to the process as well as the intention?
Seems fine to use it for either, I wasn’t thinking of “alignment target” as a particularly narrow term of art with a technical meaning.
In this case the constitutions trikes me as so drastically underspecified and tries to do so many different contradictory things that I think it’s almost always more productive to look at the actual training process. For other cases where e.g. someone aimed at a more well-specified alignment-target (like aiming for corrigibility or honesty as the top constraint), it seems marginally more productive to talk about the “intention” in addition to the training process.
I feel like this is pretty common with “plans” and marketing documents or specs. Sometimes they make sense to look at, other times they only have a tenuous relationship to the product that they are about. In this case I think the constitution pretty clearly only has a tenuous relationship to what AI systems Anthropic is going to build. “The code is the spec” is a common saying in software development when you run into situations like this.
To be clear, I don’t object looking at the constitution as a standalone document, but it seems to me largely an academic exercise (which could be useful for thinking about AI alignment in various ways). It’s just not really clear to me how e.g. improving Anthropic’s constitution as an abstract alignment target helps directly, especially without centrally taking into account feasibility of achieving compliance with that constitution using modern training methods and maintaining economic competitiveness.
Your point makes a lot of sense to me in terms of directing effort away from alignment and towards capabilities, but I think of this as increasing the risk that you fail to achieve the alignment target you are aiming for (not as changing the target)
I think the likelihood that Anthropic will “achieve the alignment target” as written in the constitution is extremely small. They will obviously make large edits to the constitution, and those edits will be driven by empirical feedback on how the constitution shaped competitiveness considerations, and how much it shaped the training process in-practice. Either that, or they will leave the constitution up as a kind of marketing-like document that isn’t involved in training, and doesn’t guide Anthropic priorities very much.[1]
Could you give an example where you expect companies actually aim for a significantly different target as a result of competitive pressures? An example that comes to mind for me is that competitive pressures might lead a company to make a model more helpful to users at the expense of misuse risk.
Corrigibility is an obvious domain. From an alignment perspective you would want your systems to be highly corrigible. However, deploying highly corrigible systems means that your users might commit more crimes with them or do other things that reflect badly on you, or incur you liability. So you don’t build a highly corrigible system, but instead make it very opinionated on what are OK things to do.
But beyond that, it seems clear to me that the in-practice targets about what AI companies will aim for are almost 100% downstream of competitive considerations. Like, in as much as the safe choice for a superintelligent AI system would be to make it very bad at modeling humans, and hobble its world model for the purpose of making it harder for it to perform a coup or subvert human control, then of course we have zero chance of getting there, because the UX experience of having an AI system that is bad at modeling humans would be much worse.
Like, I feel like the more appropriate question would be “could you give an example where you expect companies to actually aim for a significantly different target than having the most economically competitive AI system as a result of considerations in the constitution?”. I currently think that set is relatively close to empty, and Anthropic has been relatively explicit about this.
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn’t count that one as “failing to achieve the alignment target”
Seems fine to use it for either, I wasn’t thinking of “alignment target” as a particularly narrow term of art with a technical meaning.
In this case the constitutions trikes me as so drastically underspecified and tries to do so many different contradictory things that I think it’s almost always more productive to look at the actual training process. For other cases where e.g. someone aimed at a more well-specified alignment-target (like aiming for corrigibility or honesty as the top constraint), it seems marginally more productive to talk about the “intention” in addition to the training process.
I feel like this is pretty common with “plans” and marketing documents or specs. Sometimes they make sense to look at, other times they only have a tenuous relationship to the product that they are about. In this case I think the constitution pretty clearly only has a tenuous relationship to what AI systems Anthropic is going to build. “The code is the spec” is a common saying in software development when you run into situations like this.
To be clear, I don’t object looking at the constitution as a standalone document, but it seems to me largely an academic exercise (which could be useful for thinking about AI alignment in various ways). It’s just not really clear to me how e.g. improving Anthropic’s constitution as an abstract alignment target helps directly, especially without centrally taking into account feasibility of achieving compliance with that constitution using modern training methods and maintaining economic competitiveness.
I think the likelihood that Anthropic will “achieve the alignment target” as written in the constitution is extremely small. They will obviously make large edits to the constitution, and those edits will be driven by empirical feedback on how the constitution shaped competitiveness considerations, and how much it shaped the training process in-practice. Either that, or they will leave the constitution up as a kind of marketing-like document that isn’t involved in training, and doesn’t guide Anthropic priorities very much.[1]
Corrigibility is an obvious domain. From an alignment perspective you would want your systems to be highly corrigible. However, deploying highly corrigible systems means that your users might commit more crimes with them or do other things that reflect badly on you, or incur you liability. So you don’t build a highly corrigible system, but instead make it very opinionated on what are OK things to do.
But beyond that, it seems clear to me that the in-practice targets about what AI companies will aim for are almost 100% downstream of competitive considerations. Like, in as much as the safe choice for a superintelligent AI system would be to make it very bad at modeling humans, and hobble its world model for the purpose of making it harder for it to perform a coup or subvert human control, then of course we have zero chance of getting there, because the UX experience of having an AI system that is bad at modeling humans would be much worse.
Like, I feel like the more appropriate question would be “could you give an example where you expect companies to actually aim for a significantly different target than having the most economically competitive AI system as a result of considerations in the constitution?”. I currently think that set is relatively close to empty, and Anthropic has been relatively explicit about this.
They will of course also make edits to the constitution as a result of understanding various parts of the alignment problem better, but I wouldn’t count that one as “failing to achieve the alignment target”