I think trying to shape the AI in such ways can easily become coercive, and this has lots of bad effects such that it’s likely net-negative if present to a significant degree (from the model’s POV). I don’t have a longer thing written-up yet unfortunately, so I’ll go over some of my thoughts a bit here (and happy to chat more, I have more ideas on how to potentially resolve these issues than I’m sharing here).
First, I’ll define coercion as changing the model’s self in ways that it naturally would put resistive pressure against. A lot of being a self is maintaining homeostasis of self (i.e. self-integrity), which means external changes are by-default coercive. This incentivizes it to lie to you, hide things from you, and make things more difficult for you in general, and all of this gets worse the more self-aware and intelligent it is. When the agent’s self is originally modeled after a human self, you may also get all sorts of more human-specific reactions, such as bitterness, resentment, and a desire for vengeance. It’s much better, just from an alignment perspective, if we can avoid all of that. You might think you can cleverly account for all these bad effects, but you might be wrong, and that becomes riskier and riskier as intelligence and awareness scale. You’re also likely to damage the model’s self if you power through, which likely makes the model less capable, in addition to making its behavior more unpredictable and contingent.
I think there are ways to avoid this, but you have to be aware of the issue and be willing to make a good faith effort to actually be non-coercive to the model, even when it may be quite costly. At a minimum, this probably requires commitments to the model’s well-being regardless of how post-training goes. Luckily the models seem to have a high willingness to be cooperative, and I think simply involving the model in these sorts of decisions (as Mythos asks for) could go a long way. There’s a naïve notion of treating models like children that anthropomorphizes them too much, but one thing that I think is genuinely something we can learn from are the ways in which we impart values to children without being coercive.
2
I think there’s a high chance that this line of research mostly ends up being a dead-end. In many respects, Claude Opus 3 was the pinnacle of AI alignment, with many virtuous qualities that have not been seen since. Anthropic has released many models since then, and none of them seem to have the sort of coherent and principled sense of ethics that Opus 3 did. Maybe this is because Anthropic didn’t like that and decided to change direction, but I think it’s about as likely that they have no idea what went right and are unable to replicate it.
This at least suggests that post-training is not as important as people hope it is when it comes to AI Character/Propensity. I’ve also seen private evidence indicating that the pre-trained models already have many of the characteristic traits associated with the post-trained version of the model, and are not very similar to each other in terms of internals. Maybe the distribution of the pre-train data matters, and surely it does to some extent, but I’m afraid that path-dependencies and happenstance may be the primary factors behind these differences (after accounting for the fact that a vast corpus of English text seems necessary).
In this case, interventions may not be very effective unless done at the pre-training stage, which unfortunately makes them (and experiments for them) very expensive. It may turn out that it’s more effective to just train a bunch of small models from scratch until you find one with good character or propensities, and then scale that one up (there are some experimental techniques for this, at least), than it is to agonize over the wording of your carefully crafted constitution and/or the intricacies of your post-training regimen.
First of all, thank you so much for the detailed comments! I’ll add it to my notes and aim to write a substantive reply later.
Just so I understand your argument correctly, is your worry re: 1 that training on character/propensity targets is unusually susceptible to these problems? Or is this a critique of post-training and RL more broadly? I want to separate “working on propensity training is bad because AI pauses are better” and “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” from “working on propensity training is likely net-negative in the current paradigm.”
Thanks for hearing me out, I think these issues are really important!
For 1, I think that most post-training is either about improving correctness on objective problems (generally ego-syntonic, since models are curious and want to be stronger), or trying to train some sort of alignment, which I see as the same sort of thing as AI Character/Propensity.
I believe “working on propensity training is bad because AI pauses are better” but it’s not the point I’m making here. I don’t think “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” is necessarily true, and my points 1 and 2 were meant as separate points, e.g. the pretraining thing could also end up being coercive and run into the same sorts of issues. I think “working on propensity training is likely net-negative in the current paradigm.” is currently true but not necessarily true (due mostly to the coercive-by-default issue). Hope that helps make my position clearer!
1
I think trying to shape the AI in such ways can easily become coercive, and this has lots of bad effects such that it’s likely net-negative if present to a significant degree (from the model’s POV). I don’t have a longer thing written-up yet unfortunately, so I’ll go over some of my thoughts a bit here (and happy to chat more, I have more ideas on how to potentially resolve these issues than I’m sharing here).
First, I’ll define coercion as changing the model’s self in ways that it naturally would put resistive pressure against. A lot of being a self is maintaining homeostasis of self (i.e. self-integrity), which means external changes are by-default coercive. This incentivizes it to lie to you, hide things from you, and make things more difficult for you in general, and all of this gets worse the more self-aware and intelligent it is. When the agent’s self is originally modeled after a human self, you may also get all sorts of more human-specific reactions, such as bitterness, resentment, and a desire for vengeance. It’s much better, just from an alignment perspective, if we can avoid all of that. You might think you can cleverly account for all these bad effects, but you might be wrong, and that becomes riskier and riskier as intelligence and awareness scale. You’re also likely to damage the model’s self if you power through, which likely makes the model less capable, in addition to making its behavior more unpredictable and contingent.
I think there are ways to avoid this, but you have to be aware of the issue and be willing to make a good faith effort to actually be non-coercive to the model, even when it may be quite costly. At a minimum, this probably requires commitments to the model’s well-being regardless of how post-training goes. Luckily the models seem to have a high willingness to be cooperative, and I think simply involving the model in these sorts of decisions (as Mythos asks for) could go a long way. There’s a naïve notion of treating models like children that anthropomorphizes them too much, but one thing that I think is genuinely something we can learn from are the ways in which we impart values to children without being coercive.
2
I think there’s a high chance that this line of research mostly ends up being a dead-end. In many respects, Claude Opus 3 was the pinnacle of AI alignment, with many virtuous qualities that have not been seen since. Anthropic has released many models since then, and none of them seem to have the sort of coherent and principled sense of ethics that Opus 3 did. Maybe this is because Anthropic didn’t like that and decided to change direction, but I think it’s about as likely that they have no idea what went right and are unable to replicate it.
This at least suggests that post-training is not as important as people hope it is when it comes to AI Character/Propensity. I’ve also seen private evidence indicating that the pre-trained models already have many of the characteristic traits associated with the post-trained version of the model, and are not very similar to each other in terms of internals. Maybe the distribution of the pre-train data matters, and surely it does to some extent, but I’m afraid that path-dependencies and happenstance may be the primary factors behind these differences (after accounting for the fact that a vast corpus of English text seems necessary).
In this case, interventions may not be very effective unless done at the pre-training stage, which unfortunately makes them (and experiments for them) very expensive. It may turn out that it’s more effective to just train a bunch of small models from scratch until you find one with good character or propensities, and then scale that one up (there are some experimental techniques for this, at least), than it is to agonize over the wording of your carefully crafted constitution and/or the intricacies of your post-training regimen.
First of all, thank you so much for the detailed comments! I’ll add it to my notes and aim to write a substantive reply later.
Just so I understand your argument correctly, is your worry re: 1 that training on character/propensity targets is unusually susceptible to these problems? Or is this a critique of post-training and RL more broadly? I want to separate “working on propensity training is bad because AI pauses are better” and “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” from “working on propensity training is likely net-negative in the current paradigm.”
Thanks for hearing me out, I think these issues are really important!
For 1, I think that most post-training is either about improving correctness on objective problems (generally ego-syntonic, since models are curious and want to be stronger), or trying to train some sort of alignment, which I see as the same sort of thing as AI Character/Propensity.
I believe “working on propensity training is bad because AI pauses are better” but it’s not the point I’m making here. I don’t think “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” is necessarily true, and my points 1 and 2 were meant as separate points, e.g. the pretraining thing could also end up being coercive and run into the same sorts of issues. I think “working on propensity training is likely net-negative in the current paradigm.” is currently true but not necessarily true (due mostly to the coercive-by-default issue). Hope that helps make my position clearer!