When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.
Obviously, all agents having undergone training to look “not egregiously misaligned”, will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between “not egregiously misaligned” and “conniving to satisfy some other set of preferences”. But there are a lot of messy places in between these two positions, including “I’m not really sure what I want” or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.
All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will “self-correct” and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?
Is this basically correct? If so, this won’t work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren’t enough to steer this chaotic system where you want it to go.
are these agents going to do sloppy research?
I think there were a few times where you are somewhat misreading your critics when they say “slop”. It doesn’t mean “bad”. It means something closer to “very subtly bad in a way that is difficult to distinguish from quality work”. Where the second part is the important part.
E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.
It means something closer to “very subtly bad in a way that is difficult to distinguish from quality work”. Where the second part is the important part.
I think my arguments still hold in this case though right?
i.e. we are training models so they try to improve their work and identify these subtle issues—and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.
My guess is that your core mistake is here
I agree there are lots of “messy in between places,” but these are also alignment failures we see in humans.
And if humans had a really long time to do safety reseach, my guess is we’d be ok. Why? Like you said, there’s a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)
and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don’t actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
these are also alignment failures we see in humans.
Many of them have close analogies in human behaviour. But you seem to be implying “and therefore those are non-issues”???
There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.
How is this evidence in favour of your plan ultimately resulting in a solution to alignment???
but these systems empirically often move in reasonable and socially-beneficial directions over time
Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?
and i expect we can make AI agents a lot more aligned than humans typically are
Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you’re confusing yourself by using the word “aligned” here, can we taboo it? Human reflective instability looks like: they realize they don’t care about being a lawyer and go become a monk. Or they realize they don’t want to be a monk and go become a hippy (this one’s my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.
We have a lot of experience with the space of human reflective instabilities. We’re pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.
But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can’t nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).
Therefore I think it’s extremely, wildly wrong to expect “we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are”.
but, Claude sure as hell seems to
Why do you even consider this relevant evidence?
[Edit 25/02/25: To expand on this last point, you’re saying:
If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
It seems like you’re doing the same dichotomy here, where you say it’s either pretending or it’s aligned. I know that they will act like they care about the law. We both see the same evidence, I’m not just ignoring it. I just think you’re interpreting this evidence poorly, perhaps by being insufficiently careful about “alignment” as meaning “reflectively goal stable with predictable goals and predictable instabilities” vs “acts like a law-abiding citizen at the moment”.
My guess is that your core mistake is here:
Obviously, all agents having undergone training to look “not egregiously misaligned”, will not look egregiously misaligned. You seem to be assuming that there is mostly a dichotomy between “not egregiously misaligned” and “conniving to satisfy some other set of preferences”. But there are a lot of messy places in between these two positions, including “I’m not really sure what I want” or <goals-that-are-highly-dependent-on-the-environment-e.g.-status-seeking>.
All AIs you train will be somewhere in this in between messy place. What you are hoping for is that if you put a group of these together, they will “self-correct” and force/modify each other to keep pursuing to the same goals-you-trained-them-to-look-like-they-wanted?
Is this basically correct? If so, this won’t work just because this is absolute chaos and the goals-you-trained-them-to-look-like-they-wanted aren’t enough to steer this chaotic system where you want it to go.
I think there were a few times where you are somewhat misreading your critics when they say “slop”. It doesn’t mean “bad”. It means something closer to “very subtly bad in a way that is difficult to distinguish from quality work”. Where the second part is the important part.
E.g. I find it difficult to use LLMs to help me do math or code weird algorithms, because they are good enough at outputting something that looks right. It feels like it takes longer to detect and fix their mistakes than it does to do it from scratch myself.
I think my arguments still hold in this case though right?
i.e. we are training models so they try to improve their work and identify these subtle issues—and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.
I agree there are lots of “messy in between places,” but these are also alignment failures we see in humans.
And if humans had a really long time to do safety reseach, my guess is we’d be ok. Why? Like you said, there’s a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)
and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don’t actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
Many of them have close analogies in human behaviour. But you seem to be implying “and therefore those are non-issues”???
There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.
How is this evidence in favour of your plan ultimately resulting in a solution to alignment???
Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?
Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you’re confusing yourself by using the word “aligned” here, can we taboo it? Human reflective instability looks like: they realize they don’t care about being a lawyer and go become a monk. Or they realize they don’t want to be a monk and go become a hippy (this one’s my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.
We have a lot of experience with the space of human reflective instabilities. We’re pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.
But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can’t nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).
Therefore I think it’s extremely, wildly wrong to expect “we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are”.
Why do you even consider this relevant evidence?
[Edit 25/02/25:
To expand on this last point, you’re saying:
It seems like you’re doing the same dichotomy here, where you say it’s either pretending or it’s aligned. I know that they will act like they care about the law. We both see the same evidence, I’m not just ignoring it. I just think you’re interpreting this evidence poorly, perhaps by being insufficiently careful about “alignment” as meaning “reflectively goal stable with predictable goals and predictable instabilities” vs “acts like a law-abiding citizen at the moment”.
]