joshc comments on Training AI to do alignment research we don’t already know how to do

joshc 25 Feb 2025 3:30 UTC
2 points
−3
It means something closer to “very subtly bad in a way that is difficult to distinguish from quality work”. Where the second part is the important part.

I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues—and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.

My guess is that your core mistake is here

I agree there are lots of “messy in between places,” but these are also alignment failures we see in humans.

And if humans had a really long time to do safety reseach, my guess is we’d be ok. Why? Like you said, there’s a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)

and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don’t actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
- Jeremy Gillen 25 Feb 2025 4:31 UTC
  4 points
  1
  Parent
  these are also alignment failures we see in humans.
  Many of them have close analogies in human behaviour. But you seem to be implying “and therefore those are non-issues”???
  There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.
  How is this evidence in favour of your plan ultimately resulting in a solution to alignment???
  but these systems empirically often move in reasonable and socially-beneficial directions over time
  Is this the actual basis of your belief in your plan to ultimately get a difficult scientific problem solved?
  and i expect we can make AI agents a lot more aligned than humans typically are
  Ahh I see. Yeah this is crazy, why would you expect this? I think maybe you’re confusing yourself by using the word “aligned” here, can we taboo it? Human reflective instability looks like: they realize they don’t care about being a lawyer and go become a monk. Or they realize they don’t want to be a monk and go become a hippy (this one’s my dad). Or they have a mid-life crisis and do a bunch of stereotypical mid-life crisis things. Or they go crazy in more extreme ways.
  We have a lot of experience with the space of human reflective instabilities. We’re pretty familiar with the ways that humans interact with tribes and are influenced by them, and sometimes break with them.
  But the space of reflective-goal-weirdness is much larger and stranger than we have (human) experience with. There are a lot of degrees of freedom in goal specification that we can’t nail down easily through training. Also, AIs will be much newer, much more in progress, than humans are (not quite sure how to express this, another way to say it is to point to the quantity of robustness&normality training that evolution has subjected humans to).
  Therefore I think it’s extremely, wildly wrong to expect “we can make AI agents a lot more [reflectively goal stable with predictable goals and safe failure-modes] than humans typically are”.
  but, Claude sure as hell seems to
  Why do you even consider this relevant evidence?
  [Edit 25/02/25:
  To expand on this last point, you’re saying:
  If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
  It seems like you’re doing the same dichotomy here, where you say it’s either pretending or it’s aligned. I know that they will act like they care about the law. We both see the same evidence, I’m not just ignoring it. I just think you’re interpreting this evidence poorly, perhaps by being insufficiently careful about “alignment” as meaning “reflectively goal stable with predictable goals and predictable instabilities” vs “acts like a law-abiding citizen at the moment”.
  ]