joshc comments on Training AI to do alignment research we don’t already know how to do

joshc 25 Feb 2025 2:27 UTC
2 points
0
I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
- Jeremy Gillen 25 Feb 2025 20:29 UTC
  2 points
  0
  Parent
  I’m not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the “core mistake” comment below, and the “faithful simulators” comment is another possibility.
  Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You’ll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.
  I don’t see why you would have more trust in agents created this way.
  (My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I’d read more. Idk why this one was upvoted more, that’s silly).