Jeremy Gillen comments on Training AI to do alignment research we don’t already know how to do

Jeremy Gillen 25 Feb 2025 20:29 UTC
2 points
0
I’m not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the “core mistake” comment below, and the “faithful simulators” comment is another possibility.
Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You’ll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.
I don’t see why you would have more trust in agents created this way.
(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I’d read more. Idk why this one was upvoted more, that’s silly).