I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
I’m not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the “core mistake” comment below, and the “faithful simulators” comment is another possibility.
Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You’ll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.
I don’t see why you would have more trust in agents created this way.
(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I’d read more. Idk why this one was upvoted more, that’s silly).
I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
I’m not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the “core mistake” comment below, and the “faithful simulators” comment is another possibility.
Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You’ll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop.
I don’t see why you would have more trust in agents created this way.
(My parent comment was more of a semi-serious joke/tease than an argument, my other comments made actual arguments after I’d read more. Idk why this one was upvoted more, that’s silly).