I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
There is a reason that paragraph says
rather than
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.