Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 27 Feb 2025 21:14 UTC
4 points
3
There is a reason that paragraph says
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment^[1], not that what I’ve sketched here suffices to solve outer alignment.
1. ^
  Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
- williawa 28 Feb 2025 22:06 UTC
  1 point
  0
  Parent
  Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.