Therefore I posit that interpretability, rather than the likes of embedded agency or value learning, should be the focus of our research.
FWIW I’m very strongly in favor of interpretability research.
I’m slightly pessimistic that insight into embedded agency will be helpful for solving the alignment problem—at least based on my personal experience of (A) thinking about how human brains reason about themselves, (B) thinking about the alignment problem, (C) noting that the former activity has not been helping me with the latter activity. (Maybe there are other reasons to work on embedded agency, I dunno.)
FWIW I’m very strongly in favor of interpretability research.
I’m slightly pessimistic that insight into embedded agency will be helpful for solving the alignment problem—at least based on my personal experience of (A) thinking about how human brains reason about themselves, (B) thinking about the alignment problem, (C) noting that the former activity has not been helping me with the latter activity. (Maybe there are other reasons to work on embedded agency, I dunno.)