Rohin Shah comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah 6 Jan 2023 18:46 UTC
LW: 6 AF: 5
0
AF
Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
I’d be pretty surprised though if more than one person in my survey had that view.