davidad comments on Categorizing failures as “outer” or “inner” misalignment is often confused

davidad 7 Jan 2023 4:40 UTC
LW: 12 AF: 5
7
AF
From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it’s not clear whether
- you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
- your policy-scoring “function” was actually stochastic and “defined” by the physical process of humans interacting with the AI’s actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.
I tend to favor the latter interpretation—I’d say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have “programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon” is the decomposition into Reward Specification and Adequate Policy Learning really useful.
- Rohin Shah 7 Jan 2023 8:15 UTC
  LW: 2 AF: 2
  2
  AF Parent
  Yup, this is the objective-based categorization, and as you’ve noted it’s ambiguous on the scenarios I mention because it depends on how you choose the “definition” of the design objective (aka policy-scoring function).