From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it’s not clear whether
you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring “function” was actually stochastic and “defined” by the physical process of humans interacting with the AI’s actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.
I tend to favor the latter interpretation—I’d say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have “programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon” is the decomposition into Reward Specification and Adequate Policy Learning really useful.
Yup, this is the objective-based categorization, and as you’ve noted it’s ambiguous on the scenarios I mention because it depends on how you choose the “definition” of the design objective (aka policy-scoring function).
From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it’s not clear whether
you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring “function” was actually stochastic and “defined” by the physical process of humans interacting with the AI’s actions and clicking
Merge
buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.I tend to favor the latter interpretation—I’d say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have “programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon” is the decomposition into Reward Specification and Adequate Policy Learning really useful.
Yup, this is the objective-based categorization, and as you’ve noted it’s ambiguous on the scenarios I mention because it depends on how you choose the “definition” of the design objective (aka policy-scoring function).