Isn’t the “you get what you measure” problem a problem for capabilities progress too, not just alignment? I.e.: Some tasks are sufficiently complex (hence hard to evaluate) and lacking in unambiguous ground-truth feedback that, when you turn the ML crank on them, you’re not necessarily going to select for actually doing the task well. You’ll select for “appearing to do the task well,” and it’s open question how well this correlates with actually doing the task well. (“Doing the task” here can include something much higher-level, like “being ‘generally intelligent’.”)
Which isn’t to say this problem wouldn’t bite especially hard for alignment. Alignment seems harder to verify than lots of things. But this is one reason I’m not fully sold that once you get human-level AI, capabilities progress will get faster.
(I’m hardly an expert on this, so might well have missed existing discourse on & answers to this question.)
Isn’t the “you get what you measure” problem a problem for capabilities progress too, not just alignment? I.e.: Some tasks are sufficiently complex (hence hard to evaluate) and lacking in unambiguous ground-truth feedback that, when you turn the ML crank on them, you’re not necessarily going to select for actually doing the task well. You’ll select for “appearing to do the task well,” and it’s open question how well this correlates with actually doing the task well. (“Doing the task” here can include something much higher-level, like “being ‘generally intelligent’.”)
Which isn’t to say this problem wouldn’t bite especially hard for alignment. Alignment seems harder to verify than lots of things. But this is one reason I’m not fully sold that once you get human-level AI, capabilities progress will get faster.
(I’m hardly an expert on this, so might well have missed existing discourse on & answers to this question.)