You might be interested in Reducing Goodhart. I’m a fan of “detecting and avoiding internal Goodhart,” and I claim that’s a reflective version of the value learning problem.
You might be interested in Reducing Goodhart. I’m a fan of “detecting and avoiding internal Goodhart,” and I claim that’s a reflective version of the value learning problem.