Thinking about my focus on a theory of human values for AI alignment, the problem is quite hard when we ask for a way to precisely specify values. I might state the problem as something like finding “a theory of human values accurate and precise enough that its predictions don’t come apart under extreme optimization”. To borrow Isnasene’s notation, here X = “a theory of human values accurate and precise enough” and Y = “its predictions don’t come apart under extreme optimization”.
So what is an inverse problem with X’ and Y’? A Y’ might be something like “functions that behave as expected under extreme optimization”, where “behave as expected” is something like no Goodhart effects. We could even just be more narrow and make Y’ = “functions that don’t exist Goodhart effects under extreme optimization”. Then the X’ would be something like a generalized description of the classes of functions that satisfy Y’.
Doing the double inverse, we would try to find X from X’ by looking at what properties hold for this class of functions that don’t suffer from Goodharting, and use them to help us identify what would be needed to create an adequate theory of human values.
Looking for “functions that don’t exhibit Goodhart effects under extreme optimization” might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I’m actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn’t result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything “weird”, e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Right, that’s the question. Sure, it is easy to state that “metric must be a faithful representation of the target”, but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true “values” (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I’m not sure.
Thinking about my focus on a theory of human values for AI alignment, the problem is quite hard when we ask for a way to precisely specify values. I might state the problem as something like finding “a theory of human values accurate and precise enough that its predictions don’t come apart under extreme optimization”. To borrow Isnasene’s notation, here X = “a theory of human values accurate and precise enough” and Y = “its predictions don’t come apart under extreme optimization”.
So what is an inverse problem with X’ and Y’? A Y’ might be something like “functions that behave as expected under extreme optimization”, where “behave as expected” is something like no Goodhart effects. We could even just be more narrow and make Y’ = “functions that don’t exist Goodhart effects under extreme optimization”. Then the X’ would be something like a generalized description of the classes of functions that satisfy Y’.
Doing the double inverse, we would try to find X from X’ by looking at what properties hold for this class of functions that don’t suffer from Goodharting, and use them to help us identify what would be needed to create an adequate theory of human values.
Looking for “functions that don’t exhibit Goodhart effects under extreme optimization” might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I’m actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn’t result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything “weird”, e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Right, that’s the question. Sure, it is easy to state that “metric must be a faithful representation of the target”, but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true “values” (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I’m not sure.
Alternatively, you could make a function that (clearly) breaks under optimization at a certain point, and then stop optimizing it once it breaks.