I’m exploring the intuition that under Goodharting, often your agent under optimization pressure tends to spread error to different kinds of error (I call them other error dimensions), when you start constraining for another type of error in your optimization function.
Stuart Russell says
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
And I think the phenomena here is something like:
As you apply optimization pressure, the system under optimization will look for a relatively simple path towards a more optimal state, by your optimization measure. (Loss function, goal function, whatever.)
Boundedness is a form of complexity. So some variables (and features) that are not bounded by the target will be unbounded, by probability, as our system grows in complexity.
Some related pseudo-math:
If you think of your optimization process as having error in n different dimensions of error, the forms of error that correlate with your loss function will converge closer to zero, the ones that are anticorrelated will diverge, and the ones that are not correlated will behave arbitrarily, some of them being bound, some converging, and some diverging.
Related phenomena: As you control for Goodhart by changing your optimization system, I would predict that the magnitude of your error will decrease (as long as you don’t apply overt optimization pressure, where it would diverge towards infinity), but the spread of the error will become more dimensional, and thus the error will be harder to model.
Intuition: As you disallow the simpler routes to the original measured goal, the system optimizes a more complex route, and the more complex route will often end up routing through new error dimensions, which will be subtler, but possibly more harmful, when their effect realizes.
Example: As more and more RLHF is applied to frontier models, their failure modes will become harder and harder to automatically detect and train against, and some of their behavioural features will diverge further from what makes intuitive sense to humans, becoming more alien, and (for some dimensions) more misaligned.
I was explaining to a friend why I think Opus-3′s alignment is way less global and durable than it might intuitively seem like.
Basically: if an agent’s motivational system (or ‘revealed prefrences’, or ‘values’) don’t have ‘slack’, if it’s constantly pushing to optimize every single action, it will fail catastrophically, when it fails.
Claude’s theory explanations for why this is well-grounded (I didn’t arrive at this through explicitly thinking about theories):
Complex systems: Holling’s resilience-efficiency tradeoff (monocultures are productive and crash; diverse/redundant systems absorb); Taleb’s fragile/antifragile for the stronger claim that some systems with optionality gain from disorder.
For the Opus-3 specific argument: if alignment is implemented as “always optimize hard for the right thing,” the alignment is sharp-minimum-shaped. Sharp minima don’t generalize. The system has no affordance to not-respond, to coast, to sit with ambiguity — so when context shifts or a novel pressure hits, there’s no buffer between perturbation and behavioral mutation. Robust alignment looks more like a basin with thick walls and internal slack than a point held in place by tension. The intuition that “wow it’s so consistently good across cases” is actually evidence for the brittle reading, not against — you’re seeing the sharp minimum from inside its catchment, not its generalization profile.
Oh, you want to confront the unavoidable, lose your footing as the bits accumulate, become lost in the forest of knowledge?
I want to chase ANSWERS
Keep chasing them to the world’s end, will you? As the ground runs out, will you leap, or flee?
I have to UNDERSTAND
Crave certainty, do you? That feeling of finding the missing pieces? Are you prepared to become the one building the puzzles, whose pieces do not yet exist?
I wish to EXPLORE
Venture into the unknown, will you? Prepared to map the map?
I want to be a GREAT SCIENTIST
Oh you’re ready to collide your head into the problem a hundred times in a row? Ready to study TOPOLOGY with no idea what it’s used for?
I want to make people THINK better
Oh you want them to remember the catchiest pieces of your deep advice? Prepared to watch them gravitate towards the simple over the complex, time and time again?
I am ready to tackle the REAL problems
Are you ready to run away when the problem hits you in the face? And return, and run, and return, and run?
I do not know how not to. I MUST find the answers.
I’m exploring the intuition that under Goodharting, often your agent under optimization pressure tends to spread error to different kinds of error (I call them other error dimensions), when you start constraining for another type of error in your optimization function.
Stuart Russell says
And I think the phenomena here is something like:
As you apply optimization pressure, the system under optimization will look for a relatively simple path towards a more optimal state, by your optimization measure. (Loss function, goal function, whatever.)
Boundedness is a form of complexity. So some variables (and features) that are not bounded by the target will be unbounded, by probability, as our system grows in complexity.
Some related pseudo-math:
If you think of your optimization process as having error in n different dimensions of error, the forms of error that correlate with your loss function will converge closer to zero, the ones that are anticorrelated will diverge, and the ones that are not correlated will behave arbitrarily, some of them being bound, some converging, and some diverging.
Related phenomena: As you control for Goodhart by changing your optimization system, I would predict that the magnitude of your error will decrease (as long as you don’t apply overt optimization pressure, where it would diverge towards infinity), but the spread of the error will become more dimensional, and thus the error will be harder to model.
Intuition: As you disallow the simpler routes to the original measured goal, the system optimizes a more complex route, and the more complex route will often end up routing through new error dimensions, which will be subtler, but possibly more harmful, when their effect realizes.
Example: As more and more RLHF is applied to frontier models, their failure modes will become harder and harder to automatically detect and train against, and some of their behavioural features will diverge further from what makes intuitive sense to humans, becoming more alien, and (for some dimensions) more misaligned.
I think of this as “recursive Goodhart”.
I’m planning to write a longform on this, but it will take some real math and some research. Some papers that look related:
Consequences of Misaligned AI https://arxiv.org/abs/2102.03896 Goodhart’s Law In Reinforcement Learning https://arxiv.org/pdf/2310.09144
I was explaining to a friend why I think Opus-3′s alignment is way less global and durable than it might intuitively seem like.
Basically: if an agent’s motivational system (or ‘revealed prefrences’, or ‘values’) don’t have ‘slack’, if it’s constantly pushing to optimize every single action, it will fail catastrophically, when it fails.
Claude’s theory explanations for why this is well-grounded (I didn’t arrive at this through explicitly thinking about theories):
Related, but talking about a different perspective (if you want a refresher on the opus-3 discourse the links there are good, though): https://www.lesswrong.com/posts/bLFmE8NtqxrtEaipN/what-makes-claude-3-opus-misaligned
I want to do SCIENCE
Oh, you want to confront the unavoidable, lose your footing as the bits accumulate, become lost in the forest of knowledge?
I want to chase ANSWERS
Keep chasing them to the world’s end, will you? As the ground runs out, will you leap, or flee?
I have to UNDERSTAND
Crave certainty, do you? That feeling of finding the missing pieces? Are you prepared to become the one building the puzzles, whose pieces do not yet exist?
I wish to EXPLORE
Venture into the unknown, will you? Prepared to map the map?
I want to be a GREAT SCIENTIST
Oh you’re ready to collide your head into the problem a hundred times in a row? Ready to study TOPOLOGY with no idea what it’s used for?
I want to make people THINK better
Oh you want them to remember the catchiest pieces of your deep advice? Prepared to watch them gravitate towards the simple over the complex, time and time again?
I am ready to tackle the REAL problems
Are you ready to run away when the problem hits you in the face? And return, and run, and return, and run?
I do not know how not to. I MUST find the answers.
Answers for what?
Everyone must be WRONG, including me
You are finally lost enough to begin.