I also think that it’s probably worth considering soft optimization to the old ImpactMeasures work from this community—in particular, I think it’d be interesting to cast soft optimization methods as robust optimization, and then see how the critiques raised against impact measures (e.g. in this comment or this question) apply to soft optimization methods like RL-KL or the minimax objective you outline here.
Thanks for linking these, I hadn’t read most of these. As far as I can tell, most of the critiques don’t really apply to soft optimization. The main one that does is Paul’s “drift off the rails” thing. I expect we need to use the first AGI (with soft opt) to help solve alignment in a more permanent and robust way, then use that make a more powerful AGI that helps avoid “drifting off the rails”.
In my understanding, impact measures are an important part of the utility function that we don’t want to get wrong, but not much more than that. Whereas soft optimization directly removes Goodharting of the utility function. It feels like the correct formalism for attacking the root of that problem. Whereas impact measures just take care of a (particularly bad) symptom.
Abram Demski has a good answer to the question you linked that contrasts mild optimization with impact measures, and it’s clear that mild optimization is preferred. And Abram actually says:
An improvement on this situation would be something which looked more like a theoretical solution to Goodhart’s law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities (“this is how you get the most of what you want”), allowing ML researchers to develop algorithms orienting toward this.
Thanks for linking these, I hadn’t read most of these. As far as I can tell, most of the critiques don’t really apply to soft optimization. The main one that does is Paul’s “drift off the rails” thing. I expect we need to use the first AGI (with soft opt) to help solve alignment in a more permanent and robust way, then use that make a more powerful AGI that helps avoid “drifting off the rails”.
In my understanding, impact measures are an important part of the utility function that we don’t want to get wrong, but not much more than that. Whereas soft optimization directly removes Goodharting of the utility function. It feels like the correct formalism for attacking the root of that problem. Whereas impact measures just take care of a (particularly bad) symptom.
Abram Demski has a good answer to the question you linked that contrasts mild optimization with impact measures, and it’s clear that mild optimization is preferred. And Abram actually says:
This is exactly what I’ve got.