Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)
Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.
I have been working on ML robustness, as I think it’s the on-ramp towards understanding OOD generalization and agency, which seems necessary for any alignment strategy.
(Also, robustness also improves safety in the short-term (help defend against immediate threat models) and mid-term (assess the viability of various alignment plans) other than the long-term goal listed above.)
Realizing that it might actually work well, I wrote this memo to understand the whether this work is good for safety and trying to figure out a path forward, given the current state of the frontier AI and safety/alignment field.
Knowledge of generalization is dual-use.
Generalizing complex goals in diverse contexts require good generalization, and most-likely in the form of self-referential error-correction (to deal with OOD/non-stationarity)
Act competently in an aligned way in complex environments requires strong generalization, but improving generalization alone improve AI capabilities without necessarily improving safety prospects
“Muddling through” is pessimistic of capability and on the science of generalization/intelligence
Are we working on tools or agents?
Is “lousy agents” a (robust) solution, with or without paradigm shifts in AI capabilities?
Are there any better paths than this (essentially relying on path dependence of frontier AI labs)?
How to improve the odds of “muddling through”?
Current alignment work are mostly marginally beneficial. Principled/explanatory progress in alignment will most likely include knowledge of intelligence/generalization. If we succeed the tool-AI will become more like agents, and the agents will become less lousy—which alone is bad for safety!
Defining strong threat models/evaluating defense coherently requires model of OOD generalization
Alignment in general
Following rules efficiently in a broad range of contexts (including self-referential ones) requires generalization (early examples in jailbreaks)
Mechanistic Interpretability
Truly understanding how models act in complex scenarios ~= knowledge/theory of generalization
Formal guarantees
Either define the rules inefficiently like GOFAI/restrict to limited formalized domains, or require strong generalization to form guarantees
“Interpretability/alignment tools for open source ecosystem”
Same as mechanistic interpretability. What if you succeed?
Control
“Buy time for better solutions” What solutions? Only automated alignment researcher (relatively) make sense
Paths forward
(Behind closed doors, vs in public - ‘real’ robust solutions lead to power concentration)
Clarity: understand why current models are useful but not dangerously general (tech, econ), how to keep them that way? (The natural thing to do, but it’s dual use!)
Limited path: work on hardening bounds/make them more useful, without relying on generalization (inefficient)
Pareto frontier: some level of understanding and guarantees, some efficiency gain? (Isn’t this the status quo?)
Automated alignment research: much less silly (compared to the current state)
Marginal, pragmatic alignment work (keep muddling through)?
Questions:
Will LLMs become truly general/deeply reflective soon, and does that increase Xrisk?
Is OOD generalization the bottleneck?
Is principled/explanatory work on various aspects of alignment likely to improve on the bottleneck, more than actually improving safety given capability improvements?
Assessing the current state
It’s pretty good (very similar to the limited path, alignment methods are inefficient and therefore don’t boost generalization radically), conditioning on no qualitative improvements soon
How to keep us in this state, or prepare to transition out of it?
Automated alignment research is the only coherent story for marginal alignment work
Does this mean anything with the pace of AI progress?
IDK :(
Some of us have to try to gain clarity (while the rest work on the problems in front of us)
I still expect a explanatory theory of intelligence/generality to drop some time, and it doesn’t seem that difficult
Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)
Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.
Knowledge of generalization is dual-use.
Generalizing complex goals in diverse contexts require good generalization, and most-likely in the form of self-referential error-correction (to deal with OOD/non-stationarity)
Act competently in an aligned way in complex environments requires strong generalization, but improving generalization alone improve AI capabilities without necessarily improving safety prospects
“Muddling through” is pessimistic of capability and on the science of generalization/intelligence
Are we working on tools or agents?
Is “lousy agents” a (robust) solution, with or without paradigm shifts in AI capabilities?
Are there any better paths than this (essentially relying on path dependence of frontier AI labs)?
How to improve the odds of “muddling through”?
Current alignment work are mostly marginally beneficial. Principled/explanatory progress in alignment will most likely include knowledge of intelligence/generalization. If we succeed the tool-AI will become more like agents, and the agents will become less lousy—which alone is bad for safety!
Robustness (adversarial, jailbreak/prompt injection)
Defining strong threat models/evaluating defense coherently requires model of OOD generalization
Alignment in general
Following rules efficiently in a broad range of contexts (including self-referential ones) requires generalization (early examples in jailbreaks)
Mechanistic Interpretability
Truly understanding how models act in complex scenarios ~= knowledge/theory of generalization
Formal guarantees
Either define the rules inefficiently like GOFAI/restrict to limited formalized domains, or require strong generalization to form guarantees
“Interpretability/alignment tools for open source ecosystem”
Same as mechanistic interpretability. What if you succeed?
Control
“Buy time for better solutions” What solutions? Only automated alignment researcher (relatively) make sense
Paths forward
(Behind closed doors, vs in public - ‘real’ robust solutions lead to power concentration)
Clarity: understand why current models are useful but not dangerously general (tech, econ), how to keep them that way? (The natural thing to do, but it’s dual use!)
Limited path: work on hardening bounds/make them more useful, without relying on generalization (inefficient)
Ambitious path: robustness against OOD, reflective stability, robust self-other models, “organic alignment” (dangerous)
Pareto frontier: some level of understanding and guarantees, some efficiency gain? (Isn’t this the status quo?)
Automated alignment research: much less silly (compared to the current state)
Marginal, pragmatic alignment work (keep muddling through)?
Questions:
Will LLMs become truly general/deeply reflective soon, and does that increase Xrisk?
Is OOD generalization the bottleneck?
Is principled/explanatory work on various aspects of alignment likely to improve on the bottleneck, more than actually improving safety given capability improvements?
Assessing the current state
It’s pretty good (very similar to the limited path, alignment methods are inefficient and therefore don’t boost generalization radically), conditioning on no qualitative improvements soon
How to keep us in this state, or prepare to transition out of it?
Automated alignment research is the only coherent story for marginal alignment work
Does this mean anything with the pace of AI progress?
IDK :(
Some of us have to try to gain clarity (while the rest work on the problems in front of us)
I still expect a explanatory theory of intelligence/generality to drop some time, and it doesn’t seem that difficult