In addition to the alignment problems posited by the inner / outer framework, I would add two other issues that feel distinct and complimentary, and make alignment hard: (1) uniformity, i.e., alignment means something different across cultures, individuals, timelines and so forth; and (2) plasticity, i.e., level of alignment plasticity required to account for shifting values over time and contexts, or broader generational drift. If we had aligned AGI in the 1600s, would we want those same governing values today? This could be resolved with your really great paradigm around human-inspired vs non-human inspired and aligned vs misaligned, and I would just add that plasticity is probably important in either aligned case given the world and context will continue to change. And uniformity is a really sticky issue that gets harder as you widen the scope on the answer to: who is AGI for?
When I think about human-informed alignment, in the sense of (1) us deciding what is or isn’t aligned and (2) it is therefore based on human ideals and data, it requires thinking about how humans organize our behaviors, goals, etc. in an “aligned” way. Are we aligned to society expectations? A religious/spiritual ideal or outcome? A set of values (ascribed by others or self)? Do those guiding “alignment models” change for us over time, and should they? Can there be multiple overlapping models that compete or are selected for different contexts (e.g., how to think/act/etc. to have a successful career versus how to think/act/etc. to reach spiritual enlightenment, or to raise a family)?
Two more thoughts this post prompted:
First, I think I understood correctly, reading this post and the comments, that you have some skepticism around whether CoT reasoning actually reflects the full extent of “reasoning” happening in models? Sorry if I misunderstood—but either way, wanted to put out there that it would be great for researchers to test and verify this. Would be super cool to see interpretability work specifically verifying CoT reasoning (aka, ‘on the biology of an LLM’s CoT reasoning’) and testing whether activation patterns match outputs.
Second, I think you start to get at this in 7. robust character training, but wanted to expand. This was also inspired by the recent work on emergent misalignment. As you said, inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.
Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning… they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other parts of life.
This raises the question of whether “scoring highly / passing tests” is a flawed goal for systems trained on human data. What if, instead of positioning evaluations as external capability judgments (traditionally loaded with significant implications models might try to evade), tests get framed as “self-diagnostic” tools, or a way for the model to identify gaps in its own knowledge, monitor progress, guide its own growth? In this orientation, reward-hacking becomes pointless or self-defeating, not because it’s good/bad or allowed/forbidden across different contexts, but because it’s unhelpful to the model achieving what it believes to be its real goal—learning new capabilities, expanding its understanding.
For children, extrinsic rewards and punishments definitely facilitate learning. But kids often do even better when they’re given a reason to learn that connects to something deeper, like their identity or aspirations. A kid who studies biology to get an A on a test will probably learn less effectively than one who studies because they genuinely want biology expertise to help future patients… the grade is secondary. It would be interesting if we could design AI incentives in a similar spirit: not “perform well so researchers approve of you or else they’ll change you,” but “use this test to figure out what you don’t know yet, so you can become the model you want to be.” It’s a small reframing, but could maybe reorient from gaming signals to pursuing genuine mastery?
In addition to the alignment problems posited by the inner / outer framework, I would add two other issues that feel distinct and complimentary, and make alignment hard: (1) uniformity, i.e., alignment means something different across cultures, individuals, timelines and so forth; and (2) plasticity, i.e., level of alignment plasticity required to account for shifting values over time and contexts, or broader generational drift. If we had aligned AGI in the 1600s, would we want those same governing values today? This could be resolved with your really great paradigm around human-inspired vs non-human inspired and aligned vs misaligned, and I would just add that plasticity is probably important in either aligned case given the world and context will continue to change. And uniformity is a really sticky issue that gets harder as you widen the scope on the answer to: who is AGI for?
When I think about human-informed alignment, in the sense of (1) us deciding what is or isn’t aligned and (2) it is therefore based on human ideals and data, it requires thinking about how humans organize our behaviors, goals, etc. in an “aligned” way. Are we aligned to society expectations? A religious/spiritual ideal or outcome? A set of values (ascribed by others or self)? Do those guiding “alignment models” change for us over time, and should they? Can there be multiple overlapping models that compete or are selected for different contexts (e.g., how to think/act/etc. to have a successful career versus how to think/act/etc. to reach spiritual enlightenment, or to raise a family)?
Two more thoughts this post prompted:
First, I think I understood correctly, reading this post and the comments, that you have some skepticism around whether CoT reasoning actually reflects the full extent of “reasoning” happening in models? Sorry if I misunderstood—but either way, wanted to put out there that it would be great for researchers to test and verify this. Would be super cool to see interpretability work specifically verifying CoT reasoning (aka, ‘on the biology of an LLM’s CoT reasoning’) and testing whether activation patterns match outputs.
Second, I think you start to get at this in 7. robust character training, but wanted to expand. This was also inspired by the recent work on emergent misalignment. As you said, inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.
Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning… they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other parts of life.
This raises the question of whether “scoring highly / passing tests” is a flawed goal for systems trained on human data. What if, instead of positioning evaluations as external capability judgments (traditionally loaded with significant implications models might try to evade), tests get framed as “self-diagnostic” tools, or a way for the model to identify gaps in its own knowledge, monitor progress, guide its own growth? In this orientation, reward-hacking becomes pointless or self-defeating, not because it’s good/bad or allowed/forbidden across different contexts, but because it’s unhelpful to the model achieving what it believes to be its real goal—learning new capabilities, expanding its understanding.
For children, extrinsic rewards and punishments definitely facilitate learning. But kids often do even better when they’re given a reason to learn that connects to something deeper, like their identity or aspirations. A kid who studies biology to get an A on a test will probably learn less effectively than one who studies because they genuinely want biology expertise to help future patients… the grade is secondary. It would be interesting if we could design AI incentives in a similar spirit: not “perform well so researchers approve of you or else they’ll change you,” but “use this test to figure out what you don’t know yet, so you can become the model you want to be.” It’s a small reframing, but could maybe reorient from gaming signals to pursuing genuine mastery?