Inner alignment is the problem of “how do we successfully point the optimization behavior of an agent that we train at any particular chosen target?” Or, as I quoted (in the expandable section in my post) directly from the LW page defining inner alignment: “Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?”
Safety pretraining is a specific proposal for this: let’s do it using self-supervised SGD followed by conditional generation. This has a specific advantage in avoiding misgeneralization, compared to using reinforcement learning, because pretrained systems tend to produce the same distribution they were trained on (modulo prompting): they don’t automatically attempt to generalize, so are less prone to misgeneralization. It also avoids all the other concerns around using reinforcement learning to train very smart systems, which are what people normally discuss at great length when discussing the challenges of inner alignment. The answer here is simple: just don’t use reinforcement learning, at all.
So please explain, how do you feel this not a solution to inner alignment? (That’s not a rhetorical question: I’m genuinely confused as to what you’re claiming needs to be corrected and why.) Are you suggesting that the inner alignment problem is somehow by definition confined only to uses of reinforcement learning?
I agree that it helps a lot with alignment! I’m on my phone, will respond properly later, but “solved problem” to me means “superintelligence-robust”, and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I’m working on a post about this, eta within a week. I don’t mean to say you’re wrong that it helps, only that I’d like to reserve the words “solved problem” for certified generalization results.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I’d do my best to justify the claim. So you are proposing something around the level of “New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we’ve been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science”? I would agree that that’s more measured and accurate, but it’s also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I’d propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there’s an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
Clickbait burns the commons and thus gets downvotes. How about just “the best way to align an LLM so far: dramatic progress on LLM alignment”? Don’t overclaim, just emphasize, imo. (Could still be overclaiming.)
Inner alignment is the problem of “how do we successfully point the optimization behavior of an agent that we train at any particular chosen target?” Or, as I quoted (in the expandable section in my post) directly from the LW page defining inner alignment: “Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?”
Safety pretraining is a specific proposal for this: let’s do it using self-supervised SGD followed by conditional generation. This has a specific advantage in avoiding misgeneralization, compared to using reinforcement learning, because pretrained systems tend to produce the same distribution they were trained on (modulo prompting): they don’t automatically attempt to generalize, so are less prone to misgeneralization. It also avoids all the other concerns around using reinforcement learning to train very smart systems, which are what people normally discuss at great length when discussing the challenges of inner alignment. The answer here is simple: just don’t use reinforcement learning, at all.
So please explain, how do you feel this not a solution to inner alignment? (That’s not a rhetorical question: I’m genuinely confused as to what you’re claiming needs to be corrected and why.) Are you suggesting that the inner alignment problem is somehow by definition confined only to uses of reinforcement learning?
I agree that it helps a lot with alignment! I’m on my phone, will respond properly later, but “solved problem” to me means “superintelligence-robust”, and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I’m working on a post about this, eta within a week. I don’t mean to say you’re wrong that it helps, only that I’d like to reserve the words “solved problem” for certified generalization results.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I’d do my best to justify the claim. So you are proposing something around the level of “New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we’ve been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science”? I would agree that that’s more measured and accurate, but it’s also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I’d propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there’s an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics
Clickbait burns the commons and thus gets downvotes. How about just “the best way to align an LLM so far: dramatic progress on LLM alignment”? Don’t overclaim, just emphasize, imo. (Could still be overclaiming.)
OK, you convinced me. Changing the title from:
The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?
to:
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
So it now raises the possibility, rather than claiming it.