I agree that it helps a lot with alignment! I’m on my phone, will respond properly later, but “solved problem” to me means “superintelligence-robust”, and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I’m working on a post about this, eta within a week. I don’t mean to say you’re wrong that it helps, only that I’d like to reserve the words “solved problem” for certified generalization results.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I’d do my best to justify the claim. So you are proposing something around the level of “New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we’ve been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science”? I would agree that that’s more measured and accurate, but it’s also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I’d propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there’s an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
Clickbait burns the commons and thus gets downvotes. How about just “the best way to align an LLM so far: dramatic progress on LLM alignment”? Don’t overclaim, just emphasize, imo. (Could still be overclaiming.)
I agree that it helps a lot with alignment! I’m on my phone, will respond properly later, but “solved problem” to me means “superintelligence-robust”, and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I’m working on a post about this, eta within a week. I don’t mean to say you’re wrong that it helps, only that I’d like to reserve the words “solved problem” for certified generalization results.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I’d do my best to justify the claim. So you are proposing something around the level of “New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we’ve been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science”? I would agree that that’s more measured and accurate, but it’s also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I’d propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there’s an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics
Clickbait burns the commons and thus gets downvotes. How about just “the best way to align an LLM so far: dramatic progress on LLM alignment”? Don’t overclaim, just emphasize, imo. (Could still be overclaiming.)
OK, you convinced me. Changing the title from:
The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?
to:
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
So it now raises the possibility, rather than claiming it.