Re my own take on the alignment problem, if I’m assuming that a future AGI will be built by primarily RL signals and memory plus continuous learning is part of the deal with AGI, I think my current use case for AGI is broadly to train them to be at least slightly better than humans currently are at alignment research specifically, and my general worldview on alignment is that we should primarily try to prepare ourselves for the automated alignment researchers that will come, so most of the work that needs to be frontloaded is stuff that would cause the core loop of automated alignment research to catastrophically fail.
On the misleading intuitions from everyday life, I’d probably say misleading intuitions from modern-day everyday life, because while I do agree that overgeneralization from innate learned reward is a problem, I’d say another problem is people overindex on the history of the modern era for intuitions about peaceful trade being better than war ~all the time, where sociopathic entities/no innate social drives people can be made to do lots of good things for everyone via trade, rather than conflict, but the issue here is that the things that make sociopathic entities do good things like trade peacefully with others rather than enslaving/murdering/raping them are fundamentally dependent on a set of assumptions that are fundamentally violated by AGI and the technologies it spawns (one example of an important assumption is that in order for wars to be prevented, conflict is costly even for the winners, but uncontrolled AGIs/ASIs engaging in conflict with humans and nature is likely to not be costly for them, or it can be made to not be costly, which seriously messes up the self-interested case for trade, and more importantly a really large assumption that is implicit is that property rights are inviolable and that expropriating resources is counter-productive and harms you compared to trading or peacefully making a business, but the problem is that AGI shifts the factors of wealth from hard/impossible to expropriate to relatively easy to expropriate away from other agents, especially if serious nanotech/biotech is ever produced by AGI, similar to how the issue with human/non-human animal relationships is that animal labor is basically worthless, but their land/capital is not worthless, and conflict isn’t costly, meaning stealing resources from gorillas and other non-human animals gets you the most wealth).
This is why Matthew Barnett’s comments on how trade is usually worth it for selfish agents in the modern era, so AGIs will trade with us peacefully too are so mistaken:
(It talks about how democracy and human economic relevance would fade away with AGI, but the relevance here is that it points out that the elites of the post-AGI world have no reason to give/trade with commoners what resources need to survive if they are not value aligned with commoner welfare due to the incentives AGI sets up, and thus there’s little reason to assume that misaligned AGIs would trade with us, instead of letting us starve or directly killing us all).
Re Goodhart’s Law, @Jeremy Gillen has said that his proposal gives us a way to detect if goodharting has occured, and implies in the comments that it has a lower alignment tax than you think:
(Steven Byrnes): FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
(Jeremy Gillen): Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it’s reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn’t clear that it won’t self-modify (but it’s hard to tell).
(Steven Byrnes): There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
(Jeremy Gillen): But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
(Steven Byrnes): So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
(Jeremy Gillen): My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
My view on practical plans for AI safety is that in the short term, I think that it’s probably ok at human-level AI having potentially crazy motivations, because I think AI control is at least a reasonable plan that could plausibly work, and I’m more optimistic than you that modulo rogue internal deployments/fast take-off where we don’t continually reduce the compute used on the AI, the AI’s action space for harm is vastly reduced compared to humans.
A big part of this has to do with the fact that AI states/memory can basically always be reset, so it’s pretty easy to reset an AGI’s memory, which prevents a lot of the issues of the AGI eventually learning how to defeat the defenses.
In this project, we assume that the model shares no state between calls. While solving a task, the model can see the actions taken so far and their outputs in its context, but it’s up to us to decide what context we give the model. If we want independent samples of the next action, we make multiple calls to the model using the same context, and the model’s statelessness between calls means that it doesn’t know it is being resampled. We can also easily reconstruct the context of earlier steps if we later decide that we want more samples for their information value. Future models might well maintain state between calls, for instance by using some kind of recurrence or long-term memory bank. But so long as we retain some checkpoints of past states we can easily reset it. In real deployments, this storage cost will likely be worth it during short time windows (like on a particular trajectory), but may not be between trajectories. Statefulness between trajectories, for instance through updating a large codebase over time, is a property that future control settings might have which is harder to work around, and we leave it to future control research to study its impact. Nonetheless, the ability to control the model’s state and draw independent samples is a major difference between security systems designed to be robust to human insider threats and control protocols for untrusted AIs, and it turns out to be an incredibly powerful advantage in many situations.
So my general worldview on how I’d approach the AI alignment problem is I’d use AI control (defined broadly) to make the AGIs that are potentially misaligned do useful work on the alignment problem, and then use the vast AGI army to produce a solution to alignment for very powerful AIs.
We do need to worry about sandbagging/deception, though.
I do think the alignment problem for brain-like AGI may unfortunately be hard to solve, and while I’m not yet convinced that behaviorist rewards give scheming AIs with overwhelming probability no matter what you reward, I do think I should plan for worlds where the alignment problem is genuinely hard to solve, and I like AI control as a stepping stone to solving the alignment problem.
AI control definitely is making assumptions, but I don’t think it’s making assumptions that are only applicable to LLMs (but it would be nice if we got LLMs first for AGI), but it definitely assumes a limit on capabilities, but lots of the control stuff would work if we got a brain-like AGI with memory and continuous learning.
Re my own take on the alignment problem, if I’m assuming that a future AGI will be built by primarily RL signals and memory plus continuous learning is part of the deal with AGI, I think my current use case for AGI is broadly to train them to be at least slightly better than humans currently are at alignment research specifically, and my general worldview on alignment is that we should primarily try to prepare ourselves for the automated alignment researchers that will come, so most of the work that needs to be frontloaded is stuff that would cause the core loop of automated alignment research to catastrophically fail.
On the misleading intuitions from everyday life, I’d probably say misleading intuitions from modern-day everyday life, because while I do agree that overgeneralization from innate learned reward is a problem, I’d say another problem is people overindex on the history of the modern era for intuitions about peaceful trade being better than war ~all the time, where sociopathic entities/no innate social drives people can be made to do lots of good things for everyone via trade, rather than conflict, but the issue here is that the things that make sociopathic entities do good things like trade peacefully with others rather than enslaving/murdering/raping them are fundamentally dependent on a set of assumptions that are fundamentally violated by AGI and the technologies it spawns (one example of an important assumption is that in order for wars to be prevented, conflict is costly even for the winners, but uncontrolled AGIs/ASIs engaging in conflict with humans and nature is likely to not be costly for them, or it can be made to not be costly, which seriously messes up the self-interested case for trade, and more importantly a really large assumption that is implicit is that property rights are inviolable and that expropriating resources is counter-productive and harms you compared to trading or peacefully making a business, but the problem is that AGI shifts the factors of wealth from hard/impossible to expropriate to relatively easy to expropriate away from other agents, especially if serious nanotech/biotech is ever produced by AGI, similar to how the issue with human/non-human animal relationships is that animal labor is basically worthless, but their land/capital is not worthless, and conflict isn’t costly, meaning stealing resources from gorillas and other non-human animals gets you the most wealth).
This is why Matthew Barnett’s comments on how trade is usually worth it for selfish agents in the modern era, so AGIs will trade with us peacefully too are so mistaken:
https://forum.effectivealtruism.org/posts/4LNiPhP6vw2A5Pue3/consider-granting-ais-freedom?commentId=HdsGEyjepTkeZePHF
Ben Garfinkel and Luke Drago has talked about this issue too:
https://forum.effectivealtruism.org/posts/TMCWXTayji7gvRK9p/is-democracy-a-fad#Why_So_Much_Democracy__All_of_a_Sudden_
https://forum.effectivealtruism.org/posts/TMCWXTayji7gvRK9p/is-democracy-a-fad#Automation_and_Democracy
https://intelligence-curse.ai/defining/
(It talks about how democracy and human economic relevance would fade away with AGI, but the relevance here is that it points out that the elites of the post-AGI world have no reason to give/trade with commoners what resources need to survive if they are not value aligned with commoner welfare due to the incentives AGI sets up, and thus there’s little reason to assume that misaligned AGIs would trade with us, instead of letting us starve or directly killing us all).
Re Goodhart’s Law, @Jeremy Gillen has said that his proposal gives us a way to detect if goodharting has occured, and implies in the comments that it has a lower alignment tax than you think:
https://www.lesswrong.com/posts/ZHFZ6tivEjznkEoby/detect-goodhart-and-shut-down
https://www.lesswrong.com/posts/ZHFZ6tivEjznkEoby/detect-goodhart-and-shut-down#Ws6kohnReEAcBDecA
My view on practical plans for AI safety is that in the short term, I think that it’s probably ok at human-level AI having potentially crazy motivations, because I think AI control is at least a reasonable plan that could plausibly work, and I’m more optimistic than you that modulo rogue internal deployments/fast take-off where we don’t continually reduce the compute used on the AI, the AI’s action space for harm is vastly reduced compared to humans.
See this podcast for more details below:
https://80000hours.org/podcast/episodes/buck-shlegeris-ai-control-scheming/
A big part of this has to do with the fact that AI states/memory can basically always be reset, so it’s pretty easy to reset an AGI’s memory, which prevents a lot of the issues of the AGI eventually learning how to defeat the defenses.
More here:
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
So my general worldview on how I’d approach the AI alignment problem is I’d use AI control (defined broadly) to make the AGIs that are potentially misaligned do useful work on the alignment problem, and then use the vast AGI army to produce a solution to alignment for very powerful AIs.
We do need to worry about sandbagging/deception, though.
I do think the alignment problem for brain-like AGI may unfortunately be hard to solve, and while I’m not yet convinced that behaviorist rewards give scheming AIs with overwhelming probability no matter what you reward, I do think I should plan for worlds where the alignment problem is genuinely hard to solve, and I like AI control as a stepping stone to solving the alignment problem.
AI control definitely is making assumptions, but I don’t think it’s making assumptions that are only applicable to LLMs (but it would be nice if we got LLMs first for AGI), but it definitely assumes a limit on capabilities, but lots of the control stuff would work if we got a brain-like AGI with memory and continuous learning.