Yeah, I think they will probably do better and more regulations than if politicians were more directly involved, but I’m not super sanguine about bureaucrats in absolute terms.
NickGabs
Steering Llama-2 with contrastive activation additions
Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations
An upcoming US Supreme Court case may impede AI governance efforts
Empirical Evidence Against “The Longest Training Run”
This was addressed in the post: “To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of “agentic”) as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don’t think these discussions have been very helpful; I think agency is probably a “you know it when you see it” kind of phenomenon. Additionally, even if we do need a more formal operationalization of agency for this proposal to work, I suspect that we will only be able to develop such an operationalization via more empirical research. The main particular thing I mean to actively exclude by stipulating that the system must be agentic is an LLM or similar system arguing for itself to be improved in response to a prompt. ”
Proposal: labs should precommit to pausing if an AI argues for itself to be improved
Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious human actors where security problems could pop up before the AI becomes misaligned, but this failure mode seems less likely and less x-risk inducing than misalignment, and involves a pretty different set of measures (info sharing policies, encrypting the weights as opposed to training techniques or evals).
I think concrete ideas like this that take inspiration from past regulatory successes are quite good, esp. now that policymakers are discussing the issue.
I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models—Leopold thinks that governments and the public will “wake up” sufficiently quickly to catastrophic risk from AI such that there will be regulatory and PR forces which effectively prevent the release of misaligned models, including evals/ways of detecting misalignment that are more robust than those that might be used by ordinary consumers (which could as you point out likely be fooled by surface-level alignment due to RLHF).
AI Doom Is Not (Only) Disjunctive
How do any capabilities or motivations arise without being explicitly coded into the algorithm?
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
I agree scaling well with data is quite good. But see (1)
How?
I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.
Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.
In the intervening period, I’ve updated towards your position, though I still think it is risky to build systems with capabilities that open ended which are that close to agents in design space
I think you’re probably right. But even this will make it harder to establish an agency where the bureaucrats/technocrats have a lot of autonomy, and it seems there’s at least a small chance of an extreme ruling which could make it extremely difficult.