let’s call “hard alignment” the (“orthodox”) problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don’t care about by default and destroying everything of value to us on the way there. let’s call “easy” alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.
what should one be working on? as always, the calculation consists of comparing
p(
hard
) × how much value we can get inhard
p(
easy
) × how much value we can get ineasy
given how AI capabilities are going, it’s not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it’s not we’re doomed anyways. but i think, in this particular case, this is wrong.
this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.
to me, p(hard
) is big enough, and my hard
-compatible plan seems workable enough, that it makes sense for me to continue to work on it.
let’s not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.
I’m put in mind of something Yudkowsky said on the Bankless podcast:
“Enrico Fermi was saying that fission chain reactions were 50 years off if they could ever be done at all, 2 years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer.”
He was speaking about how far away AGI could be, but I think the same logic applies to alignment. It looks hopeless right now, but events never play out exactly like you expect them to, and breakthroughs happen all the time.
Excellent point. In one frame, pessimism applied to timelines makes them look further away than they actually turn out to be. In another frame, pessimism applied to doom makes it seem closer / more probable, but it uses the anti-pessimism frame applied to timelines—“AGI will happen much sooner than we think”.
I get the sense reading some LessWrong comments that there is a divide between “alignment-is-easy”-ers and “alignment-is-hard”-ers. I also get the sense that Yudkowsky’s p(doom) has increased over the years, to where it is now. Isn’t it somewhat strange that we should be getting two groups whose probability of p(doom) is moving away from the center?
Answer: The problem here is that the “alignment is hard” people are anticorrelated with truth, because we have a bias to click on and report negative news that outstrips positive news, so we aren’t rational when we update on negative news compared to positive news. Combine this with confirmation bias, and the alignment is easy or moderate difficulty people are closer to the truth, due to new evidence.
Links to negative news bias:
https://www.vox.com/the-highlight/23596969/bad-news-negativity-bias-media
https://archive.is/7EhiX (Atlantic article archived, so no paywall.)
Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.
You’re right that optimism bias is an issue, but optimism bias is generally an individual phenomenon, and the most important phenomenon is what people share instead of what they believe, so negative news being shared more is the most important issue.
But recently we found a technique of alignment that solves almost every alignment problem in one go, and scales well with data.
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes, I’m talking about that technique known as Pretraining from Human Feedback.
The biggest reasons I’m so optimistic about the technique, even with it’s limitations, is the following:
It almost completely or completely solves deceptive alignment by giving it a myopic goal, so there’s far less incentive or no incentive to be deceptive.
It scales well with data, which is extremely useful, that is the more data it has, the more aligned it will be.
The tests, while sort of unimportant from our perspective, gave tentative evidence for the proposition that we can control power seeking such that we can avoid having an AI power seek if it’s misaligned and actually power seek only when it’s aligned.
They dissolved, rather than resolved embedded agency/embedded alignment concerned by using offline learning. In particular, the AI can’t hack or manipulate a human’s values, unlike online learning. In essence they translated the ontology of Cartesianism and it’s boundaries in a sensible way to an embedded world.
It’s not a total one shot solution, but it’s the closest we came to a one shot-solution, and I can see a path to alignment that’s fairly straightforward from here.
How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
I agree scaling well with data is quite good. But see (1)
How?
I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons
It’s basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it’s a simple, myopic goal, which prevents deceptive alignment.
In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst.
Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it’s indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn’t powerseek if it would be misaligned to a human’s interest’s? In particular, if the model doesn’t try to get personal identifying information, then it’s also voluntarily limiting it’s ability to seek power when it detects that it’s misaligned with a human’s values. That’s arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
Completing buggy Python code in a buggy way
https://arxiv.org/abs/2107.03374
Or to espouse views consistent with those expressed in the prompt (sycophancy).
https://arxiv.org/pdf/2212.09251.pdf
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.
Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
They’re looking more false by the day.
Yeah, this is starting to make a lot more sense to me. It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake. Magical categories don’t seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.
I don’t see how you arrive at these conclusions at all. I agree that how alignment of the current models works there’s some vague hope that things might keep going like this even when capabilities increase. Is there any specific thing that makes you update more strongly?
Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch’s stuff becomes super duper important to understand.
What Critch stuff do you have in mind?
Modal Fixpoint Cooperation without Löb’s Theorem
Löbian emotional processing of emergent cooperation: an example
«BOUNDARIES» SEQUENCE
Really, just skim his work. He’s been thinking well about the hard problems of alignment for a while.
Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
What does GPT want?
I don’t know.
My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
They don’t “want” anything and thinking of them as having wants leads to confused thinking.
The major mistakes I think happened is that we had biases towards overweighting and clicking on negative news, and it looks like we don’t actually have to solve the problems of embedded agency, probably the most dominant framework on LW, due to Pretraining from Human Feedback. It was the first alignment technique that actually scales with more data. In other words, we dissolved, rather than resolved the problem of embedded agency: We managed to create Cartesian boundaries that actually work in an embedded world.
Link to Pretraining from Human Feedback:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Link to an Atlantic article about negativity bias in the news, archived so that no paywall exists:
https://archive.is/7EhiX
I’ve seen you comment several times about the link between Pretraining from Human Feedback and embedded agency, but despite being quite familiar with the embedded agency sequence I’m not getting your point.
I think my main confusion is that to me “the problem of embedded agency” means “the fact that our models of agency are non-embedded, but real world agents are embedded, and so our models don’t really correspond to reality”, whereas you seem to use “the problem of embedded agency” to mean a specific reason why we might expect misalignment.
Could you say (i) what the problem of embedded agency means to you, and in particular what it has to do with AI risk, and (ii) in what sense PTHF avoids it?
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate, like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.
Hard alignment seems much more tractable to me now than it did two years ago, in a similar way to how capabilities did in 2016. It was already obvious by then more or less how neural networks worked; much detail has been filled out since then, but it didn’t take that much galaxy brain to hypothesize the right models. The pieces felt, and feel now, like they’re lying around and need integrating, but the people who have come up with the pieces do not yet believe me that they are overlapping, or that there’s mathematical grade insight to be had underneath these intuitions, rather than just janky approximations of insights.
I think we can do a lot better than QACI, but I don’t have any ideas for how except by trying to make it useful for neural networks at a small scale. I recognize that that is an extremely annoying thing to say from your point of view, and my hope is that people who understand how to bridge NNs and LIs exist somewhere.
I also think soft alignment is progress on hard alignment, due to conceptual transfer; but that soft alignment is thoroughly insufficient. without hard alignment, everything all humans and almost all AIs care about will be destroyed. I’d like to keep emphasizing that last bit—don’t forget that most AIs will not get to participate in club takeoff if an unaligned takeoff occurs! Unsafe takeoff will result in the fooming AI having sudden, intense value-drift, even against self.
What’s an LI—a living intelligence? a logical inductor?
logical inductor
I’m moderately skeptical about these alignment approaches (PreDCA, QACI?) which don’t seem to care about the internal structure of an agent, only about a successful functionalist characterization of its behavior. Internal structure seem to be relevant if you want to do CEV-style self-improvement (thus, June Ku).
However, I could be missing a lot, and meanwhile, the idea of bridging neural networks and logical induction sounds interesting. Can you say more about what’s involved? Would a transformer trained to perform logical induction be relevant? How about the recent post on knowledge in parameters vs knowledge in architecture?
I don’t think we should be in the business of not caring at all about the internal structure but I think that the claims we need to make about the internal structure need to be extremely general across possible internal structures so that we can invoke the powerful structures and still get a good outcome
sorry about low punctuation, voice input
more later, or poke me on discord
I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).
I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking “when talking/thinking about the methodology, what capabilities are assumed to be in place?”.
I’m not sure about this, but unless I’m mistaken[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): “Let’s assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)”. And as far as I know, they haven’t done much work where the assumption was something more like “suppose we were extremely good at gradient descent / searching through spaces of possible programs”.
In my own theorizing, I don’t make all of the simplifying assumptions that (I think/suspect) MIRI made in their “orthodox” research. But I make other assumptions (for the purpose of simplification), such as:
“let’s assume that we’re really good at gradient descent / searching for possible AIs in program-space”[2]
“let’s assume that the things I’m imagining are not made infeasible due to a lack of computational resources”
“let’s assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)”
In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said “we have written the source code for a superintelligent AGI, but we haven’t turned it on yet” (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology?
I very well could be, and would appreciate any corrections.
(I know they have worked on lots of detail-oriented things that aren’t “one big plan” to “solve alignment”. And maybe how I phrase myself makes it seem like I don’t understand that. But if so, that’s probably due to bad wording on my part.)
Well, I sort of make that assumption, but there are caveats.
As far as I understand, MIRI did not assume that we’re just able to give the AI a utility function directly. The Risks from Learned Optimization paper was written mainly by people from MIRI!
Other things like Ontological Crises and Low Impact sort of assume yoi can get some info into the values of an agent, and Logical Induction was more about how to construct systems that satisfy some properties in their cognition.
There’s lots of material that does assume that, even if there is some that doesnt.
I’m a bit unsure about how to interpret you here.
In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.
Good point (I should have written my comment in such a way that pointing out this didn’t feel necessary).
I guess this is more central to what I was trying to communicate than whether it is expressed in terms of a utility function per se.
In this tweet, Eliezer writes:
”The idea with agent foundations, which I guess hasn’t successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).”
Based on e.g. this talk from 2016, I get the sense that when he says “coherent target” he means targets that relate to the non-digital world. But perhaps that’s not the case (or perhaps it’s sort of the case, but more nuanced).
Maybe I’m making this out to have been a bigger part of their work than what actually was the case.
Yeah, I find it difficult to figure out how to look at this. At lot of MIRI discussion focused on their decision theory work, but I think that’s just not that important.
Tiling agents e.g. was more about constructing or theorizing about agents that may have access to their own values, in a highly idealized setting about logic.
I feel there’s often a wrong assumption in probabilistic reasoning, something like moderate probabilities for everything by default? after all, if you say you’re 70⁄30 nobody who disagrees will ostracize you like if you say 99⁄1.
“If alignment is easy I want to believe alignment is easy. If alignment is hard I want to believe alignment is hard. I will work to form accurate beliefs”
I… kinda want to ping @Jeffrey Ladish about how this post uses “play to your outs”, which is exactly the reason I pushed against that phrasing a year ago in Don’t die with dignity; instead play to your outs.
I think even now AI can understand close enough what we want within the distribution. I.e. in a world that is similar to what it is now.
Problems will arise when the world will significantly change, even if it changes along with our wishes. Our values are just not designed for the reality where, for example, people can arbitrarily change themselves. Or reality where ANY human’s mental activity is obsolete, because AI can predict what the human want and how to get it before human can even articulate it.
The best way to limit the impact of a rogue AI is to limit the production of autonomous (intelligent) lethal weapons. More details in this post and its comments:
https://www.lesswrong.com/posts/b2d3yBzzik4hajGni/limit-intelligent-weapons
I weak-downvoted this: in general I think it is informative for people to just state their opinion, but in this case the opinion had very little to do with the content of the post and was not argued for. The linked post also did not engage with any of the existing arguments around TAI risk.
(Not that I disagree with “limiting the spread of autonomous weapons is going to lead to fewer human deaths in expectation”, but I don’t think it is the best strategy to limit such kinds of impact.)
Which part of my statement does not make sense, and how so?
My statement is relevent to the post. The beginning of the article partially defined hard alignment as preventing AI from destroying everything of value to us. The most likely way a rogue AI would do that is by gaining unauthorized access to weapons with built-in intelligence.
I dont think the most likely way is gaining access to autonomous weapons designed to kill. An ai smarter than all humans has many different options to take over, including making its own autonomous weapons
I don’t get it; why would ‘refraining from designing intelligent machines to kill people’ help prevent AI from killing everyone? That’s a really bold and outlandish claim that I think you have to actually defend and not just tell people to agree with… Like, from my perspective, you’re just assuming the hard parts of the problem don’t exist, and replacing all the hard parts with an easier problem (‘avoid designing AIs to kill people’). It’s the hard parts of the problem that seem on track to kill us; solving the easier problem doesn’t seem to help.
Yes, we need to solve the harder alignment problems as well. I suggested limited intelligent weapons as the first step, because these are the most obviously misanthropic AI being developed, and the clearest vector of attack for any rogue AI. Why don’t we focus on that first, before we focus on the more subtle vectors.
The end of the post you linked said, basically, “we need a plan”. Do you have a better one?