given how AI capabilities are going, it’s not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it’s not we’re doomed anyways. but i think, in this particular case, this is wrong.
Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
let’s not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.
Yeah, this is starting to make a lot more sense to me. It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake. Magical categories don’t seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.
I don’t see how you arrive at these conclusions at all. I agree that how alignment of the current models works there’s some vague hope that things might keep going like this even when capabilities increase. Is there any specific thing that makes you update more strongly?
Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch’s stuff becomes super duper important to understand.
Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
The major mistakes I think happened is that we had biases towards overweighting and clicking on negative news, and it looks like we don’t actually have to solve the problems of embedded agency, probably the most dominant framework on LW, due to Pretraining from Human Feedback. It was the first alignment technique that actually scales with more data. In other words, we dissolved, rather than resolved the problem of embedded agency: We managed to create Cartesian boundaries that actually work in an embedded world.
I’ve seen you comment several times about the link between Pretraining from Human Feedback and embedded agency, but despite being quite familiar with the embedded agency sequence I’m not getting your point.
I think my main confusion is that to me “the problem of embedded agency” means “the fact that our models of agency are non-embedded, but real world agents are embedded, and so our models don’t really correspond to reality”, whereas you seem to use “the problem of embedded agency” to mean a specific reason why we might expect misalignment.
Could you say (i) what the problem of embedded agency means to you, and in particular what it has to do with AI risk, and (ii) in what sense PTHF avoids it?
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate,
like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.
Alternatively, reality is looking to me like the hard alignment problem is just based on fundamentally mistaken models about the world. It’s not about playing our out, it’s that it doesn’t seem like we live in a hard alignment world.
They’re looking more false by the day.
Yeah, this is starting to make a lot more sense to me. It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake. Magical categories don’t seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.
I don’t see how you arrive at these conclusions at all. I agree that how alignment of the current models works there’s some vague hope that things might keep going like this even when capabilities increase. Is there any specific thing that makes you update more strongly?
Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch’s stuff becomes super duper important to understand.
What Critch stuff do you have in mind?
Modal Fixpoint Cooperation without Löb’s Theorem
Löbian emotional processing of emergent cooperation: an example
«BOUNDARIES» SEQUENCE
Really, just skim his work. He’s been thinking well about the hard problems of alignment for a while.
Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
What does GPT want?
I don’t know.
My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
They don’t “want” anything and thinking of them as having wants leads to confused thinking.
The major mistakes I think happened is that we had biases towards overweighting and clicking on negative news, and it looks like we don’t actually have to solve the problems of embedded agency, probably the most dominant framework on LW, due to Pretraining from Human Feedback. It was the first alignment technique that actually scales with more data. In other words, we dissolved, rather than resolved the problem of embedded agency: We managed to create Cartesian boundaries that actually work in an embedded world.
Link to Pretraining from Human Feedback:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences
Link to an Atlantic article about negativity bias in the news, archived so that no paywall exists:
https://archive.is/7EhiX
I’ve seen you comment several times about the link between Pretraining from Human Feedback and embedded agency, but despite being quite familiar with the embedded agency sequence I’m not getting your point.
I think my main confusion is that to me “the problem of embedded agency” means “the fact that our models of agency are non-embedded, but real world agents are embedded, and so our models don’t really correspond to reality”, whereas you seem to use “the problem of embedded agency” to mean a specific reason why we might expect misalignment.
Could you say (i) what the problem of embedded agency means to you, and in particular what it has to do with AI risk, and (ii) in what sense PTHF avoids it?
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate, like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.