To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate,
like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.
To respond to i: The problem of embedded agency that is relevant for alignment is that you can’t put boundaries that an AI can’t breach or manipulate, like say school and not school, since those boundaries are themselves manipulatable, that is all defenses are manipulatable, and the AI can affect it’s own distribution such that it can manipulate a human’s values or amplify Goodhart errors in the data set like RLHF. That is, there are no real, naturally occurring Cartesian boundaries that aren’t breakable or manipulatable, except maybe the universe itself.
To respond to ii: Pretraining from Human Feedback avoids embedded agency concerns by using an offline training schedule, where we give it a data set on human values that it learns, and hopefully generalizes. The key things to note here:
We do alignment first and early, to prevent it from learning undesirable behavior, and get it to learn human values from text. In particular, we want to make sure it has learned aligned intentions early.
The real magic, and how it solves alignment, is that we select the data and give it in batches, and critically, offline training does not allow an AI to hack or manipulate the distribution because it cannot select which parts of human values as embodied in text, it must learn all of the human values in the data. No control or degrees of freedom are given to the AI, unlike in online training, meaning we can create a Cartesian boundary between an AI’s values and a specific human’s values, which is very important for AI Alignment, as the AI can’t amplify Goodhart in human preferences or affect the distribution of human preferences.
That’s how it translates the Cartesian, as well as it’s boundaries ontology into an embedded world properly.
There are many more benefits to Pretraining from Human Feedback, but I hope this response answers your question.