[Question] What are the known difficulties with this alignment approach?

Assume you have a world-model that is nicely factored into spatially localized variables that contain interesting-to-you concepts. (Yes, that’s a big assumption, but are there any known difficulties with the proposal if we grant this assumption?)

Pick some Markov blanket (which contains some actuators) as the bounds for your AI intervention.

Represent your goals as a causal graph (or computer program, or whatever) that fits within these bounds. For instance if you want a fusion power plant, represent it as something that takes in water and produces helium and electricity.

Perform a Pearlian counterfactual surgery where you cut out the variables within the Markov blanket and replace them with a program representing your high-level goal, and then optimize the action variables to match the behavior of the counterfactual graph.

No comments.