Jon Garcia comments on What Is The Alignment Problem?

Jon Garcia 6 Mar 2025 18:23 UTC
3 points
0
Exercise: Do What I Mean (DWIM)
I haven’t thought much about what patterns need to hold in the environment in order for “do what I mean” to make sense at all. But it’s a natural next target in this list, so I’m including it as an exercise for readers: what patterns need to hold in the environment in order for “do what I mean” to make sense at all? Note that either necessary or sufficient conditions on such patterns can constitute marginal progress on the question.
As far as I can tell, DWIM will necessarily require other-agent modeling in some sort of predictive-coding framework. The “patterns in the environment” would be the correspondence between the actual state of the world and the representation of the desired goal state in the mind of the human, as well as between the trajectory taken to reach the goal state and the human’s own internal acceptance criteria.
Part of the AGI not hooked up to the reward signal would need to have a generative model of human agent’s behavior, words, commands, etc., derived from a latent representation of their beliefs and desires. This latent representation is constantly updated to minimize prediction error derived from observation, verbal feedback, etc. (e.g., Human: “That’s not what I meant!” AGI: “Hmm, what must be going on inside their head to make them say that, given the state of the environment and prior knowledge about their preferences, and how does that differ from what I was assuming?”)
At the same time, the AGI needs to have some latent representation of the environment and the paths taken through it that uses (a linear mapping to) the same latent space it uses for representing the human’s desires. Correspondence can then be measured and optimized for directly.

Jon Garcia comments on What Is The Alignment Problem?

Exercise: Do What I Mean (DWIM)