I sometimes come back to think about this post. Might as well write a comment.
Goodhart’s law. You echo the common frame that an approximate value function is almost never good enough, and that’s why Goodhart’s law is a problem. Probably what I though when I first read this post was that I’d just written a sequence about how human values live inside models of humans (whether our own models or an AI’s), which makes that frame weird—weird to talk about an ‘approximate value function’ that’s not really an approximation to anything specific. The Siren Worlds problem actually points towards more meat—how do we want to model preferences for humans who are inconsistent, who have opinions about that inconsistency, who mistrust themselves though even that mistrust is imperfect?
You say basically all this at various points in the post, so I know it’s kind of superficial to talk about the initial framing. But to indulge my superficiality for a while, I’m curious about how it’s best to talk about these things (a) conveniently and yet (b) without treating human values as a unique target out there to hit.
In physics pedagogy there’s kind of an analogous issue, where intro QM is designed to steer students away from thinking in terms of “wave-particle duality”—which many students have heard about and want to think in terms of—by just saturating them with a frame where you think in terms of wave functions that sometimes give probability distributions that get sampled from (by means left unspecified).
My inclination is to do the same thing to the notion “fixed, precise human values,” which are a convenient way to think about everyday life and which many people want to think of value learning in terms of. I’d love to know a good frame to saturate the initial discussion of amplified human values, identifiability, etc. with that would introduce those topics as obviously a result of human values being very “fuzzy” and also humans having self-reflective opinions about how they want to be extrapolated.
Helpers / Helpees / Ghosts section. A good section :)
I don’t think we have to go to any lengths to ‘save’ the ghosts example by supposing that a bunch of important values rest on the existence of ghosts. A trivial action (e.g. lighting incense for ghosts) works just was well, or maybe even no action, just a hope that the AI could do something for the ghosts.
It does seem obvious at first that if there are no ghosts, the AI should not light incense for them. But there’s some inherent ambiguity between models of humans that light incense for the sake of the ghosts, and models of humans that light incense for the sake of cultural conformity, and models of humans that light incense because they like incense. Even if the written text proclaims that it’s all for the ghosts, since there are no ghosts there must be other explanations for the behavior, and maybe some of those other explanations are at least a little value-shaped. I agree that what courses of action are good will end up depending on the details.
Maybe you get lured in by the “fixed, precise human values” frame here, when you talk about the AI knowing precisely how the human’s values would update upon learning there are no ghosts. Precision is not the norm from which needing to do the value-amplification-like reasoning is a special departure, the value-amplification-like reasoning is the norm from which precision emerges in special cases.
Wireheading. I’m not sure time travel is actually a problem?
Or at least, I think there are different ways to think of model-based planning with modeled goals, and the one in which time travel isn’t a problem seems more natural way.
The way to do model-based planning with modeled goals in which time travel is a problem is: you have spread-out-in-time model of the world that you can condition on your different actions, and first you condition it on that action “time travel to a century ago and change human values to be trivially satisfied” and then you evaluate how well the world is doing according to the modeled function “human values as of one second ago, conditional on the chosen action.”
The way to do the planning in which time travel isn’t a problem is: you have a model of the world that tracks current and past state, plus a dynamics model that you can use to evolve the state conditional on different actions. The human values you use to evaluate actions are part of the unconditioned present state, never subjected to the dynamics.
On the other hand, this second way does seem like it’s making more, potentially unnecessary, commitments for the AI—if time travel is possible, what even is its dynamics model supposed to say is happening to the state of the universe? Humans have the exact same problem—we think weird thoughts like “after I time traveled, smallpox was eradicated sooner,” which imply the silly notion that the time travel happened at some time in the evolution of the state of the universe. Or are those thought so silly after all? Maybe if time travel is possible in the way normally understood, we should be thinking of histories of computations rather than histories of universes, and the first sort of AI is actually making a mistake by erasing histories of computation.
I sometimes come back to think about this post. Might as well write a comment.
Goodhart’s law. You echo the common frame that an approximate value function is almost never good enough, and that’s why Goodhart’s law is a problem. Probably what I though when I first read this post was that I’d just written a sequence about how human values live inside models of humans (whether our own models or an AI’s), which makes that frame weird—weird to talk about an ‘approximate value function’ that’s not really an approximation to anything specific. The Siren Worlds problem actually points towards more meat—how do we want to model preferences for humans who are inconsistent, who have opinions about that inconsistency, who mistrust themselves though even that mistrust is imperfect?
You say basically all this at various points in the post, so I know it’s kind of superficial to talk about the initial framing. But to indulge my superficiality for a while, I’m curious about how it’s best to talk about these things (a) conveniently and yet (b) without treating human values as a unique target out there to hit.
In physics pedagogy there’s kind of an analogous issue, where intro QM is designed to steer students away from thinking in terms of “wave-particle duality”—which many students have heard about and want to think in terms of—by just saturating them with a frame where you think in terms of wave functions that sometimes give probability distributions that get sampled from (by means left unspecified).
My inclination is to do the same thing to the notion “fixed, precise human values,” which are a convenient way to think about everyday life and which many people want to think of value learning in terms of. I’d love to know a good frame to saturate the initial discussion of amplified human values, identifiability, etc. with that would introduce those topics as obviously a result of human values being very “fuzzy” and also humans having self-reflective opinions about how they want to be extrapolated.
Helpers / Helpees / Ghosts section. A good section :)
I don’t think we have to go to any lengths to ‘save’ the ghosts example by supposing that a bunch of important values rest on the existence of ghosts. A trivial action (e.g. lighting incense for ghosts) works just was well, or maybe even no action, just a hope that the AI could do something for the ghosts.
It does seem obvious at first that if there are no ghosts, the AI should not light incense for them. But there’s some inherent ambiguity between models of humans that light incense for the sake of the ghosts, and models of humans that light incense for the sake of cultural conformity, and models of humans that light incense because they like incense. Even if the written text proclaims that it’s all for the ghosts, since there are no ghosts there must be other explanations for the behavior, and maybe some of those other explanations are at least a little value-shaped. I agree that what courses of action are good will end up depending on the details.
Maybe you get lured in by the “fixed, precise human values” frame here, when you talk about the AI knowing precisely how the human’s values would update upon learning there are no ghosts. Precision is not the norm from which needing to do the value-amplification-like reasoning is a special departure, the value-amplification-like reasoning is the norm from which precision emerges in special cases.
Wireheading. I’m not sure time travel is actually a problem?
Or at least, I think there are different ways to think of model-based planning with modeled goals, and the one in which time travel isn’t a problem seems more natural way.
The way to do model-based planning with modeled goals in which time travel is a problem is: you have spread-out-in-time model of the world that you can condition on your different actions, and first you condition it on that action “time travel to a century ago and change human values to be trivially satisfied” and then you evaluate how well the world is doing according to the modeled function “human values as of one second ago, conditional on the chosen action.”
The way to do the planning in which time travel isn’t a problem is: you have a model of the world that tracks current and past state, plus a dynamics model that you can use to evolve the state conditional on different actions. The human values you use to evaluate actions are part of the unconditioned present state, never subjected to the dynamics.
On the other hand, this second way does seem like it’s making more, potentially unnecessary, commitments for the AI—if time travel is possible, what even is its dynamics model supposed to say is happening to the state of the universe? Humans have the exact same problem—we think weird thoughts like “after I time traveled, smallpox was eradicated sooner,” which imply the silly notion that the time travel happened at some time in the evolution of the state of the universe. Or are those thought so silly after all? Maybe if time travel is possible in the way normally understood, we should be thinking of histories of computations rather than histories of universes, and the first sort of AI is actually making a mistake by erasing histories of computation.