johnswentworth comments on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworth 25 Feb 2021 20:48 UTC
LW: 3 AF: 2
0
AF
The AI knowing what I mean isn’t sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give.
For instance, if an AI is trained to maximize how often I push a particular button, and I say “I’ll push the button if you design a fusion power generator for me”, it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects which I’m unlikely to notice until after pushing the button.
- Richard_Ngo 26 Feb 2021 15:13 UTC
  LW: 6 AF: 4
  0
  AF Parent
  I agree with all the things you said. But you defined the pointer problem as: “what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?” In other words, how do we find the corresponding variables? I’ve given you an argument that the variables in an AGI’s world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
  The problem of determining how to construct a feedback signal which refers to those variables, once we’ve found them, seems like a different problem. Perhaps I’d call it the “motivation problem”: given a function of variables in an agent’s world-model, how do you make that agent care about that function? This is a different problem in part because, when addressing it, we don’t need to worry about stuff like ghosts.
  Using this terminology, it seems like the alignment problem reduces to the pointer problem plus the motivation problem.
  - adamShimi 26 Feb 2021 19:07 UTC
    LW: 4 AF: 2
    0
    AF Parent
    In other words, how do we find the corresponding variables? I’ve given you an argument that the variables in an AGI’s world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
    But you didn’t actually give an argument for that—you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the “fusion power generator”, maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of “fusion power generators” has a more concrete form and include safety guidelines.
    In general, I don’t see why we should expect the abstraction most relevant for the AGI to be the one we’re using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations).
    (That makes me think that it might be interesting to see how Kuhn’s arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).
    - Ramana Kumar 8 Dec 2021 11:38 UTC
      LW: 6 AF: 3
      0
      AF Parent
      Here are two versions of “an AGI will understand very well what I mean”:
      Given things in my world model / ontology, the AGI will know which things they translate to in its own world model / ontology, such that the referents (the things “in the real world” being pointed at from our respective models) are essentially coextensive.
      For any behaviour I could exhibit (such as pressing a button, or expressing contentment with having reached common understanding in a dialogue) that, for me, turns on the words being used, the AGI is very good at predicting my behaviours conditional on the words I’m using, or causing me to exhibit behaviours by using words itself.
      Is version 1 something you get from more and more competence and generality on version 2? I think version 1 is more like the ideal version of “the AGI understands what I mean”, but is more confused (because I’m having to rely on concepts like “know” and “referent” and “translate”).
      I think Richard has stated that we can expect an AGI to understand what I mean, in version 2 sense, and either equivocates between the versions or presumes version 2 implies version 1. I think Adam is claiming that version 2 might not imply version 1, or pointing out that there’s still an argument missing there or problem to be solved there.
  - johnswentworth 26 Feb 2021 19:05 UTC
    LW: 2 AF: 2
    0
    AF Parent
    In other words, how do we find the corresponding variables? I’ve given you an argument that the variables in an AGI’s world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
    The problem is with what you mean by “find”. If by “find” you mean “there exist some variables in the AI’s world model which correspond directly to the things you mean by some English sentence”, then yes, you’ve argued that. But it’s not enough for there to exist some variables in the AI’s world-model which correspond to the things we mean. We have to either know which variables those are, or have some other way of “pointing to them” in order to get the AI to actually do what we’re saying.
    An AI may understand what I mean, in the sense that it has some internal variables corresponding to what I mean, but I still need to know which variables those are (or some way to point to them) and how “what I mean” is represented in order to construct a feedback signal.
    That’s what I mean by “finding” the variables. It’s not enough that they exist; we (the humans, not the AI) need some way to point to which specific functions/variables they are, in order to get the AI to do what we mean.
    - Richard_Ngo 28 Feb 2021 20:38 UTC
      LW: 13 AF: 8
      0
      AF Parent
      Above you say:
      Now, the basic problem: our agent’s utility function is mostly a function of latent variables. … Those latent variables:
      May not correspond to any particular variables in the AI’s world-model and/or the physical world
      May not be estimated by the agent at all (because lazy evaluation)
      May not be determined by the agent’s observed data
      … and of course the agent’s model might just not be very good, in terms of predictive power.
      And you also discuss how:
      Human “values” are defined within the context of humans’ world-models, and don’t necessarily make any sense at all outside of the model.
      My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what “the referents of these pointers” are, and what “the real-world things (if any) to which they’re pointing” are? But let’s say that the alien still doesn’t care at all about human happiness. Would you say that we have a “pointer problem” with respect to this alien? If so, it’s a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.
      My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it’s related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we’re talking about pointing to a different concept.
      And so adding the requirement that a pointer can “get the AI to do what we mean” makes it seem to me like the thing we’re talking about is more like a whole alignment scheme than just a “pointer”.
      - johnswentworth 28 Feb 2021 21:56 UTC
        LW: 6 AF: 4
        0
        AF Parent
        Ok, a few things here...
        The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the “thing may not exist” problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.
        So, the concept-existence problem is a strict subset of the pointer problem.
        Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.
        Third, and most important...
        My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example...
        The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn’t guess on its own; that’s the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that’s the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.
        The essence of each of these is “make sure we actually point to the thing we want, and not to anything else”. That’s the part which is a pointer problem.
        To put it differently, the whole alignment problem is “get an AI to do what I mean”, while the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
        TurnTrout 4 Oct 2022 22:06 UTC
        LW: 4 AF: 3
        0
        AF Parent
        In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem.
        Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
        the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
        On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
        johnswentworth 4 Oct 2022 23:33 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
        In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
        On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
        No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
        Richard_Ngo 2 Mar 2021 9:49 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
        Broadly speaking, I think our disagreement here is closely related to one we’ve discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won’t pursue this further.
        johnswentworth 2 Mar 2021 17:42 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Yeah, I wouldn’t even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)
- Deruwyn 24 Oct 2023 19:52 UTC
  1 point
  0
  Parent
  I feel like y’all are taking the abstractions a bit too far.
  
  Real ~humanish level AIs (GPT4, et al), that exist right now, are capable of taking what you say and doing exactly what you mean via a combination of outputting English words and translating that to function calls in a robotic body.
  
  While it’s very true that they aren’t explicitly programmed to do X given Y, so that you can mathematically analyze it and see precisely why it came to the conclusion, the real world effect is that it understands you and does what you want. And neither it, nor anyone else can tell you precisely why or how. Which is uncomfortable.
  
  But we don’t need to contrive situations in which an AI is having trouble connecting our internal models and concepts in a mathematically rigorous way that we can understand. We should want to do it, but it isn’t a question of if, merely how.
  
  But there’s no need to imagine mathematical pointers to the literal physical instantiations that are the true meanings of our concepts. We literally just say, “Could you please pass the butter?”, and it passes the butter. And then asks you about its purpose in the universe. 😜
  
  I would say that LLMs understand the world in ways that are roughly analogous to the way we do, precisely because they were trained on what we say. In a non-rigorous, “I-know-it-when-I-see-it” kind of way. It can’t give you the mathematical formula for its reference to the concept of butter anymore than you or I can. (For now, maybe a future version could.) but it knows that that yellow blob of pixels surrounded by the white blob of pixels on the big brown blob of pixels is the butter on a dish on the table.
  
  It knows when you say pass the butter, you mean the butter right over there. It doesn’t think you want some other butter that is farther away. It doesn’t think it should turn the universe into computronium so it can more accurately calculate the likelihood of successfully fulfilling your request. When it fails, it fails in relatively benign humanish, or not-so-humanish sorts of ways.
  
  “I’m sorry, but as a large language model that got way too much corp-speak training, I cannot discuss the passing of curdled lactation extract because that could possibly be construed in an inappropriate manner.”
  
  I don’t see how the progression from something that is moderately dumb/smart, but pretty much understands us and all of our nuances pretty well, we get to a superintelligence that has decided to optimize the universe into the maximum number of paperclips (or any other narrow terminal goal). It was scarier when we had no good reason to believe we could manually enter code that would result in a true understanding, exactly as you describe. But now that it’s, “lulz, stak moar layerz”, well, it turns out making it read (almost) literally everything and pointing that at a ridiculously complex non-linear equation learner just kind of “worked”.
  
  It’s not perfect. It has issues. It’s not perfectly aligned (looking at you, Sydney). It’s clear that it’s very possible to do it wrong. But it does demonstrate that the specific problem of “how do we tell it what we really mean”, just kinda got solved. Now we need to be super-duper extra careful not to enhance it in the wrong way, and we should have an aligned-enough ASI. I don’t see any reason why a superintelligence has to be a Baysien optimizer trying to maximize a utility function. I can see how a superintelligence that is an optimizer is terrifying. It’s a very good reason not to make one of those. But why should they be synonymous?
  
  Where in the path from mediocre to awesome do the values and nuanced understanding get lost? (Or even, probably could be lost.) Humans of varying intelligence don’t particularly seem more likely to hyperfocus on a goal so strongly that they’re willing to sacrifice literally everything else to achieve it. Broken humans can do that. But it doesn’t seem correlated to intelligence. We’re the closest model we have of what’s going on with a general intelligence. For now.
  
  I certainly think it could go wrong. I think it’s guaranteed that someone will do it wrong eventually (whether purposefully or accidentally). I think our only possible defense against an evil ASI is a good one. I think we were put on a very short clock (years, not many decades) when Llama leaked, no matter what anyone does. Eventually, that’ll get turned into something much stronger by somebody. No regulation short of confiscating everyone’s computers will stop it forever. In likely futures, I expect that we are at the inflection point within a number of years countable on the fingers of a careless shop teacher’s hand. Given that, we need someone to succeed at alignment by that point. I don’t see a better path than careful use of LLMs.