Q Home comments on A single principle related to many Alignment subproblems?

Q Home 23 May 2025 10:44 UTC
LW: 3 AF: 1
0
AF
Yes, some value judgements (e.g. “this movie is good”, “this song is beautiful”, or even “this is a conscious being”) depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:
We aren’t particularly good at remembering exact experiences, we like very different experiences, we can’t access each other’s experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior (“don’t kill everyone”, “don’t seek power”, “don’t mess with human brains”) shouldn’t require answering many specific, complicated machinery-dependent questions (“what separates good and bad movies?”, “what separates good and bad life?”, “what separates conscious and unconscious beings?”).
Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):
- “How stimulating or addicting or novel is this experience?” ← I think those parameters were always comprehensible and optimizable, even in the Stone Age. (In a limited way, but still.) For example, it’s easy to get different gradations of “less addicting experiences” by getting injuries, starving or not sleeping.
- “How ‘good’ is this experience in a more nebulous or normative way?” ← I think this is a more complicated value (aesthetic taste), based on simpler values.
- Note that I’m using “easy to comprehend” in the sense of “the thing behaves in a simple way most of the time”, not in the sense of “it’s easy to comprehend why the thing exists” or “it’s easy to understand the whole causal chain related to the thing”. I think the latter senses are not useful for a simplicity metric, because they would mark everything as equally incomprehensible.
- Note that “I care about taste experiences” (A), “I care about particular chemicals giving particular taste experiences” (B), and “I care about preserving the status quo connection between chemicals and taste experiences” (C) are all different things. B can be much more complicated than C, B might require the knowledge of chemistry while C doesn’t.
Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?
What links here?
- Q Home's comment on A single principle related to many Alignment subproblems? by Q Home (25 May 2025 8:08 UTC; 3 points)
- Davidmanheim 26 May 2025 11:01 UTC
  LW: 4 AF: 2
  0
  AF Parent
  I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.
  
  So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don’t think that this is stationary, so I’m not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.
  
  But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.
  - Q Home 26 May 2025 13:08 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Are you talking about value learning? My proposal doesn’t tackle advanced value learning. Basically, my argument is “if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility”. My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is “if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as ‘human values’”.^[1]
    The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I’m not talking about extrapolating something out of distribution. Unless I’m missing your point.
    ^
    Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might’ve failed.
    - Davidmanheim 27 May 2025 3:27 UTC
      LW: 2 AF: 1
      0
      AF Parent
      No, the argument above is claiming that A is false.
      - Q Home 27 May 2025 5:57 UTC
        LW: 1 AF: 1
        0
        AF Parent
        But
        To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can’t be false.
        Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn’t seem controversial. (The ideas themselves are controversial, but for other reasons.)
        If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
        Based on your comments, I can guess that something below is the crux:
        You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”. But that’s a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that “past human ability to comprehend limits human values” — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
        You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
        You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define “ability to comprehend” based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can’t comprehend atomic theory.
        Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?
        Davidmanheim 27 May 2025 6:21 UTC
        LW: 6 AF: 3
        0
        AF Parent
        To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about.
        
        You’re conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That’s what I meant when I said I think A is false.
        If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
        That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
        You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”.
        No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
        You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
        No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it—as I tweeted partly thinking about this conversation.)
        Q Home 27 May 2025 8:17 UTC
        1 point
        0
        Parent
        Thanks for clarifying! Even if I still don’t fully understand your position, I now see where you’re coming from.
        No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
        Then those values/motivations should be limited by the complexity of human cognition, since they’re produced by it. Isn’t that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn’t require building an AGI which learns coherent human values. It “merely” requires an AGI which doesn’t affect human values in large and unintended ways.
        No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.
        This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it’s not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to “this is beautiful”) can be defined as a part of our comprehension ability. Yes, humans don’t know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven’t given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it’s a trivial property (some foods are worse than others, therefore some foods have the best taste).
        That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
        I agree that ambitious value learning is a big “if”. But Task AGI doesn’t require it.