TristanTrim comments on A single principle related to many Alignment subproblems?

TristanTrim 17 May 2025 4:02 UTC
2 points
0
I apologize, I didn’t read in full, but I’m curious if you considered the case of, for example, the Mandelbrot set? A very simple equation specifies an infinitely precise, complicated set. If human values have this property than it would be correct to say the Kolmogorov complexity of human values is very low, but there are still very exacting constraints on the universe for it to satisfy human values.
- Q Home 17 May 2025 7:20 UTC
  4 points
  0
  Parent
  Don’t worry about not reading it all. But could you be a bit more specific about the argument you want to make or the ambiguity you want to clarify? I have a couple of interpretations of your question.
  
  Interpretation A:
  1. The post defines a scale-dependent metric which is supposed to tell how likely humans are to care about something.
  2. There are objects which are identical/similar on every scale. Do they break the metric? (Similar questions can be asked about things other than “scale”.) For example, what if our universe contains an identical, but much smaller universe, with countless people in it? Men In Black style. Would the metric say we’re unlikely to care about the pocket universe just because of its size?
  Interpretation B:
  1. The principle says humans don’t care about constraining things in overly specific ways.
  2. Some concepts with low Kolmogorov Complexity constrain things in infinitely specific ways.
  My response to B is that my metric of simplicity is different from Kolmogorov Complexity.
  - TristanTrim 19 May 2025 22:49 UTC
    5 points
    0
    Parent
    Thanks for responding : )
    
    A is amusing, definitely not what I was thinking. B seems like it is probably what I was thinking, but I’m not sure, and don’t really understand how having a different metric of simplicity changes things.
    
    While the true laws of physics can be arbitrarily complicated, the behavior of variables humans care about can’t be arbitrarily complicated.
    
    I think this is the part that prompted my question. I may be pretty far off of understanding what you are trying to say, but my thinking is basically that I am not content with the capabilities of my current mind, so I would like to improve it, but in doing so I would be capable of having more articulate preferences, and my current preference would define a function from the set of possible preferences to an approval rating such that I would be trying to improve my mind in such a way that my new more articulate preferences are the ones I most approve of or find sufficiently acceptable.
    
    If this process is iterated, it defines some path or cone from my current preferences through the space of possible preferences moving from less to more articulate. It might be that other people would not seek such a thing, though I suspect many would, but with less conscientiousness about what they are doing. It is also possible there are convergent states where my preferences and capabilities would determine a desire to remain as I am. ( I am mildly hopeful that that is the case. )
    
    It is my understanding that the mandelbrot set is not smooth at any scale (not sure if anyone has proven this), but that is the feature I was trying to point out. If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the “variables humans care about can’t be arbitrarily complicated”, but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
    - Q Home 20 May 2025 8:48 UTC
      4 points
      0
      Parent
      I think I understand you now. Your question seems much simpler than I expected. You’re basically just asking “but what if we’ll want infinitely complicated / detailed values in the future?”
      
      If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the “variables humans care about can’t be arbitrarily complicated”, but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
      
      It’s OK if the principle won’t be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I’m not trying to dictate to anyone what they should care about.
      - TristanTrim 20 May 2025 10:08 UTC
        1 point
        0
        Parent
        Hmm… I appreciate the response. It makes me more curious to understand what you’re talking about.
        
        At this point I think it would be quite reasonable if you suggest that I actually read your article instead of speculating about what it says, lol, but if you want to say anything about my following points of confusion I wouldn’t say no : )
        
        For context my current view is that value alignment is the only safe way to build ASI. I’m less skeptical about corrigible task ASI than prosaic scaling with RLHF, but I’m currently still quite skeptical in absolute terms. Roughly speaking, prosaic kills us, task genie maybe kills us maybe allows us to make stupid wishes which harm us. I’m kinda not sure if you are focusing on stuff that takes us from prosaic from to task genie, or that helps with task genie not killing us. I suspect you are not focused on task genie allowing us to make stupid wishes, but I’d be open to hearing I’m wrong.
        
        I also have an intuition that having preferences for future preferences is synonymous with having those preferences, but I suppose there are also ways in which they are obviously different, ie their uncompressed specification size. Are you suggesting that limiting the complexity of the preferences the AI is working off of to similar levels of complexity of current encodings of human preferences (ie human brains) ensures the preferences aren’t among the set of preferences that are misaligned because they are too complicated (even though the human preferences are synonymous with more complicated preferences). I think I’m surely misunderstanding, maybe the way you are applying the natural abstraction hypothesis, or possibly a bunch of things.
        Q Home 20 May 2025 16:13 UTC
        3 points
        0
        Parent
        Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?
        
        Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
        TristanTrim 20 May 2025 21:47 UTC
        2 points
        0
        Parent
        
        Could you reformulate the last paragraph
        
        I’ll try. I’m not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.
        
        The post is about corrigible task ASI
        
        Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn’t corrigible, so I should probably develop my thinking about corrigibility.