David Scott Krueger (formerly: capybaralet) comments on Evaluating the historical value misspecification argument

David Scott Krueger (formerly: capybaralet) 24 Dec 2024 5:55 UTC
LW: 2 AF: 1
0
AF
This comment made me reflect on what fragility of values means.

To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like “people” in its environment (in order to instantiate human values like “try not to hurt people”) even as the world changes radically with the introduction of various forms of transhumanism.

I guess it’s not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we’re at with adversarial robustness of vision-language models, e.g.
- Linch 2 Jan 2025 9:08 UTC
  LW: 3 AF: 1
  1
  AF Parent
  I think I’m relatively optimistic that the difference between a system that “can (and will) do a very good job with human values when restricted to the text domain: vs “system that can do a very good job, unrestricted” isn’t that high. This is because I’m personally fairly skeptical about arguments along the lines of “words aren’t human thinking, words are mere shadows of human thinking” that people put out, at least when it comes to human values.
  
  (It’s definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).
  - David Scott Krueger (formerly: capybaralet) 2 Jan 2025 15:37 UTC
    LW: 1 AF: 1
    0
    AF Parent
    OTMH, I think my concern here is less:
    “The AI’s values don’t generalize well outside of the text domain (e.g. to a humanoid robot)”
    
    and more:
    “The AI’s values must be much more aligned in order to be safe outside the text domain”
    
    I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.
    
    This would be because the richer domain / interface of the robot creates many more opportunities to “exploit” whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.
    - Noosphere89 2 Jan 2025 18:42 UTC
      LW: 6 AF: 1
      2
      AF Parent
      Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.
      
      “The AI’s values must be much more aligned in order to be safe outside the text domain”
      - David Scott Krueger (formerly: capybaralet) 6 Jan 2025 3:12 UTC
        LW: 3 AF: 2
        0
        AF Parent
        There are 2 senses in which I agree that we don’t need full on “capital V value alignment”:
        We can build things that aren’t utility maximizers (e.g. consider the humble MNIST classifier)
        There are some utility functions that aren’t quite right, but are still safe enough to optimize in practice (e.g. see “Value Alignment Verification”, but see also, e.g. “Defining and Characterizing Reward Hacking” for negative results)
        But also:
        Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns—CAVEAT: agency is not a unidimensional quantity, cf: “Harms from Increasingly Agentic Algorithmic Systems”).
        Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don’t really see how your arguments are relevant to this question.
        Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial “hack” on it’s values leading to behavior that significantly diverges from things humans would endorse.