Wei Dai comments on Review of AI Alignment Progress

Wei Dai 8 Feb 2023 6:21 UTC
43 points
19

I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.

“Humans don’t have utility functions” and the idea of values of being embedded in heuristics and/or subagents have been discussed since the beginning of LW, but usually framed as a problem, not a solution. The key issue here is that utility functions are the only reflectively stable carrier of value that we know of, meaning that an agent with a utility function would want to preserve that utility function and build more agents with the same utility function, but we don’t know what agents with other kinds of value will want to do in the long run. I don’t understand why we no longer seem to care about reflective stability...

Granted I haven’t read all of the literature on shard theory so maybe I’ve missed something. Can you perhaps summarize the argument for this?
- Quintin Pope 8 Feb 2023 13:04 UTC
  10 points
  2
  Parent
  I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models.
  I don’t think that the utility function framework helps much. I do agree that utility functions seem poorly suited to capturing human values. However, my reaction to that is to be more skeptical of utility functions. Also, I don’t think the true solution to questions of reflectivity is to reach some perfected fixed point, after which your values remain static for all time. That doesn’t seem human, and I’m in favor of ‘cloning the human prior’ as much as possible. It seems like a very bad idea to deliberately set out to create an agent whose mechanism of forming and holding values is completely different from our own.
  I also think utility functions are poorly suited to capturing the behaviors and values of current AI systems, which I take as another downwards update on the applicability of utility functions to capture such things in powerful cognitive systems.
  We can take a deep model and derive mathematical objects with properties that are nominally very close to utility functions. E.g., for transformers, it will probably soon be possible to find a reasonably concise energy function (probably of a similar OOM of complexity as the model weights themselves) whose minimization corresponds to executing forwards passes of the transformer. However, this energy function wouldn’t tell you much about the values or objectives of the model, since the energy function is expressed in the ontology of model weights and activations, not an agent’s beliefs / goals.
  I think this may be the closest we’ll get to static, universally optimized for utility functions of realistic cognitive architectures. (And they’re not even really static because mixing in any degree of online learning would cause the energy function in question to constantly change over time.)
  - Wei Dai 9 Feb 2023 3:38 UTC
    6 points
    2
    Parent
    
    I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models.
    
    Interesting. I’m curious what kinds of results you’re hoping for, or just more details about your project. (But feel free to ignore this if talking about it now isn’t a good use of your time.) My understanding is that LLMs can potentially be fine-tuned or used to do various things, including instantiating various human-like or artificial characters (such as a “helpful AI assistant”). Seems like reflectivity and self-modification could vary greatly depending on what character(s) you instantiate, or how you use the LLMs in other ways.
    
    Also, I don’t think the true solution to questions of reflectivity is to reach some perfected fixed point, after which your values remain static for all time.
    
    I’m definitely not in favor of building an AI with an utility function representing fixed, static values, at least not in anything like our current circumstances. (My own preferred approach is what I called “White-Box Metaphilosophical AI” in the linked post.) I was just taken aback by Peter saying that he felt “deconfused when I reject utility functions”, when there was a reason that people were/are thinking about utility functions in relation to AI x-safety, and the new approach he likes hasn’t addressed that reason yet.
    - PeterMcCluskey 10 Feb 2023 22:20 UTC
      4 points
      0
      Parent
      I don’t see any clear-cut disagreement between my position and your White-Box Metaphilosophical AI. I wonder how much is just a framing difference?
      
      Reflective stability seems like something that can be left to a smarter-than-human aligned AI.
      
      I’m not saying it would be bad to implement a utility function in an AGI. I’m mostly saying that aiming for that makes human values look complex and hard to observe.
      
      E.g. it leads people to versions of the diamond alignment problem that sound simple, but which cause people to worry about hard problems which they mistakenly imagine are on a path to implementing human values.
      
      Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
      - Wei Dai 11 Feb 2023 4:07 UTC
        9 points
        0
        Parent
        
        Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
        
        Let’s distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it’s not reflectively stable and the proponents haven’t talked about how they plan to ensure that things will go well in the long run. If you’re talking about the former and I’m talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I’m wrong), so it seems important to consider that in any overall evaluation of shard theory?
        
        BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.
        
        The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI’s “genes” and “environment” will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
        Humans are all partly or even mostly selfish, but we don’t want the AI to be. What’s the plan here, or reason to think that shard-based agents can be trained to not be selfish?
        What links here?
        How do you feel about LessWrong these days? [Open feedback thread] by jacobjacob (5 Dec 2023 20:54 UTC; 101 points)
        Wei Dai's comment on Dreams of AI alignment: The danger of suggestive names by TurnTrout (4 Mar 2024 17:12 UTC; 2 points)
        PeterMcCluskey 13 Feb 2023 21:11 UTC
        2 points
        0
        Parent
        I think of shard theory as more than just a model of how to model humans.
        
        My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
        
        Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
        
        I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
        
        Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
        Noosphere89 13 Feb 2023 21:35 UTC
        1 point
        0
        Parent
        Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.
- Scott Garrabrant 8 Feb 2023 16:48 UTC
  9 points
  −1
  Parent
  I feel like reflective stability is what caused me to reject utility. Specifically, it seems like it is impossible to be reflectively stable if I am the kind of mind that would follow the style of argument given for the independence axiom. It seems like there is a conflict between reflective stability and Bayesian updating.
  I am choosing reflective stability, in spite of the fact that loosing updating is making things very messy and confusing (especially in the logical setting), because reflective stability is that important.
  When I lose updating, the independence axiom, and thus utility goes along with it.
  What links here?
  - Vladimir_Nesov's comment on Self-Reference Breaks the Orthogonality Thesis by lsusr (17 Feb 2023 7:24 UTC; 4 points)
  - Scott Garrabrant 8 Feb 2023 16:54 UTC
    5 points
    2
    Parent
    Although I note that my flavor of rejecting utility functions is trying to replace them with something more general, not something incompatible.
  - Wei Dai 9 Feb 2023 3:48 UTC
    4 points
    −2
    Parent
    UDT still has utility functions, even though it doesn’t have independence… Is it just a terminological issue? Like you want to call the representation of value in whatever the correct decision theory turns out to be something besides “utility”? If so, why?
    - Vladimir_Nesov 11 Feb 2023 13:04 UTC
      2 points
      0
      Parent
      
      UDT still has utility functions
      
      But where does UDT get those utility functions from, why does it care about expected utility specifically and not arbitrary preference over policies? Utility functions seem to centrally originate from updateful agents, which take many actions in many hypothetical situations, coherent with each other, forcing preference to be describable as expected utility. Such agents can then become reflectively stable by turning to UDT, now only ever taking a single decision about policy, in the single situation of total ignorance, with nothing else for it to be coherent with.
      
      So by becoming updateless, a UDT agent loses contact with the origin of (motivation for) its own utility function. To keep it, it would still implicitly need an updateful point of view, with its many situations that consitutute the affordance for acting coherently, to motivate its preference to have the specific form of expected utility. Otherwise it only has the one situation, and its preference and policy could be anything, with no opportunity to be constrained by coherence.
    - Scott Garrabrant 10 Feb 2023 19:17 UTC
      2 points
      0
      Parent
      I think UDT as you specified it has utility functions. What do you mean by doesn’t have independence? I am advocating for an updateless agent model that might strictly prefer a mixture between outcomes A and B to either A or B deterministically. I think an agent model with this property should not be described as having a “utility.” Maybe I am conflating “utility” with expected utility maximization/VNM and you are meaning something more general?
      If you mean by utility something more general than utility as used in EUM, then I think it is mostly a terminological issue.
      I think I endorse the word “utility” without any qualifiers as referring to EUM. In part because I think that is how it is used, and in part because EUM is nice enough to deserve the word utility.
      - Scott Garrabrant 10 Feb 2023 19:28 UTC
        2 points
        0
        Parent
        Even if EUM doesn’t get “utility”, I think it at least gets “utility function”, since “function” implies cardinal utility rather than ordinal utility and I think people almost always mean EUM when talking about cardinal utility.
        I personally care about cardinal utility, where the magnitude of the utility is information about how to aggregate rather than information about how to take lotteries, but I think this is a very small minority usage of cardinal utility, so I don’t think it should change the naming convention very much.
- PeterMcCluskey 9 Feb 2023 0:00 UTC
  7 points
  4
  Parent
  Some people on LW have claimed that reflective stability is essential. My impression is that Robin Hanson always rejected that.
  
  This seems like an important clash of intuitions. It seems to be Eliezer claiming that his utility function required it, and Robin denying that is a normal part of human values. I suspect this disagreement stems from some important disagreement about human values.
  
  My position seems closer to Robin’s than to Eliezer’s. I want my values to become increasingly stable. I consider it ok for my values to change moderately as I get closer to creating a CEV. My desire for stability isn’t sufficiently strong compared to other values that I need guarantees about it.
- TAG 11 Feb 2023 11:55 UTC
  3 points
  0
  Parent
  
  The key issue here is that utility functions are the only reflectively stable carrier of value that we know of,
  
  That’s not necessarily a good thing. One man’s reflective stability is another man’s incorrigibility.
- TekhneMakre 8 Feb 2023 19:15 UTC
  2 points
  0
  Parent
  This point seems anti-memetic for some reason. See my 4 karma answer here, on a post with 99 karma.
  https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers?commentId=mgdud4Z6EM88SnzqE