TurnTrout comments on Dreams of AI alignment: The danger of suggestive names

TurnTrout 4 Mar 2024 18:08 UTC
5 points
1
I do think that a wide range of shard-based mind-structures will equilibrate into EU optimizers, but I also think this is a somewhat mild statement. My stance is that utility functions represent a yardstick by which decisions are made. “Utility was made by the agent, for the agent” as it were—and not “the agent is made to optimize the utility.” What this means is:
Suppose I start off caring about dogs and diamonds in a shard-like fashion, with certain situations making me seek out dogs and care for them (in the usual intuitive way); and similarly for diamonds. However, there will be certain situations in which the dog-shard “interferes with” the diamond-shard, such that the dog-shard e.g. makes me daydream about dogs while doing my work and thereby do worse in life overall. If I didn’t engage in this behavior, then in general I’d probably be able to get more dog-caring and diamond-acquisition. So from the vantage point of this mind and its shards, it is subjectively better to not engage in such “incoherent” behavior which is a strictly dominated strategy in expectation (i.e. leads to fewer dogs and diamonds).
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
This doesn’t mean, of course, that these shards decide to implement a utility function with absurd results by the initial decision-making procedure. For example, tiling the universe (half with dog-squiggles, half with diamond-squiggles) would not be a desirable outcome under the initial decision-making process. Insofar as such an outcome could be foreseen as a consequence of making decisions a proposed utility function, the shards would disprefer that utility function.^[1]
So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning. On this view, one would derive a utility function as a rule of thumb for how to make decisions effectively and (nearly) Pareto-optimally in relevant scenarios.^[2]
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
1. ^
  This elides any practical issues with self-modification, and possible value drift from e.g. external sources, and so on. I think they don’t change the key conclusions here. I think they do change conclusions for other questions though.
2. ^
  Again, if I’m imagining the vantage point of dog+diamond agent, it wouldn’t want to waste tons of compute deriving a policy for weird situations it doesn’t expect to run into. The most important place to become more coherent is the expected on-policy future.
What links here?
- TurnTrout's comment on Richard Ngo’s Shortform by Richard_Ngo (11 Mar 2024 22:31 UTC; 9 points)
- Wei Dai 4 Mar 2024 21:31 UTC
  9 points
  2
  Parent
  Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
  
  What do you think that algorithm will be? Why would it not be some explicit EU-maximization-like algorithm, with a utility function that fully represents both of their values? (At least eventually?) It seems like the best way to guarantee that the two shards will never step on each others’ toes ever again (no need to worry about running into unforeseen situations), and also allows the agent to easily merge with other similar agents in the future (thereby avoiding stepping on even more toes).
  
  (Not saying I know for sure this is inevitable, as there could be all kinds of obstacles to this outcome, but it still seems like our best guess of what advanced AI will eventually look like?)
  
  So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning.
  
  I agree with this statement, but what about:
  1. Shards just making a mistake and picking a bad utility function. (The individual shards aren’t necessarily very smart and/or rational?)
  2. The utility function is fine for the AI but not for us. (Would the AI shards’ values exactly match our shards, including relative power/influence, and if not, why would their utility function be safe for us?)
  3. Competitive pressures forcing shard-based AIs to become more optimizer-like before they’re ready, or to build other kinds of more competitive but riskier AI, similar to how it’s hard for humans to stop our own AI arms race.
  (You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
  
  Yes, you’re helping me better understand your perspective, thanks. However as indicated by my questions above, I’m still not sure why you think shard-based AI agents would be safe in general, and in particular (among other risks) why they wouldn’t turn into dangerous goal-directed optimizers at some point.