Wei Dai comments on Dreams of AI alignment: The danger of suggestive names

Wei Dai 4 Mar 2024 21:31 UTC
9 points
2
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.

What do you think that algorithm will be? Why would it not be some explicit EU-maximization-like algorithm, with a utility function that fully represents both of their values? (At least eventually?) It seems like the best way to guarantee that the two shards will never step on each others’ toes ever again (no need to worry about running into unforeseen situations), and also allows the agent to easily merge with other similar agents in the future (thereby avoiding stepping on even more toes).

(Not saying I know for sure this is inevitable, as there could be all kinds of obstacles to this outcome, but it still seems like our best guess of what advanced AI will eventually look like?)

So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning.

I agree with this statement, but what about:
1. Shards just making a mistake and picking a bad utility function. (The individual shards aren’t necessarily very smart and/or rational?)
2. The utility function is fine for the AI but not for us. (Would the AI shards’ values exactly match our shards, including relative power/influence, and if not, why would their utility function be safe for us?)
3. Competitive pressures forcing shard-based AIs to become more optimizer-like before they’re ready, or to build other kinds of more competitive but riskier AI, similar to how it’s hard for humans to stop our own AI arms race.
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)

Yes, you’re helping me better understand your perspective, thanks. However as indicated by my questions above, I’m still not sure why you think shard-based AI agents would be safe in general, and in particular (among other risks) why they wouldn’t turn into dangerous goal-directed optimizers at some point.