johnswentworth comments on Dreams of AI alignment: The danger of suggestive names

johnswentworth 12 Feb 2024 19:09 UTC
12 points
6
My prescription for the problem you’re highlighting here: track a prototypical example.
Trying to unpack e.g. “optimization pressure” into a good definition—even an informal definition—is hard. Most people who attempt to do that will get it wrong, in the sense that their proffered definition will not match their own intuitive usage of the term (even in cases where their own intuitive usage is coherent), or their implied usage in an argument. But their underlying intuitions about “optimization pressure” are often still be correct, even if those intuitions are not yet legible. Definitions, though an obvious strategy, are not a very good one for distinguish coherent-word-usage from incoherent-word-usage.
(Note that OP’s strategy of unpacking words into longer phrases is basically equivalent to using definitions, for our purposes.)
So: how can we track coherence of word usage, practically?
Well, we can check coherence constructively, i.e. by exhibiting an example which matches the usage. If someone is giving an argument involving “optimization pressure”, I can ask for a prototypical example, and then walk through the argument in the context of the example to make sure that it’s coherent.
For instance, maybe someone says “unbounded optimization wants to take over the world”. I ask for an example. They say “Well, suppose we have a powerful AI which wants to make as many left shoes as possible. If there’s some nontrivial pool of resources in the universe which it hasn’t turned toward shoe-making, then it could presumably make more shoes by turning those resources toward shoe-making, so it will seek to do so.”. And then I can easily pick out a part of the example to drill in on—e.g. maybe I want to focus on the implicit “all else equal” and argue that all else will not be equal, or maybe I want to drill into the “AI which wants” part and express skepticism about whether anything like current AI will “want” things in the relevant way (in which case I’d probably ask for an example of an AI which might want to make as many left shoes as possible), or [whatever else].
The key thing to notice is that I can easily express those counterarguments in the context of the example, and it will be relatively clear to both myself and my conversational partner what I’m saying. Contrast that to e.g. just trying to use very long phrases in place of “unbounded optimization” or “wants”, which makes everything very hard to follow.
In one-way “conversation” (e.g. if I’m reading a post), I’d track such a prototypical example in my head. (Well, really a few prototypical examples, but 80% of the value comes from having any.) Then I can relatively-easily tell when the argument given falls apart for my example.