Joe_Collman comments on Deceptive AI ≠ Deceptively-aligned AI

Joe_Collman 9 Jan 2024 17:12 UTC
LW: 2 AF: 1
0
AF
By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.
This description allows us to classify every output of a highly capable AI as deceptive:
For any AI output, it’s essentially guaranteed that a human will update away from the truth about something. A highly capable AI will be able to predict some of these updates—thus it will be “knowingly providing … misleading information”.
Conversely, we can’t require that a human be misled about everything in order to classify something as deceptive—nothing would then qualify as deceptive.
There’s no obvious fix here.
Our common-sense notion of deception is fundamentally tied to motivation:
- A teacher says X in order to give a student a basic-but-flawed model that’s misleading in various ways, but is a step towards deeper understanding.
  - Not deception.
- A teacher says X to a student in order to give them a basic-but-flawed model that’s misleading in various ways, to manipulate them.
  - Deception.
The student’s updates in these cases can be identical. Whether we want to call the statement deceptive comes down to the motivation of the speaker (perhaps as inferred from subsequent actions).
In a real world context, it is not possible to rule out misleading behavior: all behavior misleads about something.
We can only hope to rule out malign misleading behavior. This gets us into questions around motivation, values etc (or at least into much broader considerations involving patterns of behavior and long-term consequences).

(I note also that requiring “knowingly” is an obvious loophole—allowing self-deception, willful ignorance or negligence to lead to bad outcomes; this is why some are focusing on truthfulness rather than honesty)