TurnTrout comments on Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

TurnTrout 18 Jan 2023 5:55 UTC
LW: 11 AF: 7
0
AF
Still on the topic of deception, there are arguments suggesting that something like GPT will always be “deceptive” for Goodhart’s Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. “Looking good” and “being correct” correlate with each other to some extent, but will eventually be pushed apart once there’s enough optimization pressure on the “looking good” part.
As such, this seems like an unsolvable problem… but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.
This argument proves too much. Humans are rewarded by each other for appearing to be friendly, but not actually for being friendly. Therefore, they are incentivized to seem friendly. Therefore, no humans will really be friendly or really care about each other, because that’s not what human culture is really selecting for.
We have concluded a falsehood; the line of reasoning requires, at the least, more assumptions which make it clear why the argument obtains in the AI case but not the human case. (Personally, I suspect the reasoning is mostly invalid.) So, I think “selection” / “incentivization” arguments require great care and caution.
(As another line of reasoning, reward is not the optimization target contravenes claims like “GPT will always be ‘deceptive’ due to Goodhart’s / Siren worlds.”)