Sam Marks comments on TurnTrout’s shortform feed

Sam Marks 1 Dec 2023 17:21 UTC
LW: 13 AF: 9
11
AF
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI which basically does what you say.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI “basically does what you say.” Indeed, the majority of my P(doom) comes from the difference between “looks good to human evaluators” and “is actually what the human evaluators wanted.” Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don’t provide much evidence about whether these concerns will pan out: with current models and training set-ups, “looks good to evaluators” almost always coincides with “is what evaluators wanted.” I worry that we’ll only see this distinction matter once models are smart enough that they could competently deceive their overseers if they were trying (because of the same argument made here). (Forms of sycophancy where models knowingly assert false statements when they expect the user will agree are somewhat relevant, but there are also benign reasons models might do this.)
- tailcalled 1 Dec 2023 20:39 UTC
  5 points
  1
  Parent
  To what extent do you worry about the training methods used for ChatGPT, and why?
- Noosphere89 6 Dec 2023 22:30 UTC
  2 points
  1
  Parent
  I think my crux is that once we remove the deceptive alignment issue, I suspect that profit forces alone will generate a large incentive to reduce the gap, primarily because I think that people way overestimate the value of powerful, unaligned agents to the market and underestimate the want for control over AI.