TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 24 Jun 2022 18:31 UTC
LW: 32 AF: 15
0
AF
Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.
Just because we write down English describing what we want the AI to do (“be helpful”), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn’t mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It’s easier to say “this logic enables nonmonotonic reasoning” and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox’s theorem)
And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)
In the context of “how do we build AIs which help people?”, asking “does CIRL solve corrigibility?” is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable “corrigibility”-like property; we have assumed it is good to have in an AI; we have assumed it is good in a similar way as “helping people”; we have elevated CIRL in particular as a formalism worth inquiring after.
But this is not the first question to ask, when considering “sometimes people want to help each other, and it’d be great to build an AI which helps us in some way.” Much better to start with existing generally intelligent systems (humans) which already sometimes act in the way you want (they help each other) and ask after the guaranteed-to-exist reason why this empirical phenomenon happens.
What links here?
- Humans provide an untapped wealth of evidence about alignment by TurnTrout (14 Jul 2022 2:31 UTC; 213 points)
- DanielFilan's comment on AI Pause Will Likely Backfire by Nora Belrose (EA Forum; 16 Sep 2023 17:39 UTC; 15 points)
- TurnTrout 14 Jul 2022 1:22 UTC
  LW: 2 AF: 2
  0
  AF Parent
  And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)
  Actually, this is somewhat too uncharitable to my past self. It’s true that I did not, in 2018, grasp the two related lessons conveyed by the above comment:
  1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, “low impact”), and not just supported by “it sounds nice or has some good properties.”
  2. Don’t randomly jump to highly specific ideas and questions without lots of locating evidence.
  However, in World State is the Wrong Abstraction for Impact, I wrote:
  I think what gets you is asking the question “what things are impactful?” instead of “why do I think things are impactful?”. Then, you substitute the easier-feeling question of “how different are these world states?”. Your fate is sealed; you’ve anchored yourself on a Wrong Question.
  I had partially learned lesson #2 by 2019.