Mateusz Bagiński comments on Oversight Misses 100% of Thoughts The AI Does Not Think

Mateusz Bagiński 12 Aug 2022 17:24 UTC
2 points
1
This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective. If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values
- lc 12 Aug 2022 17:28 UTC
  4 points
  2
  Parent
  The other kind of misalignment I was thinking about is if it’s able to perform a
  Kira
  ::Death Note or
  Carissa
  ::Planecrash style flip during training, where it modifies itself to have the “correct” thoughts/instrumental goals in anticipation of inspection but buries an if(time() > ...){} hatch inside itself which it & its overseers won’t notice until it’s too late.