Thomas Kwa comments on Oversight Misses 100% of Thoughts The AI Does Not Think

Thomas Kwa 15 Aug 2022 22:50 UTC
6 points
2
I might disagree with this. It seems like to achieve anything, the AI’s planning process will have to be robust to the perturbations of humans strategically disrupting its plans, because otherwise it just gets turned off. This seems very close to explicitly thinking about how to counter human plans.
My rephrasing of the question: can a fight between real-life optimizers be well-modeled by which one has “more optimization power” along a single dimension, or does the winning agent have to model and counter the losing agents’s strategies?
- arguments for:
  - you can win aerial dogfights by disrupting the other craft’s OODA loop rather than a specific strategy
  - Skill in many adversarial games seems to be well-modeled by a single ELO score rather than multiple dimensions
- argument against:
  - The good regulator theorem says there’s some correspondence between the actions of an optimizer and the structure of its environment, which seem likely to take the form of explicit planning
  - Humans can defeat non-robust optimization processes that have huge amounts of optimization power in one distribution just by putting them in a different distribution. Pests multiply until humans deploy a targeted pesticide; temperature equilibrates between indoors and outdoors until we install an air conditioner.
- johnswentworth 16 Aug 2022 0:20 UTC
  4 points
  0
  Parent
  In the case of human planning, I know that there are lots of things which will cause other humans to “turn me off”, like e.g. going on a murder spree. So I mostly use search methods such that those things aren’t in my search space in the first place.
  An AI using search methods such that things-humans-find-obviously-bad-and-will-punish just aren’t in the search space probably looks, at first glance, like an AI actually working as intended (even given interpretability tools). The problem is that there’s also a bunch of stuff humans would consider bad but either wouldn’t notice or wouldn’t punish (most likely because they wouldn’t easily notice/understand why it’s bad, at least until much later). And the AI has no particular reason to leave that stuff out of its search space, nor any particular reason to deceive humans about it; from the AI’s perspective, that stuff is strategically isomorphic to stuff humans don’t care about at all.