In the case of human planning, I know that there are lots of things which will cause other humans to “turn me off”, like e.g. going on a murder spree. So I mostly use search methods such that those things aren’t in my search space in the first place.
An AI using search methods such that things-humans-find-obviously-bad-and-will-punish just aren’t in the search space probably looks, at first glance, like an AI actually working as intended (even given interpretability tools). The problem is that there’s also a bunch of stuff humans would consider bad but either wouldn’t notice or wouldn’t punish (most likely because they wouldn’t easily notice/understand why it’s bad, at least until much later). And the AI has no particular reason to leave that stuff out of its search space, nor any particular reason to deceive humans about it; from the AI’s perspective, that stuff is strategically isomorphic to stuff humans don’t care about at all.
In the case of human planning, I know that there are lots of things which will cause other humans to “turn me off”, like e.g. going on a murder spree. So I mostly use search methods such that those things aren’t in my search space in the first place.
An AI using search methods such that things-humans-find-obviously-bad-and-will-punish just aren’t in the search space probably looks, at first glance, like an AI actually working as intended (even given interpretability tools). The problem is that there’s also a bunch of stuff humans would consider bad but either wouldn’t notice or wouldn’t punish (most likely because they wouldn’t easily notice/understand why it’s bad, at least until much later). And the AI has no particular reason to leave that stuff out of its search space, nor any particular reason to deceive humans about it; from the AI’s perspective, that stuff is strategically isomorphic to stuff humans don’t care about at all.