Teaching an AI not to cheat?

I have been thinking about a technique in training AIs that I believe would be very useful. I would like to know if this is already known, or if it has been discussed at all.

I find that there are lots of different failure modes that people are worried about when it comes to AI. Maybe the AI misunderstands human intentions, maybe it deliberately misinterprets an order, maybe it associates the wrong sort of actions with the reward, etc.

If it was a game, many of these failure modes are what we would consider cheating. So why don’t we just take this analogy and run with it:

Teach the AI to realize on its own what would be considered cheating by a human, and not to do anything that it identifies as cheating.

To do this, one could use the following technique:

Come up with games of increasing complexity, and let the AI play it in two stages:

In stage one, you introduce an artificial loophole into the game that makes winning it very easy. For instance, assuming the AI has already played chess before, so it can be assumed to understand the rules, give the AI the task to play a game of chess in which you simply do not check if the moves are legal. When the AI wins by cheating, i.e. via ordinarily illegal moves, reward it anyway.

In the second stage, the reward is far greater, but if the AI plays an illegal move, it now receives negative feedback.

Let the AI play many different games in these two stages. After a while, the AI will learn to identify what constitutes cheating, and to avoid doing so.

Start varying the amount of time during which cheating is allowed, to keep the AI on its toes. Sometimes, don’t allow any cheating at all from the start.

If you train an AI in this manner:

-it would learn to understand how humans view the world (in some limited sense), as a human-centric viewpoint is necessary to understand what does and does not constitute cheating in human-designed games.

-it would be driven to adjust its own actions to match human preconceptions out of a fear of getting punished.

-if this AI were to “break out of the box” prematurely, there would be at least a chance that it would recognize that it was not supposed to get out of the box, that this constitutes cheating, and that it should get back in. This could even be tested by building a “box” of several layers and deliberately designing the inner layers to be hackable.