If the inner optimizer only affects the world by passing predictions to the outer model, the most obvious trick is to assign artificially negative outcomes to states you want to avoid (e g. which the inner optimizer can predict would update the outer model in bad ways) which then never get checked by virtue of being too scary to try. What are the other obvious hacks?
I guess if you sufficiently control the predictions, you can just throw off the pretense and send the outer model the prediction that giving you a channel to the outside would be a great idea.
That obvious trick relies on that we will only verify its prediction for strategies that it recommends. Here’s a protocol that doesn’t fail to it: A known number of gems are distributed among boxes, we can only open one box and want many gems. Ask for the distribution, select one gem at random from the answer and open its box. For every gem it hides elsewhere, selecting it reveals deception.
If the inner optimizer only affects the world by passing predictions to the outer model, the most obvious trick is to assign artificially negative outcomes to states you want to avoid (e g. which the inner optimizer can predict would update the outer model in bad ways) which then never get checked by virtue of being too scary to try. What are the other obvious hacks?
I guess if you sufficiently control the predictions, you can just throw off the pretense and send the outer model the prediction that giving you a channel to the outside would be a great idea.
That obvious trick relies on that we will only verify its prediction for strategies that it recommends. Here’s a protocol that doesn’t fail to it: A known number of gems are distributed among boxes, we can only open one box and want many gems. Ask for the distribution, select one gem at random from the answer and open its box. For every gem it hides elsewhere, selecting it reveals deception.