Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.
Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/​runs more trial-and-error trials?
Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.
That still involves training it with no negative feedback error term for excess blocks (which would overwhelm a mere 0.1% uncertainty).
This is supposed to be a toy model of excessive simplicity. Do you have suggestions for improving it (for purposes of presenting to others)?
Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/​runs more trial-and-error trials?
Ok.