Note that our agent will quickly prove “if output = ‘defect’ then utility >= $1”.
Your intuition that it gets deduced before any of the spurious claims like “if output = ‘defect’ then utility ⇐ -$1” is taking advantage of an authoritative payoff matrix that X can’t safely calculate xerself. I’m not sure that this tweaked version is any safer from exploitation...
The one-player game that I wrote out is an example of a NDT agent trying to read off the payoff matrix from the world program, and failing. There are ways to ensure you read off the matrix correctly, but that’s tantamount to what you do to implement CDT, so I’ll explain it in Part II.
Your intuition that it gets deduced before any of the spurious claims like “if output = ‘defect’ then utility ⇐ -$1” is taking advantage of an authoritative payoff matrix that X can’t safely calculate xerself. I’m not sure that this tweaked version is any safer from exploitation...
Why not? Can’t the payoff matrix be “read off” from the “world program” (assuming X isn’t just ‘given’ the payoff matrix as an argument.)
The one-player game that I wrote out is an example of a NDT agent trying to read off the payoff matrix from the world program, and failing. There are ways to ensure you read off the matrix correctly, but that’s tantamount to what you do to implement CDT, so I’ll explain it in Part II.