Instrumental Convergence Bounty

I have yet to find a real-world example that I can test my corrigibility definition on. Hence, I will send $100 to the first person who can send/​show me an example of instrumental convergence that is:

  1. Surprising, in the sense that the model was trained on a goal other than “maximize money” “maximize resources” or “take over the world”

  2. Natural, in the sense that instrumental convergence arose while trying to do some objective, not with the goal in advance being to show instrumental convergence

  3. Reproducible, in the sense that I could plausibly run the model+environment+whatever else is needed to show instrumental convergence on a box I can rent on lambda

Example of what would count as a valid solution:

I was training an agent to pick apples in Terraria and it took over the entire world in order to convert it into a massive apple orchard

Example that would fail because it is not surprising

I trained an agent to play CIV IV and it took over the world

Example that would fail because it is not natural

After reading your post, I created a toy model where an agent told to pick apples tiles the world with apple trees

Example that would fail because it is not reproducible

The US military did a simulation in which they trained a drone which decided to take out is operator

Update

The bounty has been claimed by Hastings.

Obviously I would still appreciate more examples, but there won’t be a 2nd bounty (or if I do create one in the future it will have more requirements attached).