I feel confused as to how step (3) is supposed to work, especially how “having the training be done by the model being trained given access to tools from (2)” is a route to this.
At some step in the amplification process, we’ll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren’t guaranteed to work, as you note in the bullet points of step (2).) Are you supposing that there’s some property other than myopia that the model could use to verify that its successor is non-deceptive, such that it can successfully verify myopia? What is that property? And do we have reason to think that property will only be guaranteed if the model doing the training is myopic? (Otherwise why bother with myopia at all—just use that other property to guarantee non-deception.)
Intuitively step (3) seems harder than (2), since in (3) you have to worry about deception creeping in to the more powerful successor agent, while (2) by definition only requires myopia verification of non-deceptive models.
ETA: Other than this confusion, I found this post helpful for understanding what success looks like to (at least one) alignment researcher, so thanks!
I feel confused as to how step (3) is supposed to work, especially how “having the training be done by the model being trained given access to tools from (2)” is a route to this.
At some step in the amplification process, we’ll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren’t guaranteed to work, as you note in the bullet points of step (2).) Are you supposing that there’s some property other than myopia that the model could use to verify that its successor is non-deceptive, such that it can successfully verify myopia? What is that property? And do we have reason to think that property will only be guaranteed if the model doing the training is myopic? (Otherwise why bother with myopia at all—just use that other property to guarantee non-deception.)
Intuitively step (3) seems harder than (2), since in (3) you have to worry about deception creeping in to the more powerful successor agent, while (2) by definition only requires myopia verification of non-deceptive models.
ETA: Other than this confusion, I found this post helpful for understanding what success looks like to (at least one) alignment researcher, so thanks!