Rohin Shah comments on Distinguishing claims about training vs deployment

Rohin Shah 4 Feb 2021 2:11 UTC
LW: 5 AF: 4
AF
Planned summary for the Alignment Newsletter:
One story for AGI is that we train an AI system on some objective function, such as an objective that rewards the agent for following commands given to it by humans using natural language. We then deploy the system without any function that produces reward values; we instead give the trained agent commands in natural language. Many key claims in AI alignment benefit from more precisely stating whether they apply during training or during deployment.
For example, consider the instrumental convergence argument. The author proposes that we instead think of the training convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behavior aimed towards certain convergent goals (such as self-preservation). This could happen either via the AGI internalizing them directly as final goals, or by the AGI learning final goals for which these goals are instrumental.
The author similarly clarifies goal specification, the orthogonality thesis, fragility of value, and Goodhart’s Law.