Gurkenglas comments on Aligning a toy model of optimization

Gurkenglas 29 Jun 2019 2:12 UTC
5 points
Use Opt to find a language model. The hope is to make it imitate a human researcher’s thought process fast enough that the imitation can attempt to solve the AI alignment problem for us.

Use Opt to find a proof that generating such an imitation will not lead to a daemon’s treacherous turn, as defined by the model disagreeing in its prediction from a large enough Solomonoff fraction of its competitors. The hope is that the consequentialist portion of the hypothesis space is not large and cooperative/homogenous enough to form a single voting block that bypasses the daemon alarm.