Use Opt to find a language model. The hope is to make it imitate a human researcher’s thought process fast enough that the imitation can attempt to solve the AI alignment problem for us.
Use Opt to find a proof that generating such an imitation will not lead to a daemon’s treacherous turn, as defined by the model disagreeing in its prediction from a large enough Solomonoff fraction of its competitors. The hope is that the consequentialist portion of the hypothesis space is not large and cooperative/homogenous enough to form a single voting block that bypasses the daemon alarm.
Use Opt to find a language model. The hope is to make it imitate a human researcher’s thought process fast enough that the imitation can attempt to solve the AI alignment problem for us.
Use Opt to find a proof that generating such an imitation will not lead to a daemon’s treacherous turn, as defined by the model disagreeing in its prediction from a large enough Solomonoff fraction of its competitors. The hope is that the consequentialist portion of the hypothesis space is not large and cooperative/homogenous enough to form a single voting block that bypasses the daemon alarm.