Adam Jermyn comments on Prize for Alignment Research Tasks

Adam Jermyn 16 May 2022 14:37 UTC
2 points
Summarize an Alignment Proposal in the form of a Training Story
- Context: An alignment researcher has written a proposal for building a safe AI system. To better understand the core of the idea, they want to summarize their proposal in the form of a training story.
- Input type: A proposal for building a safe AI system.
- Output type: A summary of the proposal in the form of a training story.
- Info constraints: None.
Both instances are taken directly from How do we become confident in the safety of a machine learning system?
Instance 1:
- Input: Paul Christiano’s description of corrigibility.
- Output: Paul Christiano’s description of the “intended model”.
Instance 2:
- Input: Chris Olah’s Microscope AI proposal.
- Output: (verbatim from How do we become confident in the safety of a machine learning system?)
The training goal of Microscope AI is a purely predictive model that internally makes use of human-understandable concepts to be able to predict the data given to it, without reasoning about the effects of its predictions on the world. Thus, we can think of Microscope AI’s training goal as having two key components:
1. the model doesn’t try to optimize anything over the world, instead being composed solely of a world model and a pure predictor; and
2. the model uses human-understandable concepts to do so.
The reason that we want such a model is so that we can do transparency and interpretability on it, which should hopefully allow us to extract the human-understandable concepts learned by the model. Then, the idea is that this will be useful because we can use those concepts to help improve human understanding and decision-making.
The plan for getting there is to do self-supervised learning on a large, diverse dataset while using transparency tools during training to check that the correct training goal is being learned. Primarily, the training rationale is to use the nudge of an inductive bias towards simplicity to ensure that we get the desired training goal. This relies on it being the case that the simplest algorithm that’s implementable on a large neural network and successfully predicts the training data is a straightforward/pure predictor—and one that uses human-understandable concepts to do so. The use of transparency tools during training is then mostly just to verify that such a nudge is in fact sufficient, helping to catch the presence of any sort of agentic optimization so that training can be halted in such a case.

Adam Jermyn comments on Prize for Alignment Research Tasks

Summarize an Alignment Proposal in the form of a Training Story

Instance 1:

Instance 2: