Task: Suggest surprising experiments that challenge assumptions
Context: A researcher is considering an alignment proposal that hinges on some key assumptions. They would like to see some suggestions for experiments (either theoreetical thoughts experiments or actual real-world experiments) that could challenge those assumptions. If the experiment has been done, it should report the results.
Input type: An assumption about a powerful AI system
Output type: a suggestion for an experiment that could challenge that assumption. If it has been done already, the results of those experiments.
Instance 1:
Input: The performance of a model is impossible to predict, so we can’t hope to have an idea of a model’s capabilities before it is trained and evaluated.
Output: It might be that a key measure of performance of a model, such as the loss, might scale predictably with the model size. This was investigated by Kaplan et al (https://arxiv.org/abs/2001.08361), who found that the loss tends to follow a power law.
Instance 2:
Input: Suppose a model is trained on data that is mixed with some noise (as in https://arxiv.org/pdf/2009.08092.pdf ).The model will necessarily learn that the data was mixed with some noise, rather than learn a really complex decision boundary.
Output: Suppose that you try fine-tuning one of these models on data that doesn’t have the noise. It might be very slow to adapt to this in which case it might have learned the complex decision boundary. (This experiment hasn’t been done.)
Instance 3:
Input: It’s impossible to train a neural network without non-linearities like ReLU or a sigmoid.
Output: That is true for theoretical neural networks, but real neural networks are trained using floating point numbers with inherently non-linear arithmetic. These imperfections might be enough to train a competent model. This experiment was done by Jakob Foerster, who found that this was indeed enough: https://openai.com/blog/nonlinear-computation-in-linear-networks/
Task: convert mathematical expressions into natural language
Context: A researcher is reading a paper about alignment that contains a lot of well-specified but dense mathematical notation. They would like to see a less terse and more fluent description of the same idea that’s easier to read, similar to what a researcher might say to them at a blackboard while writing the math. This might involve additional context for novices.
Input type: a piece of mathematically-dense but well-specified text from a paper
Output type: a fluent, natural language descirption of the same mathematical objects
Info constraints: none
Instance 1:
Input: the section “The circuit distillation prior” from https://www.alignmentforum.org/posts/7ygmXXGjXZaEktF6M/towards-a-better-circuit-prior-improving-on-elk-state-of-the
Output: Consider a predictive model that predicts the output of a video camera given some sensors in the world. As in ELK, our goal will be to find a function that looks at the sensors, the model, and some questions, then reutrns some answers to those questions using the model’s latent knoweldge.
Instance 2:
Input: the section “Our model of proxy misspecification” from https://www.alignmentforum.org/posts/tWpgtjRm9qwzxAZEi/proxy-misspecification-and-the-capabilities-vs-value
Output: Alice has n things that she values: given any of these items, she’ll always value a set at least as much if she adds another one. A robot is given a proxy for this utility, but it dpeends on a strict subset of the items. The robot optimizes its proxy subjec to some resource constrants. It’s a thoerem that the robot will not pick the things that weren’t included in its utility.
Instance 3:
Input: The paragraph “vector valued preferences” from https://www.alignmentforum.org/posts/oheKfWA7SsvpK7SGp/probability-is-real-and-value-is-complex
Output: We think of events, in the sense of the sigma algebras that are used in the formalization fo probability theory. Each event has a probability and an expected utility assigned to it. We are interested in the product of these two, which Valdimir Nesov called “shouldness’”
This feels tractable in large part becuase mathematical notation tends to invovle a lot of context, which a language model could probably digest.