Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I’m always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Artificial Intelligence, Values and Alignment (Iason Gabriel) (summarized by Rohin): This paper from a DeepMind author considers what it would mean to align an AI system. It first makes a distinction between the technical and normative aspects of the AI alignment problem. Roughly, the normative aspect asks, “what should our AI systems do?”, while the technical aspect asks, “given we know what our AI systems should do, how do we get them to do it?”. The author argues that these two questions are interrelated and should not be solved separately: for example, the current success of deep reinforcement learning in which we maximize expected reward suggests that it would be much easier to align AI to a utilitarian framework in which we maximize expected utility, as opposed to a deontological or Kantian framework.

The paper then explores the normative aspect, in both the single human and multiple humans case. When there’s only one human, we must grapple with the problem of what to align our AI system to. The paper considers six possibilities: instructions, expressed intentions, revealed preferences, informed preferences, interests, and values, but doesn’t come to a conclusion about which is best. When there are multiple humans, we must also deal with the fact that different people disagree on values. The paper analyzes three possibilities: aligning to a global notion of morality (e.g. “basic human rights”), doing what people would prefer from behind a veil of ignorance, and pursuing values that are determined by a democratic process (the domain of social choice theory).

Technical AI alignment

Mesa optimization

Inner alignment requires making assumptions about human values (Matthew Barnett) (summarized by Rohin): Typically, for inner alignment, we are considering how to train an AI system that effectively pursues an outer objective function, which we assume is already aligned. Given this, we might think that the inner alignment problem is independent of human values: after all, presumably the outer objective function already encodes human values, and so if we are able to align to an arbitrary objective function (something that presumably doesn’t require human values), that would solve inner alignment.

This post argues that this argument doesn’t work: in practice, we only get data from the outer objective on the training distribution, which isn’t enough to uniquely identify the outer objective. So, solving inner alignment requires our agent to “correctly” generalize from the training distribution to the test distribution. However, the “correct” generalization depends on human values, suggesting that a solution to inner alignment must depend on human values as well.

Rohin’s opinion: I certainly agree that we need some information that leads to the “correct” generalization, though this could be something like e.g. ensuring that the agent is corrigible (AN #35). Whether this depends on human “values” depends on what you mean by “values”.

Learning human intent

A Framework for Data-Driven Robotics (Serkan Cabi et al) (summarized by Nicholas): This paper presents a framework for using a mix of task-agnostic data and task-specific rewards to learn new tasks. The process is as follows:

1. A human teleoperates the robot to provide a demonstration. This circumvents the exploration problem, by directly showing the robot the relevant states.

2. All of the robot’s sensory input is saved to NeverEnding Storage (NES), which stores data from all tasks for future use.

3. Humans annotate a subset of the NES data via task-specific reward sketching, where humans draw a curve showing progress towards the goal over time (see paper for more details on their interface).

4. The labelled data is used to train a reward model.

5. The agent is trained using all the NES data, with the reward model providing rewards.

6. At test-time, the robot continues to save data to the NES.

They then use this approach with a robotic arm on a few object manipulation tasks, such as stacking the green object on top of the red one. They find that on these tasks, they can annotate rewards at hundreds of frames per minute.

Nicholas’s opinion: I’m happy to see reward modeling being used to achieve new capabilities results, primarily because it may lead to more focus from the broader ML community on a problem that seems quite important for safety. Their reward sketching process is quite efficient and having more reward data from humans should enable a more faithful model, at least on tasks where humans are able to annotate accurately.

Miscellaneous (Alignment)

Does Bayes Beat Goodhart? (Abram Demski) (summarized by Flo): It has been claimed (AN #22) that Goodhart’s law might not be a problem for expected utility maximization, as long as we correctly account for our uncertainty about the correct utility function.

This post argues that Bayesian approaches are insufficient to get around Goodhart. One problem is that with insufficient overlap between possible utility functions, some utility functions might essentially be ignored when optimizing the expectation, even if our prior assigns positive probability to them. However, in reality, there is likely considerable overlap between the utility functions in our prior, as they are selected to fit our intuitions.

More severely, bad priors can lead to systematic biases in a bayesian’s expectations, especially given embeddedness. As an extreme example, the prior might assign zero probability to the correct utility function. Calibrated instead of Bayesian learning can help with this, but only for regressional Goodhart (Recon #5). Adversarial Goodhart, where another agent tries to exploit the difference between your utility and your proxy seems to also require randomization like quantilization (AN #48).

Flo’s opinion: The degree of overlap between utility functions seems to be pretty crucial (also see here (AN #82)). It does seem plausible for the Bayesian approach to work well without the correct utility in the prior if there was a lot of overlap between the utilities in the prior and the true utility. However, I am somewhat sceptical of our ability to get reliable estimates for that overlap.

Other progress in AI

Deep learning

Deep Learning for Symbolic Mathematics (Guillaume Lample et al) (summarized by Matthew): This paper demonstrates the ability of sequence-to-sequence models to outperform computer algebra systems (CAS) at the tasks of symbolic integration and solving ordinary differential equations. Since finding the derivative of a function is usually easier than integration, the authors generated a large training set by generating random mathematical expressions, and then using these expressions as the labels for their derivatives. The mathematical expressions were formulated as syntax trees, and mapped to sequences by writing them in Polish notation. These sequences were, in turn, used to train a transformer model. While their model outperformed top CAS on the training data set, and could compute answers much more quickly than the CAS could, tests of generalization were mixed: importantly, the model did not generalize extremely well to datasets that were generated using different techniques than the training dataset.

Matthew’s opinion: At first this paper appeared more ambitious than Saxton et al. (2019), but it ended up with more positive results, even though the papers used the same techniques. Therefore, my impression is not that we recently made rapid progress on incorporating mathematical reasoning into neural networks; rather, I now think that the tasks of integration and solving differential equations are simply well-suited for neural networks.

Unsupervised learning

Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data (Felipe Petroski Such et al) (summarized by Sudhanshu): The Generative Teaching Networks (GTN) paper breaks new ground by training generators that produce synthetic data that can enable learner neural networks to learn faster than when training on real data. The process is as follows: The generator produces synthetic training data by transforming some sampled noise vector and label; a newly-initialized learner is trained on this synthetic data and evaluated on real data; the error signal from this evaluation is backpropagated to the generator via meta-gradients, to enable it to produce synthetic samples that will train the learner networks better. They also demonstrate that their curriculum learning variant, where the input vectors and their order are learned along with generator parameters, is especially powerful at teaching learners with few samples and few steps of gradient descent.

They apply their system to neural architecture search, and show an empirical correlation between performance of a learner on synthetic data and its eventual performance when trained on real data. In this manner, they make the argument that data from a trained GTN can be used to cheaply assess the likelihood of a given network succeeding to learn on the real task, and hence GTN data can tremendously speed up architecture search.

Sudhanshu’s opinion: I really like this paper; I think it shines a light in an interesting new direction, and I look forward to seeing future work that builds on this in theoretical, mechanistic, and applied manners. On the other hand, I felt they did gloss over how exactly they do curriculum learning, and their reinforcement learning experiment was a little unclear to me.

I think the implications of this work are enormous. In a future where we might be limited by the maturity of available simulation platforms or inundated by deluges of data with little marginal information, this approach can circumvent such problems for the selection and (pre)training of suitable student networks.

Read more: Blog post

News

Junior Research Assistant and Project Manager role at GCRI (summarized by Rohin): This job is available immediately, and could be full-time or part-time. GCRI also currently has a call for advisees and collaborators.

Research Associate and Senior Research Associate at CSER (summarized by Rohin): Application deadline is Feb 16.