Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Underspecification Presents Challenges for Credibility in Modern Machine Learning (Alexander D’Amour, Katherine Heller, Dan Moldovan et al) (summarized by Rohin): This paper explains one source of fragility to distributional shift, which the authors term underspecification. The core idea is that for any given training dataset, there are a large number of possible models that achieve low loss on that training dataset. This means that the model that is actually chosen is effectively arbitrarily chosen from amongst this set of models. While such a model will have good iid (validation) performance, it may have poor inductive biases that result in bad out-of-distribution performance.

The main additional prediction of this framing is that if you vary supposedly “unimportant” aspects of the training procedure, such as the random seed used, then you will get a different model with different inductive biases, which will thus have different out-of-distribution performance. In other words, not only will the out-of-distribution performance be worse, its variance will also be higher.

The authors demonstrate underspecification in a number of simplified theoretical settings, as well as realistic deep learning pipelines. For example, in an SIR model of disease spread, when we only have the current number of infections during the initial growth phase, the data cannot distinguish between the case of having high transmission rate but low durations of infection, vs. a low transmission rate but high durations of infection, even though these make very different predictions about the future trajectory of the disease (the out-of-distribution performance).

In deep learning models, the authors perform experiments where they measure validation performance (which should be relatively precise), and compare it against out-of-distribution performance (which should be lower and have more variance). For image recognition, they show that neural net training has precise validation performance, with 0.001 standard deviation when varying the seed, but less precise performance on ImageNet-C (AN #15), with standard deviations in the range of 0.002 to 0.024 on the different corruptions. They do similar experiments with medical imaging and NLP.

Rohin’s opinion: While the problem presented in this paper isn’t particularly novel, I appreciated the framing of fragility of distributional shift as being caused by underspecification. I see concerns about inner alignment (AN #58) as primarily worries about underspecification, rather than distribution shift more generally, so I’m happy to see a paper that explains it well.

That being said, the experiments with neural networks were not that compelling—while it is true that the models had higher variance on the metrics testing robustness to distributional shift, on an absolute scale the variance was not high: even a standard deviation of 0.024 (which was an outlier) is not huge, especially given that the distribution is being changed.

TECHNICAL AI ALIGNMENT

INTERPRETABILITY

Manipulating and Measuring Model Interpretability (Forough Poursabzi-Sangdeh et al) (summarized by Rob): This paper performs a rigorous, pre-registered experiment investigating to what degree transparent models are more useful for participants. They investigate how well participants can estimate what the model predicts, as well as how well the participant can make predictions given access to the model information. The task they consider is prediction of house prices based on 8 features (such as number of bathrooms and square footage). They manipulate two independent variables. First, CLEAR is a presentation of the model where the coefficients for each feature are visible, whereas BB (black box) is the opposite. Second, −8 is a setting where all 8 features are used and visible, whereas in −2 only the 2 most important features (number of bathrooms and square footage) are visible. (The model predictions remain the same whether 2 or 8 features are revealed to the human.) This gives 4 conditions: CLEAR-2, CLEAR-8, BB-2, BB-8.

They find a significant difference in ability to predict model output in the CLEAR-2 setting vs all other settings, supporting their pre-registered hypothesis that showing the few most important features of a transparent model is the easiest for participants to simulate. However, counter to another pre-registered prediction, they find no significant difference in deviation from model prediction based on transparency or number of features. Finally, they found that participants shown the clear model were less likely to correct the model’s inaccurate predictions on “out of distribution” examples than participants with the black box model.

Rob’s opinion: The rigour of the study in terms of it’s relatively large sample size of participants, pre-registered hypotheses, and follow up experiments is very positive. It’s a good example for other researchers wanting to make and test empirical claims about what kind of interpretability can be useful for different goals. The results are also suggestive of considerations that designers should keep in mind when deciding how much and what interpretability information to present to end users.

FORECASTING

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain (Daniel Kokotajlo) (summarized by Rohin): This post argues against a particular class of arguments about AI timelines. These arguments have the form: “The brain has property X, but we don’t know how to make AIs with property X. Since it took evolution a long time to make brains with property X, we should expect it will take us a long time as well”. The reason these are not compelling is because humans often use different approaches to solve problems than evolution did, and so humans might solve the overall problem without ever needing to have property X. To make these arguments more convincing, you need to argue 1) why property X really is necessary and 2) why property X won’t follow quickly once everything else is in place.

This is illustrated with a hypothetical example of someone trying to predict when humans would achieve heavier-than-air flight: in practice, you could have made decent predictions just by looking at the power to weight ratios of engines vs. birds. Someone who argued that we were far away because “we don’t even know how birds stay up for so long without flapping their wings” would have made incorrect predictions.

Rohin’s opinion: This all seems generally right to me, and is part of the reason I like the biological anchors approach (AN #121) to forecasting transformative AI.

OTHER PROGRESS IN AI

CRITIQUES (AI)

A narrowing of AI research? (Joel Klinger et al) (summarized by Rohin): Technology development can often be path-dependent, where initial poorly-thought-out design choices can persist even after they are recognized as poorly thought out. For example, the QWERTY keyboard persists to this day, because once enough typists had learned to use it, there was too high a cost to switch over to a better-designed keyboard. This suggests that we want to maintain a diversity of approaches to AI so that we can choose amongst the best options, rather than getting locked into a suboptimal approach early on.

The paper then argues, based on an analysis of arXiv papers, that thematic diversity in AI has been going down over time, as more and more papers are focused on deep learning. Thus, we may want to have policies that encourage more diversity. It also has a lot of additional analysis of the arXiv dataset for those interested in a big-picture overview of what is happening in the entire field of AI.

MISCELLANEOUS (AI)

Neurosymbolic AI: The 3rd Wave (Artur d’Avila Garcez et al) (summarized by Zach): The field of neural-symbolic AI is broadly concerned with how to combine the power of discrete symbolic reasoning with the expressivity of neural networks. This article frames the relevance of neural-symbolic reasoning in the context of a big question: what are the necessary and sufficient building blocks of AI? The authors address this and argue that AI needs to have both the ability to learn from and make use of experience. In this context, the neural-symbolic approach to AI seeks to establish provable correspondences between neural models and logical representations. This would allow neural systems to generalize beyond their training distributions through neural-reasoning and would constitute significant progress towards AI.

The article surveys the last 20 years of research on neural-symbolic integration. As a survey, a number of different perspectives on neural-symbolic AI are presented. In particular, the authors tend to see neural-symbolic reasoning as divided into two camps: localist and distributed. Localist approaches assign definite identifiers to concepts while distributed representations make use of continuous-valued vectors to work with concepts. In the later parts of the article, promising approaches, current challenges, and directions for future work are discussed.

Recognizing ‘patterns’ in neural networks constitutes a localist approach. This relates to explainable AI (XAI) because recognizing how a given neural model makes a decision is a pre-requisite for interpretability. One justification for this approach is that codifying patterns in this way allows systems to avoid reinventing the wheel by approximating functions that are already well-known. On the other hand, converting logical relations (if-then) into representations compatible with neural models constitutes a distributed approach. One distributed method the authors highlight is the conversion of statements in first-order logic to vector embeddings. Specifically, Logic Tensor Networks generalize this method by grounding logical concepts onto tensors and then using these embeddings as constraints on the resulting logical embedding.

Despite the promising approaches to neural-symbolic reasoning, there remain many challenges. Somewhat fundamentally, formal reasoning systems tend to struggle with existential quantifiers while learning systems tend to struggle with universal quantification. Thus, the way forward is likely a combination of localist and distributed approaches. Another challenging area lies in XAI. Early methods for XAI were evaluated according to fidelity: measures of the accuracy of extracted knowledge in relation to the network rather than the data. However, many recent methods have opted to focus on explaining data rather than the internal workings of the model. This has resulted in a movement away from fidelity which the authors argue is the wrong approach.

Read more: Logic Tensor Networks, The Bitter Lesson

Zach’s opinion: The article does a reasonably good job of giving equal attention to different viewpoints on neural-symbolic integration. While the article does focus on the localist vs. distributed distinction, I also find it to be broadly useful. Personally, after reading the article I wonder if ‘reasoning’ needs to be hand-set into a neural network at all. Is there really something inherently different about reasoning such that it wouldn’t just emerge from any sufficiently powerful forward predictive model? The authors make a good point regarding XAI and the importance of fidelity. I agree that it’s important that our explanations specifically fit the model rather than interpret the data. However, from a performance perspective, I don’t feel I have a good understanding of why the abstraction of a symbol/logic should occur outside the neural network. This leaves me thinking the bitter lesson (AN #49) will apply to neural-symbolic approaches that try to extract symbols or apply reason using human features (containers/first-order logic).

Rohin’s opinion: While I do think that you can get human-level reasoning (including e.g. causality) by scaling up neural networks with more diverse data and environments, this does not mean that neural-symbolic methods are irrelevant. I don’t focus on them much in this newsletter because 1) they don’t seem that relevant to AI alignment in particular (just as I don’t focus much on e.g. neural architecture search) and 2) I don’t know as much about them, but this should not be taken as a prediction that they won’t matter. I agree with Zach that the bitter lesson will apply, in the sense that for a specific task as we scale up we will tend to reproduce neural-symbolic approaches with end-to-end approaches. However, it could still be the case that for the most challenging and/or diverse tasks, neural-symbolic approaches will provide useful knowledge / inductive bias that make them the best at a given time, even though vanilla neural nets could scale better (if they had the data, memory and compute).

NEWS

DPhil Scholarships Applications Open (Ben Gable) (summarized by Rohin): FHI will be awarding up to six scholarships for the ²⁰²¹⁄₂₂ academic year for DPhil students starting at the University of Oxford whose research aims to answer crucial questions for improving the long-term prospects of humanity. Applications are due Feb 14.

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[AN #134]: Underspecification as a cause of fragility to distribution shift