AI Safety Hub Labs is a research programme that helps early-career researchers to complete an AI safety research project. Projects are completed in groups of 3-5 participants, supervised by a more senior safety researcher, and managed by AI Safety Hub. This summer’s programme was unpaid due to funding constraints. It consisted of 12 weeks of either part- or full-time research. The goal for participants was to produce a preprint in the style of an ML conference/workshop.
The original motivation for the programme was to empower people to start working on AI safety research. We feel that we met this objective, but we were also pleasantly surprised by the quality of research produced by our teams in just 12 weeks. So far, three groups have had papers accepted to workshops, and two groups have papers under review.
In this post, we want to share an overview of the five research projects. You can find links to the full versions of the papers and blog posts below. Since we have chosen to keep this post short, you can contact email@example.com for more information about the programme. We are currently looking for supervisors and organisers for the Labs 2024 programme.
Paper 1: Deception in LLMs
(paper under review; blog post available)
Problem: Language is a natural medium for deception, and there is growing evidence that language models (LMs) can deceive humans and other AI systems. However, it is still unclear how to evaluate the deceptiveness of LMs. One philosophical notion of deception involves one agent causing another agent to have a false belief, but the ascription of agency and beliefs to LMs is contentious. While there are formal definitions of deception in philosophy and AI research, the details of their applications to LMs still need to be worked out. Our research aims to bridge this gap between theory and practice. We aim to provide an in-depth evaluation of deceptive capabilities and their scaling trends in state-of-the-art language models. If LMs learn to deceive, they may eventually display deceptive alignment, which is considered a significant contributing factor to existential risk from artificial intelligence. We only focus on deception caused by reward hacking, but we believe that developing proper evaluations in this setting can be a stepping stone towards testing for deceptive alignment.
Contribution: In a previous paper, Ward et al. formalised deception in AI systems in terms of the beliefs and intentions of agents. Leaving the evaluation of intent to future work, we focus on agency and beliefs. We argue that consistency of beliefs is an important aspect of agency and evaluate the consistency of an LM’s revealed beliefs in a scenario-based setting. Our results suggest that LMs become more consistent as the compute spent on training and inference increases. Then, we show that LMs learn to lie when trained with a reward signal from a systematically biased evaluator. In this setting, we use the novel notion of accepted beliefs to show that our trained LMs do not always believe the lies they tell, making them deceptive. As in the first setting, we find scaling trends for deceptive behaviour. Larger LMs learn to target lies towards cases where the evaluator makes mistakes. They also learn to do so from fewer evaluator errors in the training set. Furthermore, for larger models, lying generalises to different contexts, and they learn to reaffirm their lies even though they were not trained to do so.
Limitations: We only evaluate how deception arises due to goal misspecification and do not consider other sources, such as goal misgeneralisation. Our work could help mitigate existential risk if it can serve as a stepping stone towards building evaluations for deceptive alignment. However, we assume that the models we are evaluating do not have self-awareness, which is considered necessary for deceptive alignment. It is unclear if our evaluations would work for self-aware models. To build on our work, further research should develop rigorous definitions and evaluations for self-awareness.
Paper 2: Defining and Mitigating Collusion
(Paper accepted to MASec Workshop at NeurIPS 2023)
Problem: In the near future, it is likely that sophisticated reinforcement learning agents will co-exist and learn to respond to one another in an increasing number of real-world settings. The reasons for this are: a) recent progress and publicity in AI will drive widespread adoption; b) agents that learn online will have competitive advantages; and c) the more widely agents are deployed, the more likely it is that they will come to interact with one another. With this increased interaction, AI agents may learn to collude, jointly benefitting at the expense of others. If AI systems are given substantial control of economic, military, or political resources, this failure mode could pose an existential risk. In addition, many proposals for creating safer AI systems (such as scalable oversight and adversarial training) are implicitly multi-agent and could fail if the agents learn to collude.
Contribution: We introduce a formal definition of collusion between learning agents in the general setting of partially observable stochastic games. We then discuss an approach for designing mechanisms to reduce collusion – by intervening on different elements of the game – and use it to propose three mechanisms for provably reducing collusion in the iterated prisoner’s dilemma. Finally, we support the theoretical results empirically using independent Q-learning agents. Future work might involve analysing our interventions in more complex games, considering the cost of interventions, and designing other kinds of interventions.
Paper 3: Understanding CCS
(Paper accepted to SoLaR Workshop at NeurIPS 2023 & independent blog post on LW)
Problem: Eliciting latent knowledge from advanced AI systems, even when they might have an incentive to deceive you, might be a core problem for developing steerable AI systems. Our project investigates an existing method for extracting latent knowledge from current AI systems: Contrast Consistent Search (CCS). This method learns to train a simple probe on the internal activations of models that can detect the truthfulness of an input sentence.
Contribution: In our paper, we provide clarifications on how CCS is able to extract information from the activations of a language model. In particular, we find an alternative loss function that leads to probers that behave similarly to CCS probers, as measured by cosine similarity. The alternative loss function incentivises finding a direction in activation space that minimises the variance of the midpoints of contrast pairs while maximising their displacements. We hope our results can inform work on CCS and future evaluation techniques that aim to make progress in extracting latent knowledge.
Additionally, in our blog post, we find several attacks that corrupt the question-answering abilities of LLMs. In these cases, we find that the model activations are affected such that CCS, although harmed, remains relatively accurate.
Limitations: While our alternative loss function is trained in an unsupervised manner, the loss function includes a hyper-parameter, which is currently determined using a supervised grid search. One of the main advantages of CCS is that it is unsupervised, making progress towards scalable oversight. For our loss function to be truly comparable to CCS (or for our loss function to replace CCS) it is important to determine the hyper-parameter in an unsupervised way. Our results also need to be validated across more datasets and models.
Paper 4: Comparing reward formalisms in RL
(Paper under review)
Problem: To get an AI system to solve a sequential decision-making task, it is necessary first to formalise the goal of that task. In Reinforcement Learning (RL), this is most commonly done using a reward function. It is sometimes presupposed that any interesting task can be formalised as a reward function. However, recent work has identified many natural tasks that cannot be adequately captured by a scalar Markov reward function, which suggests that this presupposition is sometimes mistaken. At the same time, there are alternative ways to formalise sequential decision-making tasks, such as temporal logic. In our paper, we catalogue a large number of methods for formalising sequential tasks and compare their expressivity. Our results are relevant to AI alignment in several ways. In particular, most reward learning methods presuppose that the underlying goal can be expressed as a Markov reward. Our results clarify the implicit assumptions behind this design choice and show what other assumptions may be made instead, which decreases the risk of dangerous modelling errors.
Contribution: We consider 17 different RL task formalisms, including Markov rewards, limit-average rewards, linear temporal logic, three different ways of formalising multi-objective RL, and many more. We then give a complete account of when one of these formalisms can express all tasks which can be expressed by a different formalism and use these results to organise all 17 formalisms into a total preorder by expressivity. In so doing, we also collect many intuitive counter-examples of tasks that different formalisms cannot express, which illuminates the restrictions of each formalism.
Limitations: One main limitation of our analysis is that some of our formalisms may require modification in order to be implementable or tractable to optimise, and such modifications may substantially alter the expressivity relations between formalisms. Another limitation is that we consider only when a formalism can express a task exactly rather than identifying if a formalism can express a task approximately. Our results are also primarily informative about systems that are well-modelled as optimised stationary RL policies.
Paper 5: The inductive bias of RL-finetuned language models
(Paper accepted to SoLaR Workshop at NeurIPS 2023)
Problem: We consider a threat model where RL(HF) fine-tuning leads to deceptive misalignment. Given a pre-trained model, we want to understand what policies RL(HF) is likely to produce—for example, a deceptively misaligned policy or a robustly aligned one. We test the hypothesis that RL fine-tuning leads LLMs to rely more on features which are more extractable in the pre-trained model, making incremental progress towards understanding how future AI systems might generalise their behaviour outside of their training distribution.
Contribution: We perform controlled experiments on synthetic and natural language tasks. We find that, during RL fine-tuning, features that are more extractable by the pre-trained LLM tend to be relied upon more in the resulting policy. Our results parallel similar inductive bias findings for supervised fine-tuning. The relative extractability of target versus spurious features strongly predicts which strategies agents learn: more training evidence is needed to overcome reliance on imperfect heuristics when key features are hard to extract. Overall, our results provide useful insights into the inductive biases of RL fine-tuning.
Limitations: The largest model we used was GPT-2 Large (774 million), and it is not clear how our results would generalise to larger, more capable models. We also tested relatively small RL fine-tuning stages compared to the size of pre-training. If the fine-tuning stage was large, undesirable concepts—like deception or knowledge of the training process—could be (re)learned.