[AN #115]: AI safety research problems in the AI-GA framework

Link post

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world

Newsletter #115

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

PROBLEMS

FORECASTING

MISCELLANEOUS (ALIGNMENT)

AI STRATEGY AND POLICY

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

NEWS

HIGHLIGHTS

Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity (Adrien Ecoffet et al) (summarized by Rohin): One potential pathway to powerful AI is through open-ended search, in which we use search algorithms to search for good architectures, learning algorithms, environments, etc. in addition to using them to find parameters for a particular architecture. See the AI-GA paradigm (AN #63) for more details. What do AI safety issues look like in such a paradigm?

Building on DeepMind’s framework (AN #26), the paper considers three levels of objectives: the ideal objective (what the designer intends), the explicit incentives (what the designer writes down), and the agent incentives (what the agent actually optimizes for). Safety issues can arise through differences between any of these levels.

The main difference that arises when considering open-ended search is that it’s much less clear to what extent we can control the result of an open-ended search, even if we knew what result we wanted. We can get evidence about this from existing complex systems, though unfortunately there are not any straightforward conclusions: several instances of convergent evolution might suggest that the results of the open-ended search run by evolution were predictable, but on the other hand, the effects of intervening on complex ecosystems are notoriously hard to predict.

Besides learning from existing complex systems, we can also empirically study the properties of open-ended search algorithms that we implement in computers. For example, we could run search for some time, and then fork the search into independent replicate runs with different random seeds, and see to what extent the results converge. We might also try to improve controllability by using meta learning to infer what learning algorithms, environments, or explicit incentives help induce controllability of the search.

The remaining suggestions will be familiar to most readers: they suggest work on interpretability (that now has to work with learned architectures), better benchmarks, human-in-the-loop search, safe exploration, and sim-to-real transfer.

Rohin’s opinion: I’m glad that people are paying attention to safety in this AGI paradigm, and the problems they outline seem like reasonable problems to work on. I actually expect that the work needed for the open-ended search paradigm will end up looking very similar to the work needed by the “AGI via deep RL” paradigm: the differences I see are differences in difficulty, not differences in what problems qualitatively need to be solved. I’m particularly excited by the suggestion of studying how particular environments can help control the result of the open-ended search: it seems like even with deep RL based AGI, we would like to know how properties of the environment can influence properties of agents trained in that environment. For example, what property must an environment satisfy in order for agents trained in that environment to be risk-averse?

TECHNICAL AI ALIGNMENT

PROBLEMS

Model splintering: moving from one imperfect model to another (Stuart Armstrong) (summarized by Rohin): This post introduces the concept of model splintering, which seems to be an overarching problem underlying many other problems in AI safety. This is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.

Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained). For example, we may initially model all gases as ideal gases, defined by their pressure, volume and temperature. However, as we learn more, we may transition to the van der Waal’s equations, which apply differently to different types of gases, and so an environment like “1 liter of gas at standard temperature and pressure (STP)” now splinters into “1 liter of nitrogen at STP”, “1 liter of oxygen at STP”, etc.

Model splintering can also apply to reward functions: for example, in the past people might have had a reward function with a term for “honor”, but at this point the “honor” concept has splintered into several more specific ideas, and it is not clear how a reward for “honor” should generalize to these new concepts.

The hope is that by analyzing splintering and detecting when it happens, we can solve a whole host of problems. For example, we can use this as a way to detect if we are out of distribution. The full post lists several other examples.

Rohin’s opinion: I think that the problems of generalization and ambiguity out of distribution are extremely important and fundamental to AI alignment, so I’m glad to see work on them. It seems like model splintering could be a fruitful approach for those looking to take a more formal approach to these problems.

An Architectural Risk Analysis of Machine Learning Systems: Towards More Secure Machine Learning (Gary McGraw et al) (summarized by Rohin) (H/​T Catherine Olsson): One systematic way of identifying potential issues in a system is to perform an architectural risk analysis, in which you draw an architecture diagram showing the various components of the system and how they interact, and then think about each component and interaction and how it could go wrong. (Last week’s highlight (AN #114) did this for Bayesian history-based RL agents.) This paper performs an architectural risk analysis for a generic ML system, resulting in a systematic list of potential problems that could occur.

Rohin’s opinion: As far as I could tell, the problems identified were ones that we had seen before, but I’m glad someone has gone through the more systematic exercise, and the resulting list is more organized and easier to understand than previous lists.

FORECASTING

Forecasting Thread: AI Timelines (Amanda Ngo et al) (summarized by Rohin): This post collects forecasts of timelines until human-level AGI, and (at the time of this writing) has twelve such forecasts.

Roadmap to a Roadmap: How Could We Tell When AGI is a ‘Manhattan Project’ Away? (John-Clark Levin et al) (summarized by Rohin): The key hypothesis of this paper is that once there is a clear “roadmap” or “runway” to AGI, it is likely that state actors could invest a large number of resources into achieving it, comparably to the Manhattan project. The fact that we do not see signs of such investment now does not imply that it won’t happen in the future: currently, there is so little “surface area” on the problem of AGI that throwing vast amounts of money at the problem is unlikely to help much.

If this were true, then once such a runway is visible, incentives could change quite sharply: in particular, the current norms of openness may quickly change to norms of secrecy, as nations compete (or perceive themselves to be competing) with other nations to build AGI first. As a result, it would be good to have a good measure of whether we have reached the point where such a runway exists.

Read more: Import AI summary

MISCELLANEOUS (ALIGNMENT)

State of AI Ethics (Abhishek Gupta et al) (summarized by Rohin): This report from the Montreal AI Ethics Institute has a wide variety of summaries on many different topics in AI ethics, quite similarly to this newsletter in fact.

AI STRATEGY AND POLICY

Decision Points in AI Governance (Jessica Cussins Newman) (summarized by Rohin): While the last couple of years have seen a proliferation of “principles” for the implementation of AI systems in the real world, we are only now getting to the stage in which we turn these principles into practice. During this period, decision points are concrete actions taken by some AI stakeholder with the goal of shaping the development and use of AI. (These actions should not be predetermined by existing law and practice.) Decision points are the actions that will have a disproportionately large influence on the field, and thus are important to analyze. This paper analyzes three case studies of decision points, and draws lessons for future decision points.

First, we have the Microsoft AETHER committee. Like many other companies, Microsoft has established a committee to help the company make responsible choices about its use of AI. Unlike e.g. Google’s AI ethics board, this committee has actually had an impact on Microsoft’s decisions, and has published several papers on AI governance along the way. The committee attributes its success in part to executive-level support, regular opportunities for employee and expert engagement, and integration with the company’s legal team.

Second, we have the GPT-2 (AN #46) staged release process. We’ve covered (AN #58) this (AN #55) before (AN #58), so I won’t retell the story here. However, this shows how a deviation from the norm (of always publishing) can lead to a large discussion about what publication norms are actually appropriate, leading to large changes in the field as a whole.

Finally, we have the OECD AI Policy Observatory, a resource that has been established to help countries implement the OECD AI principles. The author emphasizes that it was quite impressive for the AI principles to even get the support that they did, given the rhetoric about countries competing on AI. Now, as the AI principles have to be put into practice, the observatory provides several resources for countries that should help in ensuring that implementation actually happens.

Read more: MAIEI summary

OTHER PROGRESS IN AI


REINFORCEMENT LEARNING

Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Noam Brown, Anton Bakhtin et al) (summarized by Rohin): AlphaZero (AN #36) and its predecessors have achieved impressive results in zero-sum two-player perfect-information games, by using a combination of search (MCTS) and RL. This paper provides the first combination of search and deep RL for imperfect-information games like poker. (Prior work like Pluribus (AN #74) did use search, but didn’t combine it with deep RL, instead relying on significant expert information about poker.)

The key idea that makes AlphaZero work is that we can estimate the value of a state independently of other states without any interaction effects. For any given state s, we can simulate possible future rollouts of the game, and propagate the values of the resulting new states back up to s. In contrast, for imperfect information games, this approach does not work since you cannot estimate the value of a state independently of the policy you used to get to that state. The solution is to instead estimate values for public belief states, which capture the public common knowledge that all players have. Once this is done, it is possible to once again use the strategy of backing up values from simulated future states to the current state, and to train a value network and policy network based on this.

NEWS

AI Governance Project Manager (Markus Anderljung) (summarized by Rohin): The Centre for the Governance of AI is hiring for a project manager role. The deadline to apply is September 30.

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.