What’s an alignment topic where, if someone decomposed the overall task, a small group of smart people (like here on Lesswrong) could make conceptual progress? By “smart”, assume they can notice confusion, google, and program.

Be specific. Explain what strategies you would use to explore your topic, and give examples on decomposing tasks.

This is more than just an exercise in factored cognition; there are stakes: I host the key alignment group, and if your answer is convincing enough, then we’ll work on it (caveat: some people may leave if your topic isn’t their cup of tea, but others may join because it is). Additionally, you can pitch your idea in the key alignment group meeting on (8/25) Tuesday 7pm UTC; just DM me if you’d like to join the next meeting.

I work at OpenAI on safety. In the past it seems like theres a gap between what I’d consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.

Topics w/ strategies/breakdown:

Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues

Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values

Design some evaluations or metrics for measuring a specific alignment failure (which lets you measure whether you’ve improved a model or not)

Gather human feedback data / labels / whatever you think you can try training on

Try training on your data (there are tutorials on how to use Google Colab to fine-tune GPT-2 with a new dataset)

Forecast scaling laws: figure out how performance on your evaluation or metric varies with the amount of human input data; compare to how much time it takes to generate each labelled example (be quantitative!)

Multi-objective reinforcement learning — instead of optimizing a single objective, optimize multiple objectives together (and some of the objectives can be constraints)

What are ways we can break down existing AI alignment failures in RL-like settings into multi-objective problems, where some of the objectives are safety objectives and some are goal/task objectives

How can we design safety objectives such that they can transfer across a wide variety of systems, machines, situations, environments, etc?

How can we measure and evaluate our safety objectives, and what should we expect to observe during training/deployment?

How can we incentivize individual development and sharing of safety objectives

How can we augment RL methods to allow transferrable safety objectives between domains (e.g., if using actor critic methods, how to integrate a separate critic for each safety objective)

What are good benchmark environments or scenarios for multi-objective RL with safety objectives (classic RL environments like Go or Chess aren’t natively well-suited to these topics)

Forecasting the Economics of AGI (turn ‘fast/slow/big/etc’ into real numbers with units)

This is more “AI Impacts” style work than you might be asking for, but I think it’s particularly well-suited for clever folks that can look things up on the internet.

Identify vague terms in AI alignment forecasts, like the “fast” in “fast takeoff”, that can be operationalized

Come up with units that measure the quantity in question, and procedures for measurements that result in those units

Try applying traditional economics growth models, such as experience curves, to AI development, and see how well you can get things to fit (much harder to do this for AI than making cars — is a single unit a single model trained? Maybe a single week of a researchers time? Is the cost decreasing in dollars or flops or person-hours or something else? Etc etc)

Sketch models for systems (here the system is the whole ai field) with feedback loops, and inspect/explore parts of the system which might respond most to different variables (additional attention, new people, dollars, hours, public discourse, philanthropic capital, etc)

Topics not important enough to make it into my first 30 minutes of writing:

Cross disciplinary integration with other safety fields, what will and won’t work

Systems safety for organizations building AGI

Safety acceleration loops — how/where can good safety research make us better and faster at doing safety research

Cataloguing alignment failures in the wild, and create a taxonomy of them

Anti topics: Things I would have put on here a year ago

Too late for me to keep writing so saving this for another time I guess

I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.

Motivation: Starting the theoretical investigation of dialogic reinforcement learning (DLRL).Topic: Consider the following setting.A is a set of “actions”, Q is a set of “queries”, N is a set of “annotations”. W is the set of “worlds” defined as W:=[0,1]A×{g,b,u}Q×N. Here, the semantics of the first factor is “mapping from actions to rewards”, the semantics of the second factor is “mapping from queries to {good, bad, ugly}”, where “good” means “query can be answered”, “bad” means “query cannot be answered”, “ugly” means “making this query loses the game”. In addition, we are given a fixed mapping σ:Q→RW (assigning to each query its semantics). H is a set of “hypotheses” which is a subset of ΔW (i.e. each hypothesis is a belief about the world).Some hypothesis h∗∈H represents the user’s beliefs, but the agent doesn’t know which. Instead, it only has a prior ζ∈ΔH. On each round, the agent is allowed to either make an annotated query (q,n)∈Q×N or take an action from A. Taking an action produces a reward and ends the game. Making a query can either (i) produce a number, which is Eh∗[σ(q)] (good), or (ii) produce nothing (bad), or (iii) end the game with zero reward (ugly).

The problem is devising algorithms for the agent, s.t., in expectation w.r.t. h∗∼ζ, the h∗-expected reward approximates the best possible h∗-expected reward (the latter is what we would get if the agent knew which hypothesis is correct) and the number of queries is low. Propose sets of assumptions about the ingredients of the setting that lead to non-trivial bounds. Consider proving both positive results and negative results (the latter meaning: “no algorithm can achieve a bound better than...”)

Strategy:See the theoretical research part of my other answer. I advise to start by looking for the minimal simplification of the setting about which it is still possible to prove non-trivial results. In addition, start with bounds that scale with the sizes of the sets in question, proceed to look for more refined parameters (analogous to VC dimension in offline learning).Motivation: Improving understanding of relationship between learning theory and game theory.Topic: Study the behavior of learning algorithms in mortal population games, in the γ→1 limit. Specifically, consider the problem statements from the linked comment:Are any/all of the fixed points attractors?

What can be said about the size of the attraction basins?

Do all Nash equilibria correspond to fixed points?

Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

You can approach this theoretically (proving things) or experimentally (writing simulations). Specifically, it would be easiest to start from agents that follow fictitious play. You can then go on to more general Bayesian learners, other algorithms from the literature, or (on the experimental side) to using deep learning. Compare the convergence properties you get to those known in evolutionary game theory.

Notice that, due to the grain-of-truth problem, I intended to study this using non-Bayesian learning algorithms, but due to the ergodic-ish nature of the setting, Bayesian learning algorithms might perform well. But, if they perform poorly, this is still important to know.

Strategies: See my other answer.The idea is an elaboration of a comment I made previously.Motivation: Improving the theoretical understanding of AGI by facilitating synthesis between algorithmic information theory and statistical learning theory.Topic: Fix some reasonable encoding of communicating MDPs, and use this encoding to define ζCMDP: the Solomonoff-type prior over communicating MDPs. That is, the probability of a communicating MDP H is proportional to 2−K(H) where K(H) is the length of the shortest program producing the encoding of H.Consider CMDP-AIXI: the Bayes-optimal agent for ζCMDP. Morally speaking, we would like to prove that CMDP-AIXI (or any other policy) has a frequentist (i.e. per hypothesis H) non-anytime regret bound of the form O(nαζCMDP(H)−βτ(H)γ), where n is the time horizon

^{[1]}, τ(H) is a parameter such as MDP diameter, bias span or mixing time, α∈(0,1), β,γ∈(0,∞) (this time γ is just a constant,nottime discount). However, this precise result is probably impossible, because the Solomonoff prior falls off very slowly.Warm-up: Prove this!Next, we need the concept of “sophisticated core”, inspired by algorithmic statistics. Given a bit string x, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program that halts on all inputs, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the

sophisticationof x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x and denote it SC(x).To any halting program Q we can associate the environment μQ:=EH∼ζCMDP[H∣SC(H)=Q]. We also define the prior ξ by ξ(Q):=PrH∼ζCMDP[SC(H)=Q]. ζ and ξ are “equivalent” in the sense that EQ∼ξ[μQ]=EH∼ζCMDP[H]. However, they are not equivalent for the purpose of regret bounds.

Challenge: Investigate the conjecture that there is a (n-dependent) policy satisfying the regret bound O(nαξ(Q)−βE[τ(μQ)]γ) for every μQ, or something similar.Strategy: See the theoretical research part of my other answer.I am using unnormalized regret and step-function time discount here to make the notation more standard, even though usually I prefer normalized regret and geometric time discount. ↩︎

I think it would be fun and productive to “wargame” the emergence of AGI in broader society in some specific scenario—my choice (of course) would be “we reverse-engineer the neocortex”. Different people could be different interest-groups / perspectives, e.g. industry researchers, academic researchers, people who have made friends with the new AIs, free-marketers, tech-utopians, people concerned about job losses and inequality, people who think the AIs are conscious and deserve rights, people who think the AIs are definitely not conscious and don’t deserve rights (maybe for religious reasons?), militaries, large companies, etc.

I don’t know how these “wargame”-type exercises actually work—honestly, I haven’t even played D&D :-P Just a thought. I personally have some vague opinions about brain-like AGI development paths and what systems might be like at different stages etc., but when I try to think about how this could play out with all the different actors, it kinda makes my head spin. :-)

The goal of course is to open conversations about what

mightplausibly happen, not to figure out whatwillhappen, which is probably impossible.The idea is an elaboration of a comment I made previously.Motivation: Improving our understanding of superrationality.Topic: Investigate the following conjecture.Consider two agents playing iterated prisoner’s dilemma (IPD) with geometric time discount. It is well known that, for sufficiently large discount parameters (11−γ≫0), essentially all outcomes of the normal form game become Nash equilibria (the folk theorem). In particular, cooperation can be achieved via the tit-for-tat strategy. However, defection is still a Nash equilibrium (and even a subgame perfect equilibrium).

Fix n1,n2∈N. Consider the following IPD variant: the first player is forced to play a strategy that can be represented by a finite state automaton of n1 states, and the second player is forced to play a strategy that can be represented by a finite state automaton of n2 states. For our purpose a “finite state automaton” consists of a set of states S, the transition mapping τ:S×{C,D}→S and the “action mapping” α:S→{C,D}. Here, τ tells you how to update your state after observing the opponent’s last action, and α tells you which action to take. Denote the resulting (normal form) game FIPD(n1,n2,γ), where γ is the time discount parameter.

Conjecture:If n1,n2≥2 then there are a functions T:(0,1)→(0,∞) and δ:(0,1)→(0,∞) s.t. the following conditions hold:limγ→1T(γ)=0

limγ→1δ(γ)=0

Any thermodynamic equilibrium of FIPD(n1,n2,γ) of temperature T(γ) has the payoffs of CC up to δ(γ).

Strategies:You could take two approaches: theoretical research and experimental research.For theoretical research, you would try to prove or disprove the conjecture. If the initial conjecture is too hard, you can try to find easier variants (such as n1=n2=2, or adding more constraints on the automaton). If you succeed proving the conjecture, you can go on to studying games other than prisoner’s dilemma (for example, do we always converge to Pareto efficiency?) If you succeed in disproving the conjecture, you can go on to look for variants that survive (for example, assume n1=n2 or that the finite state automatons must not have irreversible transitions).

To decompose the task I propose: (i) have each person in the team think of ideas how to approach this (ii) brainstorm everyone’s ideas together and select a subset of promising ideas (iii) distribute the promising ideas among people and/or take each promising idea and find multiple lemmas that different people can try proving.

Don’t forget to check whether the literature has adjacent results. This also helps decomposing: the literature survey can be assigned to a subset of the team, and/or different people can search for different keywords / read different papers.

For experimental research, you would code an algorithm that computes the thermodynamic equilibria, and see how the payoffs behave as a function of T and γ. Optimally, you would also provide a derivation of the error bounds on your results. To decompose the task, use the same strategy as in the theoretical case to come up with the algorithms and the code design. Afterwards, decompose it by having each person implement a segment of the code (pair programming is also an option).

It is also possible to go for theoretical

andexperimental simultaneously, by distributing among people and cross-fertilizing along the way.Here’s a problem for you, which I’m not sure fits the requirements, but might: How do you learn whether an AI has been trained to use Gricean communication (e.g. “I interpret your words by modeling you as saying them because you model me as interpreting them, and so on until further recursion isn’t fruitful”) without being able to read its source code and check its functioning against some specification of recursive agential modeling?