Twenty-three AI alignment research project definitions

I came up with these research project definitions when I read the iterated amplification sequence. Last year I put five of them up for voting (see Which of these five AI alignment research projects ideas are no good?) and chose no. 23 to work on (see IDA with RL and overseer failures). But I didn’t think of publishing all of them until JJ Hepburn gave me the idea that they might be useful to others.
The definitions follow the format recommended in The Craft of Research: ‘I’m studying X, because I want to find out Y, so that I (and you) can better understand Z.’ The quality of language and content varies.
An unaligned benchmark
----------------------

1
I'm studying mild optimization,
    because I want to understand what its problems and limitations
    are,
        because I want to decide whether or not to work on one of
        them,
            because well-functioning mild optimization appears to
            be close to what humans and organisations safely do
            today, so basing an AI on it might have useful
            outcomes.
- This is more a study project than a research project.

2
I'm studying how image classifiers can recognize new types of
images,
    because I want to find out how to detect out-of-distribution
    inputs,
        in order to help my reader understand how to make ML
        systems that don't confidently make wrong predictions.


Approval-directed agents/bootstrapping
--------------------------------------

3
I'm studying Bayesian machine learning,
    because I want to understand how to make ML systems that
    notice when they are confused
        in order to help my reader understand how to make ML
        systems that will ask the overseer for input when doing
        otherwise would lead to failure.

4
I'm studying possible structures of approval-directed agents
    because I want to understand how much human thought and input
    they would require
        in order to help my reader understand whether
        approval-directed agents are feasible.


Humans Consulting HCH
---------------------

5
I'm studying the articles linked from Humans Consulting HCH,
    because I want to understand the section ‘Hope’ from Humans
    Consulting HCH
        in order be able to think about the tension between the
        system being capable and reflecting the human's judgment.


Corrigibility (Christiano 2017)
-------------------------------

6
I'm studying Omohundro's preferences-about-your-utility-function
case,
    because I want to implement it
        in order to help my reader understand how it can be
        implemented and whether it is unstable.

7
I'm studying the change in prediction failures as a predictor
becomes stronger,
    because I want to find out whether the failures not only
    become fewer, but also harder to detect over time,
        in order to help my reader understand whether systems
        built on such predictors will remain corrigible as they
        become stronger.


Iterated Distillation and Amplification
---------------------------------------

8
I'm studying the capability and safety-relevant properties of
imitation learning,
    because I want to find out whether it can produce aligned
    agents,
        in order to help my reader understand how to even get to
        the base case of iterated distillation and amplification.


Benign model-free RL
--------------------

9
I'm studying the performance of benign model-free RL,
    because I want to know whether the claim that it will achieve
    state-of-the-art performance is true,
        in order to help my reader understand how useful benign
        model-free RL will be.

10
I'm studying reward learning and robustness,
    because I want to know whether they can achieve competitive
    agents without malign behaviour,
        in order to help my reader understand whether benign
        model-free RL will be possible.

11
I'm implementing (parts of) benign model-free RL,
    because I want to know what works and doesn't work in
    practice,
        in order to help my reader understand which parts of the
        scheme need further conceptual research.


Supervising strong learners by amplifying weak experts
------------------------------------------------------

12
I'm studying ways to improve the sample efficiency of a supervised
learner,
    because I want to know how to reduce the number of calls to H
    in CSASupAmp,
        in order to help my reader understand how we can adapt
        that proof-of-concept for solving real world tasks that
        require even more
        training data.

13
I'm studying the effects of how CSASupAmp samples questions,
    because I want to know how to sample questions in a way that
    improves the scheme's learning performance,
        in order to help my reader understand how we can adapt
        that proof-of-concept for solving real world tasks that
        require even more training data.


Machine Learning Projects for Iterated Distillation and Amplification
--------------------------------------------------------------------

14
❇ Any of the projects there. At first glance Adaptive Computation
is most interesting, but perhaps also requires most studying. I
would ask Owain to find out what people are already working on or
what has received most interest, then work on the least crowded
one.


Directions and desiderata for AI alignment
------------------------------------------

15
I'm studying integrating models of known human heuristics and
biases into IRL systems,
    because I want to improve the performance of IRL in a domain
    where it is hindered by the discrepancies between the existing
    error models and actual human irrationality
        in order to help my readers understand how to get IRL
        systems to infer true human values despite Stuart
        Armstrong's impossibility result.


The reward engineering problem
------------------------------

16
I'm experimenting with semi-supervised reinforcement learning,
    because I want to find out how humans can supervise machine
    learning with reasonably small effort,
        in order to help my reader understand how to avoid
        optimizing proxy objectives that we have to use because
        the sample hunger of current ML algorithms is so great.

17
I'm studying the use of a discriminator in imitation learning,
    because I want to find out how to help humans produce
    demonstrations that the agent can imitate,
        in order to help my reader understand how we might use
        imitation learning to solve the reward engineering
        problem.


Capability amplification (Christiano 2016)
------------------------------------------

18
I'm studying cognitive tasks and how to decompose them into ever
simpler steps,
    because I want to find algorithms for capability amplification
        in order to help my reader understand the nature of
        obstacles to capability amplification. – What is the
        obstacle? How exactly is it an obstacle? Make it simple.


Learning with catastrophes
--------------------------

19
❇ I'm studying the development of the performance of an adversary
in adversarial training,
    because I want to find out whether the adversary gets worse as
    the primary agent becomes more robust, or whether the
    adversary gets traction at all when the primary agent is
    already quite robust,
        in order to help my reader understand how confident we
        could be in a red team to find all relevant catastrophic
        situations.


Thoughts on reward engineering
------------------------------

20
I'm studying the effects of importance sampling on the behaviour
that an RL agent learns,
    because I want to find out whether it can lead to undesirable
    outcomes
        in order to help my reader understand whether importance
        sampling can solve the problem of widely varying rewards
        in reward engineering.

21
I'm studying the effects of an inconsistent comparison function on
optimizing with comparisons,
    because I want to know whether it prevents the two agents from
    converging on a desirable equilibrium quickly enough
        inorder to help my reader understand whether optimizing
        with comparisons can solve the problem of inconsistency
        and unreliability in reward engineering.


Techniques for optimizing worst-case performance
------------------------------------------------

22
❇ I'm studying transparency in the service of adversarial training
(using transparency to ease finding adversarial examples, or to
detect adversarial success earlier/more often),
    because I want to make the adversary ten times more effective
        in order to help my reader understand how to build ML
        systems that never fail catastrophically.


Reliability ampflification
--------------------------

23
I'm studying the impact of overseer failure on RL-based IDA,
    because I want to know under what conditions the amplification
    increases or decreases the failure rate,
        in order to help my reader understand whether we need to
        combine capability amplification with explicit reliability
        amplification in all cases.


Security amplification
----------------------

24
I'm studying adversarial examples for meta-execution with ML-based
sub-agents,
    because I want to find a case where security amplification by
    meta-execution fails to amplify security,
        in order to help my reader understand what obstructions to
        security amplification there are.


Meta-execution
--------------

No project idea from the article directly. Ought has put forth
some open issues (cf.
https://docs.google.com/document/d/1xzFuDD1xiG-oe750MYrP9PEgwEXtxbfSBnHrgrCRnhY/edit#),
but that might be outdated and would be too closely tied to Ought.