Which of these five AI alignment research projects ideas are no good?

rmoehn8 Aug 2019 7:17 UTC

25 points

AI Machine Learning (ML)Research Agendas

I’ll post five AI alignment research project ideas as comments. It would be great if you could approval-vote on them by using upvotes. Ie. when you think the project idea isn’t good, you leave the comment as is; otherwise you give it a single upvote.

The project ideas follow this format (cf. The Craft of Research):

I'm studying <topic>,
    because I want to <question that guides the search>,
        in order to help my reader understand <more significant
        question that would be informed by an answer to the
        previous question>.

The project ideas are fixed-width in order to preserve the indentation. If they get formatted strangely, you might be able to fix it by increasing the width of your browser window or zooming out.

What links here?

Twenty-three AI alignment research project definitions by rmoehn (3 Feb 2020 22:21 UTC; 23 points)

rmoehn8 Aug 2019 7:17 UTC

25 points

13 comments1 min readLW link

AI Machine Learning (ML)Research Agendas

rmoehn 8 Aug 2019 7:10 UTC

5 points

I'm studying Bayesian machine learning,
    because I want to understand how to make ML systems that notice when they
    are confused
        in order to help my reader understand how to make ML systems that will
        ask the overseer for input when doing otherwise would lead to failure.
- More a study project than a research project.

rmoehn 8 Aug 2019 7:11 UTC

4 points

I'm studying ways to improve the sample efficiency of a supervised learner,
    because I want to know how to reduce the number of calls to H in
    ‘Supervising strong learners by amplifying weak experts’
    (https://www.lesswrong.com/s/EmDuGeRw749sD3GKd/p/xKvzpodBGcPMq7TqE),
        in order to help my reader understand how we can adapt that
        proof-of-concept for solving real world tasks that require even more
        training data.
- This doesn’t just mean achieving more with the samples we have. It can mean
  finding new kinds of samples that convey more information, and finding new
  ways of extracting them from the human and conveying them to the learner.

rmoehn 10 Aug 2019 0:28 UTC
1 point
0
Thanks for the votes so far! The poll is still open.

By the way, I’d prefer if you only give upvotes. That’s how approval voting works. If you’re concerned that it would skew my total karma, feel free to balance your upvotes by voting down this comment.
- Alicorn 10 Aug 2019 18:03 UTC
  6 points
  0
  Parent
  Are you aware that people’s votes are worth different amounts? I do not think there’s a way to vote less than one’s default vote amount.
  - rmoehn 10 Aug 2019 21:58 UTC
    1 point
    0
    Parent
    No, I wasn’t aware of that. Then I guess I have to come up with a different mechanism for my next poll.
    - Ben Pace 10 Aug 2019 22:52 UTC
      3 points
      0
      Parent
      Note that if people do only give upvotes, then you can hover over a comment’s score to see the total number of votes on it, which is what you’re looking for here.
      - rmoehn 11 Aug 2019 4:45 UTC
        1 point
        0
        Parent
        Good idea, thank you!
    - Raemon 10 Aug 2019 22:27 UTC
      3 points
      0
      Parent
      FWIW, I think sooner or later the LW team will implement something that better facilitates polls, but it may be awhile.

rmoehn 8 Aug 2019 7:12 UTC

1 point

I'm studying the use of a discriminator in imitation learning,
    because I want to find out how to help humans produce demonstrations that
    the agent can imitate,
        in order to help my reader understand how we might use imitation
        learning to solve the reward engineering problem.

rmoehn 8 Aug 2019 7:12 UTC

0 points

I'm studying the effects of importance sampling on the behaviour that an RL
agent learns,
    because I want to find out whether it can lead to undesirable outcomes
        in order to help my reader understand whether importance sampling can
        solve the problem of widely varying rewards in reward engineering.

rmoehn 8 Aug 2019 7:12 UTC
−1 points
0
```
I'm studying the effects of an inconsistent comparison function on optimizing
with comparisons,
    because I want to know whether it prevents the two agents from converging on
    a desirable equilibrium quickly enough
        in order to help my reader understand whether optimizing with
        comparisons can solve the problem of inconsistency and unreliability in
        reward engineering.
```
- John_Maxwell 10 Aug 2019 4:37 UTC
  2 points
  0
  Parent
  Can you explain this one a bit more? It seems to me that if the human is giving inconsistent answers, in the sense that the human says A > B and B > C and C > A, then the thing to do is to flag this and ask them to resolve the inconsistency instead of trying to find a way to work around it. Interpretability > Magic, I say.
  - rmoehn 11 Aug 2019 5:10 UTC
    1 point
    0
    Parent
    I don’t think that would work in this case. I derived the project idea from Thoughts on reward engineering, section 2. There the overseer generates rewards based on its preferences and provides these rewards to RL agents.
    
    Suppose the training starts with the overseer generating rewards from its preferences and the agents updating their value functions accordingly. After a while the agents propose something new and the overseer generates a reward that is inconsistent with those it has generated before. But it happens that this one is the true preference and the proper fix would be to revise the earlier rewards. However, rewarded is rewarded – I guess it would be hard to reverse the corresponding changes in the value functions.
    
    Of course one could record all actions and rewards and snapshots of the value functions, then rewind and reapply with revised rewards. But given today’s model sizes and training volumes, it’s not that straightforward.