Koen.Holtman comments on A brief review of the reasons multi-objective RL could be important in AI Safety Research

Koen.Holtman 7 Oct 2021 10:50 UTC
3 points
0
Thanks for writing this up! I support your call for more alignment research that looks more deeply at the structure of the objective/reward function. In general I feel that the reward function part of the alignment problem/solution space could use much more attention, especially because I do not expect traditional ML research community to look there.

Traditional basic ML research tends to abstract away from the problem of writing am aligned reward function: it all about investigating improvements to general-purpose machine learning, machine learning that can optimize for any possible ‘black box’ reward function $R$ .

In the work you did, you show that this black box view of the reward function is too narrow. Once you open up the black box and treat the reward function as a vector, you can define additional criteria about how machine learning performance can be aligned or unaligned.

In general, I found that once you take the leap and start contemplating reward function design, certain problems of AI alignment can become much more tractable. To give an example: the management of self-modification incentives in agents becomes kind of trivial if you can add terms to the reward function which read out some physical sensors, see for example section 5 of my paper here.

So I have been somewhat puzzled by the question of why there is so little alignment research in this direction, or why so few people step up an point out that this kind of stuff is trivial. Maybe this is because improving the reward function is not considered to be a part of ML research. If I try to manage self-modification incentives with my hands tied behind my back, without being allowed to install physical sensors coupled to reward function terms, the whole problem becomes much less tractable. Not completely intractable, but the the solutions I then find (see this earlier paper ) are mathematicaly much more complex, and less robust under mistakes of machine learning.

I sometimes have the suspicion that there are whole non-ML conferences or bodies of literature devoted to alignment related reward function design, but I am just not seeing them. Unfortunately, it looks like the modem2021 workshop website with the papers you linked to is currently down. It was working two weeks ago.

So a general literature search related question: while doing your project, did you encounter any interesting conferences or papers that I should be reading, if I want to read more work on aligned reward function design? I have already read Human-aligned artificial intelligence is a multiobjective problem.
What links here?
- Koen.Holtman's comment on Scalar reward is not enough for aligned AGI by Peter Vamplew (20 Jan 2022 16:47 UTC; 2 points)
- Roland Pihlakas 15 Oct 2021 16:40 UTC
  1 point
  0
  Parent
  Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
  - Roland Pihlakas 21 Dec 2022 19:17 UTC
    1 point
    0
    Parent
    The paper is now published with open access here:
    https://link.springer.com/article/10.1007/s10458-022-09586-2
- Ben Smith 12 Oct 2021 0:37 UTC
  1 point
  0
  Parent
  The only resource I’d recommend, beyond MODEM, when that’s back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought’s GPT-3-based AI lit search engine (yes, they’re teaching GPT-3 about how to create a superintelligent AI. hmm). It’s in beta, but if they waitlist you and don’t accept you in, email me and I’ll suggest they add you. I wouldn’t say it’ll necessarily show you research you’re not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.
  https://elicit.org/